Site Reliability Engineering (SRE) & CloudOps Runbooks Automation¶
CloudOps Automation to spend less time doing operations
- Challenges
- Application Support and Operations takes up to
~30%?
of Developer time - Documentation Advantages BUT it takes X months kick-off
~10%?
ongoing
- Application Support and Operations takes up to
- Goals:
- Faster Resolution of Issues
- Simpler Escalations
- Easier Onboarding
- Better Training
- Better Discipline
- Automation:
- Improve Outcomes & Lower MTTR (mean time to repair)
- Reduces Manual DevOps βtoilβ
- Auto Remediations
- Increase Observability
Why CloudOps Runbooks Automation based on Python & Jupyter Notebooks
- Online β Collaborative β Improve team collaboration
- CloudOps: Python, JupyterLab
- Cloud Governance as Code: CloudCustodian (open-source rules engine)
- Documentation via text/markdown
- Easy Automation
CloudOps Runbooks¶
- Runbooks
- Configurations
- SSO Credentials:
- Input Params: region, AMI ID, etc. β
Name | Description | Value | Required
columns - Output Params for each Action
- Actions
- Others: GitOps