Certificate Management SOP¶
Standard Operating Procedure for certificate lifecycle management across AWS ACM, IAM server certificates, and Azure Key Vault.
1. Overview¶
This SOP covers the end-to-end lifecycle of TLS/SSL certificates across the enterprise Landing Zone:
| Certificate Source | Auto-Renewal | Scope | Management |
|---|---|---|---|
| AWS ACM (public) | Yes (DNS-validated) | Regional, per-account | Managed by ACM |
| AWS ACM (private) | Yes (via ACM PCA) | Regional, per-account | Managed by ACM PCA |
| IAM Server Certificates | No | Global, per-account | Manual rotation required |
| Azure Key Vault | Configurable | Per-subscription | Managed by Key Vault policy |
Applies to: All AWS accounts in the Landing Zone (31+ accounts) and Azure subscriptions under tenant $AZURE_TENANT_ID.
CLI toolkit: All discovery, triage, and reporting operations use the runbooks cert command group (5 commands, read-only).
2. Renewal Procedures¶
2.1 ACM Auto-Renewal (DNS-Validated)¶
ACM automatically renews certificates that meet all conditions:
- Certificate status is
ISSUED - DNS CNAME validation record exists and resolves correctly
- Certificate is associated with an AWS resource (ELB, CloudFront, API Gateway)
AWSO-143 Lesson Learned
21 expired in-use certificates were discovered across 31 accounts. Root cause: stale DNS CNAME validation records had been removed during DNS cleanup, preventing ACM auto-renewal. Always verify CNAME records exist before assuming auto-renewal will work.
Verification:
# Find certs that should auto-renew but are approaching expiry
runbooks cert expiring --days 45 --all-accounts \
--management-profile $AWS_MANAGEMENT_PROFILE
2.2 ACM DNS Validation (Manual Trigger)¶
When a certificate is stuck in PENDING_VALIDATION:
-
Identify stuck certificates:
-
Extract the required CNAME from ACM (Console or API):
-
Create the CNAME record in Route 53 or your external DNS provider.
-
Wait for validation (typically 5-30 minutes). Re-run inventory to confirm status changes to
ISSUED.
2.3 IAM Server Certificate Manual Rotation¶
IAM server certificates never auto-renew. Follow this procedure:
-
Generate or obtain the new certificate from your CA.
-
Upload the new certificate:
-
Update all resource references (ELB listeners, CloudFront distributions) to point to the new certificate ARN.
-
Verify the old certificate is no longer in use:
-
Delete the old certificate only after confirming no resources reference it.
3. Escalation Matrix¶
Escalation bands align with the runbooks cert expiry bucket system:
| Days to Expiry | Expiry Bucket | Severity | Action | Responsible | Escalation To |
|---|---|---|---|---|---|
| > 90 days | VALID | Info | Add to renewal backlog | SRE team | -- |
| 31-90 days | ATTENTION_90D | Low | Create tracking ticket, verify DNS validation records | SRE team lead | Account owner |
| 8-30 days | WARNING_30D | Medium | Investigate renewal blockers, attempt manual renewal | SRE lead + Account owner | Security lead |
| 1-7 days | CRITICAL_7D | High | Block-and-tackle remediation, daily status updates | Security lead + Platform team | CTO |
| 0 (expired) | EXPIRED | Critical | P1 incident, immediate remediation, stakeholder notification | Incident commander | CTO + CISO |
Triage command (generates escalation-ready output):
runbooks cert triage --days 90 --all-accounts \
--ops-profile $AWS_OPERATIONS_PROFILE \
--mode executive --email
4. Troubleshooting Guide¶
4.1 PENDING_VALIDATION¶
Symptom
Certificate shows PENDING_VALIDATION status for more than 30 minutes.
Root Cause: DNS CNAME validation record is missing, incorrect, or not yet propagated.
Diagnosis:
# List all PENDING_VALIDATION certificates
runbooks cert inventory --status PENDING_VALIDATION \
--ops-profile $AWS_OPERATIONS_PROFILE
Resolution:
- Extract required CNAME from ACM (see Section 2.2)
- Verify the CNAME exists:
dig CNAME _validation-record.example.com - If missing, create it in Route 53 or external DNS
- If using external DNS, check TTL and propagation delay
4.2 EXPIRED Certificates¶
Symptom
Certificate status is EXPIRED or past its NotAfter date.
Root Cause: Auto-renewal failed (stale CNAME) or IAM cert was not manually rotated.
Diagnosis:
# List all expired certificates with in-use status
runbooks cert expiring --days 0 --all-accounts \
--management-profile $AWS_MANAGEMENT_PROFILE
Resolution:
| Cert Type | Fix |
|---|---|
| ACM (DNS-validated) | Restore CNAME record, request new certificate if expired >30 days |
| ACM (email-validated) | Re-validate via email approval |
| IAM Server Cert | Follow manual rotation (Section 2.3) |
4.3 DNS CNAME Missing¶
Symptom
ACM auto-renewal fails silently. Certificate transitions from ISSUED to EXPIRED without PENDING_VALIDATION intermediate state.
Root Cause: The _acme-challenge or ACM validation CNAME was deleted during DNS cleanup, zone migration, or infrastructure-as-code drift.
Resolution:
- Run
runbooks cert inventory --all-accountsto identify affected domains - For each affected certificate, re-extract the CNAME from ACM
- Re-create the CNAME in DNS
- If the certificate is already expired, request a new ACM certificate for the same domain
- Add a tag
managed-by: acm-auto-renewto DNS validation records to prevent accidental cleanup
Prevention
Add ACM validation CNAMEs to your IaC (Terraform/CloudFormation) to prevent drift. Never manually delete DNS records with _ prefix without verifying they are not ACM validation records.
5. CLI Reference¶
All certificate operations use the runbooks cert command group:
| Task | Command | Output |
|---|---|---|
| Check DNS validation records | runbooks cert dns-check |
Table: Domain, CNAME, DNS Status, Action |
| Discover all certificates | runbooks cert inventory |
Rich table with expiry buckets |
| Find expiring certificates | runbooks cert expiring |
Filtered list within N days |
| Executive risk report | runbooks cert report |
Markdown report with risk breakdown |
| Combined triage workflow | runbooks cert triage |
Inventory + expiring + report |
Quick Start Examples¶
6. Monitoring¶
Automated Triage Schedule¶
Set up a weekly certificate triage as a cron job or CI/CD scheduled pipeline:
# Weekly triage (Monday 9am) — adjust cron schedule to your needs
# 0 9 * * 1
runbooks cert triage --days 90 \
--ops-profile $AWS_OPERATIONS_PROFILE \
--mode executive --email \
--output-dir /tmp/cert-reports \
--export-json /tmp/cert-reports/triage-$(date +%Y-%m-%d).json
CloudWatch ACM Expiry Alarms¶
AWS ACM publishes the DaysToExpiry metric to CloudWatch for each certificate. Configure alarms at key thresholds:
| Threshold | Alarm Action | SNS Topic |
|---|---|---|
| 90 days | Informational notification | cert-lifecycle-info |
| 30 days | Warning to SRE channel | cert-lifecycle-warning |
| 7 days | Page on-call SRE | cert-lifecycle-critical |
# Example: CloudWatch alarm for 30-day warning (per certificate)
aws cloudwatch put-metric-alarm \
--alarm-name "cert-expiry-30d-${CERT_DOMAIN}" \
--namespace "AWS/CertificateManager" \
--metric-name "DaysToExpiry" \
--dimensions Name=CertificateArn,Value=$CERT_ARN \
--statistic Minimum \
--period 86400 \
--evaluation-periods 1 \
--threshold 30 \
--comparison-operator LessThanOrEqualToThreshold \
--alarm-actions $SNS_TOPIC_ARN \
--profile $AWS_PROFILE
Dashboard Integration¶
Combine automated triage with the FinOps dashboard for a unified operational view:
- Weekly:
runbooks cert triage --mode executivefor stakeholder summary - Daily:
runbooks cert expiring --days 7for critical alerts - Monthly:
runbooks cert reportfor compliance audit trail