Skip to content

Certificate Management SOP

Standard Operating Procedure for certificate lifecycle management across AWS ACM, IAM server certificates, and Azure Key Vault.

1. Overview

This SOP covers the end-to-end lifecycle of TLS/SSL certificates across the enterprise Landing Zone:

Certificate Source Auto-Renewal Scope Management
AWS ACM (public) Yes (DNS-validated) Regional, per-account Managed by ACM
AWS ACM (private) Yes (via ACM PCA) Regional, per-account Managed by ACM PCA
IAM Server Certificates No Global, per-account Manual rotation required
Azure Key Vault Configurable Per-subscription Managed by Key Vault policy

Applies to: All AWS accounts in the Landing Zone (31+ accounts) and Azure subscriptions under tenant $AZURE_TENANT_ID.

CLI toolkit: All discovery, triage, and reporting operations use the runbooks cert command group (5 commands, read-only).


2. Renewal Procedures

2.1 ACM Auto-Renewal (DNS-Validated)

ACM automatically renews certificates that meet all conditions:

  • Certificate status is ISSUED
  • DNS CNAME validation record exists and resolves correctly
  • Certificate is associated with an AWS resource (ELB, CloudFront, API Gateway)

AWSO-143 Lesson Learned

21 expired in-use certificates were discovered across 31 accounts. Root cause: stale DNS CNAME validation records had been removed during DNS cleanup, preventing ACM auto-renewal. Always verify CNAME records exist before assuming auto-renewal will work.

Verification:

# Find certs that should auto-renew but are approaching expiry
runbooks cert expiring --days 45 --all-accounts \
    --management-profile $AWS_MANAGEMENT_PROFILE

2.2 ACM DNS Validation (Manual Trigger)

When a certificate is stuck in PENDING_VALIDATION:

  1. Identify stuck certificates:

    runbooks cert inventory --status PENDING_VALIDATION \
        --all-accounts --management-profile $AWS_MANAGEMENT_PROFILE
    
  2. Extract the required CNAME from ACM (Console or API):

    aws acm describe-certificate \
        --certificate-arn $CERT_ARN \
        --profile $AWS_PROFILE \
        --query 'Certificate.DomainValidationOptions[].ResourceRecord'
    
  3. Create the CNAME record in Route 53 or your external DNS provider.

  4. Wait for validation (typically 5-30 minutes). Re-run inventory to confirm status changes to ISSUED.

2.3 IAM Server Certificate Manual Rotation

IAM server certificates never auto-renew. Follow this procedure:

  1. Generate or obtain the new certificate from your CA.

  2. Upload the new certificate:

    aws iam upload-server-certificate \
        --server-certificate-name $NEW_CERT_NAME \
        --certificate-body file://cert.pem \
        --private-key file://key.pem \
        --certificate-chain file://chain.pem \
        --profile $AWS_PROFILE
    
  3. Update all resource references (ELB listeners, CloudFront distributions) to point to the new certificate ARN.

  4. Verify the old certificate is no longer in use:

    runbooks cert inventory --in-use-only --profile $AWS_PROFILE
    
  5. Delete the old certificate only after confirming no resources reference it.


3. Escalation Matrix

Escalation bands align with the runbooks cert expiry bucket system:

Days to Expiry Expiry Bucket Severity Action Responsible Escalation To
> 90 days VALID Info Add to renewal backlog SRE team --
31-90 days ATTENTION_90D Low Create tracking ticket, verify DNS validation records SRE team lead Account owner
8-30 days WARNING_30D Medium Investigate renewal blockers, attempt manual renewal SRE lead + Account owner Security lead
1-7 days CRITICAL_7D High Block-and-tackle remediation, daily status updates Security lead + Platform team CTO
0 (expired) EXPIRED Critical P1 incident, immediate remediation, stakeholder notification Incident commander CTO + CISO

Triage command (generates escalation-ready output):

runbooks cert triage --days 90 --all-accounts \
    --ops-profile $AWS_OPERATIONS_PROFILE \
    --mode executive --email

4. Troubleshooting Guide

4.1 PENDING_VALIDATION

Symptom

Certificate shows PENDING_VALIDATION status for more than 30 minutes.

Root Cause: DNS CNAME validation record is missing, incorrect, or not yet propagated.

Diagnosis:

# List all PENDING_VALIDATION certificates
runbooks cert inventory --status PENDING_VALIDATION \
    --ops-profile $AWS_OPERATIONS_PROFILE

Resolution:

  1. Extract required CNAME from ACM (see Section 2.2)
  2. Verify the CNAME exists: dig CNAME _validation-record.example.com
  3. If missing, create it in Route 53 or external DNS
  4. If using external DNS, check TTL and propagation delay

4.2 EXPIRED Certificates

Symptom

Certificate status is EXPIRED or past its NotAfter date.

Root Cause: Auto-renewal failed (stale CNAME) or IAM cert was not manually rotated.

Diagnosis:

# List all expired certificates with in-use status
runbooks cert expiring --days 0 --all-accounts \
    --management-profile $AWS_MANAGEMENT_PROFILE

Resolution:

Cert Type Fix
ACM (DNS-validated) Restore CNAME record, request new certificate if expired >30 days
ACM (email-validated) Re-validate via email approval
IAM Server Cert Follow manual rotation (Section 2.3)

4.3 DNS CNAME Missing

Symptom

ACM auto-renewal fails silently. Certificate transitions from ISSUED to EXPIRED without PENDING_VALIDATION intermediate state.

Root Cause: The _acme-challenge or ACM validation CNAME was deleted during DNS cleanup, zone migration, or infrastructure-as-code drift.

Resolution:

  1. Run runbooks cert inventory --all-accounts to identify affected domains
  2. For each affected certificate, re-extract the CNAME from ACM
  3. Re-create the CNAME in DNS
  4. If the certificate is already expired, request a new ACM certificate for the same domain
  5. Add a tag managed-by: acm-auto-renew to DNS validation records to prevent accidental cleanup

Prevention

Add ACM validation CNAMEs to your IaC (Terraform/CloudFormation) to prevent drift. Never manually delete DNS records with _ prefix without verifying they are not ACM validation records.


5. CLI Reference

All certificate operations use the runbooks cert command group:

Task Command Output
Check DNS validation records runbooks cert dns-check Table: Domain, CNAME, DNS Status, Action
Discover all certificates runbooks cert inventory Rich table with expiry buckets
Find expiring certificates runbooks cert expiring Filtered list within N days
Executive risk report runbooks cert report Markdown report with risk breakdown
Combined triage workflow runbooks cert triage Inventory + expiring + report

Quick Start Examples

runbooks cert triage --profile $AWS_PROFILE --days 90
runbooks cert triage --ops-profile $AWS_OPERATIONS_PROFILE \
    --mode executive --output-dir ./cert-reports
runbooks cert inventory --all-accounts \
    --management-profile $AWS_MANAGEMENT_PROFILE \
    --azure --subscription $AZURE_SUBSCRIPTION_ID
runbooks cert expiring --days 30 --ops-profile $AWS_OPERATIONS_PROFILE \
    --export-csv expiring.csv --export-json expiring.json

6. Monitoring

Automated Triage Schedule

Set up a weekly certificate triage as a cron job or CI/CD scheduled pipeline:

# Weekly triage (Monday 9am) — adjust cron schedule to your needs
# 0 9 * * 1
runbooks cert triage --days 90 \
    --ops-profile $AWS_OPERATIONS_PROFILE \
    --mode executive --email \
    --output-dir /tmp/cert-reports \
    --export-json /tmp/cert-reports/triage-$(date +%Y-%m-%d).json

CloudWatch ACM Expiry Alarms

AWS ACM publishes the DaysToExpiry metric to CloudWatch for each certificate. Configure alarms at key thresholds:

Threshold Alarm Action SNS Topic
90 days Informational notification cert-lifecycle-info
30 days Warning to SRE channel cert-lifecycle-warning
7 days Page on-call SRE cert-lifecycle-critical
# Example: CloudWatch alarm for 30-day warning (per certificate)
aws cloudwatch put-metric-alarm \
    --alarm-name "cert-expiry-30d-${CERT_DOMAIN}" \
    --namespace "AWS/CertificateManager" \
    --metric-name "DaysToExpiry" \
    --dimensions Name=CertificateArn,Value=$CERT_ARN \
    --statistic Minimum \
    --period 86400 \
    --evaluation-periods 1 \
    --threshold 30 \
    --comparison-operator LessThanOrEqualToThreshold \
    --alarm-actions $SNS_TOPIC_ARN \
    --profile $AWS_PROFILE

Dashboard Integration

Combine automated triage with the FinOps dashboard for a unified operational view:

  1. Weekly: runbooks cert triage --mode executive for stakeholder summary
  2. Daily: runbooks cert expiring --days 7 for critical alerts
  3. Monthly: runbooks cert report for compliance audit trail