Skip to content

CentralizeOps to Simplify Operations Efficiency to Focus on Scaling Workloads

🚀 Introduction to the Next-Gen AWS Systems Manager 🌟

Traditional operational overheads can quickly increase when maintaining enterprise-scale multi-account cloud infrastructures. Manual inventory management, inconsistent patching, and fragmented visibility across hybrid and multiple clouds introduce unnecessary complexity and risk.

AWS Systems Manager (SSM) offers seamless integration and intuitive automation capabilities, focusing on centralization, automation, and ease of use:
  • Automation, Run Command, Patch Manager, and Session Manager
  • Centralized multi-account/multi-region dashboards for inventory and operational insights.
  • Integrates with AWS Organizations for centralized management from a single delegated admin account centralize-ops.
  • Built-in remediation workflows for unmanaged nodes and compliance drift.
  • Enhanced automation capabilities with intuitive visual runbook builders.
  • Diagnose & Remediate functionality to automatically resolve common management and networking issues.

This blog delves deeply into the next-generation AWS Systems Manager, practically exploring its features and architecture from an experienced hybrid-cloud architect’s perspective.


🛠️ Building a Centralized Operations Hub

🚀 Scaling Day-2 Operations with the Next-Gen AWS Systems Manager

Audience: Cloud & DevOps engineers who already automate at least some fleet operations and want deeper guard-rails, multi-account visibility, and low-code runbooks.

  • Simplified deployment across organization: Setup AWS Systems Manager with one click
  • Maintaining the organization’s infrastructure: Enable and scheduled Diagnose & Remediation so you can keep track of unmanaged nodes to resolve issues
  • Streamline routine management tasks: Automate your operations on managed nodes once they are registered

  • Empowering organizations to simplify operational efficiency to focus on scaling their operational workloads

    • IT Operations Manager: Getting started
    • DevOps Engineer: Scaling up
    • Compliance Manager: Compliance

How do I efficiently manage our environments and nodes as we scale?

Visual tour of the AWS Systems Manager Demo

How do I centralize and automate operations?

Improve visibility and control
  • Execute critical operational tasks
  • Automate operational tasks
  • Streamline complex tasks
  • Safely perform disruptive tasks in bulk

How do we automate to meet compliance and mitigate risks?

Automation runbook builder with low-code visual designer

Automation runbook builder


🎬 Technical Demo Script

Meet the team

🎤 Intro: "Today, let's briefly demonstrate the core new capabilities of AWS Systems Manager’s integrated experience."

**Step 1 – Delegated Admin Setup: centralized-ops **

  • "Here’s the AWS Organizations console, with our Delegated Admin account registered."
  • "From the Systems Manager console, click Get Started. Choose All Accounts and All Regions. SSM deploys CloudFormation stack sets automatically, enabling centralized visibility."

Step 2 – Centralized Inventory Dashboard:

  • "Once enabled, the dashboard provides a comprehensive, centralized inventory of nodes across AWS and on-premises environments."
  • "Quickly see managed versus unmanaged nodes, OS breakdowns, agent versions, and detailed node statuses at a glance."

Step 3 – Diagnose & Remediate Unmanaged Nodes:

  • "Clicking on Unmanaged Nodes reveals reasons nodes aren't integrated, such as missing VPC endpoints."
  • "Click Diagnose and Remediate to execute an automated runbook. Remediation logs show real-time progress and outcomes."

Step 4 – Automation Runbook for OS Upgrade:

  • "Let’s illustrate automation with a common task—upgrading Windows Server nodes."
  • "Select nodes running outdated OS versions directly from the inventory dashboard."
  • "Use the built-in AWS automation runbook, specifying rate controls to safely orchestrate the upgrade. AWS automatically snapshots instances, applies updates, and rolls back if needed."

Conclusion & Benefits:

  • "These enhancements—centralized inventory, automated remediation, and powerful runbook automation—significantly streamline operations, improve security posture, and enable effortless management at scale."

1. One Integrated Control-Plane

AWS Systems Manager (SSM) used to feel like a Swiss-army knife: 22 sub-services you had to wire together yourself (Run Command, Patch Manager, Session Manager, Automation, Quick Setup). The Next-Gen AWS Systems Manager turns those Lego pieces into an opinionated, organisation-aware dashboard that ships with:

# Capability (Job-to-Be-Done) What Changed in the Next Gen UX Why It Matters in Practice
1 🔍 Inventory & Drift One-click delegated-admin setup via AWS Organizations. Cross-region, cross-account inventory renders in <2 min. No more custom Resource Data Sync pipelines. Immediate view of unmanaged or out-of-date nodes.
2 🛠️ Diagnose & Remediate Unmanaged Nodes Built-in SSM-DiagnoseAndRemediate runbook detects missing VPC endpoints / IAM / agent. Cuts first-day toil when onboarding 100 + legacy accounts; evidence logged in CloudTrail.
3 📦 Patch & Compliance Posture Patch compliance tiles surface host CVE exposure; drill-down links directly to Automation runbooks. Ops can prove SLA (≤ 7 days critical-patch) to auditors without spreadsheets.
4 🧮 Drag-and-Drop Runbook Builder Visual designer + Amazon Q security linting; commits YAML to SSM Documents. Platform teams codify “sudo” playbooks once, then delegate safe execution to app teams.
5 🔑 Session Manager Deep-Link Fleet page now shows “Connect” next to each instance. Removes last excuse for bastion hosts; sessions are keyless and fully logged.

1️⃣ Centralized Operations via Delegated Admin

AWS Systems Manager now seamlessly integrates with AWS Organizations to centralize management of resources across:

  • Multiple AWS accounts & AWS regions
  • Hybrid/on-premises nodes
  • Multi-cloud environments

This integration is achieved by registering a Delegated Admin account within AWS Organizations, allowing centralized operational visibility and control.

Practical Benefits:
Benefit Impact
Single pane visibility Reduces operational complexity, improves oversight
Central inventory management Provides real-time status and health across environments
Standardized operational tasks Reduces configuration drift and operational errors

2️⃣ Diagnose & Remediate Unmanaged Nodes

A common operational challenge is managing nodes not properly integrated with SSM due to network or agent issues.

The enhanced SSM experience introduces the Diagnose and Remediate capability to automate troubleshooting and resolve such issues:

  • Identifies networking misconfigurations (e.g., missing VPC endpoints)
  • Automates remedial steps through pre-built runbooks
Example Workflow:
Step Action Automation
1 Identify unmanaged nodes Inventory Dashboard
2 Run diagnosis workflow Built-in automation
3 Review findings Automated logs
4 Execute remediation Automated runbook

3️⃣ Advanced Automation with Runbook Builder

Automation remains key to scalability. AWS has significantly simplified runbook creation through its new visual drag-and-drop Automation Runbook Builder:

  • Integrates best-practice operational workflows
  • Supports rate-controlled execution across fleets
  • Provides automatic rollbacks on detected failures
  • Ensures auditability and compliance with built-in security checks (via AWS Q Developer integration)
Practical Use Cases:
Scenario Automation Runbook
OS upgrades (e.g., Windows Server 2019 to 2022) Snapshot creation → patching → validation → rollback on error
Routine fleet patching Baseline checks → deployment throttling → compliance reporting

📝 Takeaways

  • Start with inventory — you can’t patch what you can’t see.
  • Automate everything twice: once in a sand-box, once in prod under approval workflow.
  • Guard service-linked roles (AWSServiceRoleForAmazonSSM_*) with SCPs; they’re your cross-account blast-radius.
  • Retire bastions — Session Manager + port-forward handles RDP/SSH use-cases with full audit.
  • Treat runbooks like code — peer review, Git version, CI-lint with Amazon Q, and enforce automation:StartRunbook via IAM conditions.

🚨 Technical Best-Practices & Recommendations

From practical experience deploying large-scale environments, these are essential best practices:

  • Service-linked Roles (SLR): Regularly audit SLRs (e.g., AWSServiceRoleForAmazonSSM_AccountDiscovery) ensuring trust policies are restricted to required AWS principals.
  • Operational Automation: Use runbooks extensively, enforce change approval workflows, and integrate with ChatOps for visibility.
  • Immutable Logs: Ensure audit and security logs (CloudTrail, SSM actions) reside in isolated accounts with strict read-only access.

TODO: WIP ==>

Architecture Pattern — What Good Looks Like

WIP ...


flowchart TD
  subgraph Org
    direction LR
    MGMT[(Management Account)]
    AUDIT[(Audit / Log Archive)]
    OPS[(Delegated Admin: Central Ops)]
  end
  MemberAcc1[(Workload A)] -->|Quick Setup StackSets| OPS
  MemberAcc2[(Workload B)] --> OPS
  OPS --OpsData Sync--> S3Inventory[(S3 Inventory Bucket)]
  OPS --Automation--> MemberAcc1
  OPS --Run Command--> MemberAcc2
  AUDIT <--CloudTrail & Config--> MemberAcc1

Key points:

  • Enable trusted access for ssm.amazonaws.com once in the management account; then promote a single delegated admin (Central Ops).
  • Quick Setup pushes the agent, IAM roles, and VPC interface endpoints into every new account automatically (StackSets with drift detection).
  • Audit/log archive account keeps immutable CloudTrail and Config data; Central Ops queries it but cannot delete.

Deep-Dive Action Plan 🗓️
Week Owner Concrete Deliverable Success Metric
Jun 02–06 Platform Eng Enable trusted-access & set CentralOps as SSM delegated-admin. Ops console shows green banner “Organization setup complete”.
Jun 09–13 NetSec Deploy interface VPC endpoints (com.amazonaws.<region>.ssm*, ec2messages, ssmmessages) to all VPCs via AWS Firewall Manager. Diagnose & Remediate report shows 0 network errors.
Jun 16–20 IAM Team Create SCP: DenyUpdateAssumeRolePolicy on path /aws-service-role/ssm.amazonaws.com/ (except via pipeline role). GuardDuty & CloudTrail have no role-policy edits outside pipeline.
Jun 23–27 SRE Guild Write three golden runbooks with the low-code builder: Rollback-AMI, Quarantine-Instance, Patch-Kernel-Zero-Day. Runbooks pass Amazon Q lint; approval workflow gate via Service Catalog.
Jul 01 ☀️ Compliance Turn on SSM patch policies (critical ≤ 7 days) & baseline exceptions list. Patch compliance widget ≥ 95 % across all prod accounts.
Jul 07–11 App Teams Migrate remaining SSH bastion workflows to Session Manager + port-forward plug-in. Bastion SG inbound rules = 0.
Jul 14–18 FinOps Tag CentralOps automation executions; export OpsCost Athena view. Showback report links 80 % of SSM automation cost to cost-centre tag.
Jul 21–25 Security Ops Configure EventBridge → Chatbot alerts for UpdateAssumeRolePolicy & failed automation steps. Mean time-to-ack ≤ 5 min.
Jul 28–31 All GameDay: simulate kernel CVE; validate runbooks, patch pipeline, rollback path. Pass criteria: zero customer impact, < 2 h full fleet patch.

📅 To-Do Action Items (June–July)
# Action Owner Deadline
1 Register Delegated Admin account in AWS Organizations Cloud Team 15 June
2 Enable SSM next-gen experience across all production regions/accounts DevOps Team 20 June
3 Deploy Inventory Data Sync to centralized S3 bucket DevOps Team 22 June
4 Perform node discovery and automated remediation (Diagnose & Remediate) CloudOps 30 June
5 Create and test automation runbooks for Windows Server upgrade (QA → Prod) Infrastructure Team 10 July
6 Integrate AWS Systems Manager with AWS Chatbot (Teams) for operational alerts DevSecOps Team 15 July
7 Document new operational runbooks in GitOps workflows DevOps Team 25 July
8 Conduct organization-wide operational training workshop on next-gen SSM Cloud Team 30 July

Demo Script 🎥

Target audience: Ops Leads.

Goal: Show end-to-end flow: enable org-wide view → detect unmanaged node → remediate → verify compliance.

Timeline Narration Live Clicks (or pre-recorded)
00:00 “We’re logged into the CentralOps delegated-admin account. Notice the orange banner prompting us to Get Started with the new Systems Manager experience.” Browser on https://console.aws.amazon.com/systems-manager
00:15 “One click, pick All Regions, hit Enable. StackSets fire behind the scenes — no manual IAM/VPC work.” Click Enable. Show progress modal.
00:45 “After ~90 seconds the dashboard lights up. Top-left tile shows 148 Managed / 17 Unmanaged nodes.” Refresh dashboard.
01:00 “Let’s drill into unmanaged nodes and run the built-in Diagnose & Remediate.” Click Unmanaged ➜ Diagnose.
01:20 “SSM auto-detects missing VPC interface endpoints and proposes a runbook. We’ll accept defaults — note the rate control set to 5 nodes at a time.” Show runbook parameters, press Execute.
01:50 “Execution logs stream in real-time; CloudTrail is capturing the API calls.” Open Automation Executions tab.
02:10 “Refresh: unmanaged count is 0 — every node now reports an agent heartbeat.” Back to dashboard.
02:25 “Next, filter Windows 2016 boxes — we plan to uplift them before end-of-support.” Use Operating System filter.
02:40 “Exporting the CSV gives application owners a heads-up.” Click Download CSV.
02:50 “Now we patch just 10 % to start. I select the nodes, invoke the AWS-provided AWSEC2-CloneAndUpgradeWindows runbook.” Bulk select ➜ Automation.
03:20 “Notice the visual graph: snapshot, detach, upgrade, reboot, verify. Rollback path is baked in.” Scroll runbook designer.
03:40 “Fast-forward — execution succeeded. Patch compliance tile turned green; CIS dashboard in Security Hub updates automatically.” Show compliance tile.
04:00 “Finally, zero bastions: I open Session Manager directly from the node list, no inbound ports required.” Click Connect.
04:30 “All actions are logged to S3 & CloudWatch; auditors can replay every keystroke.” Show CloudTrail event & session log.
04:45 “In under five minutes we: 1) onboarded org-wide, 2) healed connectivity, 3) patched legacy OS, 4) proved compliance, 5) killed bastions.” Recap slide.
05:00 “Questions?” End demo.

CLI / IaC Snippets 📄
## Enable SSM trusted access & set delegated admin
aws organizations enable-aws-service-access \
  --service-principal ssm.amazonaws.com

aws organizations register-delegated-administrator \
  --account-id $CENTRAL_OPS \
  --service-principal ssm.amazonaws.com
## Terraform: VPC Interface Endpoints (excerpt)
module "ssm_endpoints" {
  source  = "terraform-aws-modules/vpc/aws//modules/vpc-endpoints"
  vpc_id  = module.vpc.vpc_id
  endpoints = {
    ssm          = { service = "ssm",          private_dns = true }
    ssmmessages  = { service = "ssmmessages",  private_dns = true }
    ec2messages  = { service = "ec2messages",  private_dns = true }
  }
}