Summary of "AWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)"

High-level summary

Topic: Using generative AI plus AWS Fault Injection Service (FIS) to automate and speed resilience testing and disaster‑recovery validation.

Goal: Automatically discover failure scenarios, convert past incidents (RCAs) into reproducible tests, and run safe, controlled chaos experiments so teams can validate mitigations faster — days vs. weeks. The presentation claimed up to ~90% reduction in experiment time.

Key technological concepts and products

AWS Fault Injection Service (FIS)
- Orchestrates resilience experiments.
- Provides native actions (CPU/memory stress, latency, power interruption) and reusable templates.
AWS Systems Manager
- Inventory
  - Acts as a “blueprint reader” to discover installed software, services, network configuration, DB connections, and which EC2 instances are online/reporting.
- Automation Documents (SSM documents)
  - Used to implement non‑native impairments (e.g., stop IIS, drop files, manipulate OS state).
  - Emphasis on modular, idempotent documents and state restoration.
Generative AI / agentic capabilities
- Bedrock-backed agents (e.g., inventory analysis agent, document generator agent) for discovery, hypothesis generation, and automated SSM document generation.
- AWS Strands framework (SDK) for building/running agents: model selection, tools that expose AWS APIs/tags, prompts + callbacks — only a few lines of code needed.
- Proposed multi‑agent architecture:
  1. Hypothesis generator
  2. Prioritization agent
  3. Experiment designer (SSM doc agent)
  4. Experiment executor
  5. Monitoring / iteration agents
Observability & operations
- Tie experiment results into monitoring and incident tooling (DevOps agent, CloudWatch Investigator referenced).
- Aim to reduce MTTR and improve operational response.
Resilience lifecycle
- Phases: set objectives → design/implement → evaluate/test → operate → learn/respond.
- AI can assist across the design, testing, and learning phases.

Demo and practical workflow

Inventory agent runs on an online EC2 instance and discovers IIS, ODBC drivers, DB dependencies, autoscaling/healthcheck info.
Agent generates failure hypotheses (e.g., block DB port via security group, impair IIS/app pool, create latency).
Document generator builds an SSM Automation document (PowerShell for Windows), including:
- Preconditions
- Safety checks
- Cleanup (state restoration, idempotency)
FIS experiment template ties native FIS actions and the SSM Automation document together and executes the experiment via console, API, or CLI.
Teams validate the produced documents and iterate.

Prompt engineering and agent guardrails

Prompts function like detailed job descriptions:
- Define agent persona, allowed tools/actions, what to ignore, and output format.
Explicit “do not” constraints:
- Do not impair protected services (e.g., Systems Manager agent).
- Do not break management connectivity.
- Never modify critical production infrastructure beyond the permitted scope.
Preconditions to validate before running experiments:
- Instance is online.
- Target service is running.
- Sufficient disk space.
- Required modules are installed.
Emphasis on safety:
- Isolation boundaries and limiting targets.
- Rollback and restoration steps.
- Logging and explicit preconditions.

Best practices for SSM documents and experiments

Make SSM documents modular and idempotent; include preconditions and restoration flows.
Use native FIS actions where available; author custom SSM documents only for OS- or application‑level actions not covered by FIS.
Validate automation in lower‑risk (production‑like) environments before full experiments.
Keep experiments focused and controlled — practice chaos engineering with purpose rather than random destruction.

Benefits and impact

Large time savings in discovery, test generation, and document creation — engineers shift from “writing plumbing” to validating and iterating.
Faster discovery of unknown dependencies and single points of failure.
Enables turning RCAs into automated tests to verify mitigations and disaster‑recovery playbooks.
Supports human–AI collaboration: AI accelerates draft tests; humans validate and refine.

“Everything fails all the time.” — Dr. Werner Vogels (quoted as a resiliency principle referenced in the presentation)

Guides, resources, and next steps

Resilience lifecycle framework (released Oct 2023) — north star for resilient application design.
FIS native actions and template library + best‑practices blog (used as context for agent prompts).
AWS Systems Manager Inventory and Automation documents guidance.
AWS Strands framework for building agents.
Multi‑agent chaos engineering code demonstrated (reusable) — “test at your own risk; validate and adapt.”
AWS Resiliency Analyst Framework and Fault Isolation Boundaries guidance.
Onsite resources: re:Invent booth (demos, Q&A).