Summary of "AWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)"

High-level summary

Topic: Using generative AI plus AWS Fault Injection Service (FIS) to automate and speed resilience testing and disaster‑recovery validation.

Goal: Automatically discover failure scenarios, convert past incidents (RCAs) into reproducible tests, and run safe, controlled chaos experiments so teams can validate mitigations faster — days vs. weeks. The presentation claimed up to ~90% reduction in experiment time.


Key technological concepts and products


Demo and practical workflow

  1. Inventory agent runs on an online EC2 instance and discovers IIS, ODBC drivers, DB dependencies, autoscaling/healthcheck info.
  2. Agent generates failure hypotheses (e.g., block DB port via security group, impair IIS/app pool, create latency).
  3. Document generator builds an SSM Automation document (PowerShell for Windows), including:
    • Preconditions
    • Safety checks
    • Cleanup (state restoration, idempotency)
  4. FIS experiment template ties native FIS actions and the SSM Automation document together and executes the experiment via console, API, or CLI.
  5. Teams validate the produced documents and iterate.

Prompt engineering and agent guardrails


Best practices for SSM documents and experiments


Benefits and impact

“Everything fails all the time.” — Dr. Werner Vogels (quoted as a resiliency principle referenced in the presentation)


Guides, resources, and next steps


Speakers / main sources

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video