Skip to main content

RISE 2026 Cloud Resilience Benchmark

RISE 2026 is a benchmark for teams that design, operate, and govern resilient cloud and hybrid infrastructure. This edition keeps the original benchmark format while updating the guidance for current operating realities: identity-first security, immutable recovery patterns, deeper observability, policy-driven operations, automated remediation, and continuous resilience testing.

The benchmark is intended to be used together with the control table and the DORA mapping. It remains benchmark-first and DORA-primary: the controls describe what resilient operations should look like in practice, and the mapping shows how those controls align to the Digital Operational Resilience Act.

Control structure

Each control in the benchmark follows the same structure so that teams can assess maturity consistently across backup, network, security, platform, and operational disciplines.

  1. Control Title A short statement describing the expected resilience practice.
  2. Control Description A concise explanation of why the control matters and which resilience outcome it supports.
  3. Control Implementation A practical set of actions that teams can use to establish the control.
  4. Control Maturity Levels A five-level scale that shows the progression from ad hoc adoption to optimized, measured, and continuously improved operation.
  5. Control Recommendations Additional suggestions that help teams strengthen implementation quality and long-term sustainability.

The maturity levels are interpreted consistently throughout the benchmark:

  • Level 1: Initial/Ad Hoc The control exists in isolated or informal ways, with inconsistent ownership and limited repeatability.
  • Level 2: Defined The control is documented, scoped, and assigned to named owners, but execution is still uneven.
  • Level 3: Managed The control is implemented consistently, reviewed regularly, and supported by operational processes.
  • Level 4: Measurable Teams track meaningful metrics, exceptions, and outcomes to understand whether the control is working.
  • Level 5: Optimized The control is continuously improved through automation, testing, feedback loops, and lessons learned.

Governance and Risk Management

Enterprise resilience depends on explicit governance, decision ownership, and clear tolerance for operational risk. In 2026, resilient organizations treat resilience as a managed capability with accountable leaders, measurable objectives, and structured exception handling rather than as a purely technical concern.

Define a formal resilience governance model with clear accountability

Control Description

This control establishes who owns resilience decisions, how they are reviewed, and how responsibilities are shared across business, engineering, security, operations, and leadership. Clear governance prevents critical resilience work from becoming fragmented or optional.

Control Implementation

  1. Define governance forums, decision owners, and escalation paths for resilience matters.
  2. Assign accountable owners for major resilience domains, including infrastructure, identity, third-party risk, and continuity planning.
  3. Document how resilience decisions are approved, reviewed, and communicated.
  4. Review governance roles after major organizational, platform, or regulatory change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Resilience responsibilities are informal and depend on individual initiative.
  • Level 2: Defined: Governance roles are documented, but decision-making remains uneven.
  • Level 3: Managed: Resilience accountability is assigned clearly and used consistently.
  • Level 4: Measurable: Teams track governance coverage, decision timeliness, and unresolved ownership gaps.
  • Level 5: Optimized: Governance is continuously improved based on incidents, audits, and changing business needs.

Control Recommendations

  1. Keep accountability specific enough that each critical resilience area has a named owner.
  2. Include both technical and business stakeholders in governance decisions for high-impact services.
  3. Use governance records as evidence for risk, audit, and regulatory review.

Perform regular resilience risk assessments and maintain a risk register

Control Description

Resilience risk assessments help organizations identify material weaknesses before they trigger disruption. A durable risk register supports prioritization, ownership, and follow-through across technical and business teams.

Control Implementation

  1. Define a methodology for identifying resilience risks, vulnerabilities, dependencies, and potential impact.
  2. Assess critical services, platforms, and suppliers on a regular cadence and after major change.
  3. Record identified risks, owners, treatment plans, and residual risk decisions in a maintained register.
  4. Review open risks regularly in governance forums and escalate stalled items.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Risks are recorded inconsistently and mainly after incidents.
  • Level 2: Defined: Risk assessment methods exist, but coverage and updates are incomplete.
  • Level 3: Managed: Resilience risk assessments are performed routinely and captured in a maintained register.
  • Level 4: Measurable: Teams track risk age, treatment progress, and recurring risk themes.
  • Level 5: Optimized: Risk assessment is integrated with architecture change, continuity planning, and investment decisions.

Control Recommendations

  1. Include third-party dependencies, concentration risks, and people/process dependencies in assessments.
  2. Separate accepted risk from untreated risk so status remains clear.
  3. Reuse risk data in testing plans and continuity prioritization.

Define resilience objectives and tolerances for critical services

Control Description

Organizations need explicit resilience expectations for critical services in order to design, operate, and improve them consistently. Service-level tolerances translate abstract resilience goals into actionable targets.

Control Implementation

  1. Identify critical services and the business processes that depend on them.
  2. Define resilience objectives such as recovery priorities, outage tolerance, integrity expectations, and dependency assumptions.
  3. Align architectural, operational, and continuity decisions to those objectives.
  4. Review objectives when business criticality, architecture, or external obligations change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Resilience expectations are implicit or vary by team.
  • Level 2: Defined: Objectives are documented for some critical services, but not used consistently.
  • Level 3: Managed: Critical services have clear resilience targets that guide operational practice.
  • Level 4: Measurable: Teams track whether service operation and testing align with defined tolerances.
  • Level 5: Optimized: Objectives are reviewed continuously and refined using incidents, testing, and business feedback.

Control Recommendations

  1. Define objectives at the service level, not only at the infrastructure level.
  2. Make tradeoffs explicit where cost, speed, and resilience compete.
  3. Use the same objectives across backup, continuity, and testing activities.

Establish exception management and compensating controls for unmet requirements

Control Description

Not every resilience requirement can be met immediately. This control ensures that deviations are visible, time-bound, and protected by compensating measures rather than quietly becoming the operating norm.

Control Implementation

  1. Define a formal process for requesting, approving, and expiring resilience exceptions.
  2. Require business justification, owner assignment, impact assessment, and target remediation dates.
  3. Identify compensating controls that reduce risk while the exception remains open.
  4. Review and renew or close exceptions on a defined cadence.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Exceptions exist informally and are rarely revisited.
  • Level 2: Defined: Exception handling is documented, but evidence and expiry discipline vary.
  • Level 3: Managed: Exceptions are approved formally, tracked, and linked to compensating controls.
  • Level 4: Measurable: Teams track exception age, recurrence, and unresolved high-risk deviations.
  • Level 5: Optimized: Exception management drives systematic remediation and prevents long-term policy drift.

Control Recommendations

  1. Require visible expiry dates for high-impact exceptions.
  2. Avoid blanket exceptions that mask multiple distinct risks.
  3. Review whether repeated exceptions indicate a benchmark or platform gap that should be fixed centrally.

Report resilience posture and remediation progress to leadership regularly

Control Description

Leadership reporting keeps resilience visible as a business priority and supports informed decisions about risk, investment, and remediation sequencing.

Control Implementation

  1. Define a regular reporting cadence for resilience posture, material risks, and remediation progress.
  2. Include critical service status, testing outcomes, open exceptions, and major dependency issues in reporting.
  3. Present decision points and tradeoffs that require leadership action or budget.
  4. Track whether reported actions are completed and whether exposure is reduced over time.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Leadership receives resilience updates only after major incidents.
  • Level 2: Defined: Reporting expectations exist, but information is inconsistent or incomplete.
  • Level 3: Managed: Leadership receives regular, decision-oriented resilience reporting.
  • Level 4: Measurable: Teams track reporting completeness, action closure, and issue recurrence.
  • Level 5: Optimized: Reporting is concise, evidence-based, and directly informs resilience strategy and funding.

Control Recommendations

  1. Focus leadership reporting on risk posture and required decisions rather than raw operational detail.
  2. Distinguish current exposure from planned improvement.
  3. Use trend views so leaders can see whether resilience is improving over time.

Third-Party and SaaS Resilience

Critical services increasingly depend on external providers, SaaS platforms, and shared ecosystems. A wider resilience benchmark must therefore account for supplier concentration, contractual recovery assumptions, and the practical ability to continue operations when a third party degrades or fails.

Maintain an inventory of critical third-party and SaaS dependencies

Control Description

Organizations cannot manage supplier resilience if they do not know which third parties are operationally critical. This control creates the dependency visibility needed for risk assessment, continuity planning, and response.

Control Implementation

  1. Record external providers, SaaS platforms, and managed services that support critical or important functions.
  2. Link each dependency to the services, processes, and data it affects.
  3. Capture ownership, service scope, contractual status, and operational importance for each dependency.
  4. Review the inventory after onboarding new vendors, adding new critical services, or retiring legacy systems.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Knowledge of third-party dependencies is fragmented and incomplete.
  • Level 2: Defined: An inventory exists, but coverage and ownership are uneven.
  • Level 3: Managed: Critical third-party and SaaS dependencies are inventoried and maintained consistently.
  • Level 4: Measurable: Teams track inventory completeness, ownership gaps, and stale entries.
  • Level 5: Optimized: Dependency inventory is continuously updated and integrated with risk, continuity, and incident processes.

Control Recommendations

  1. Track shared providers that support multiple business-critical services.
  2. Include both direct vendors and materially important subcontracted or embedded services where known.
  3. Link inventory records to continuity plans and incident contacts.

Assess concentration risk and exit readiness for critical providers

Control Description

Supplier resilience is weakened when too many critical services depend on one provider or when exit from that provider is impractical. Concentration and exit readiness should be assessed deliberately rather than discovered during crisis.

Control Implementation

  1. Identify providers whose failure, degradation, or contractual breakdown would affect multiple critical services.
  2. Assess concentration risk by provider, region, control plane, and operational dependency.
  3. Define feasible exit, migration, or fallback options for the most critical dependencies.
  4. Review concentration and exit assumptions after major sourcing or architecture changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Concentration risk is understood informally, with little exit planning.
  • Level 2: Defined: Assessment methods exist, but only some critical providers are reviewed.
  • Level 3: Managed: Critical providers have documented concentration and exit assessments.
  • Level 4: Measurable: Teams track concentration exposure, exit readiness, and unresolved provider lock-in risks.
  • Level 5: Optimized: Concentration management informs sourcing strategy, architecture, and continuity design.

Control Recommendations

  1. Evaluate management-plane concentration separately from workload concentration.
  2. Be explicit about where exit is strategic, partial, or currently unrealistic.
  3. Use test exercises to validate whether fallback options are usable in practice.

Define minimum resilience and security requirements for suppliers

Control Description

Suppliers that support critical services should meet baseline expectations for resilience, security, recovery, and incident cooperation. This control ensures those expectations are defined before disruption occurs.

Control Implementation

  1. Define minimum supplier requirements for availability, security, backup, recovery, notification, and evidence.
  2. Apply those requirements during vendor selection, onboarding, and contract review.
  3. Record justified exceptions where suppliers cannot meet the standard.
  4. Review requirements as internal resilience expectations and regulations evolve.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Supplier resilience expectations are inconsistent or undocumented.
  • Level 2: Defined: Minimum requirements are documented, but not applied uniformly.
  • Level 3: Managed: Critical suppliers are assessed against defined resilience and security requirements.
  • Level 4: Measurable: Teams track supplier compliance, exceptions, and remediation progress.
  • Level 5: Optimized: Supplier requirements are continuously improved using incidents, audits, and market learning.

Control Recommendations

  1. Include requirements for notification timing, crisis contacts, and testing support.
  2. Align supplier standards with internal recovery objectives for dependent services.
  3. Coordinate legal, procurement, security, and operations stakeholders when defining the baseline.

Monitor supplier performance, incidents, and contractual recovery commitments

Control Description

Third-party resilience cannot be managed as a one-time onboarding task. Ongoing monitoring helps organizations detect service deterioration, repeated incidents, or unsupported assumptions in provider commitments.

Control Implementation

  1. Track operational performance, incident history, and support responsiveness for critical suppliers.
  2. Review whether providers meet contractual availability, continuity, and recovery commitments.
  3. Escalate recurring or material supplier issues through governance and vendor-management channels.
  4. Update continuity plans and risk treatment when supplier performance changes materially.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Supplier issues are reviewed mainly when major failures occur.
  • Level 2: Defined: Monitoring expectations exist, but evidence is incomplete or inconsistent.
  • Level 3: Managed: Critical suppliers are monitored regularly against operational and contractual expectations.
  • Level 4: Measurable: Teams track incident frequency, performance trend, and unresolved supplier issues.
  • Level 5: Optimized: Supplier monitoring is integrated with resilience reporting, risk review, and sourcing decisions.

Control Recommendations

  1. Track provider incidents even when they do not yet create customer-visible impact.
  2. Compare contractual commitments with actual recovery performance during provider events.
  3. Ensure monitoring includes both technical and commercial ownership.

Test contingency plans for third-party and SaaS disruption

Control Description

Contingency plans for third-party failure are only useful if they have been exercised. Testing validates whether teams can actually continue operations or recover service when an external dependency is unavailable.

Control Implementation

  1. Define disruption scenarios for critical suppliers, SaaS platforms, and managed services.
  2. Exercise fallback workflows, alternate providers, manual workarounds, or degraded modes where applicable.
  3. Measure operational impact, decision quality, and execution speed during the exercise.
  4. Use findings to improve contingency plans, architecture, and contractual strategy.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Third-party contingency planning is largely theoretical.
  • Level 2: Defined: Plans exist, but testing is infrequent or narrow.
  • Level 3: Managed: Critical supplier disruption plans are exercised on a defined cadence.
  • Level 4: Measurable: Teams track test coverage, fallback effectiveness, and unresolved gaps.
  • Level 5: Optimized: Supplier contingency testing is realistic, repeatable, and integrated with continuity governance.

Control Recommendations

  1. Include scenarios where provider support is delayed or unavailable.
  2. Test communications, access to export data, and emergency administrative actions.
  3. Update provider concentration assessments using the results of contingency exercises.

Data Backup and Recovery

Modern backup programs must protect more than databases alone. In 2026, resilient recovery means covering data, configuration, identities, secrets, and critical SaaS platforms while defending against ransomware, operator error, supplier disruption, and regional failures.

Establish a regular backup schedule for critical data

Control Description

This control ensures that critical data is backed up on a schedule aligned with business impact, recovery objectives, and change frequency. A reliable schedule reduces recovery uncertainty and limits the blast radius of data loss events.

Control Implementation

  1. Classify critical data sets, systems, and supporting configuration according to business impact.
  2. Define backup frequency, retention, and recovery point objectives for each class of data.
  3. Automate backup jobs and policy enforcement wherever possible.
  4. Review failed, skipped, or degraded backup runs and assign owners to remediation.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Backups happen irregularly and rely on individual effort.
  • Level 2: Defined: Backup schedules are documented for key systems, but coverage gaps remain.
  • Level 3: Managed: Backup schedules are automated, owned, and followed consistently.
  • Level 4: Measurable: Teams track backup success rates, policy drift, and recovery objective adherence.
  • Level 5: Optimized: Backup schedules adapt to business criticality, change patterns, and recovery test results.

Control Recommendations

  1. Cover infrastructure state, configuration repositories, secrets metadata, and critical SaaS exports in addition to core data stores.
  2. Align schedules with documented RPO and RTO targets rather than using the same cadence everywhere.
  3. Expose backup health in operational dashboards so failures are visible alongside production telemetry.

Store backups in multiple locations (offsite and/or cloud-based storage)

Control Description

This control reduces the risk that a single provider, location, or compromised administrative domain can destroy all recoverable copies. Separate failure domains are essential for both cyber resilience and disaster recovery.

Control Implementation

  1. Store backups in at least two distinct locations or administrative domains.
  2. Ensure one copy is isolated from day-to-day production access patterns.
  3. Validate retention, replication, and deletion protections for each storage location.
  4. Review location diversity whenever infrastructure topology or provider strategy changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Most backups are stored in a single environment or trust boundary.
  • Level 2: Defined: Multiple storage locations are documented, but isolation and coverage are incomplete.
  • Level 3: Managed: Backups are routinely stored across separate failure domains with clear ownership.
  • Level 4: Measurable: Teams track replication timeliness, location coverage, and isolation exceptions.
  • Level 5: Optimized: Storage strategy includes immutable or offline patterns and evolves with threat modeling.

Control Recommendations

  1. Keep at least one backup copy protected by immutability, delayed deletion, or offline custody.
  2. Avoid concentrating all backup copies inside the same cloud account, subscription, or management plane.
  3. Periodically confirm that access controls for backup storage are stricter than those of the production environment.

Implement a versioning system to track and restore previous versions of data

Control Description

Versioning improves recovery precision by allowing teams to restore data from a known-good point in time. It is especially valuable when corruption, ransomware, or deployment errors are discovered after the initial event.

Control Implementation

  1. Enable versioning for backup systems, object stores, and data platforms that support point-in-time recovery.
  2. Define retention windows that balance operational recovery needs, legal obligations, and storage cost.
  3. Protect version history from unauthorized pruning or tampering.
  4. Include version selection in recovery runbooks and exercises.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Teams rely on the latest backup copy with limited historical recovery options.
  • Level 2: Defined: Versioning exists for some systems, but policies are inconsistent.
  • Level 3: Managed: Versioning is implemented according to a documented policy across critical systems.
  • Level 4: Measurable: Teams monitor retention compliance, restore accuracy, and version-coverage gaps.
  • Level 5: Optimized: Versioning strategy is refined using incident learning, legal requirements, and recovery drills.

Control Recommendations

  1. Prefer solutions that support point-in-time restore, immutable snapshots, and policy-based retention.
  2. Extend versioning to schemas, infrastructure definitions, and application configuration where practical.
  3. Test restoration from older recovery points, not only the most recent backup.

Encrypt backups to protect sensitive data

Control Description

Backup data often contains the most complete copy of sensitive information in the estate. Encrypting backups protects confidentiality and reduces the impact of storage compromise or media loss.

Control Implementation

  1. Define which backup classes require encryption based on sensitivity and regulatory requirements.
  2. Use strong encryption for stored backups and for any transfer between systems or locations.
  3. Protect encryption keys with dedicated access controls, rotation procedures, and recovery safeguards.
  4. Verify that restored data can be decrypted under normal and contingency conditions.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Some backups are encrypted, but coverage is incomplete and unmanaged.
  • Level 2: Defined: Encryption requirements and key-handling expectations are documented.
  • Level 3: Managed: Backup encryption is consistently enforced and key access is controlled.
  • Level 4: Measurable: Teams track encryption coverage, key-rotation compliance, and decryption test results.
  • Level 5: Optimized: Backup encryption practices are automated, audited, and integrated with broader data protection strategy.

Control Recommendations

  1. Use centrally governed key management rather than embedding unmanaged keys in backup tooling.
  2. Separate backup operators from key administrators where practical.
  3. Include key-loss and emergency-access scenarios in recovery testing.

Test backup and recovery processes periodically to ensure data integrity

Control Description

Backups only create resilience when recovery has been proven. Regular recovery exercises validate integrity, timing, dependencies, and the operational readiness of the teams that must execute the restore.

Control Implementation

  1. Establish a recovery-testing schedule based on system criticality and change frequency.
  2. Perform restore exercises that validate data completeness, application usability, and supporting dependencies.
  3. Record actual recovery times and compare them to stated objectives.
  4. Feed issues from tests into backlog, risk, and control-improvement processes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Recovery tests are rare or limited to basic file checks.
  • Level 2: Defined: Recovery testing is planned, but scope and evidence are inconsistent.
  • Level 3: Managed: Critical systems are tested on a regular cadence with documented outcomes.
  • Level 4: Measurable: Teams track recovery success, integrity defects, and RTO/RPO performance.
  • Level 5: Optimized: Recovery testing is scenario-based, repeatable, and continuously improved through automation and exercises.

Control Recommendations

  1. Test full-service recovery, not only raw data restore.
  2. Include partial corruption, ransomware, and cross-region recovery scenarios.
  3. Require remediation owners and deadlines for failed recovery objectives.

Network redundancy and failover

Resilient networks are designed to absorb provider, path, and device failures without prolonged service degradation. In 2026, that means combining path diversity, health-aware traffic management, and repeatable failover validation.

Implement redundant network connections to prevent single points of failure

Control Description

This control ensures that the loss of a single carrier, path, or connection does not isolate critical services. Redundant connectivity protects availability and supports graceful degradation during infrastructure incidents.

Control Implementation

  1. Identify network dependencies that can interrupt critical services if they fail.
  2. Provision redundant links, circuits, or cloud connectivity paths for those dependencies.
  3. Separate redundant paths by provider, device, or physical route where feasible.
  4. Document ownership, failover logic, and operational limits for each redundant path.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Critical connectivity relies on single paths or informal fallback arrangements.
  • Level 2: Defined: Redundant connections are planned for key systems, but not consistently deployed.
  • Level 3: Managed: Critical services use deliberate redundancy with documented ownership and support.
  • Level 4: Measurable: Teams monitor redundancy coverage, dependency concentration, and failover outcomes.
  • Level 5: Optimized: Network redundancy is continuously reviewed against growth, provider risk, and incident learning.

Control Recommendations

  1. Favor provider and route diversity over duplicating the same failure domain twice.
  2. Include DNS, load-balancer, and edge dependencies in redundancy analysis.
  3. Review concentration risk when multiple services share the same underlying carrier or cloud interconnect.

Use load balancers to distribute traffic evenly across resources

Control Description

Load balancing helps services remain responsive during node failures, demand spikes, and rolling changes. It also creates a controlled point for health-aware traffic shaping and failover behavior.

Control Implementation

  1. Deploy load balancers for internet-facing and internal services that require horizontal resilience.
  2. Define health checks, routing policies, and drain behavior for planned maintenance and failure handling.
  3. Distribute traffic across independent instances, zones, or regions according to service design.
  4. Review balancing rules after significant architecture, traffic, or dependency changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Traffic distribution is manual or concentrated on a small number of nodes.
  • Level 2: Defined: Load-balancing patterns exist, but health checks and routing behavior are basic.
  • Level 3: Managed: Load balancers are consistently configured, monitored, and used for critical services.
  • Level 4: Measurable: Teams track saturation, failover behavior, and balancing effectiveness under load.
  • Level 5: Optimized: Traffic management is tuned using service-level objectives, resilience tests, and real traffic patterns.

Control Recommendations

  1. Configure graceful connection draining for deployments and node replacement.
  2. Use health checks that validate application readiness, not just port reachability.
  3. Periodically test behavior during partial zone or instance failure.

Employ network failover solutions (e.g., redundant routers, switches)

Control Description

Failover components reduce dependency on individual devices and allow services to continue when routers, firewalls, switches, or virtual network functions become unavailable.

Control Implementation

  1. Identify network devices and services whose failure would cause major service interruption.
  2. Deploy redundant devices or managed failover services for those critical points.
  3. Define state synchronization, configuration management, and operating procedures for failover pairs or clusters.
  4. Validate replacement and failover behavior after changes to firmware, policy, or topology.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Critical network devices are single points of failure.
  • Level 2: Defined: Failover designs exist on paper or for a limited subset of infrastructure.
  • Level 3: Managed: Redundant network components are implemented and operationally maintained.
  • Level 4: Measurable: Teams track failover readiness, device drift, and recovery outcomes.
  • Level 5: Optimized: Failover configuration is automated and continuously improved through testing and standardization.

Control Recommendations

  1. Keep primary and secondary device configurations under version control.
  2. Use out-of-band management paths for recovery from control-plane failure.
  3. Review firmware lifecycle and support windows as part of resilience planning.

Monitor network performance and latency to detect potential issues

Control Description

Network telemetry provides early warning for congestion, path instability, and dependency issues that often precede full outages. Monitoring enables teams to detect and remediate performance degradation before it becomes service loss.

Control Implementation

  1. Collect latency, packet-loss, throughput, and path health telemetry for critical services.
  2. Define thresholds and baselines that distinguish normal variance from meaningful degradation.
  3. Correlate network telemetry with application and platform monitoring.
  4. Escalate sustained anomalies to the teams responsible for network, platform, and service operations.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Network performance is only reviewed after incidents or user complaints.
  • Level 2: Defined: Basic monitoring exists, but coverage and alert quality are inconsistent.
  • Level 3: Managed: Critical paths are monitored continuously with clear ownership and response expectations.
  • Level 4: Measurable: Teams analyze trends, threshold quality, and incident correlation to improve detection.
  • Level 5: Optimized: Monitoring is proactive, service-aware, and tuned using resilience and performance testing.

Control Recommendations

  1. Monitor north-south and east-west traffic paths for critical workloads.
  2. Track external dependencies such as DNS, CDN, or carrier performance where they materially affect services.
  3. Review noisy alerts regularly to keep signal quality high.

Test network redundancy and failover processes to ensure proper functioning

Control Description

Failover design should be proven under realistic conditions, not assumed from configuration alone. Testing validates timing, routing behavior, and operational readiness during degraded or emergency states.

Control Implementation

  1. Schedule failover tests for critical network paths, devices, and traffic-management layers.
  2. Execute controlled scenarios that remove or impair selected network dependencies.
  3. Measure impact on service availability, latency, and operator response.
  4. Capture findings and remediation work in resilience improvement plans.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Failover is largely untested outside of real incidents.
  • Level 2: Defined: Test plans exist, but execution is infrequent or narrow.
  • Level 3: Managed: Critical failover paths are exercised on a documented cadence.
  • Level 4: Measurable: Teams track failover timing, user impact, and recurring defects.
  • Level 5: Optimized: Testing is repeatable, low-friction, and used to harden both design and operations.

Control Recommendations

  1. Test at least one scenario that affects shared dependencies, not only single devices.
  2. Include rollback and recovery-to-primary procedures in the exercise.
  3. Review whether failover causes secondary bottlenecks elsewhere in the system.

Infrastructure monitoring and alerting

Resilient operations depend on fast, trustworthy signals. In 2026, monitoring should unify metrics, logs, traces, and topology context so that teams can detect, triage, and respond with less guesswork across internal systems, critical vendors, and business-priority services.

Implement a monitoring system to track the health and performance of cloud infrastructure

Control Description

This control establishes the foundational visibility needed to operate resilient infrastructure. A unified monitoring system helps teams detect unhealthy states, capacity issues, and service degradation before they escalate.

Control Implementation

  1. Instrument critical infrastructure components with metrics and health telemetry.
  2. Centralize dashboards and operational views for infrastructure, platform, and service owners.
  3. Define ownership for monitored signals and their operational relevance.
  4. Review monitoring coverage whenever new critical systems or dependencies are introduced.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Monitoring is fragmented and focused on only a few systems.
  • Level 2: Defined: Core monitoring tools and ownership are documented, but coverage is uneven.
  • Level 3: Managed: Critical infrastructure is monitored consistently using shared operational standards.
  • Level 4: Measurable: Teams track monitoring coverage, detection quality, and data freshness.
  • Level 5: Optimized: Monitoring evolves with architecture changes and is integrated into engineering delivery practices.

Control Recommendations

  1. Include managed services and control-plane dependencies, not just self-managed hosts and containers.
  2. Use standardized telemetry tags so cross-team analysis remains reliable.
  3. Keep dashboards focused on operational decisions rather than vanity metrics.

Set up alerts for critical events and performance thresholds

Control Description

Alerting turns raw telemetry into actionable operational response. Effective alerts help teams notice important problems quickly without overwhelming them with noise.

Control Implementation

  1. Define alerts for events and thresholds that require operator action.
  2. Route alerts to the right teams based on ownership, severity, and time-of-day expectations.
  3. Include runbook links, service context, and likely impact in alert payloads where practical.
  4. Review alert quality after incidents, near misses, and major system changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Alerts are sparse, overly noisy, or lack clear ownership.
  • Level 2: Defined: Alert rules and escalation paths exist, but tuning is inconsistent.
  • Level 3: Managed: Critical alerts are actionable, routed correctly, and maintained as part of operations.
  • Level 4: Measurable: Teams track false positives, missed detections, and acknowledgement times.
  • Level 5: Optimized: Alerting is continuously tuned using incident analysis, SLOs, and automation.

Control Recommendations

  1. Prefer symptom-based alerts for customer impact in addition to component-level threshold alerts.
  2. Retire or tune alerts that rarely lead to action.
  3. Distinguish paging conditions from informational notifications.

Monitor resource usage to identify potential bottlenecks and capacity issues

Control Description

Resource usage monitoring helps teams see stress before it becomes failure. It supports both short-term incident avoidance and longer-term capacity planning.

Control Implementation

  1. Track utilization, saturation, throttling, and queueing signals for compute, storage, and network resources.
  2. Define service-specific thresholds and trend reviews for critical workloads.
  3. Correlate resource pressure with deployments, scheduled jobs, and dependency behavior.
  4. Escalate persistent bottlenecks into engineering, capacity, or architecture workstreams.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Resource pressure is noticed late and handled reactively.
  • Level 2: Defined: Teams know which resources matter, but monitoring depth varies.
  • Level 3: Managed: Critical workloads have regular resource monitoring and ownership.
  • Level 4: Measurable: Teams measure saturation trends, bottleneck recurrence, and forecasting accuracy.
  • Level 5: Optimized: Resource monitoring informs autoscaling, workload placement, and architecture decisions.

Control Recommendations

  1. Monitor limits and quotas alongside raw utilization.
  2. Include managed database, messaging, and storage services in resource reviews.
  3. Use historical trend windows that can reveal seasonal behavior, not only real-time views.

Establish a centralized logging system to collect and analyze logs from various components

Control Description

Centralized logging supports investigation, detection, and post-incident learning by making operational evidence searchable across systems and teams.

Control Implementation

  1. Aggregate logs from infrastructure, platforms, applications, and security-relevant services.
  2. Standardize timestamps, identifiers, and metadata needed for correlation.
  3. Define retention, access controls, and integrity expectations for operational logs.
  4. Test log ingestion and searchability when new services or schemas are introduced.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Logs remain scattered across individual systems with limited retention.
  • Level 2: Defined: Central logging exists for some systems, but format and access standards vary.
  • Level 3: Managed: Critical logs are centralized, retained appropriately, and available for investigation.
  • Level 4: Measurable: Teams monitor pipeline health, coverage, query performance, and missing-log conditions.
  • Level 5: Optimized: Logging is structured, policy-driven, and continuously improved for detection and response.

Control Recommendations

  1. Capture audit logs from identity, control-plane, and deployment systems alongside runtime logs.
  2. Protect sensitive data in logs through redaction and access control.
  3. Validate that log retention meets operational and compliance requirements.

Control Description

Telemetry becomes more valuable when teams review it deliberately for patterns, not only during incidents. Trend analysis helps identify weak signals, recurring failure modes, and opportunities for hardening.

Control Implementation

  1. Hold recurring reviews of monitoring, alerting, and incident trend data.
  2. Identify repeated failure modes, recurring noisy alerts, and slow-burn capacity or reliability problems.
  3. Convert review findings into tracked remediation or architecture work.
  4. Revisit service objectives and monitoring coverage in light of observed trends.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Monitoring data is mainly used for reactive troubleshooting.
  • Level 2: Defined: Trend reviews are planned, but follow-through is inconsistent.
  • Level 3: Managed: Teams regularly review telemetry and act on resilience findings.
  • Level 4: Measurable: Review outcomes, recurring issues, and improvement completion are tracked.
  • Level 5: Optimized: Monitoring review is embedded in continuous improvement and resilience governance.

Control Recommendations

  1. Include operations, engineering, and service owners in the review loop.
  2. Compare telemetry trends before and after major architecture changes.
  3. Keep a durable record of findings so improvements are visible over time.

Incident response planning

When incidents happen, resilience depends on prepared people and clear decisions. A modern response capability should support fast coordination, accurate communication, repeatable recovery, and disciplined learning. This section focuses on operational incident handling; broader business continuity and crisis-management expectations are covered separately.

Develop a formal incident response plan, including roles and responsibilities

Control Description

A formal incident response plan gives teams a shared operating model during service disruption. It reduces confusion by clarifying who leads, who communicates, and who restores service.

Control Implementation

  1. Define incident severity, activation criteria, roles, and decision authority.
  2. Document the end-to-end workflow for detection, triage, escalation, response, and recovery.
  3. Assign primary and backup owners for key response roles.
  4. Review the plan after major organizational, architectural, or dependency changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Incident response depends on informal knowledge and heroics.
  • Level 2: Defined: A documented plan exists, but role clarity and adoption vary.
  • Level 3: Managed: Teams use a formal plan consistently during significant incidents.
  • Level 4: Measurable: Teams track activation quality, role coverage, and plan adherence.
  • Level 5: Optimized: The plan is continuously refined through drills, incidents, and organizational learning.

Control Recommendations

  1. Keep the first page of the plan short and operationally useful.
  2. Include third-party dependency scenarios in escalation guidance.
  3. Ensure backups are named for every critical response role.

Establish a communication plan for internal and external stakeholders during incidents

Control Description

Communication plans help organizations share accurate information with the right audience at the right time. They reduce confusion, protect trust, and support coordinated response across technical and non-technical stakeholders.

Control Implementation

  1. Define communication audiences, approval paths, and update channels for incident conditions.
  2. Prepare templates for executive, customer, regulator, and internal team communications as needed.
  3. Assign clear ownership for status updates and message approvals.
  4. Review the plan after incidents to improve clarity, speed, and consistency.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Incident communication is improvised and inconsistent.
  • Level 2: Defined: Communication pathways and templates exist, but use is uneven.
  • Level 3: Managed: Stakeholder communication follows a documented plan during incidents.
  • Level 4: Measurable: Teams track timeliness, accuracy, and stakeholder feedback on communications.
  • Level 5: Optimized: Communication plans are rehearsed, streamlined, and improved through real-event learning.

Control Recommendations

  1. Separate technical coordination channels from stakeholder-facing updates.
  2. Use pre-approved language where response speed matters.
  3. Keep customer and leadership updates aligned to the same source of operational truth.

Perform regular incident response drills to test and refine the plan

Control Description

Drills expose weaknesses in plans, tooling, and coordination before a real incident does. They build shared muscle memory across engineering, operations, security, and business leadership.

Control Implementation

  1. Run scheduled tabletop and hands-on exercises for representative incident scenarios.
  2. Include dependency failures, cyber events, and operational mistakes in the exercise catalog.
  3. Measure readiness factors such as activation speed, coordination quality, and decision clarity.
  4. Feed drill findings into tracked plan and tooling improvements.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Response drills are rare or limited to discussion-only exercises.
  • Level 2: Defined: Drill cadence and scenario types are documented, but execution is uneven.
  • Level 3: Managed: Teams run regular drills and use findings to improve readiness.
  • Level 4: Measurable: Teams track exercise coverage, participation, and identified improvement items.
  • Level 5: Optimized: Drills are realistic, cross-functional, and closely tied to observed risks and incidents.

Control Recommendations

  1. Alternate between tabletop exercises and technical simulations.
  2. Include communications and leadership coordination in at least some drills.
  3. Re-test high-severity findings after remediation.

Document lessons learned from incidents and update the incident response plan accordingly

Control Description

Post-incident learning turns disruption into operational improvement. This control ensures that identified issues lead to better plans, systems, and working practices rather than being forgotten once service is restored.

Control Implementation

  1. Conduct structured post-incident reviews for material events and near misses.
  2. Capture technical causes, contributing factors, decision points, and communication observations.
  3. Assign owners and due dates to follow-up actions.
  4. Update plans, runbooks, and training materials based on completed learning.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Incident learning is informal and rarely preserved.
  • Level 2: Defined: Review processes exist, but action tracking is inconsistent.
  • Level 3: Managed: Post-incident learning is documented and linked to concrete improvements.
  • Level 4: Measurable: Teams track action completion, repeat issues, and learning-cycle quality.
  • Level 5: Optimized: Incident learning feeds a durable culture of resilience improvement across teams.

Control Recommendations

  1. Focus reviews on learning and system improvement rather than blame.
  2. Track recurring themes across incidents, not only single-event details.
  3. Share relevant lessons with adjacent teams that depend on the same patterns or platforms.

Provide training for staff on incident response processes and best practices

Control Description

Training helps people understand how to operate during pressure, use the right tools, and collaborate effectively when service risk is high.

Control Implementation

  1. Define training expectations for responders, support teams, and relevant business stakeholders.
  2. Cover incident roles, escalation paths, communication norms, and key tooling.
  3. Refresh training after major process or platform changes.
  4. Record participation and verify that training reaches backup role holders as well as primaries.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Staff learn incident response informally through experience.
  • Level 2: Defined: Training content exists, but completion and refresh cycles vary.
  • Level 3: Managed: Response training is delivered consistently to the right audiences.
  • Level 4: Measurable: Teams track completion, readiness gaps, and training effectiveness.
  • Level 5: Optimized: Training is role-specific, continuously updated, and reinforced through drills and reviews.

Control Recommendations

  1. Train new joiners early on core incident response expectations.
  2. Tailor deeper training for incident commanders, communications leads, and platform specialists.
  3. Use short refreshers between larger formal training events.

Business Continuity and Crisis Management

Operational resilience requires more than technical incident response. Organizations also need business impact analysis, continuity decision-making, manual fallback arrangements, and a crisis-management model that can coordinate technical teams, leadership, customers, regulators, and critical suppliers during severe disruption.

Perform business impact analysis for critical services and processes

Control Description

Business impact analysis identifies which services, processes, assets, and dependencies matter most during severe disruption. It creates the prioritization foundation for continuity planning, recovery objectives, and resilience investment.

Control Implementation

  1. Identify critical business services, supporting processes, information assets, and dependencies.
  2. Assess the operational, financial, customer, and regulatory impact of severe disruption.
  3. Define recovery priorities and tolerances using qualitative and quantitative criteria.
  4. Review the BIA after major organizational, product, supplier, or platform change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Recovery priorities are assumed informally and vary by stakeholder.
  • Level 2: Defined: A BIA approach exists, but critical services or dependencies are not fully covered.
  • Level 3: Managed: Critical services and processes have an up-to-date BIA that informs continuity planning.
  • Level 4: Measurable: Teams track BIA coverage, review cadence, and alignment with recovery practice.
  • Level 5: Optimized: The BIA is continuously improved using incidents, exercises, and changing business context.

Control Recommendations

  1. Include third-party dependencies and manual processes in the impact analysis.
  2. Use the BIA to drive redundancy, recovery testing, and service-priority decisions.
  3. Reconcile conflicting stakeholder priorities before a disruption occurs.

Define business continuity plans and manual workaround procedures

Control Description

Business continuity plans describe how critical services and processes continue or recover during severe disruption. Manual workarounds provide an important fallback where automation, SaaS platforms, or infrastructure are unavailable.

Control Implementation

  1. Develop continuity plans for critical services and important supporting processes.
  2. Define manual or degraded-mode procedures for the highest-priority workflows where feasible.
  3. Record activation criteria, owners, dependencies, and communication paths for each plan.
  4. Review and update plans after major process, platform, or staffing changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Continuity arrangements depend on informal team knowledge.
  • Level 2: Defined: Plans exist for some critical areas, but depth and ownership vary.
  • Level 3: Managed: Critical continuity plans and workarounds are documented, owned, and maintained.
  • Level 4: Measurable: Teams track plan coverage, stale procedures, and workaround feasibility.
  • Level 5: Optimized: Continuity plans evolve continuously using tests, incidents, and business feedback.

Control Recommendations

  1. Keep workarounds realistic for the teams who must execute them under pressure.
  2. Document the limits of degraded-mode operation, not only the intended steps.
  3. Link continuity plans to dependency inventories, runbooks, and crisis communication plans.

Establish a crisis management structure for severe disruptions

Control Description

Severe disruption requires a leadership and coordination model that goes beyond normal incident response. Crisis management helps align business decisions, public communication, regulatory obligations, and recovery prioritization.

Control Implementation

  1. Define crisis activation criteria, decision authority, and core crisis-management roles.
  2. Establish procedures for executive coordination, cross-functional escalation, and stakeholder communication.
  3. Assign backup role holders and ensure crisis contacts are maintained.
  4. Review the structure after exercises, severe incidents, or organizational change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Crisis leadership is improvised during major disruption.
  • Level 2: Defined: Crisis roles and processes are documented, but readiness is inconsistent.
  • Level 3: Managed: The organization has a usable crisis-management structure for severe events.
  • Level 4: Measurable: Teams track role readiness, activation quality, and decision-flow effectiveness.
  • Level 5: Optimized: Crisis management is rehearsed, continuously refined, and aligned with continuity and incident practice.

Control Recommendations

  1. Keep crisis-management roles distinct from day-to-day technical execution roles where possible.
  2. Define how crisis leadership interacts with legal, compliance, and communications functions.
  3. Ensure severe supplier failures can be handled inside the same crisis framework.

Exercise continuity and crisis scenarios with business and technical stakeholders

Control Description

Exercises reveal whether continuity plans and crisis arrangements are workable under real pressure. They help build shared understanding between technical and non-technical teams before an actual severe event occurs.

Control Implementation

  1. Run regular continuity and crisis exercises for scenarios that threaten critical services or important functions.
  2. Include business, operations, engineering, security, supplier, and leadership stakeholders as needed.
  3. Measure activation speed, coordination quality, workaround usability, and decision clarity.
  4. Track remediation actions for issues discovered during exercises.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Continuity and crisis exercises are rare or mostly discussion-based.
  • Level 2: Defined: Exercise expectations exist, but scenarios or participation are limited.
  • Level 3: Managed: Continuity and crisis scenarios are exercised regularly with relevant stakeholders.
  • Level 4: Measurable: Teams track exercise coverage, findings, and remediation completion.
  • Level 5: Optimized: Exercises are realistic, cross-functional, and tightly aligned to major business risks.

Control Recommendations

  1. Include provider failure and communications breakdown scenarios in the exercise catalog.
  2. Test manual fallback procedures, not only escalation conversations.
  3. Re-run scenarios after major remediation to confirm improvement.

Review continuity assumptions and recovery priorities after major change

Control Description

Continuity plans become stale when services, dependencies, and business priorities change. This control keeps recovery assumptions aligned to the organization’s current operating model.

Control Implementation

  1. Trigger continuity reviews after major platform, supplier, business, or regulatory change.
  2. Reassess recovery priorities, dependencies, and workaround feasibility against current reality.
  3. Update continuity documentation, crisis procedures, and recovery sequencing where gaps exist.
  4. Communicate changes to teams that own affected services and processes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Continuity assumptions change slowly and mainly after visible disruption.
  • Level 2: Defined: Review triggers are known, but execution is inconsistent.
  • Level 3: Managed: Continuity assumptions and priorities are reviewed after material change.
  • Level 4: Measurable: Teams track review cadence, stale assumptions, and update completion.
  • Level 5: Optimized: Continuity review is embedded into change governance and resilience planning.

Control Recommendations

  1. Revalidate changes against the business impact analysis, not only technical documentation.
  2. Pay particular attention to supplier, staffing, and recovery-sequencing changes.
  3. Keep historical decisions so teams can understand why priorities changed.

Capacity planning and scaling

Capacity resilience is about more than avoiding exhaustion. Teams need enough headroom, elasticity, and insight to handle growth, bursts, maintenance, and failure-driven redistribution of traffic.

Regularly assess infrastructure capacity and plan for growth

Control Description

This control ensures that capacity is reviewed against business demand, resilience targets, and architecture changes. Regular planning reduces the chance of service failure caused by predictable growth or shifting usage patterns.

Control Implementation

  1. Identify critical services and the resources that constrain their growth.
  2. Review current demand, historical trends, and planned business or product changes.
  3. Define headroom targets that account for failures, maintenance, and recovery scenarios.
  4. Update capacity plans whenever assumptions materially change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Capacity planning happens only after services show stress.
  • Level 2: Defined: Capacity review practices are documented, but not applied consistently.
  • Level 3: Managed: Critical services follow a regular capacity planning cycle.
  • Level 4: Measurable: Teams track forecast accuracy, headroom, and capacity-related incident trends.
  • Level 5: Optimized: Capacity plans are continuously refined using demand signals, testing, and resilience objectives.

Control Recommendations

  1. Plan for degraded-mode operation, not only steady-state demand.
  2. Include third-party quotas and service limits in capacity analysis.
  3. Review whether capacity assumptions still hold after architecture simplification or consolidation.

Implement auto-scaling strategies to handle fluctuating workloads

Control Description

Autoscaling improves resilience by adapting supply to demand more quickly than manual intervention. It is most useful when paired with sane limits, strong observability, and service-aware scaling triggers.

Control Implementation

  1. Identify workloads that benefit from horizontal or vertical autoscaling.
  2. Configure scaling rules based on meaningful demand or saturation signals.
  3. Define guardrails for minimum capacity, maximum expansion, and scale-in safety.
  4. Test autoscaling behavior under burst, failure, and dependency degradation scenarios.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Scaling depends on manual action during or after demand spikes.
  • Level 2: Defined: Autoscaling exists for some workloads, but rules and limits are basic.
  • Level 3: Managed: Autoscaling is consistently implemented for appropriate workloads with clear ownership.
  • Level 4: Measurable: Teams track scaling responsiveness, thrashing, and cost-resilience tradeoffs.
  • Level 5: Optimized: Scaling strategy is tuned continuously using testing, demand analysis, and service objectives.

Control Recommendations

  1. Use warm capacity or queue-based patterns where cold starts are operationally risky.
  2. Prevent autoscaling from amplifying downstream bottlenecks or quota exhaustion.
  3. Review scale-in behavior carefully to avoid instability after demand drops.

Use load testing to identify capacity limits and potential bottlenecks

Control Description

Load testing helps teams understand where systems break, saturate, or degrade. It validates assumptions before peak demand or recovery events force the answer in production.

Control Implementation

  1. Define representative test scenarios for steady load, burst load, and failure-redistribution conditions.
  2. Measure throughput, latency, error behavior, and dependency impact during tests.
  3. Identify the first limiting component and record acceptable operating ranges.
  4. Feed findings into capacity plans, architecture decisions, and scaling rules.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Load testing is rare or limited to small-scale checks.
  • Level 2: Defined: Test approaches are documented, but realism or coverage is limited.
  • Level 3: Managed: Critical services are load tested on a regular basis.
  • Level 4: Measurable: Teams track capacity limits, regression risk, and performance trends over time.
  • Level 5: Optimized: Load testing is integrated into resilience validation and change planning.

Control Recommendations

  1. Include downstream systems so tests reveal end-to-end bottlenecks.
  2. Use production-like data sizes and topology where feasible.
  3. Re-run tests after major runtime, storage, or architecture changes.

Monitor resource usage to anticipate and address potential capacity issues

Control Description

This control turns raw utilization into early action. Continuous monitoring helps teams spot unsafe trends before they create performance incidents or constrain recovery operations.

Control Implementation

  1. Track leading indicators such as queue depth, latency, saturation, and quota consumption.
  2. Define thresholds for warning and intervention based on criticality.
  3. Review trends in regular operations and capacity forums.
  4. Escalate persistent pressure to engineering or platform planning work.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Capacity risk is noticed late through incidents or complaints.
  • Level 2: Defined: Monitoring exists for major resources, but thresholds and ownership vary.
  • Level 3: Managed: Teams monitor capacity risk consistently and respond before failure occurs.
  • Level 4: Measurable: Forecast variance, threshold quality, and recurring bottlenecks are tracked.
  • Level 5: Optimized: Monitoring data actively informs automation, investment, and architecture choices.

Control Recommendations

  1. Watch resource pressure during maintenance windows and failover events, not just normal traffic.
  2. Include cloud service quotas and API limits in monitoring scope.
  3. Treat repeated near-saturation as a resilience problem, not only a performance problem.

Review and update capacity plans based on changing business requirements and growth

Control Description

Capacity planning must evolve as services, traffic patterns, and business priorities change. Static plans become stale quickly in modern cloud environments.

Control Implementation

  1. Revisit capacity assumptions after major product, customer, geographic, or platform changes.
  2. Compare actual growth and operational behavior against prior forecasts.
  3. Update scaling, procurement, or architecture work based on revised demand expectations.
  4. Communicate plan changes to teams that own dependent services or shared platforms.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Plans are revisited only after shortages become visible.
  • Level 2: Defined: Review triggers are documented, but updates are irregular.
  • Level 3: Managed: Capacity plans are reviewed and updated as part of normal governance.
  • Level 4: Measurable: Teams track revision cadence, forecast drift, and change adoption.
  • Level 5: Optimized: Capacity planning is tightly linked to product planning, resilience targets, and engineering delivery.

Control Recommendations

  1. Include business and finance stakeholders when capacity shifts have material cost or customer implications.
  2. Reassess plans after consolidation onto shared platforms or common clusters.
  3. Keep historical versions so teams can learn from forecast misses.

Identity, Secrets, and Administrative Access

Identity systems, privileged access, and machine credentials are now a primary resilience dependency. A modern benchmark should therefore treat administrative identity, secrets handling, and break-glass access as first-class controls rather than as secondary details inside broader security guidance.

Centralize and harden privileged identity administration

Control Description

Privileged identity administration should be tightly controlled, observable, and resistant to misuse. Centralization reduces sprawl and makes privileged access easier to govern during both normal operations and crisis.

Control Implementation

  1. Centralize privileged identity and administrative access management where feasible.
  2. Separate high-impact administrative roles from standard user access paths.
  3. Protect privileged administration with stronger authentication, logging, and change controls.
  4. Review administrative identity architecture after major platform or directory changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Privileged administration is distributed across unmanaged or weakly governed paths.
  • Level 2: Defined: Administrative identity standards exist, but centralization is incomplete.
  • Level 3: Managed: Privileged identity administration is centrally governed for critical systems.
  • Level 4: Measurable: Teams track admin-path coverage, exception use, and privileged change visibility.
  • Level 5: Optimized: Administrative identity architecture is continuously hardened and aligned with resilience objectives.

Control Recommendations

  1. Reduce the number of systems that can modify core identity or access policy.
  2. Keep privileged identity workflows separate from routine operational access wherever practical.
  3. Monitor for emergency changes to identity policy and administrative groups.

Use short-lived credentials and just-in-time access for privileged operations

Control Description

Standing privilege creates persistent exposure and increases the impact of credential theft. Short-lived access and just-in-time elevation reduce that risk while preserving operational agility.

Control Implementation

  1. Identify privileged operations that can be performed using temporary rather than standing access.
  2. Implement short-lived credentials, session limits, or just-in-time elevation mechanisms for those operations.
  3. Record approvals, duration, and activity for privileged sessions.
  4. Review residual standing privilege and reduce it over time.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Privileged access is mostly standing and weakly time-bounded.
  • Level 2: Defined: Temporary access patterns are documented, but adoption is limited.
  • Level 3: Managed: Critical privileged operations use short-lived or just-in-time access consistently.
  • Level 4: Measurable: Teams track standing privilege reduction, temporary access use, and policy exceptions.
  • Level 5: Optimized: Privileged access is risk-aware, automated, and continuously tuned using operational evidence.

Control Recommendations

  1. Start with the most sensitive administrative domains, not the easiest systems.
  2. Keep temporary access requests usable during incidents without bypassing accountability.
  3. Review whether tooling or process friction is causing teams to retain standing privilege.

Manage secrets with controlled storage, rotation, and access policies

Control Description

Secrets such as API keys, tokens, certificates, and passwords are critical to both security and service continuity. Poor secrets management can turn routine incidents into severe, long-duration outages.

Control Implementation

  1. Store secrets in controlled systems rather than in source code, local files, or unmanaged configuration.
  2. Define access controls, lifecycle ownership, and rotation expectations for secret classes.
  3. Rotate high-impact secrets on a defined cadence and after compromise, personnel change, or platform events.
  4. Audit secret access and review stale, duplicated, or overexposed secrets regularly.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Secrets are scattered across unmanaged locations and rotated inconsistently.
  • Level 2: Defined: Secret handling standards exist, but coverage and enforcement are partial.
  • Level 3: Managed: Critical secrets are centrally governed with controlled access and rotation procedures.
  • Level 4: Measurable: Teams track secret coverage, rotation compliance, and exposure findings.
  • Level 5: Optimized: Secret lifecycle management is automated, continuously monitored, and integrated with platform policy.

Control Recommendations

  1. Distinguish between human, service, and emergency-use secrets when defining rotation policy.
  2. Include SaaS administrative tokens and integration keys in scope.
  3. Test secret rotation procedures for services with low operational tolerance for failure.

Protect and test emergency access and break-glass procedures

Control Description

Emergency access is necessary when normal identity paths fail or when immediate privileged action is required. Those procedures must be both well protected and genuinely usable under severe conditions.

Control Implementation

  1. Define emergency access accounts, credentials, approvals, and activation conditions for critical systems.
  2. Protect break-glass mechanisms with strong custody, monitoring, and post-use review.
  3. Test emergency access procedures regularly under realistic failure scenarios.
  4. Rotate or re-establish emergency credentials after use or when control assumptions change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Emergency access exists informally or is untested.
  • Level 2: Defined: Break-glass procedures are documented, but protection or test discipline is inconsistent.
  • Level 3: Managed: Emergency access is controlled, monitored, and tested for critical systems.
  • Level 4: Measurable: Teams track test success, access integrity, and post-use review completion.
  • Level 5: Optimized: Emergency access is robust, low-friction in crisis, and continuously improved through testing.

Control Recommendations

  1. Ensure break-glass access does not depend on the same identity path it is meant to replace.
  2. Include identity-provider outage scenarios in testing.
  3. Review whether emergency access creates unmanaged standing risk outside of crisis conditions.

Govern machine identities and service credentials across workloads

Control Description

Machine identities increasingly control communication between services, platforms, and automation systems. Governing them explicitly reduces hidden trust relationships and improves resilience when credentials or workload boundaries fail.

Control Implementation

  1. Identify service accounts, workload identities, certificates, and automation credentials used by critical systems.
  2. Define ownership, issuance, rotation, and revocation processes for machine identities.
  3. Reduce long-lived shared credentials in favor of scoped, workload-specific identities where feasible.
  4. Review machine identity exposure after architecture, platform, or supply-chain change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Machine identities are poorly inventoried and often shared broadly.
  • Level 2: Defined: Governance expectations exist, but implementation remains incomplete.
  • Level 3: Managed: Critical workloads use governed machine identities with defined lifecycle controls.
  • Level 4: Measurable: Teams track identity coverage, credential age, and revocation effectiveness.
  • Level 5: Optimized: Machine identity governance is automated and aligned to workload architecture and policy.

Control Recommendations

  1. Include CI/CD, backup, and infrastructure automation credentials in scope.
  2. Prefer identity models that reduce secret distribution where platform support exists.
  3. Investigate repeated credential-sharing patterns as architecture or tooling gaps.

Security and access controls

Cloud resilience depends on security controls that preserve trustworthy operation under attack and under change. A 2026 program should emphasize least privilege, strong encryption, rapid exposure reduction, and evidence-driven testing. Privileged identity administration, secrets management, and emergency access are covered in the dedicated identity section that precedes this one.

Implement strong authentication and authorization mechanisms

Control Description

Strong authentication and authorization reduce the risk of unauthorized access to critical systems, data, and administrative paths. They are foundational to both security and operational resilience.

Control Implementation

  1. Require strong authentication for privileged, remote, and high-impact access.
  2. Use role-based or policy-based authorization aligned to least-privilege principles.
  3. Centralize identity where practical and integrate lifecycle events with access changes.
  4. Review authentication methods and privileged access patterns regularly.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Access relies heavily on static credentials and broad permissions.
  • Level 2: Defined: Authentication and authorization standards are documented, but adoption is incomplete.
  • Level 3: Managed: Strong authentication and least-privilege access are consistently enforced for critical systems.
  • Level 4: Measurable: Teams track privileged access exceptions, MFA coverage, and role drift.
  • Level 5: Optimized: Identity controls are automated, risk-aware, and tightly integrated with operations.

Control Recommendations

  1. Prefer phishing-resistant MFA for privileged access where feasible.
  2. Reduce standing privilege using short-lived or just-in-time elevation patterns.
  3. Include machine identities and service-to-service authorization in scope.

Regularly review and update user access permissions

Control Description

Periodic access review helps organizations remove stale permissions, enforce least privilege, and maintain confidence that users only retain the access they still need.

Control Implementation

  1. Define a review cadence for privileged and business-critical access.
  2. Validate that current permissions match active roles and approved responsibilities.
  3. Remove or reduce unnecessary access promptly.
  4. Record approvals, exceptions, and remediation actions for auditability.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Access review is informal and often triggered only by incidents.
  • Level 2: Defined: Review procedures exist, but execution and evidence are inconsistent.
  • Level 3: Managed: Access reviews are performed regularly with accountable owners.
  • Level 4: Measurable: Teams track stale access findings, review completion, and remediation timing.
  • Level 5: Optimized: Access review is automated where possible and continuously informed by identity risk signals.

Control Recommendations

  1. Prioritize privileged, break-glass, and dormant accounts in each review cycle.
  2. Tie access changes to joiner, mover, and leaver workflows.
  3. Review permissions for third-party operators and contractors as carefully as internal staff.

Enable encryption for data at rest and in transit

Control Description

Encryption protects sensitive data and operational secrets as they move through systems and while they are stored. It also helps reduce the impact of unauthorized access to infrastructure components or media.

Control Implementation

  1. Define encryption requirements for data classes, system boundaries, and transport paths.
  2. Enable encryption for stored data and enforce secure transport for sensitive communications.
  3. Manage certificates, keys, and trust material through controlled lifecycle processes.
  4. Validate that encryption settings remain enforced after platform and dependency changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Encryption is used inconsistently and relies on manual configuration.
  • Level 2: Defined: Encryption standards exist, but coverage and verification are incomplete.
  • Level 3: Managed: Encryption is consistently enabled for critical data paths and stores.
  • Level 4: Measurable: Teams track encryption coverage, certificate hygiene, and policy exceptions.
  • Level 5: Optimized: Encryption posture is automated, continuously verified, and integrated into platform governance.

Control Recommendations

  1. Extend encryption review to backups, snapshots, replicas, and message transports.
  2. Rotate keys and certificates on a defined schedule and after relevant risk events.
  3. Monitor for expired certificates and downgraded transport configurations.

Apply security patches and updates promptly

Control Description

Prompt patching reduces the window in which known weaknesses can be exploited. Resilient organizations treat patching as a disciplined risk-reduction process rather than an occasional maintenance task.

Control Implementation

  1. Inventory software, platforms, and dependencies that require patch management.
  2. Classify updates by urgency and business impact.
  3. Test patches appropriately and deploy them within defined timelines.
  4. Track exceptions, compensating controls, and overdue remediation items.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Patching is reactive and lacks clear timelines or ownership.
  • Level 2: Defined: Patch processes and priorities are documented, but compliance is uneven.
  • Level 3: Managed: Patches are applied consistently according to risk-based timelines.
  • Level 4: Measurable: Teams track exposure age, overdue items, and exception trends.
  • Level 5: Optimized: Patch management is automated where appropriate and tightly linked to vulnerability intelligence.

Control Recommendations

  1. Include base images, managed services, network devices, and supporting tooling in patch scope.
  2. Use staged rollout patterns that limit blast radius for urgent updates.
  3. Make exception approvals time-bound and visible.

Conduct regular vulnerability assessments and penetration testing

Control Description

Assessment and testing help teams find exploitable weaknesses before attackers or operational stress do. They validate whether controls are effective in practice, not only in policy.

Control Implementation

  1. Define assessment scope, frequency, and testing methodology for critical systems.
  2. Combine automated scanning with manual validation for meaningful findings.
  3. Prioritize remediation based on exploitability and business impact.
  4. Re-test high-risk findings and use results to strengthen preventive controls.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Assessments occur irregularly and findings are not managed consistently.
  • Level 2: Defined: Assessment methods and responsibilities are documented, but coverage gaps remain.
  • Level 3: Managed: Regular vulnerability assessment and penetration testing are part of normal operations.
  • Level 4: Measurable: Teams track remediation time, recurring weaknesses, and testing coverage.
  • Level 5: Optimized: Testing scope is risk-driven, intelligence-informed, and tightly linked to engineering improvement.

Control Recommendations

  1. Focus manual testing on high-impact trust boundaries and privileged workflows.
  2. Include cloud configuration and identity attack paths, not only application endpoints.
  3. Share recurring vulnerability themes with platform and engineering teams for systemic fixes.

Software Delivery and Supply Chain Resilience

Software delivery systems are themselves critical resilience infrastructure. Source control, build systems, dependencies, artifact integrity, and rollback capability all shape whether teams can change safely and recover quickly when release tooling or supply-chain components are compromised or unavailable.

Protect source code, build systems, and deployment pipelines from unauthorized change

Control Description

Delivery systems have privileged control over production behavior. Protecting them from unauthorized change reduces the risk of both malicious compromise and high-impact operator error.

Control Implementation

  1. Identify source repositories, build systems, and deployment pipelines that influence critical services.
  2. Apply strong access control, separation of duties, and change protection to those systems.
  3. Monitor administrative changes, pipeline configuration changes, and unauthorized execution attempts.
  4. Review delivery-system protection after major tooling, platform, or organizational changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Delivery systems are protected inconsistently and share broad administrative access.
  • Level 2: Defined: Protection standards exist, but not all critical delivery paths follow them.
  • Level 3: Managed: Critical code and deployment systems are protected with clear controls and ownership.
  • Level 4: Measurable: Teams track pipeline protection coverage, privileged changes, and control exceptions.
  • Level 5: Optimized: Delivery-system protection is continuously hardened using testing, incidents, and governance feedback.

Control Recommendations

  1. Treat pipeline administration as privileged access with the same rigor as production administration.
  2. Protect branch, tag, and release controls for critical services explicitly.
  3. Investigate shadow deployment paths that bypass normal controls.

Maintain traceability and integrity for build artifacts and releases

Control Description

Teams need to know what was built, from which source, by which process, and whether the resulting artifact is trustworthy. Traceability and integrity are essential for both secure release and confident rollback.

Control Implementation

  1. Record source, build, approval, and release metadata for critical artifacts.
  2. Use integrity checks, signing, or equivalent provenance mechanisms for high-impact releases where practical.
  3. Retain artifact and release history long enough to support rollback, investigation, and audit.
  4. Review traceability gaps after incidents or release failures.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Artifact history and provenance are incomplete or difficult to reconstruct.
  • Level 2: Defined: Traceability expectations are documented, but enforcement varies.
  • Level 3: Managed: Critical build artifacts and releases are traceable and integrity-checked consistently.
  • Level 4: Measurable: Teams track provenance coverage, release-history completeness, and integrity exceptions.
  • Level 5: Optimized: Artifact trust and release traceability are automated and embedded into delivery governance.

Control Recommendations

  1. Focus first on artifacts that directly affect production or critical recovery tooling.
  2. Keep rollback paths dependent on the same trustworthy artifact history.
  3. Align traceability records with incident and audit evidence needs.

Control dependency and base image risk through continuous inventory and update processes

Control Description

Application and platform resilience depend heavily on the quality of external components. Dependency and base image risk should be understood, prioritized, and reduced continuously rather than only after urgent vulnerability events.

Control Implementation

  1. Inventory critical dependencies, runtime components, and base images used by important services.
  2. Classify update urgency based on exploitability, operational impact, and service criticality.
  3. Establish routines to review, update, or retire outdated components.
  4. Record exceptions and compensating controls for dependencies that cannot be updated promptly.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Dependency updates are irregular and weakly prioritized.
  • Level 2: Defined: Inventory and update practices exist, but coverage is incomplete.
  • Level 3: Managed: Critical dependencies and base images are tracked and updated through defined processes.
  • Level 4: Measurable: Teams track exposure age, stale component count, and remediation velocity.
  • Level 5: Optimized: Dependency risk management is continuous, risk-driven, and tightly integrated with release engineering.

Control Recommendations

  1. Include build plugins, CI/CD components, and recovery tooling dependencies in scope.
  2. Distinguish between security urgency and operational compatibility risk when sequencing updates.
  3. Use inventory data to prioritize the most concentrated dependencies first.

Design deployments for safe rollback and progressive release

Control Description

Resilient delivery minimizes blast radius when change goes wrong. Progressive release and tested rollback patterns make it easier to restore service quickly without improvisation.

Control Implementation

  1. Define deployment strategies that support staged rollout, validation, and rollback for critical services.
  2. Ensure rollback procedures, previous artifacts, and relevant configuration states remain available.
  3. Use health signals and guardrails to pause or reverse unhealthy releases.
  4. Review deployment safety after incidents, architecture changes, or tooling migration.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Releases are broad, risky, and difficult to reverse quickly.
  • Level 2: Defined: Safer rollout patterns are known, but adoption is inconsistent.
  • Level 3: Managed: Critical services use progressive release and defined rollback procedures.
  • Level 4: Measurable: Teams track rollback success, release blast radius, and validation quality.
  • Level 5: Optimized: Deployment safety is continuously tuned using telemetry, incidents, and release evidence.

Control Recommendations

  1. Validate rollback not only in theory but during real exercises and controlled release failures.
  2. Make sure rollback does not depend on unavailable external tooling or missing artifacts.
  3. Align release guardrails with service-level objectives and customer-impact signals.

Test CI/CD recovery and release continuity during platform disruption

Control Description

Teams should be able to recover delivery capability during severe disruption, not only recover running services. This control validates whether release pipelines, artifact access, and deployment authority remain usable when platforms or providers fail.

Control Implementation

  1. Define disruption scenarios affecting source control, build systems, artifact stores, and deployment tooling.
  2. Test recovery or fallback procedures for the delivery systems that support critical services.
  3. Measure restoration time, operational workarounds, and residual release constraints.
  4. Use results to improve tooling architecture, documentation, and continuity planning.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Delivery-tool recovery is largely untested.
  • Level 2: Defined: Recovery plans for CI/CD exist, but testing is limited.
  • Level 3: Managed: Critical delivery paths are tested for continuity and recovery on a regular basis.
  • Level 4: Measurable: Teams track recovery performance, dependency weaknesses, and unresolved CI/CD gaps.
  • Level 5: Optimized: Delivery continuity testing is repeatable, prioritized, and integrated with broader resilience exercises.

Control Recommendations

  1. Include identity, artifact, and approval-path dependencies in CI/CD recovery tests.
  2. Test whether emergency changes can still be deployed safely during platform disruption.
  3. Align delivery continuity plans with crisis, continuity, and supplier-disruption scenarios.

Application resiliency and fault tolerance

Applications should degrade gracefully, recover quickly, and limit the blast radius of local failures. A resilient design couples strong runtime patterns with feedback from production behavior.

Design applications to be stateless and horizontally scalable

Control Description

Stateless, horizontally scalable services are easier to replace, redistribute, and recover during infrastructure disruptions. They reduce dependence on individual instances and simplify elasticity.

Control Implementation

  1. Minimize instance-local state and move durable data to dedicated stores.
  2. Externalize configuration, session data, and service discovery concerns.
  3. Design deployment and runtime patterns that support safe multi-instance operation.
  4. Review stateful dependencies and justify where stateless design is not practical.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Services depend heavily on instance-local state and manual scaling.
  • Level 2: Defined: Stateless patterns are understood, but adoption is inconsistent.
  • Level 3: Managed: Critical applications are designed for safe horizontal scaling wherever practical.
  • Level 4: Measurable: Teams track scaling behavior, replacement success, and state-related failure patterns.
  • Level 5: Optimized: Stateless architecture is reinforced through platform defaults and continuous design review.

Control Recommendations

  1. Document justified stateful exceptions and their recovery expectations.
  2. Avoid hidden state in local caches, temporary files, or node affinity assumptions.
  3. Validate that deployments can replace instances without customer-visible disruption.

Implement circuit breakers and retries to handle transient faults

Control Description

Circuit breakers and controlled retry patterns help applications survive short-lived dependency failures without amplifying them into broader outages.

Control Implementation

  1. Identify remote dependencies where transient failure is common or high impact.
  2. Implement retries with bounded backoff, timeouts, and idempotency awareness.
  3. Use circuit breakers or equivalent protections to stop cascading failure.
  4. Observe and tune fault-handling behavior using production telemetry and testing.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Applications rely on default client behavior or unbounded retries.
  • Level 2: Defined: Fault-handling patterns are documented, but implementation varies by team.
  • Level 3: Managed: Critical dependency paths use consistent retry and circuit-breaker patterns.
  • Level 4: Measurable: Teams track retry amplification, fallback use, and dependency recovery behavior.
  • Level 5: Optimized: Fault-tolerance patterns are standardized, tested, and tuned continuously.

Control Recommendations

  1. Pair retries with strict timeouts and request budgets.
  2. Prefer graceful degradation over repeated calls to an unhealthy dependency.
  3. Validate that retries do not worsen saturation during incidents.

Use health checks and load balancing to distribute traffic among instances

Control Description

Health-aware traffic distribution ensures that broken or degraded instances receive less traffic and healthy capacity is used effectively during both steady state and recovery.

Control Implementation

  1. Define readiness and liveness checks that reflect real application availability.
  2. Integrate health results with service discovery or load-balancing layers.
  3. Remove unhealthy instances automatically and reintroduce them only when safe.
  4. Review health-check sensitivity to avoid false positives and flapping.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Traffic continues to reach degraded instances with little control.
  • Level 2: Defined: Health checks exist, but they are simplistic or inconsistently wired into routing.
  • Level 3: Managed: Health-aware routing is consistently used for critical services.
  • Level 4: Measurable: Teams track false health signals, failover speed, and traffic distribution quality.
  • Level 5: Optimized: Health and routing behavior are continuously tuned using incident and test evidence.

Control Recommendations

  1. Make readiness checks validate dependencies that are required for useful service.
  2. Keep health endpoints lightweight and resistant to overload.
  3. Test the interaction between health checks, autoscaling, and deployment workflows.

Isolate application components to limit the impact of failures

Control Description

Isolation reduces the chance that one failing component, tenant, or workload can destabilize the whole service. It is a core blast-radius control for resilient application design.

Control Implementation

  1. Identify components whose failure or overload can cascade broadly.
  2. Separate those components using process, runtime, network, or tenancy boundaries as appropriate.
  3. Define resource limits and dependency boundaries that protect shared platforms.
  4. Reassess isolation design after major architecture or traffic changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Failures in one component often affect unrelated parts of the service.
  • Level 2: Defined: Isolation principles are known, but implementation is incomplete.
  • Level 3: Managed: Critical application components are deliberately isolated to reduce blast radius.
  • Level 4: Measurable: Teams track cross-component impact, noisy-neighbor issues, and containment effectiveness.
  • Level 5: Optimized: Isolation strategy is refined continuously using incidents, testing, and platform capabilities.

Control Recommendations

  1. Consider tenant isolation needs alongside technical component isolation.
  2. Use quotas and rate limits to protect shared dependencies.
  3. Review whether shared caches, messaging systems, or worker pools create hidden coupling.

Monitor application performance and error rates to identify potential issues

Control Description

Application-level telemetry reveals customer impact and dependency stress that infrastructure monitoring alone may miss. It provides the operational signal needed to detect degradation early and make better response decisions.

Control Implementation

  1. Instrument critical user journeys and service endpoints for latency, throughput, and error behavior.
  2. Define service-level indicators and alert thresholds for material degradation.
  3. Correlate application telemetry with dependency and infrastructure signals.
  4. Review performance regressions after releases and platform changes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Application issues are mainly discovered through support tickets or user reports.
  • Level 2: Defined: Key metrics are known, but instrumentation and review are inconsistent.
  • Level 3: Managed: Critical applications are monitored with clear ownership and response expectations.
  • Level 4: Measurable: Teams track SLI health, regression frequency, and mean time to detect issues.
  • Level 5: Optimized: Performance monitoring is embedded into release, resilience, and product feedback loops.

Control Recommendations

  1. Prefer customer-impact metrics in addition to component-level counters.
  2. Keep telemetry cardinality under control so critical signals remain usable during incidents.
  3. Use traces where they materially improve cross-service diagnosis.

Data center and geographic redundancy

Regional and site-level failures remain high-impact events. Teams should design service placement, data replication, and traffic management so that critical workloads can survive location loss with understood tradeoffs.

Deploy infrastructure across multiple data centers or availability zones

Control Description

Distributing infrastructure across multiple failure domains reduces exposure to localized outages and maintenance events. It is a foundational control for highly available systems.

Control Implementation

  1. Identify critical workloads that require zonal or site redundancy.
  2. Deploy those workloads across independent availability zones or data centers.
  3. Validate that dependencies, configuration, and capacity are available in each location.
  4. Document the expected service behavior if one location becomes unavailable.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Critical workloads depend on a single location.
  • Level 2: Defined: Multi-location design exists for some services, but coverage is incomplete.
  • Level 3: Managed: Critical workloads are deployed across multiple failure domains with documented operation.
  • Level 4: Measurable: Teams track redundancy coverage, location drift, and failover readiness.
  • Level 5: Optimized: Placement strategy is refined continuously based on risk, growth, and real test results.

Control Recommendations

  1. Confirm that control-plane or management dependencies are not single-location bottlenecks.
  2. Ensure spare capacity exists to absorb traffic after a location loss.
  3. Reassess resilience when consolidating workloads onto shared platforms.

Use geo-replication to store data redundantly across different regions

Control Description

Geo-replication protects against regional data loss and supports cross-region recovery. It is especially important for services that cannot tolerate long restoration times from a single site.

Control Implementation

  1. Define which data sets require cross-region replication and the acceptable consistency tradeoffs.
  2. Enable replication using platform capabilities or controlled data pipelines.
  3. Monitor replication lag, integrity, and failure conditions.
  4. Include replica promotion or restore procedures in recovery documentation.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Critical data is stored primarily in one region.
  • Level 2: Defined: Geo-replication requirements are documented, but adoption is partial.
  • Level 3: Managed: Critical data sets are replicated across regions according to policy.
  • Level 4: Measurable: Teams track replication lag, integrity, and failover readiness.
  • Level 5: Optimized: Replication strategy is tuned continually for resilience, cost, and recovery performance.

Control Recommendations

  1. Validate application behavior when reading from or promoting replicated data.
  2. Protect replication channels and credentials with the same rigor as primary data paths.
  3. Review legal and sovereignty constraints before choosing replica regions.

Implement global load balancing to distribute traffic across data centers

Control Description

Global load balancing helps services steer traffic away from impaired regions and distribute demand across available capacity. It is a key control for multi-region availability.

Control Implementation

  1. Define how traffic should be routed across sites during normal and degraded operation.
  2. Configure global traffic-management mechanisms with health-aware failover behavior.
  3. Align DNS, certificate, and edge configurations with multi-location routing needs.
  4. Test routing policy changes and emergency steering procedures periodically.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Cross-site traffic steering is manual or limited.
  • Level 2: Defined: Global routing patterns are documented, but automation and testing are limited.
  • Level 3: Managed: Global load balancing is in place for services that require geographic resilience.
  • Level 4: Measurable: Teams track routing behavior, failover speed, and regional saturation risks.
  • Level 5: Optimized: Traffic steering is continuously improved using tests, telemetry, and resilience objectives.

Control Recommendations

  1. Keep health signals and routing decisions aligned to actual user experience where possible.
  2. Ensure failover does not exceed capacity or quota in the surviving regions.
  3. Review TTLs and propagation characteristics for DNS-based routing strategies.

Test failover processes between data centers to ensure smooth recovery

Control Description

Site failover tests validate both system design and the teams responsible for operating it. Without testing, geographic redundancy often contains hidden assumptions that only appear during real disruption.

Control Implementation

  1. Schedule failover exercises for services that depend on multi-site resilience.
  2. Practice traffic movement, data validation, and recovery-role coordination.
  3. Measure service impact, completion time, and rollback complexity.
  4. Use findings to improve architecture, automation, and documentation.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Data center failover is largely theoretical.
  • Level 2: Defined: Failover plans are documented, but tests are rare or narrow.
  • Level 3: Managed: Critical multi-site failover paths are tested on a defined cadence.
  • Level 4: Measurable: Teams track failover success, timing, and unresolved defects.
  • Level 5: Optimized: Failover exercises are realistic, repeatable, and closely tied to service objectives.

Control Recommendations

  1. Include control-plane, identity, and observability dependencies in failover tests.
  2. Test both planned and unplanned failover paths where feasible.
  3. Confirm that failback procedures are as well understood as failover procedures.

Regularly review and update data center redundancy strategies based on evolving needs

Control Description

Geographic resilience strategies should change as traffic patterns, products, dependency models, and risk tolerance change. A design that was adequate two years ago may no longer match current reality.

Control Implementation

  1. Review location strategy after major growth, acquisition, architecture, or regulatory changes.
  2. Compare current redundancy design to recovery objectives and observed incident patterns.
  3. Update placement, replication, or traffic-management approaches where gaps exist.
  4. Communicate changes to platform, service, and business stakeholders who depend on them.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Geographic resilience strategy changes only after major disruption.
  • Level 2: Defined: Review expectations exist, but updates are inconsistent.
  • Level 3: Managed: Redundancy strategy is reviewed regularly and updated with clear ownership.
  • Level 4: Measurable: Teams track review cadence, risk gaps, and improvement completion.
  • Level 5: Optimized: Geographic resilience is actively tuned as part of broader platform and business planning.

Control Recommendations

  1. Reevaluate concentration risk in shared providers and shared services.
  2. Include financial and operational tradeoffs when updating redundancy design.
  3. Document what level of service is expected during a full-region disruption.

Regular resilience testing and validation

Resilience claims should be backed by evidence. Testing and validation confirm whether controls work under realistic stress and whether teams can recover within the objectives they publish, including in scenarios driven by supplier failure, continuity decisions, and software delivery disruption.

Conduct regular disaster recovery and failover tests

Control Description

Disaster recovery tests validate that services, teams, and dependencies can recover from severe disruption. They prove whether recovery plans are operationally viable, not just documented.

Control Implementation

  1. Define disaster recovery scenarios for critical services and supporting platforms.
  2. Schedule tests based on business impact and dependency criticality.
  3. Measure recovery timing, coordination quality, and service usability after restoration.
  4. Track remediation for gaps uncovered during testing.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Disaster recovery is largely untested.
  • Level 2: Defined: Recovery tests are planned, but they are infrequent or incomplete.
  • Level 3: Managed: Critical services undergo regular recovery and failover validation.
  • Level 4: Measurable: Teams track recovery performance, scope coverage, and unresolved defects.
  • Level 5: Optimized: Disaster recovery testing is repeatable, evidence-based, and integrated with service governance.

Control Recommendations

  1. Include dependency loss and degraded external services in at least some scenarios.
  2. Test service usability after restore, not only infrastructure availability.
  3. Align test evidence with stated recovery objectives and stakeholder expectations.

Use chaos engineering techniques to simulate failures and test system resilience

Control Description

Controlled failure injection helps teams understand how systems behave under stress and whether resilience controls work as expected in live-like conditions.

Control Implementation

  1. Identify safe, high-value failure scenarios for controlled experimentation.
  2. Define blast-radius limits, success criteria, and rollback rules before each experiment.
  3. Run experiments against representative systems or environments with appropriate safeguards.
  4. Use results to improve architecture, automation, and operational procedures.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Failure injection is rarely used or treated as an exceptional activity.
  • Level 2: Defined: Chaos testing methods are documented, but the program is limited in scope.
  • Level 3: Managed: Teams run controlled resilience experiments on a planned cadence.
  • Level 4: Measurable: Teams track experiment coverage, observed weaknesses, and remediation follow-through.
  • Level 5: Optimized: Chaos practices are systematic, safe, and tightly linked to service resilience objectives.

Control Recommendations

  1. Start with low-risk dependency failures before progressing to wider scenarios.
  2. Use observability data to confirm whether the system behaved as expected.
  3. Avoid experiments that lack clear learning goals or safety boundaries.

Test backup and recovery processes to validate data integrity

Control Description

This control focuses specifically on validating that backup data remains usable, complete, and trustworthy. It complements broader disaster recovery testing by emphasizing the integrity of restored information.

Control Implementation

  1. Select representative data sets and recovery points for integrity testing.
  2. Restore data and verify completeness, correctness, and application compatibility.
  3. Test both routine and exceptional recovery paths.
  4. Capture discrepancies and remediate systemic causes.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Integrity is assumed based on backup completion alone.
  • Level 2: Defined: Integrity testing is planned, but depth and frequency vary.
  • Level 3: Managed: Critical backup sets are tested regularly for usable recovery.
  • Level 4: Measurable: Teams track integrity failures, restore confidence, and coverage gaps.
  • Level 5: Optimized: Integrity validation is automated where practical and refined through recovery evidence.

Control Recommendations

  1. Validate schema, permissions, and metadata as well as raw data content.
  2. Test the restoration of dependent configurations needed to use the data successfully.
  3. Retain evidence from integrity tests for audit and control review purposes.

Perform load and stress tests to identify capacity limits and potential bottlenecks

Control Description

Load and stress testing reveal how systems behave near and beyond expected operating conditions. They help teams understand graceful degradation, recovery behavior, and unsafe operating limits.

Control Implementation

  1. Define test scenarios for sustained load, sudden bursts, and overload conditions.
  2. Observe system behavior across application, infrastructure, and dependency layers.
  3. Record the thresholds where performance degrades or recovery becomes unstable.
  4. Use findings to improve scaling, traffic management, and capacity planning.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Stress behavior is discovered mainly in production.
  • Level 2: Defined: Load testing plans exist, but realism or cadence is limited.
  • Level 3: Managed: Critical services undergo regular load and stress validation.
  • Level 4: Measurable: Teams track performance envelopes, bottleneck recurrence, and regression risk.
  • Level 5: Optimized: Stress testing is used systematically to guide resilience and architecture decisions.

Control Recommendations

  1. Include recovery from overload, not only the onset of overload.
  2. Exercise upstream and downstream protections such as rate limits and queues.
  3. Keep test scenarios aligned with current traffic patterns and business events.

Use the results of testing to inform updates and improvements to infrastructure resilience

Control Description

Testing creates value only when the results change systems, plans, or priorities. This control ensures that evidence from validation activities is converted into real resilience improvement.

Control Implementation

  1. Record findings from resilience tests in a durable, reviewable format.
  2. Prioritize improvements based on business impact, recurrence risk, and implementation effort.
  3. Assign owners and deadlines to remediation actions.
  4. Re-test material fixes to confirm that risk has been reduced.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Test findings are noted informally and often not completed.
  • Level 2: Defined: Follow-up processes exist, but prioritization and ownership vary.
  • Level 3: Managed: Test results are routinely converted into tracked improvement work.
  • Level 4: Measurable: Teams track completion rates, repeat findings, and residual risk over time.
  • Level 5: Optimized: Testing outcomes directly shape resilience roadmaps, platform standards, and engineering priorities.

Control Recommendations

  1. Review open findings in operational governance forums until resolved or explicitly accepted.
  2. Distinguish systemic improvements from one-off fixes.
  3. Keep evidence that demonstrates closed-loop improvement for key resilience controls.

Documentation and Knowledge Sharing

Resilience improves when critical knowledge is documented, accessible, and shared across teams. In 2026, this includes architecture intent, runbooks, recovery steps, dependency maps, supplier context, governance decisions, and the operational context needed to act quickly.

Document architecture, processes, and best practices for cloud resilience

Control Description

Clear documentation helps teams understand how systems are supposed to work and how they should be operated under stress. It reduces reliance on tribal knowledge and shortens time to effective action.

Control Implementation

  1. Document critical architectures, dependencies, recovery processes, and resilience design decisions.
  2. Keep operational runbooks and ownership details close to the systems they describe.
  3. Define a review cadence for high-impact documentation.
  4. Retire or update obsolete material when systems or processes change.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Knowledge is mostly informal and scattered across individuals.
  • Level 2: Defined: Core documentation exists, but it is incomplete or inconsistently maintained.
  • Level 3: Managed: Critical resilience documentation is current, owned, and routinely used.
  • Level 4: Measurable: Teams track document freshness, coverage, and operational usefulness.
  • Level 5: Optimized: Documentation quality is continuously improved based on incidents, drills, and user feedback.

Control Recommendations

  1. Prioritize the systems whose failure would create the greatest operational disruption.
  2. Keep diagrams and narrative explanations aligned so they support real decision-making.
  3. Make documentation easy to update as part of engineering change.

Maintain a centralized knowledge base for easy access to documentation

Control Description

A centralized knowledge base improves discoverability and reduces the time spent searching across disconnected tools during incidents, audits, and operational reviews.

Control Implementation

  1. Choose a primary location for resilience-related documentation and operational references.
  2. Organize content using consistent naming, tagging, and ownership metadata.
  3. Define access controls that balance availability with sensitivity.
  4. Review searchability and navigation quality regularly.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Documentation is scattered across multiple tools and personal locations.
  • Level 2: Defined: A central repository exists, but adoption and structure are inconsistent.
  • Level 3: Managed: Teams use a shared knowledge base for critical resilience information.
  • Level 4: Measurable: Teams track usage, discoverability issues, and outdated-content trends.
  • Level 5: Optimized: The knowledge base is curated continuously and integrated into day-to-day operations.

Control Recommendations

  1. Link runbooks, diagrams, service ownership, and incident history where possible.
  2. Archive obsolete content rather than leaving it ambiguous.
  3. Make sure responders can access critical content during an incident.

Regularly review and update documentation to reflect changes and improvements

Control Description

Outdated documentation can be as harmful as missing documentation. Regular review keeps operational knowledge aligned to the systems and processes teams actually run.

Control Implementation

  1. Define review intervals for critical documents based on impact and rate of change.
  2. Trigger extra reviews after incidents, migrations, and major platform updates.
  3. Assign accountable owners for document maintenance.
  4. Track and remediate stale or conflicting content.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Documentation is updated only when a visible problem occurs.
  • Level 2: Defined: Review expectations exist, but compliance is inconsistent.
  • Level 3: Managed: Critical documentation is reviewed and updated on a regular cadence.
  • Level 4: Measurable: Teams track freshness, overdue reviews, and content-quality issues.
  • Level 5: Optimized: Documentation updates are embedded into operational and engineering workflows.

Control Recommendations

  1. Make document review part of incident follow-up and change completion criteria.
  2. Flag high-risk stale documents prominently until corrected.
  3. Encourage brief, frequent updates over large, infrequent rewrites.

Encourage knowledge sharing and collaboration among team members

Control Description

Resilience improves when operational knowledge is shared broadly enough that critical work does not depend on a small number of individuals. Collaboration also makes it easier to spot gaps and standardize good practice.

Control Implementation

  1. Create regular forums for sharing incident learnings, architecture patterns, and operational practices.
  2. Pair teams across platform, security, and service domains on relevant resilience topics.
  3. Encourage peer review of runbooks, recovery plans, and high-impact changes.
  4. Reduce single-person ownership for critical operational knowledge areas.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Knowledge sharing depends on informal relationships and individual initiative.
  • Level 2: Defined: Teams value knowledge sharing, but practices are sporadic.
  • Level 3: Managed: Knowledge-sharing routines are part of normal operational life.
  • Level 4: Measurable: Teams track participation, coverage, and concentration risk in key knowledge areas.
  • Level 5: Optimized: Collaboration patterns are deliberate, cross-functional, and continuously improved.

Control Recommendations

  1. Use post-incident reviews as a teaching moment beyond the directly involved team.
  2. Rotate operational responsibilities thoughtfully to broaden experience.
  3. Capture recurring questions and turn them into durable guidance.

Provide training and resources to help staff stay informed about resilience

Control Description

Training ensures that resilience practices stay current as platforms, threats, and operating models evolve. It helps teams apply the benchmark consistently rather than treating it as a static document.

Control Implementation

  1. Identify the resilience skills required for engineering, operations, security, and leadership roles.
  2. Provide structured learning resources, internal guidance, and practice opportunities.
  3. Refresh content as technologies, threats, and internal standards change.
  4. Measure participation and close critical learning gaps.

Control Maturity Levels

  • Level 1: Initial/Ad Hoc: Resilience knowledge depends mostly on self-directed learning.
  • Level 2: Defined: Training resources exist, but participation and refresh are uneven.
  • Level 3: Managed: Staff receive regular resilience training relevant to their roles.
  • Level 4: Measurable: Teams track training completion, capability gaps, and role coverage.
  • Level 5: Optimized: Training is continuously updated and reinforced through drills, reviews, and practical application.

Control Recommendations

  1. Blend formal training with drills, simulations, and practical exercises.
  2. Tailor content for both specialist responders and general engineering audiences.
  3. Review training gaps after incidents to keep the program grounded in real needs.