Synopsis

This chapter provides standard playbooks for incidents in multi-CDN environments. It covers detection, triage, scoping by geography and network, isolation and reroute choices, change control during active events, communication, restoration, and post-incident analysis. The objective is to protect users first, keep changes reversible, and leave an audit trail that improves future responses.

Principles

Incident handling favors user outcomes over internal metrics. Actions modify the smallest scope that achieves protection. All changes must be reversible. Each action records who acted, what changed, and why. Telemetry drives decisions and distinguishes symptoms from causes. Providers are treated as interchangeable routes unless a risk register documents exceptions.

Detection and triage

Incidents enter through symptom alerts such as elevated error ratio, slow time to first byte, or streaming stall rates. Confirmation uses multiple signals that include real user measurement, synthetic probes, edge and origin logs, and any provider health feeds. Triage determines whether the event is limited to a region or network, whether it correlates with a recent deployment, and whether it aligns with known provider issues.

Scoping

Scope is defined by provider, geography, and ASN. The smallest credible scope is preferred. For example, a single region on one provider rather than a global provider block. Scoping uses dashboards that segment outcomes by provider, region, ASN, and protocol. Where scope is uncertain, a conservative block at the regional level applies until additional evidence narrows or widens the area.

Decision model

Incidents follow a simple decision model. If the issue is within the application or origin, steer changes will not help and the playbook shifts to origin protection. If the issue is provider specific and user harm exceeds thresholds, traffic is pinned to an alternate provider within the affected scope. If both providers degrade, reduce feature set, prefer stale over error where correctness allows, and protect origin capacity.

flowchart TD A[Symptom alert] --> B{Provider correlated?} B -- no --> ORI[Investigate origin or app] B -- yes --> SCOPE[Define scope: region and ASN] SCOPE --> HARM{User harm above threshold?} HARM -- no --> OBS[Observe, raise sampling, notify] HARM -- yes --> REROUTE[Pin to alternate provider] REROUTE --> VERIFY[Verify outcomes in scope] VERIFY --> STABLE{Improved and stable?} STABLE -- yes --> COMM[Communicate and monitor] STABLE -- no --> ESC[Widen scope or rollback]

Isolation and reroute

Isolation changes routing only within the scoped area. Options include pinning traffic to an alternate provider, removing a provider from candidate lists for the scope, or reducing exposure to a candidate that remains partially healthy. If stickiness is in use for media sessions, new sessions move first and existing sessions remain unless error rates justify forced migration. DNS steering uses resolver based controls and shorter answers within safe TTLs. Proxy steering changes per request policy and can react faster but must be observed for cache and origin effects.

Protecting origin and caches

Reroute decisions can change cache hit rate and origin load. Before widening scope, confirm shield and origin capacity, enable stale-if-error and stale-while-revalidate where configured, and prefer soft purge over hard purge. For dynamic APIs, reduce concurrency if upsurge is observed. For large static objects, verify that range requests remain served from cache in the remaining provider.

Change control during incidents

All incident changes use a fast path that still records the diff and the scope. Each change includes an identifier that links to dashboards with prefiltered segments. Manual overrides expire automatically after a defined window unless renewed. This prevents long lived drift that outlives the incident.

Communication

Communication distinguishes user impact, scope, and current actions. Internal updates go to engineering, support, and leadership channels. External updates follow public status policies when uptime commitments apply. Provider contacts receive concise evidence that includes regions, ASNs, error types, timestamps, and a representative set of request ids. Commitments from providers are recorded with times for the next update.

Verification

Each isolation step is followed by verification that compares user facing indicators before and after the change. Improvement must be visible in the scoped region and ASN subsets. Absence of improvement triggers rollback of the last step and a change in hypothesis. Verification also checks cache hit rate, origin error rate, and protocol mix to avoid secondary harm.

sequenceDiagram participant Det as Detection participant Ctrl as Routing control participant CDN as Provider edges participant Mon as Monitoring Det->>Ctrl: Request pin to alternate in region X Ctrl->>CDN: Apply scoped policy change Mon-->>Det: Segment outcomes by provider and ASN alt Improvement Det->>Ctrl: Maintain change and widen cautiously else No improvement Det->>Ctrl: Roll back change and reassess end

Restoration

Restoration removes temporary controls when providers recover. Restoration proceeds in reverse order of changes. First unpin new sessions, then remove scoped blocks, and finally restore default routing. Each step waits for verification in the same segmented views used during isolation. Restoration avoids sudden full flips that would cause cold caches and origin spikes.

Special scenarios

Partial TLS failures require certificate and protocol checks on the affected provider before route changes. Purge and cache consistency failures require verification of purge APIs, propagation delays, and validator behavior. Regional routing anomalies may require BGP or transit analysis and temporary routing constraints that keep traffic within stable paths.

Records and post-incident review

The incident record includes the trigger, scope, timeline of actions, verification evidence, provider communications, and restoration steps. The review identifies the primary cause, contributing factors, detection and response gaps, and any differences between providers that increased risk. Follow up work includes playbook changes, alert tuning, and provider parity fixes. Reviews are time bound and tracked to closure.

Templates

Playbook templates include a provider outage template, a regional degradation template, a cache collapse template, and a license or key service outage template for media. Each template lists default thresholds, scoped actions, verification queries, and restoration steps. Templates live with configuration in version control and receive updates after each review.

Drills

Preparedness requires drills that simulate provider loss, regional packet loss, purge failures, and certificate rollover mistakes. Drills run during staffed hours with clear success criteria and record the same artefacts as real incidents. Findings feed directly into changes to thresholds, dashboards, and provider contacts.

For routing policy and precedence see /multicdn/traffic-steering/. For monitoring and alerting that drive decisions see /multicdn/monitoring-slos/. For cache behavior under stress see /multicdn/cache-consistency/. For origin protections see /multicdn/origin-architecture/.

Further reading

Site reliability engineering material on incident management provides background on roles and timelines. Provider status and postmortem archives help anticipate common failure modes. Public BGP repositories and looking glass tools assist with regional routing investigations.