Synopsis
This chapter describes monitoring and service level objectives for multi-CDN operation. It defines user facing indicators, explains stable aggregation and alerting, and outlines dashboards that connect routing decisions to outcomes. The aim is to detect harm early, confirm improvements with evidence, and support clear decisions during change and incident response.
Objectives and scope
Monitoring must show whether content is correct, whether latency and reliability meet commitments, and whether routing decisions help users. It must separate symptoms from causes, include regional and network context, and expose differences between providers. It should remain simple enough that on-call engineers can act without guesswork and detailed enough to support post-incident analysis.
Indicators and objectives
Service level indicators should reflect real user experience. Typical indicators include successful request ratio, time to first byte, complete page or API latency, and for media, startup delay and stall time. Objectives assign targets to these indicators over defined windows. Error budgets quantify allowable failure and create a control for change velocity. Budgets should exist per provider and per major region so that localized regressions do not hide behind global averages.
Measurement domains
Evidence comes from several domains. Real user measurement reflects actual networks and devices. Synthetic probes provide controlled coverage and faster cadence for early warning. Edge telemetry shows cache status, protocol mix, TLS handshakes, and upstream selection. Origin telemetry shows authentication results, backend timings, and error codes. The routing control plane exposes health evaluations, policy choices, and versioned configuration. A coherent view requires joining these domains by time, region, ASN, and a stable request id.
Definitions and aggregation
Stable definitions prevent false movement. Latency should use medians or high percentiles with fixed sampling. Error ratios should exclude client aborts when the application defines them as non actionable. Aggregation windows should align with steering cadence. DNS based steering needs longer windows than a layer 7 proxy. Cold cache and warm cache behavior should be separated in synthetic tests. Protocol, device class, and object type should be recorded so that regressions do not hide behind mix shifts.
Alerting
Alerts should page on user facing symptoms first. Elevated error ratio, degraded time to first byte, or streaming stall rate should open incidents before probe-only metrics do. Multi-signal confirmation reduces false positives. Each alert must include suspected scope by geography, ASN, and provider, and must link to a runbook and a pre-filtered dashboard. Rate limits, minimum durations, and cooldowns avoid alert storms. Cause-level alerts, such as provider health feed changes or purge failures, remain secondary and inform investigations.
Dashboards
Dashboards should answer three questions. Is the service healthy for users. What changed in routing and configuration. Where is the fault located. An overview page presents global health and error budget burn segmented by provider. An operator page presents per region and ASN health with cache hit rate, upstream errors, and protocol mix. An investigation page correlates routing decisions with RUM and synthetic outcomes and annotates deployments and incidents.
Error budgets and policy
Error budgets connect reliability and change. Fast burn conditions require immediate reduction of change and may justify pinning traffic to a single provider. Slow burn conditions allow targeted rollback and continued experimentation at reduced scope. Budgets should be computed per region and provider, with a separate global view for executive reporting. Burn alerts should trigger at rates that reflect user harm, not minor variance.
Normalisation and schema
Providers name fields differently. A normalised schema avoids query drift and broken comparisons. Required fields include request id, route and provider, geography and ASN, protocol and TLS version, cache status, validator result, response class, and timing fields. A translation layer maps vendor specific names to the schema at ingestion. The schema should be versioned and documented so that dashboards are reproducible.
Correlation and traceability
Joining events across systems requires a stable id. An id should be stamped at edge entry and propagated to origin. Routing decisions should record the id and the inputs used. During incidents this enables verification that a change in route preceded or followed a change in outcome. Annotations for deployments, purge operations, certificate rollovers, and policy updates provide context in graphs.
Telemetry health
Monitoring must monitor itself. Loss of a signal, delayed pipelines, and clock skew can mislead operators. Health checks for each feed and pipeline step should raise alerts that do not page the primary on-call unless user facing alerting is at risk. When a required signal is stale, routing should fall back to a default path and record the fallback.
Cost and retention
Retention policies balance analysis needs and storage costs. Raw logs support deep investigations for a short period. Aggregated series support trend analysis for longer periods. If cost pressure rises, sampling rates should be reduced before dimensions are dropped, because missing dimensions weaken investigations.
Privacy and residency
Telemetry can contain identifiers and location. Collection and retention must follow policy and law. Storage and processing locations should match residency requirements. Aggregation should minimize exposure by avoiding unnecessary personal data. When client hints or headers influence routing, only the minimal fields required for decisions and auditing should be retained.
Operations
Monitoring and dashboards require exercises. Quiet periods should include drills that disable a signal, roll a policy, and simulate a provider failure to confirm that alerts fire and that dashboards guide operators to the root cause. Definitions and queries should be reviewed and kept under version control so that future changes remain auditable.
Related chapters
For inputs and pipelines see /multicdn/signals-telemetry/. For routing behavior influenced by alerts and objectives see /multicdn/traffic-steering/. For cache effects that drive hit rate and latency see /multicdn/cache-consistency/.
Further reading
Site reliability engineering material on SLOs and error budgets provides background on objectives and burn policies. W3C performance timing specifications describe user visible measurements. Provider references should be consulted for field names and limits in logging and metrics exports.