Synopsis

This chapter explains how to collect, process, and apply telemetry for multi-CDN routing. It covers real user measurements, synthetic probes, provider health and routing data, logs from the service stack, and the aggregation and alerting that turn signals into safe decisions. The goal is to make routing reflect user experience and to change paths only when evidence supports a better outcome.

Measurement goals

All measurement should support a small set of goals. Confirm that users receive correct content with acceptable latency and reliability. Detect faults and degradations fast enough to protect users. Provide data that is stable enough for routing but sensitive enough to catch regressions. Keep the cost and complexity of the system proportional to its value.

Data sources

Real user measurement records performance from actual sessions. It captures last mile conditions and device effects that synthetic tests miss. It requires sampling, privacy controls, and careful aggregation so that a few heavy clients do not dominate results. Synthetic probes provide controlled measurements from known vantage points. They are easy to reason about and can run on tight schedules. They do not see last mile variance and can be biased by probe networks and cache warmth. Provider health feeds and status pages can be useful as advisory inputs, but they are not a source of truth. Logs and metrics from the edge, origin, and application stack provide the ground truth for correctness and outcomes such as error rates, time to first byte, and completion rates. Routing and internet data such as BGP updates or looking glass outputs can help explain regional anomalies and inform safeguards.

Coverage and vantage points

Coverage should reflect where users are. Probe regions and networks should map to the top user populations and to networks that often deviate from averages. For RUM, sampling should be high enough to provide timely data in busy regions and still produce usable results in smaller regions. For synthetic tests, a mix of global clouds and regional providers avoids a single footprint bias.

Data quality and sampling

Measurements are noisy. Metric definitions must be consistent and each client or probe version should be recorded. RUM sampling must meet privacy and storage limits while still yielding significant results per region and network. For synthetic tests, cold cache and warm cache runs should be separated to avoid mixing effects. Probes must be validated for clock skew and network throttling. Outliers should be removed only with a documented and stable method.

Aggregation and windows

Routing should operate on time windows that smooth noise without hiding real changes. Windows should align with the pace of the steering layer. DNS based steering needs longer windows because changes propagate slowly. A layer 7 proxy can use shorter windows but still needs protection against flapping. Window lengths and summary statistics that drive decisions should be published so engineers can reason about behavior. Medians are robust for latency and throughput. Percentiles are useful when tied to clear thresholds.

Anomaly detection and alerting

Alerting should prioritize user facing outcomes. Error rates and time to first byte anomalies should page operators before internal probe scores do. Multi signal confirmation reduces false positives. Alerts should include suspected scope by geography, ASN, and CDN so responders can apply targeted controls. Stability requires rate limits, minimum durations, and cool down periods.

Control plane and pipelines

Telemetry flows into a control plane that computes routing decisions and policies. This control plane should be treated like production software. Versioned schemas, idempotent updates, and observability should expose inputs, decisions, and outputs applied to each steering layer. Missing data must trigger safe defaults with a known route and clear stale markings when expected update intervals are exceeded.

Privacy and compliance

RUM can include user identifiers and header data that require careful handling. Collect only what is needed for routing and SLOs. Aggregate at the lowest granularity that still supports decisions, such as region and ASN, and avoid storing unnecessary personal data. Storage and processing locations should be documented, with regional controls applied when required.

Instrumentation at edge and origin

Correctness requires instrumentation across the path. At the edge, record request and response metadata, cache status, and upstream selection. At the origin, record authentication results, response codes, and backend timings. Requests should be correlated across systems with stable request ids. The ability to trace a request from user to origin and back is essential when a routing change reduces cache hit rate or exposes an origin bottleneck.

Validation and correlation

Every change to routing should be validated against user visible metrics. RUM and synthetic data should be correlated so differences are understood. A synthetic improvement that does not move RUM may indicate a last mile effect or a cache effect. Routing decisions should be correlated with cache hit rates and origin load to catch unintended side effects.

Storage and retention

Raw data should be stored briefly to support deep investigations, with aggregated views retained longer for trend analysis and capacity planning. Schemas should remain stable and documented. If cost is an issue, reduce sampling rates before dropping essential dimensions such as region or ASN.

Operations

Telemetry systems should be exercised during quiet periods. Loss of a signal should trigger a safe routing fallback. Dashboards should be reproducible from documented queries. Telemetry health must be part of incident playbooks so responders can trust the data used for decisions.

flowchart LR RUM[Real user measurement] --> AGG[Aggregator] SYN[Synthetic probes] --> AGG HEALTH[Provider health feeds] --> AGG BGP[BGP and routing data] --> AGG LOGS[Edge, origin, app logs] --> AGG AGG --> SLO[SLO evaluation] AGG --> DEC[Decision engine] DEC --> POL[Policy store] POL --> DNS[DNS steering] POL --> L7[L7 proxy] POL --> CLI[Client logic]

For the overview see /multicdn/. For design choices at each layer see /multicdn/architecture-patterns/. For policy design and rollout see /multicdn/traffic-steering/.

Further reading

W3C Resource Timing, Navigation Timing, and Server Timing describe standard RUM signals. RFC 9110 defines HTTP semantics that apply to timing and correctness. Public BGP data sets and looking glass tools can aid in diagnosing regional routing issues.