Synopsis

This chapter explains how to design and operate traffic steering for multi-CDN. It covers inputs and data quality, common policy types, precedence rules, health and failover, stability controls, cost awareness, and rollout practices.

Inputs and data

Effective policies need inputs that reflect user experience and provider health. Synthetic measurements provide controlled and repeatable data but can miss last mile conditions. Real user measurements capture actual paths and networks but require sampling, privacy controls, and careful aggregation. Health signals from providers are useful but should not be trusted without verification. Logs and metrics from the service stack provide the ground truth for outcomes.

Policy types

Geography or ASN based policies direct users to preferred CDNs in each region or network. Latency and throughput policies select the provider with the best recent performance. RUM driven policies use data from real sessions to guide choices where variance is high. Cost-aware policies prefer the lowest cost option that still meets performance objectives. Policies can be combined, but they need clear rules so the result is predictable.

Precedence and conflicts

A simple order of evaluation reduces ambiguity. A common pattern is to enforce required constraints first, such as jurisdiction or allowlist, then consider health, then performance, then cost. The order should be published and kept stable. When two CDNs have similar scores, a tie break that does not flap rapidly should be used. Minimum dwell times and hysteresis reduce oscillation.

Health and failover

Health checks and external probes detect outages and severe degradation. Provider status feeds are advisory and should be confirmed with independent probes. When a provider is marked unhealthy, that state should be held long enough to protect users from rapid toggling. A manual override must exist for critical incidents and its use must be recorded.

Stability and safety

Routing that changes too often harms caches and user experience. Change rate should be controlled with dampening, percent limits, and staged rollouts. Small steps are preferred, with verification before scope increases. A default path that is known to work must exist when measurements are missing or inconclusive.

Cost-aware routing

Costs vary by region, traffic type, and contract terms. A cost model accurate enough for routing decisions should be maintained. Cost should be applied only after performance and health constraints are satisfied. Guard rails ensure that savings do not degrade user experience.

Implementation approaches

DNS providers can implement geography or ASN policies and limited health weighting. Layer 7 proxies can evaluate complex rules per request and react quickly. Client-side logic can reflect last mile conditions and prefer the best endpoint for a given user. Mixed approaches are common, but interaction must be clear and observable.

Measurement pitfalls

Measurements can be biased by resolver behavior, cache warmth, and uneven sample sizes. Cold and warm cache tests should be separated. Windows should be long enough to be stable but short enough to detect problems. Policy changes should be validated against user-visible metrics such as page load or video start time, not only probe scores.

Rollout and verification

New rules should be introduced behind flags. A small percentage of sessions should be pinned to the new path and outcomes compared to control. Promotion should be gradual, with the ability to revert. Decisions and results should be recorded to build a history of observed effects.

flowchart TD Start[Request] --> Guard[Constraints met] Guard -- no --> Default[Use default path] Guard -- yes --> Health{Healthy?} Health -- no --> Alt[Fail over] Health -- yes --> Perf{Meets SLO?} Perf -- yes --> Chosen[Use preferred CDN] Perf -- no --> Cost{Within cost limits?} Cost -- yes --> Cheap[Select cheaper CDN] Cost -- no --> Chosen

See the overview at /multicdn/. For inputs and data, read /multicdn/signals-telemetry/. For architecture choices, see /multicdn/architecture-patterns/.

Further reading

RFC 9110 for HTTP semantics. W3C Resource Timing and Navigation Timing for RUM signals.