Scaling a Live Stream to a Billion Viewers
A live broadcast turns one source into millions of simultaneous viewers in seconds. The hard part is not the video - it is keeping a viral spike from melting your origin. Here are the patterns that make it work.
flowchart LR B["broadcaster"] -->|"RTMPS"| EN["encode + segment
multi-bitrate, ~1s, manifest"] EN -->|"push segments"| O["origin shield"] O -->|"1 coalesced fetch"| P["edge cache (POP)
coalesce + cache-block"] V["millions of viewers"] -->|"many requests"| P P -->|"cached segment"| V classDef key fill:#e8f1fb,stroke:#1e1e1e,color:#1e1e1e class EN,P key linkStyle 2 stroke:#2383E2,stroke-width:2px,color:#2383E2
The problem: one source, millions of viewers, all at once
Live video is not video-on-demand with a clock attached. The defining difference is that demand is synchronized. With on-demand content, viewers arrive spread across time and a cache warms up gently. With a live event, a stream can go from a few hundred viewers to millions in seconds, and they all want the exact same bytes at the exact same moment.
That synchronization is the whole challenge. The bytes themselves are not the problem - a few seconds of compressed video is small. The problem is what happens on a cache miss at the moment a stream goes viral: if every one of a million near-simultaneous requests for the same not-yet-cached segment is forwarded to the origin, you have built a distributed denial-of-service attack against your own encoders. The design goal is to absorb a correlated spike of that shape without the origin ever seeing more than a trickle.
Make the stream cacheable: segment it and push it to the edge
The first move is to turn a continuous live feed into something a commodity cache can serve. The broadcaster sends the stream to a nearby edge location, which routes it to a central encoder. The encoder transcodes the feed into several bitrates and packages it into short segments - typically around a second each - described by a rolling manifest that lists the handful of segments currently watchable and updates as new ones are produced.
That packaging is what unlocks scale. A segment is just a small, immutable file with a stable URL, so it can be cached and served by an ordinary HTTP edge hierarchy, exactly like any static asset. Adaptive bitrate falls out of the same design: a client on a weak connection simply requests the lower-bitrate version of the next segment. The continuous, stateful problem of live video has been reframed as the well-understood problem of serving small cacheable files to a lot of people.
Beat the thundering herd: coalesce requests and block on the miss
Segmentation makes the content cacheable, but it does not by itself solve the synchronized-miss problem. The single highest-leverage pattern is request coalescing. When many viewers request the same segment that the edge does not yet have, the edge does not forward all of them upstream. It sends one request, holds the rest behind a short cache-blocking timeout, and when that one fetch returns, it populates the cache and every waiting request is served from it. A million-request stampede becomes one origin fetch.
This is reinforced by a tiered cache hierarchy: viewer requests hit a nearby edge cache first, edges fall back to a smaller number of regional origin caches, and only those talk to the encoders. Each tier shields the next, and coalescing happens at every level. The effect is that origin load stays roughly flat with respect to audience size - a hundred viewers and a hundred million viewers look almost the same from the encoder's point of view, which is exactly the property you need.
Trade milliseconds for survivability
These patterns cost latency, and that trade is the central design decision. Segmenting into one-second chunks with a rolling manifest adds several seconds of glass-to-glass delay compared to a tightly coupled sub-second transport. For the overwhelming majority of live use cases, a few seconds behind real time is completely acceptable, and in exchange you get cacheability, commodity HTTP transport, and the coalescing that makes a billion-viewer spike survivable. It is the same instinct that shows up across streaming systems: choosing seconds over milliseconds is what lets you put a durable, cacheable layer in the middle.
The remaining decisions are about freshness and isolation. Manifest updates can use a simple time-to-live (easy, slightly stale) or a push mechanism (tighter latency, more moving parts), and you pick based on how interactive the stream needs to be. Regional cache isolation means a spike or a failure in one region stays local instead of propagating to the origin. And under genuine overload, graceful degradation - serving a lower bitrate, accepting a little more delay - keeps the stream up rather than letting it fall over.
What transfers to other systems
Very few of us build live-video platforms, but the shape of this problem is everywhere: a read-heavy system facing correlated demand for the same thing at the same instant - a hot key, a viral post, a flash sale, a celebrity tweet. The lessons port directly.
Find the durable, cacheable unit and make your content that. Put request coalescing in front of any expensive origin so a synchronized miss collapses to a single fetch rather than a stampede. Build a cache hierarchy where each tier shields the one behind it, and isolate regions so a local spike stays local. And be willing to spend a little latency to buy a lot of resilience. None of these are specific to video; they are the standard toolkit for surviving correlated load, and the live-streaming case is just the most extreme and most legible example of all of them at once.
Takeaways
- Live video's hard part is synchronized demand, not the bytes: a viral stream can go from hundreds to millions of viewers in seconds, and a single cache miss becomes a thundering herd on the origin.
- Segment the stream into short, immutable files with a rolling manifest so live video becomes ordinary cacheable HTTP, servable by a commodity edge hierarchy with adaptive bitrate.
- Request coalescing is the highest-leverage pattern: collapse concurrent requests for the same segment into one upstream fetch and hold the rest behind a cache-blocking timeout, turning a stampede into a single origin hit.
- A tiered cache hierarchy plus regional isolation keeps origin load roughly flat with audience size and stops a local spike or outage from propagating upstream.
- Choosing seconds of latency over milliseconds is what unlocks caching, coalescing, and commodity transport - the same seconds-not-milliseconds trade shows up across streaming systems, and the patterns transfer to any read-heavy system facing correlated demand.