System Design: Video Streaming UI
You've Watched a Million Videos. Now Build the Player.
Every frontend engineer has used YouTube, Netflix, or Twitch. But almost nobody understands what actually happens between clicking "play" and seeing pixels move on screen. The video player is one of the most complex UI components you'll ever build — it juggles network requests, binary data decoding, real-time buffer management, adaptive quality switching, accessibility, and platform-specific quirks, all while looking effortless.
This is a full RADIO framework case study. By the end, you'll be able to whiteboard the architecture of a production video streaming UI in a system design interview — and actually build one.
Think of a video player like a restaurant kitchen during dinner rush. The manifest file is the menu — it tells you what dishes (quality levels) are available. The segment fetcher is the waiter running orders to the kitchen (network). The buffer is the prep counter where partially-ready dishes wait. The decoder is the chef assembling final plates. And the renderer is the server carrying the finished dish to your table (screen). If the waiter is slow (bad network), the chef has nothing to cook, and you sit there staring at a spinner. Adaptive bitrate is the kitchen switching from filet mignon to a burger when the waiter can't keep up — you still get fed, just at lower fidelity.
The RADIO Framework
We'll walk through each layer of RADIO — Requirements, Architecture, Data Model, Interface, Optimizations — to design a video streaming UI from scratch.
R — Requirements
Before writing a single line of code, you need to lock down what you're building and how well it needs to work. This is where most candidates fumble in interviews — they jump straight to components without clarifying scope.
Functional Requirements
These are the things a user can do:
- Video playback — play, pause, seek, volume, mute, playback speed
- Custom controls — fully custom UI overlaying the native video element (YouTube-style, not browser defaults)
- Quality selection — manual quality picker (Auto, 1080p, 720p, 480p, 360p) with adaptive bitrate as the default
- Subtitles and captions — multiple language tracks, toggleable, customizable appearance (font size, color, background)
- Picture-in-Picture — floating mini-player when scrolling away or switching tabs
- Chapters and timeline preview — chapter markers on the progress bar, thumbnail preview on hover/scrub
- Theater mode and fullscreen — toggle between normal, theater (wider), and fullscreen layouts
- Mini-player — persistent small player in corner when navigating away from the video page
- Comments — threaded comments below the video (VOD), or live chat alongside the player (live streams)
- Keyboard shortcuts — spacebar for play/pause, arrow keys for seek,
ffor fullscreen,mfor mute,cfor captions
Non-Functional Requirements
These are how well the system performs:
- Buffer startup time — under 2 seconds from click-to-play on a 4G connection
- Smooth quality switching — no visible stutter or gap when ABR changes quality levels
- Keyboard accessible — every control reachable via keyboard, visible focus indicators, screen reader announcements for state changes
- Responsive — works on mobile (touch controls, swipe to seek), desktop (mouse hover controls), and TV (d-pad navigation, 10-foot UI)
- Resilient — graceful degradation on slow networks (lower quality, not a crash), meaningful error states ("Video unavailable" not a blank screen)
- Memory efficient — no unbounded buffer growth, proper cleanup on unmount (revoke object URLs, abort pending fetches, detach MediaSource)
A — Architecture
Here's where you decompose the system into components with clear responsibilities. The key insight is separating the video engine from the UI controls — they change for completely different reasons and at different rates.
Component Tree
VideoPlayerRoot
├── VideoEngine (headless — no UI)
│ ├── ManifestParser (HLS .m3u8 / DASH .mpd)
│ ├── SegmentFetcher (HTTP range requests)
│ ├── BufferManager (MediaSource + SourceBuffer)
│ ├── ABRController (bandwidth estimation + quality switching)
│ └── SubtitleEngine (WebVTT / TTML parsing)
│
├── VideoSurface
│ ├── HTMLVideoElement (actual rendering)
│ └── SubtitleOverlay (positioned captions)
│
├── CustomControls
│ ├── PlayPauseButton
│ ├── ProgressBar
│ │ ├── BufferIndicator
│ │ ├── ChapterMarkers
│ │ └── ThumbnailPreview (on hover)
│ ├── VolumeControl (slider + mute toggle)
│ ├── TimeDisplay (current / duration)
│ ├── PlaybackSpeedSelector
│ ├── QualitySelector
│ ├── SubtitleSelector
│ ├── PiPButton
│ ├── TheaterModeButton
│ └── FullscreenButton
│
├── MiniPlayer (portal to corner of viewport)
└── CommentsSection / LiveChat
Why This Separation Matters
The VideoEngine is a headless state machine. It has zero UI. It manages the actual video pipeline — fetching manifests, downloading segments, feeding them to MediaSource, estimating bandwidth, switching quality levels. You could swap the entire UI layer without touching the engine, or reuse the engine in a different product (embedded player, mobile webview, TV app).
The CustomControls layer is purely presentational. It reads state from the engine (is the video playing? what's the current time? which quality levels are available?) and dispatches commands back to it (play, pause, seek to 45s, switch to 720p). This is the classic command/query separation pattern.
// The engine exposes a clean interface — no UI concerns
interface VideoEngine {
play(): void
pause(): void
seek(time: number): void
setQuality(level: QualityLevel | 'auto'): void
setPlaybackRate(rate: number): void
setSubtitleTrack(trackId: string | null): void
getState(): VideoState
subscribe(listener: (state: VideoState) => void): () => void
destroy(): void
}
When you separate the engine from the UI, you can explain that the same engine powers the main player, the mini-player, and an embedded iframe player — just with different control overlays. This shows the interviewer you think about reuse and scalability.
State Machine
The video player has well-defined states. Modeling them as a state machine prevents impossible transitions (you can't seek a video that hasn't loaded yet).
IDLE → LOADING → READY → PLAYING ⇌ PAUSED
↓ ↓ ↓
BUFFERING BUFFERING ENDED
↓
ERROR
Every UI element derives from this state. The play button shows a play icon in PAUSED, a pause icon in PLAYING, a spinner in BUFFERING, and is disabled in IDLE or ERROR. No ad-hoc boolean flags — just one source of truth.
D — Data Model
Let's define the data structures flowing through the system.
Video Metadata
interface VideoMetadata {
id: string
title: string
description: string
duration: number
thumbnailUrl: string
thumbnailSpriteUrl: string
manifestUrl: string
subtitleTracks: SubtitleTrack[]
chapters: Chapter[]
publishedAt: string
viewCount: number
isLive: boolean
}
interface SubtitleTrack {
id: string
language: string
label: string
url: string
isDefault: boolean
}
interface Chapter {
startTime: number
endTime: number
title: string
thumbnailUrl?: string
}
Playback State
type PlaybackStatus =
| 'idle'
| 'loading'
| 'ready'
| 'playing'
| 'paused'
| 'buffering'
| 'ended'
| 'error'
interface VideoState {
status: PlaybackStatus
currentTime: number
duration: number
bufferedRanges: Array<{ start: number; end: number }>
volume: number
isMuted: boolean
playbackRate: number
activeQuality: QualityLevel | 'auto'
availableQualities: QualityLevel[]
activeSubtitleTrack: string | null
isFullscreen: boolean
isPiP: boolean
error: MediaError | null
}
interface QualityLevel {
height: number
width: number
bitrate: number
codec: string
label: string
}
Manifest Structure
This is what HLS and DASH manifests boil down to after parsing:
interface ParsedManifest {
type: 'hls' | 'dash'
isLive: boolean
duration: number
levels: QualityLevel[]
segments: Map<QualityLevel, Segment[]>
}
interface Segment {
index: number
url: string
duration: number
byteRange?: { start: number; end: number }
}
HLS vs DASH — what actually differs?
Both HLS and DASH solve the same problem — adaptive bitrate streaming over HTTP. The differences are mostly in format and ecosystem:
HLS (HTTP Live Streaming) was created by Apple. It uses .m3u8 text playlists (a master playlist pointing to per-quality media playlists) and .ts (MPEG-2 Transport Stream) or .fmp4 (fragmented MP4) segments. It's natively supported in Safari and iOS, and supported everywhere else via hls.js which uses the Media Source Extensions API.
DASH (Dynamic Adaptive Streaming over HTTP) is the ISO standard. It uses .mpd XML manifests (Media Presentation Description) and .fmp4 segments. It's not natively supported in any browser — you always use a library like dash.js or Shaka Player.
CMAF (Common Media Application Format) is the bridge — it defines a single segment format (fMP4 with CENC encryption) that works with both HLS and DASH manifests. This means you encode your video once and serve it to both protocols, cutting storage and encoding costs in half.
In practice, most large platforms use CMAF segments with both HLS and DASH manifests, and their players auto-detect which protocol to use based on the browser.
| Feature | HLS | DASH |
|---|---|---|
| Created by | Apple (2009) | MPEG / ISO (2012) |
| Manifest format | .m3u8 (text playlist) | .mpd (XML) |
| Segment format | .ts or .fmp4 | .fmp4 (always) |
| Native browser support | Safari, iOS, macOS | None |
| JS library needed | hls.js (other browsers) | dash.js or Shaka Player |
| Low-latency variant | LL-HLS (Apple, 2019) | LL-DASH |
| DRM support | FairPlay (Apple), Widevine, PlayReady via fMP4 | Widevine, PlayReady, ClearKey |
| Segment duration | Typically 6s (LL-HLS: sub-second parts) | Typically 2-6s |
| Market share | Dominant (iOS + default fallback) | Growing (Android, Smart TVs) |
| CMAF compatible | Yes (with fMP4 segments) | Yes (native fMP4) |
The Streaming Pipeline
Here's the end-to-end flow of how a video goes from server to screen:
I — Interface (APIs and Contracts)
The interfaces define how components talk to each other and to the outside world.
Media Source Extensions (MSE)
The MediaSource API is what makes adaptive streaming possible in the browser. Instead of giving the video element a static URL, you create a MediaSource object and programmatically feed it binary data:
const mediaSource = new MediaSource()
const video = document.querySelector('video')
video.src = URL.createObjectURL(mediaSource)
mediaSource.addEventListener('sourceopen', () => {
const sourceBuffer = mediaSource.addSourceBuffer(
'video/mp4; codecs="avc1.42E01E, mp4a.40.2"'
)
fetchSegment('/segment-001.m4s').then((data) => {
sourceBuffer.appendBuffer(data)
})
})
This is what libraries like hls.js do under the hood. They parse the HLS manifest, decide which segments to download based on bandwidth, fetch them, and pipe them into SourceBuffer.
HTMLMediaElement Events
The HTMLMediaElement fires events that your custom controls subscribe to:
const events = {
play: () => updateState({ status: 'playing' }),
pause: () => updateState({ status: 'paused' }),
waiting: () => updateState({ status: 'buffering' }),
timeupdate: () => updateState({ currentTime: video.currentTime }),
ended: () => updateState({ status: 'ended' }),
error: (e) => updateState({ status: 'error', error: video.error }),
loadedmetadata: () => updateState({
duration: video.duration,
status: 'ready',
}),
progress: () => updateState({
bufferedRanges: getBufferedRanges(video.buffered),
}),
volumechange: () => updateState({
volume: video.volume,
isMuted: video.muted,
}),
}
Object.entries(events).forEach(([event, handler]) => {
video.addEventListener(event, handler)
})
Media Session API
The Media Session API lets you integrate with the OS media controls — the lock screen player on mobile, the media keys on keyboards, the system notification on desktop:
if ('mediaSession' in navigator) {
navigator.mediaSession.metadata = new MediaMetadata({
title: video.title,
artist: video.channelName,
artwork: [
{ src: video.thumbnailUrl, sizes: '512x512', type: 'image/jpeg' },
],
})
navigator.mediaSession.setActionHandler('play', () => engine.play())
navigator.mediaSession.setActionHandler('pause', () => engine.pause())
navigator.mediaSession.setActionHandler('seekbackward', (details) => {
engine.seek(engine.getState().currentTime - (details.seekOffset ?? 10))
})
navigator.mediaSession.setActionHandler('seekforward', (details) => {
engine.seek(engine.getState().currentTime + (details.seekOffset ?? 10))
})
}
PostMessage for Embeds
If your player is embedded in an iframe (like YouTube embeds), communication happens via postMessage:
// Inside the iframe player
window.addEventListener('message', (event) => {
if (event.origin !== allowedOrigin) return
const { command, args } = event.data
switch (command) {
case 'play': engine.play(); break
case 'pause': engine.pause(); break
case 'seek': engine.seek(args.time); break
case 'setVolume': engine.setVolume(args.volume); break
}
})
// Emit state changes back to the parent
function notifyParent(state: VideoState) {
window.parent.postMessage(
{ type: 'playerStateChange', state },
allowedOrigin
)
}
Origin validation is not optional
Never skip the event.origin check in postMessage handlers. Without it, any page can embed your player in an iframe and control it — or worse, inject commands that trigger navigation, data exfiltration, or XSS. Always whitelist allowed origins.
O — Optimizations
This is where a good player becomes a great one. Performance separates YouTube from a hobby project.
1. Preload Strategy
The preload attribute on the video element controls how much data the browser fetches before the user hits play:
preload="none"— fetch nothing. Best for pages with many videos (feed, search results). Zero wasted bandwidth.preload="metadata"— fetch just enough to know duration, dimensions, and first frame. Good default for "above-the-fold" hero videos.preload="auto"— browser decides how much to buffer. Only use for the primary player on a dedicated video page.
For a video feed with 20 thumbnails, using preload="auto" on all of them would hammer the CDN and waste the user's data plan. Use preload="none" with IntersectionObserver to upgrade to preload="metadata" only when a video scrolls into view.
2. Thumbnail Sprites for Timeline Scrubbing
When users hover over the progress bar, they expect to see a thumbnail preview of that timestamp. Loading individual images for every second of a 2-hour video would be thousands of HTTP requests.
The solution: thumbnail sprite sheets. Generate a single image containing a grid of thumbnails (e.g., one per 10 seconds), then use CSS background-position to show the right frame:
function getThumbnailPosition(time: number, sprite: SpriteConfig) {
const index = Math.floor(time / sprite.interval)
const col = index % sprite.columns
const row = Math.floor(index / sprite.columns)
return {
backgroundImage: `url(${sprite.url})`,
backgroundPosition: `-${col * sprite.thumbWidth}px -${row * sprite.thumbHeight}px`,
width: sprite.thumbWidth,
height: sprite.thumbHeight,
}
}
3. Bandwidth Estimation for ABR
The ABR controller needs to estimate available bandwidth to pick the right quality level. The simplest approach: measure how long each segment takes to download:
function estimateBandwidth(
segmentBytes: number,
downloadTimeMs: number,
previousEstimate: number
): number {
const measuredBps = (segmentBytes * 8 * 1000) / downloadTimeMs
const smoothingFactor = 0.7
return smoothingFactor * previousEstimate + (1 - smoothingFactor) * measuredBps
}
The smoothing factor prevents wild swings from a single slow segment (maybe the user's elevator briefly lost signal). Exponentially weighted moving average (EWMA) is what most production players use.
4. Resume Playback Position
Store the user's watch progress so they can pick up where they left off:
function saveProgress(videoId: string, currentTime: number, duration: number) {
if (currentTime < 5 || duration - currentTime < 10) return
const progress = { time: currentTime, timestamp: Date.now() }
localStorage.setItem(`watch-progress:${videoId}`, JSON.stringify(progress))
}
Skip saving if the user just started (under 5 seconds) or is near the end (within 10 seconds) — they probably finished the video and you shouldn't resume at the credits.
5. Lazy Load Player Below Fold
Don't load the full video player for videos below the viewport. Use IntersectionObserver to detect when the player area scrolls into view:
function useIntersection(ref: RefObject<HTMLElement>, rootMargin = '200px') {
const [isVisible, setIsVisible] = useState(false)
useEffect(() => {
const element = ref.current
if (!element) return
const observer = new IntersectionObserver(
([entry]) => {
if (entry.isIntersecting) {
setIsVisible(true)
observer.disconnect()
}
},
{ rootMargin }
)
observer.observe(element)
return () => observer.disconnect()
}, [ref, rootMargin])
return isVisible
}
The rootMargin of 200px starts loading the player slightly before it enters the viewport, so the user doesn't see a loading flash.
6. Picture-in-Picture API
The PiP API lets you pop the video out into a floating window managed by the OS:
async function togglePiP(video: HTMLVideoElement) {
if (document.pictureInPictureElement) {
await document.exitPictureInPicture()
} else if (document.pictureInPictureEnabled) {
await video.requestPictureInPicture()
}
}
7. Reduced Data Mode
Respect the user's data preferences via the NetworkInformation API:
function getMaxQuality(): number {
const connection = (navigator as Navigator & {
connection?: { saveData: boolean; effectiveType: string }
}).connection
if (connection?.saveData) return 360
if (connection?.effectiveType === '2g') return 240
if (connection?.effectiveType === '3g') return 480
return Infinity
}
8. Cleanup on Unmount
Video players are notorious for memory leaks. Every resource you create must be cleaned up:
function destroyPlayer(
video: HTMLVideoElement,
mediaSource: MediaSource,
objectUrl: string,
abortController: AbortController
) {
abortController.abort()
video.pause()
video.removeAttribute('src')
video.load()
if (mediaSource.readyState === 'open') {
mediaSource.endOfStream()
}
URL.revokeObjectURL(objectUrl)
}
Forgetting URL.revokeObjectURL is one of the most common memory leaks in media applications. Each createObjectURL allocates a blob reference that persists until revoked or the page unloads.
Accessibility — Not an Afterthought
A video player without keyboard controls and screen reader support is broken for a huge population of users. Here's the minimum:
Keyboard Controls
| Key | Action |
|---|---|
| Space / K | Play / Pause |
| Left Arrow | Seek back 5 seconds |
| Right Arrow | Seek forward 5 seconds |
| J | Seek back 10 seconds |
| L | Seek forward 10 seconds |
| Up Arrow | Volume up 5% |
| Down Arrow | Volume down 5% |
| M | Toggle mute |
| F | Toggle fullscreen |
| C | Toggle captions |
| Escape | Exit fullscreen |
Screen Reader Announcements
Use aria-live regions to announce state changes:
<div aria-live="polite" className="sr-only">
{status === 'playing' && 'Video playing'}
{status === 'paused' && 'Video paused'}
{status === 'buffering' && 'Video buffering'}
</div>
Every interactive control needs an aria-label:
<button
aria-label={isPlaying ? 'Pause video' : 'Play video'}
onClick={togglePlayPause}
>
{isPlaying ? <PauseIcon /> : <PlayIcon />}
</button>
Live Streaming Differences
If you're extending this design for live streaming (Twitch-style), the key differences are:
- Manifest refresh — Live HLS manifests must be re-fetched periodically (every target duration) because new segments keep appearing. VOD manifests are static.
- No total duration — You can't show a total time or let users seek to the end. The progress bar becomes a "live edge" indicator with limited DVR-style rewind.
- Latency target — Standard HLS has 6-30s latency. LL-HLS (Low-Latency HLS) brings it to 2-5s using partial segments and preload hints.
- Live chat — Chat messages arrive via WebSocket alongside the video stream. Synchronizing chat with the live video edge requires timestamp coordination.
- DVR window — Users can rewind the live stream within a defined window (e.g., last 2 hours). Outside that window, segments are evicted from the CDN.
Don't confuse "live" with "real-time." Even LL-HLS has 2-5 seconds of latency. For truly real-time interaction (under 500ms), you need WebRTC — but WebRTC doesn't scale to thousands of viewers without an SFU (Selective Forwarding Unit) infrastructure. Most "live" platforms use HLS/DASH with acceptable latency.
Common Mistakes
| What developers do | What they should do |
|---|---|
| Using a single monolithic component for the entire video player A monolithic player can't be reused across surfaces (main player, mini-player, embed). Separation lets you swap UIs without duplicating video logic. | Separating the headless video engine from the UI controls layer |
| Setting preload='auto' on all videos in a feed or gallery preload='auto' on 20 videos triggers parallel downloads that waste bandwidth, inflate CDN costs, and drain mobile batteries. | Using preload='none' with IntersectionObserver to lazy-load |
| Forgetting to call URL.revokeObjectURL after destroying the player Each createObjectURL keeps a reference to the MediaSource blob in memory. Without revokeObjectURL, you leak memory every time a player mounts and unmounts. | Revoking blob URLs, aborting pending fetches, and calling endOfStream on unmount |
| Implementing custom controls without keyboard support or aria-labels Custom controls hide the browser's native accessible controls. If you don't rebuild that accessibility layer, keyboard and screen reader users get nothing. | Full keyboard navigation and screen reader announcements for every control |
| Using raw bandwidth measurement without smoothing for ABR decisions A single slow segment (elevator, tunnel) would cause an immediate quality drop. Smoothing prevents oscillation between quality levels, which is more jarring than consistently lower quality. | Applying EWMA (Exponentially Weighted Moving Average) to bandwidth estimates |
Key Rules
- 1Separate the video engine (headless state machine) from the UI controls — they change for different reasons
- 2Use MediaSource Extensions and SourceBuffer for adaptive streaming — never set a static src for production video
- 3Model player state as a state machine (idle, loading, ready, playing, paused, buffering, ended, error) — no ad-hoc booleans
- 4Preload='none' for off-screen videos, 'metadata' for visible ones, 'auto' only for the primary player
- 5Every custom control needs keyboard support and an aria-label — hiding native controls means rebuilding accessibility
- 6Always clean up on unmount: revoke object URLs, abort fetches, call endOfStream, remove event listeners
- 7Smooth bandwidth estimates with EWMA — raw measurements cause quality oscillation
- 8Use thumbnail sprite sheets for timeline preview — individual images per timestamp would be thousands of requests
Putting It All Together
Here's the 30-second whiteboard pitch:
The video player separates a headless engine (manifest parsing, segment fetching, buffer management, ABR) from a UI controls layer (custom play/pause, progress bar, quality selector). The engine exposes a subscribe/dispatch interface — controls read state and send commands. The streaming pipeline flows from manifest fetch through segment download, demuxing, buffer append (via MSE), browser decode, to screen render. Key optimizations: lazy preloading with IntersectionObserver, thumbnail sprites for scrubbing, EWMA-smoothed ABR, resume playback via localStorage, and rigorous cleanup on unmount. Accessibility is built in — full keyboard controls and ARIA announcements for every state change.
That's the design. Whether the interviewer asks about YouTube, Netflix, Twitch, or "design a video player," this architecture covers all the bases.