Skip to content

System Design: Video Streaming UI

advanced30 min read

You've Watched a Million Videos. Now Build the Player.

Every frontend engineer has used YouTube, Netflix, or Twitch. But almost nobody understands what actually happens between clicking "play" and seeing pixels move on screen. The video player is one of the most complex UI components you'll ever build — it juggles network requests, binary data decoding, real-time buffer management, adaptive quality switching, accessibility, and platform-specific quirks, all while looking effortless.

This is a full RADIO framework case study. By the end, you'll be able to whiteboard the architecture of a production video streaming UI in a system design interview — and actually build one.

Mental Model

Think of a video player like a restaurant kitchen during dinner rush. The manifest file is the menu — it tells you what dishes (quality levels) are available. The segment fetcher is the waiter running orders to the kitchen (network). The buffer is the prep counter where partially-ready dishes wait. The decoder is the chef assembling final plates. And the renderer is the server carrying the finished dish to your table (screen). If the waiter is slow (bad network), the chef has nothing to cook, and you sit there staring at a spinner. Adaptive bitrate is the kitchen switching from filet mignon to a burger when the waiter can't keep up — you still get fed, just at lower fidelity.

The RADIO Framework

We'll walk through each layer of RADIO — Requirements, Architecture, Data Model, Interface, Optimizations — to design a video streaming UI from scratch.

R — Requirements

Before writing a single line of code, you need to lock down what you're building and how well it needs to work. This is where most candidates fumble in interviews — they jump straight to components without clarifying scope.

Functional Requirements

These are the things a user can do:

  • Video playback — play, pause, seek, volume, mute, playback speed
  • Custom controls — fully custom UI overlaying the native video element (YouTube-style, not browser defaults)
  • Quality selection — manual quality picker (Auto, 1080p, 720p, 480p, 360p) with adaptive bitrate as the default
  • Subtitles and captions — multiple language tracks, toggleable, customizable appearance (font size, color, background)
  • Picture-in-Picture — floating mini-player when scrolling away or switching tabs
  • Chapters and timeline preview — chapter markers on the progress bar, thumbnail preview on hover/scrub
  • Theater mode and fullscreen — toggle between normal, theater (wider), and fullscreen layouts
  • Mini-player — persistent small player in corner when navigating away from the video page
  • Comments — threaded comments below the video (VOD), or live chat alongside the player (live streams)
  • Keyboard shortcuts — spacebar for play/pause, arrow keys for seek, f for fullscreen, m for mute, c for captions

Non-Functional Requirements

These are how well the system performs:

  • Buffer startup time — under 2 seconds from click-to-play on a 4G connection
  • Smooth quality switching — no visible stutter or gap when ABR changes quality levels
  • Keyboard accessible — every control reachable via keyboard, visible focus indicators, screen reader announcements for state changes
  • Responsive — works on mobile (touch controls, swipe to seek), desktop (mouse hover controls), and TV (d-pad navigation, 10-foot UI)
  • Resilient — graceful degradation on slow networks (lower quality, not a crash), meaningful error states ("Video unavailable" not a blank screen)
  • Memory efficient — no unbounded buffer growth, proper cleanup on unmount (revoke object URLs, abort pending fetches, detach MediaSource)
Quiz
In a system design interview, which should you clarify FIRST when asked to design a video player?

A — Architecture

Here's where you decompose the system into components with clear responsibilities. The key insight is separating the video engine from the UI controls — they change for completely different reasons and at different rates.

Component Tree

VideoPlayerRoot
├── VideoEngine (headless — no UI)
│   ├── ManifestParser (HLS .m3u8 / DASH .mpd)
│   ├── SegmentFetcher (HTTP range requests)
│   ├── BufferManager (MediaSource + SourceBuffer)
│   ├── ABRController (bandwidth estimation + quality switching)
│   └── SubtitleEngine (WebVTT / TTML parsing)
│
├── VideoSurface
│   ├── HTMLVideoElement (actual rendering)
│   └── SubtitleOverlay (positioned captions)
│
├── CustomControls
│   ├── PlayPauseButton
│   ├── ProgressBar
│   │   ├── BufferIndicator
│   │   ├── ChapterMarkers
│   │   └── ThumbnailPreview (on hover)
│   ├── VolumeControl (slider + mute toggle)
│   ├── TimeDisplay (current / duration)
│   ├── PlaybackSpeedSelector
│   ├── QualitySelector
│   ├── SubtitleSelector
│   ├── PiPButton
│   ├── TheaterModeButton
│   └── FullscreenButton
│
├── MiniPlayer (portal to corner of viewport)
└── CommentsSection / LiveChat

Why This Separation Matters

The VideoEngine is a headless state machine. It has zero UI. It manages the actual video pipeline — fetching manifests, downloading segments, feeding them to MediaSource, estimating bandwidth, switching quality levels. You could swap the entire UI layer without touching the engine, or reuse the engine in a different product (embedded player, mobile webview, TV app).

The CustomControls layer is purely presentational. It reads state from the engine (is the video playing? what's the current time? which quality levels are available?) and dispatches commands back to it (play, pause, seek to 45s, switch to 720p). This is the classic command/query separation pattern.

// The engine exposes a clean interface — no UI concerns
interface VideoEngine {
  play(): void
  pause(): void
  seek(time: number): void
  setQuality(level: QualityLevel | 'auto'): void
  setPlaybackRate(rate: number): void
  setSubtitleTrack(trackId: string | null): void
  getState(): VideoState
  subscribe(listener: (state: VideoState) => void): () => void
  destroy(): void
}
Interview gold

When you separate the engine from the UI, you can explain that the same engine powers the main player, the mini-player, and an embedded iframe player — just with different control overlays. This shows the interviewer you think about reuse and scalability.

State Machine

The video player has well-defined states. Modeling them as a state machine prevents impossible transitions (you can't seek a video that hasn't loaded yet).

IDLE → LOADING → READY → PLAYING ⇌ PAUSED
                   ↓         ↓        ↓
                BUFFERING  BUFFERING  ENDED
                   ↓
                 ERROR

Every UI element derives from this state. The play button shows a play icon in PAUSED, a pause icon in PLAYING, a spinner in BUFFERING, and is disabled in IDLE or ERROR. No ad-hoc boolean flags — just one source of truth.

Quiz
Why should the video engine be headless (no UI) rather than tightly coupled with the controls?

D — Data Model

Let's define the data structures flowing through the system.

Video Metadata

interface VideoMetadata {
  id: string
  title: string
  description: string
  duration: number
  thumbnailUrl: string
  thumbnailSpriteUrl: string
  manifestUrl: string
  subtitleTracks: SubtitleTrack[]
  chapters: Chapter[]
  publishedAt: string
  viewCount: number
  isLive: boolean
}

interface SubtitleTrack {
  id: string
  language: string
  label: string
  url: string
  isDefault: boolean
}

interface Chapter {
  startTime: number
  endTime: number
  title: string
  thumbnailUrl?: string
}

Playback State

type PlaybackStatus =
  | 'idle'
  | 'loading'
  | 'ready'
  | 'playing'
  | 'paused'
  | 'buffering'
  | 'ended'
  | 'error'

interface VideoState {
  status: PlaybackStatus
  currentTime: number
  duration: number
  bufferedRanges: Array<{ start: number; end: number }>
  volume: number
  isMuted: boolean
  playbackRate: number
  activeQuality: QualityLevel | 'auto'
  availableQualities: QualityLevel[]
  activeSubtitleTrack: string | null
  isFullscreen: boolean
  isPiP: boolean
  error: MediaError | null
}

interface QualityLevel {
  height: number
  width: number
  bitrate: number
  codec: string
  label: string
}

Manifest Structure

This is what HLS and DASH manifests boil down to after parsing:

interface ParsedManifest {
  type: 'hls' | 'dash'
  isLive: boolean
  duration: number
  levels: QualityLevel[]
  segments: Map<QualityLevel, Segment[]>
}

interface Segment {
  index: number
  url: string
  duration: number
  byteRange?: { start: number; end: number }
}
HLS vs DASH — what actually differs?

Both HLS and DASH solve the same problem — adaptive bitrate streaming over HTTP. The differences are mostly in format and ecosystem:

HLS (HTTP Live Streaming) was created by Apple. It uses .m3u8 text playlists (a master playlist pointing to per-quality media playlists) and .ts (MPEG-2 Transport Stream) or .fmp4 (fragmented MP4) segments. It's natively supported in Safari and iOS, and supported everywhere else via hls.js which uses the Media Source Extensions API.

DASH (Dynamic Adaptive Streaming over HTTP) is the ISO standard. It uses .mpd XML manifests (Media Presentation Description) and .fmp4 segments. It's not natively supported in any browser — you always use a library like dash.js or Shaka Player.

CMAF (Common Media Application Format) is the bridge — it defines a single segment format (fMP4 with CENC encryption) that works with both HLS and DASH manifests. This means you encode your video once and serve it to both protocols, cutting storage and encoding costs in half.

In practice, most large platforms use CMAF segments with both HLS and DASH manifests, and their players auto-detect which protocol to use based on the browser.

FeatureHLSDASH
Created byApple (2009)MPEG / ISO (2012)
Manifest format.m3u8 (text playlist).mpd (XML)
Segment format.ts or .fmp4.fmp4 (always)
Native browser supportSafari, iOS, macOSNone
JS library neededhls.js (other browsers)dash.js or Shaka Player
Low-latency variantLL-HLS (Apple, 2019)LL-DASH
DRM supportFairPlay (Apple), Widevine, PlayReady via fMP4Widevine, PlayReady, ClearKey
Segment durationTypically 6s (LL-HLS: sub-second parts)Typically 2-6s
Market shareDominant (iOS + default fallback)Growing (Android, Smart TVs)
CMAF compatibleYes (with fMP4 segments)Yes (native fMP4)
Quiz
What does CMAF solve for video streaming platforms?

The Streaming Pipeline

Here's the end-to-end flow of how a video goes from server to screen:

I — Interface (APIs and Contracts)

The interfaces define how components talk to each other and to the outside world.

Media Source Extensions (MSE)

The MediaSource API is what makes adaptive streaming possible in the browser. Instead of giving the video element a static URL, you create a MediaSource object and programmatically feed it binary data:

const mediaSource = new MediaSource()
const video = document.querySelector('video')
video.src = URL.createObjectURL(mediaSource)

mediaSource.addEventListener('sourceopen', () => {
  const sourceBuffer = mediaSource.addSourceBuffer(
    'video/mp4; codecs="avc1.42E01E, mp4a.40.2"'
  )

  fetchSegment('/segment-001.m4s').then((data) => {
    sourceBuffer.appendBuffer(data)
  })
})

This is what libraries like hls.js do under the hood. They parse the HLS manifest, decide which segments to download based on bandwidth, fetch them, and pipe them into SourceBuffer.

HTMLMediaElement Events

The HTMLMediaElement fires events that your custom controls subscribe to:

const events = {
  play: () => updateState({ status: 'playing' }),
  pause: () => updateState({ status: 'paused' }),
  waiting: () => updateState({ status: 'buffering' }),
  timeupdate: () => updateState({ currentTime: video.currentTime }),
  ended: () => updateState({ status: 'ended' }),
  error: (e) => updateState({ status: 'error', error: video.error }),
  loadedmetadata: () => updateState({
    duration: video.duration,
    status: 'ready',
  }),
  progress: () => updateState({
    bufferedRanges: getBufferedRanges(video.buffered),
  }),
  volumechange: () => updateState({
    volume: video.volume,
    isMuted: video.muted,
  }),
}

Object.entries(events).forEach(([event, handler]) => {
  video.addEventListener(event, handler)
})

Media Session API

The Media Session API lets you integrate with the OS media controls — the lock screen player on mobile, the media keys on keyboards, the system notification on desktop:

if ('mediaSession' in navigator) {
  navigator.mediaSession.metadata = new MediaMetadata({
    title: video.title,
    artist: video.channelName,
    artwork: [
      { src: video.thumbnailUrl, sizes: '512x512', type: 'image/jpeg' },
    ],
  })

  navigator.mediaSession.setActionHandler('play', () => engine.play())
  navigator.mediaSession.setActionHandler('pause', () => engine.pause())
  navigator.mediaSession.setActionHandler('seekbackward', (details) => {
    engine.seek(engine.getState().currentTime - (details.seekOffset ?? 10))
  })
  navigator.mediaSession.setActionHandler('seekforward', (details) => {
    engine.seek(engine.getState().currentTime + (details.seekOffset ?? 10))
  })
}

PostMessage for Embeds

If your player is embedded in an iframe (like YouTube embeds), communication happens via postMessage:

// Inside the iframe player
window.addEventListener('message', (event) => {
  if (event.origin !== allowedOrigin) return

  const { command, args } = event.data
  switch (command) {
    case 'play': engine.play(); break
    case 'pause': engine.pause(); break
    case 'seek': engine.seek(args.time); break
    case 'setVolume': engine.setVolume(args.volume); break
  }
})

// Emit state changes back to the parent
function notifyParent(state: VideoState) {
  window.parent.postMessage(
    { type: 'playerStateChange', state },
    allowedOrigin
  )
}
Common Trap

Origin validation is not optional

Never skip the event.origin check in postMessage handlers. Without it, any page can embed your player in an iframe and control it — or worse, inject commands that trigger navigation, data exfiltration, or XSS. Always whitelist allowed origins.

Quiz
What API allows JavaScript to programmatically feed video segments into a video element for adaptive streaming?

O — Optimizations

This is where a good player becomes a great one. Performance separates YouTube from a hobby project.

1. Preload Strategy

The preload attribute on the video element controls how much data the browser fetches before the user hits play:

  • preload="none" — fetch nothing. Best for pages with many videos (feed, search results). Zero wasted bandwidth.
  • preload="metadata" — fetch just enough to know duration, dimensions, and first frame. Good default for "above-the-fold" hero videos.
  • preload="auto" — browser decides how much to buffer. Only use for the primary player on a dedicated video page.

For a video feed with 20 thumbnails, using preload="auto" on all of them would hammer the CDN and waste the user's data plan. Use preload="none" with IntersectionObserver to upgrade to preload="metadata" only when a video scrolls into view.

2. Thumbnail Sprites for Timeline Scrubbing

When users hover over the progress bar, they expect to see a thumbnail preview of that timestamp. Loading individual images for every second of a 2-hour video would be thousands of HTTP requests.

The solution: thumbnail sprite sheets. Generate a single image containing a grid of thumbnails (e.g., one per 10 seconds), then use CSS background-position to show the right frame:

function getThumbnailPosition(time: number, sprite: SpriteConfig) {
  const index = Math.floor(time / sprite.interval)
  const col = index % sprite.columns
  const row = Math.floor(index / sprite.columns)

  return {
    backgroundImage: `url(${sprite.url})`,
    backgroundPosition: `-${col * sprite.thumbWidth}px -${row * sprite.thumbHeight}px`,
    width: sprite.thumbWidth,
    height: sprite.thumbHeight,
  }
}

3. Bandwidth Estimation for ABR

The ABR controller needs to estimate available bandwidth to pick the right quality level. The simplest approach: measure how long each segment takes to download:

function estimateBandwidth(
  segmentBytes: number,
  downloadTimeMs: number,
  previousEstimate: number
): number {
  const measuredBps = (segmentBytes * 8 * 1000) / downloadTimeMs
  const smoothingFactor = 0.7
  return smoothingFactor * previousEstimate + (1 - smoothingFactor) * measuredBps
}

The smoothing factor prevents wild swings from a single slow segment (maybe the user's elevator briefly lost signal). Exponentially weighted moving average (EWMA) is what most production players use.

4. Resume Playback Position

Store the user's watch progress so they can pick up where they left off:

function saveProgress(videoId: string, currentTime: number, duration: number) {
  if (currentTime < 5 || duration - currentTime < 10) return

  const progress = { time: currentTime, timestamp: Date.now() }
  localStorage.setItem(`watch-progress:${videoId}`, JSON.stringify(progress))
}

Skip saving if the user just started (under 5 seconds) or is near the end (within 10 seconds) — they probably finished the video and you shouldn't resume at the credits.

5. Lazy Load Player Below Fold

Don't load the full video player for videos below the viewport. Use IntersectionObserver to detect when the player area scrolls into view:

function useIntersection(ref: RefObject<HTMLElement>, rootMargin = '200px') {
  const [isVisible, setIsVisible] = useState(false)

  useEffect(() => {
    const element = ref.current
    if (!element) return

    const observer = new IntersectionObserver(
      ([entry]) => {
        if (entry.isIntersecting) {
          setIsVisible(true)
          observer.disconnect()
        }
      },
      { rootMargin }
    )

    observer.observe(element)
    return () => observer.disconnect()
  }, [ref, rootMargin])

  return isVisible
}

The rootMargin of 200px starts loading the player slightly before it enters the viewport, so the user doesn't see a loading flash.

6. Picture-in-Picture API

The PiP API lets you pop the video out into a floating window managed by the OS:

async function togglePiP(video: HTMLVideoElement) {
  if (document.pictureInPictureElement) {
    await document.exitPictureInPicture()
  } else if (document.pictureInPictureEnabled) {
    await video.requestPictureInPicture()
  }
}

7. Reduced Data Mode

Respect the user's data preferences via the NetworkInformation API:

function getMaxQuality(): number {
  const connection = (navigator as Navigator & {
    connection?: { saveData: boolean; effectiveType: string }
  }).connection

  if (connection?.saveData) return 360
  if (connection?.effectiveType === '2g') return 240
  if (connection?.effectiveType === '3g') return 480
  return Infinity
}

8. Cleanup on Unmount

Video players are notorious for memory leaks. Every resource you create must be cleaned up:

function destroyPlayer(
  video: HTMLVideoElement,
  mediaSource: MediaSource,
  objectUrl: string,
  abortController: AbortController
) {
  abortController.abort()
  video.pause()
  video.removeAttribute('src')
  video.load()

  if (mediaSource.readyState === 'open') {
    mediaSource.endOfStream()
  }

  URL.revokeObjectURL(objectUrl)
}

Forgetting URL.revokeObjectURL is one of the most common memory leaks in media applications. Each createObjectURL allocates a blob reference that persists until revoked or the page unloads.

Quiz
Why should you use preload='none' for videos in a scrollable feed?

Accessibility — Not an Afterthought

A video player without keyboard controls and screen reader support is broken for a huge population of users. Here's the minimum:

Keyboard Controls

KeyAction
Space / KPlay / Pause
Left ArrowSeek back 5 seconds
Right ArrowSeek forward 5 seconds
JSeek back 10 seconds
LSeek forward 10 seconds
Up ArrowVolume up 5%
Down ArrowVolume down 5%
MToggle mute
FToggle fullscreen
CToggle captions
EscapeExit fullscreen

Screen Reader Announcements

Use aria-live regions to announce state changes:

<div aria-live="polite" className="sr-only">
  {status === 'playing' && 'Video playing'}
  {status === 'paused' && 'Video paused'}
  {status === 'buffering' && 'Video buffering'}
</div>

Every interactive control needs an aria-label:

<button
  aria-label={isPlaying ? 'Pause video' : 'Play video'}
  onClick={togglePlayPause}
>
  {isPlaying ? <PauseIcon /> : <PlayIcon />}
</button>

Live Streaming Differences

If you're extending this design for live streaming (Twitch-style), the key differences are:

  • Manifest refresh — Live HLS manifests must be re-fetched periodically (every target duration) because new segments keep appearing. VOD manifests are static.
  • No total duration — You can't show a total time or let users seek to the end. The progress bar becomes a "live edge" indicator with limited DVR-style rewind.
  • Latency target — Standard HLS has 6-30s latency. LL-HLS (Low-Latency HLS) brings it to 2-5s using partial segments and preload hints.
  • Live chat — Chat messages arrive via WebSocket alongside the video stream. Synchronizing chat with the live video edge requires timestamp coordination.
  • DVR window — Users can rewind the live stream within a defined window (e.g., last 2 hours). Outside that window, segments are evicted from the CDN.
Live latency trap

Don't confuse "live" with "real-time." Even LL-HLS has 2-5 seconds of latency. For truly real-time interaction (under 500ms), you need WebRTC — but WebRTC doesn't scale to thousands of viewers without an SFU (Selective Forwarding Unit) infrastructure. Most "live" platforms use HLS/DASH with acceptable latency.

Common Mistakes

What developers doWhat they should do
Using a single monolithic component for the entire video player
A monolithic player can't be reused across surfaces (main player, mini-player, embed). Separation lets you swap UIs without duplicating video logic.
Separating the headless video engine from the UI controls layer
Setting preload='auto' on all videos in a feed or gallery
preload='auto' on 20 videos triggers parallel downloads that waste bandwidth, inflate CDN costs, and drain mobile batteries.
Using preload='none' with IntersectionObserver to lazy-load
Forgetting to call URL.revokeObjectURL after destroying the player
Each createObjectURL keeps a reference to the MediaSource blob in memory. Without revokeObjectURL, you leak memory every time a player mounts and unmounts.
Revoking blob URLs, aborting pending fetches, and calling endOfStream on unmount
Implementing custom controls without keyboard support or aria-labels
Custom controls hide the browser's native accessible controls. If you don't rebuild that accessibility layer, keyboard and screen reader users get nothing.
Full keyboard navigation and screen reader announcements for every control
Using raw bandwidth measurement without smoothing for ABR decisions
A single slow segment (elevator, tunnel) would cause an immediate quality drop. Smoothing prevents oscillation between quality levels, which is more jarring than consistently lower quality.
Applying EWMA (Exponentially Weighted Moving Average) to bandwidth estimates

Key Rules

Key Rules
  1. 1Separate the video engine (headless state machine) from the UI controls — they change for different reasons
  2. 2Use MediaSource Extensions and SourceBuffer for adaptive streaming — never set a static src for production video
  3. 3Model player state as a state machine (idle, loading, ready, playing, paused, buffering, ended, error) — no ad-hoc booleans
  4. 4Preload='none' for off-screen videos, 'metadata' for visible ones, 'auto' only for the primary player
  5. 5Every custom control needs keyboard support and an aria-label — hiding native controls means rebuilding accessibility
  6. 6Always clean up on unmount: revoke object URLs, abort fetches, call endOfStream, remove event listeners
  7. 7Smooth bandwidth estimates with EWMA — raw measurements cause quality oscillation
  8. 8Use thumbnail sprite sheets for timeline preview — individual images per timestamp would be thousands of requests

Putting It All Together

Here's the 30-second whiteboard pitch:

The video player separates a headless engine (manifest parsing, segment fetching, buffer management, ABR) from a UI controls layer (custom play/pause, progress bar, quality selector). The engine exposes a subscribe/dispatch interface — controls read state and send commands. The streaming pipeline flows from manifest fetch through segment download, demuxing, buffer append (via MSE), browser decode, to screen render. Key optimizations: lazy preloading with IntersectionObserver, thumbnail sprites for scrubbing, EWMA-smoothed ABR, resume playback via localStorage, and rigorous cleanup on unmount. Accessibility is built in — full keyboard controls and ARIA announcements for every state change.

That's the design. Whether the interviewer asks about YouTube, Netflix, Twitch, or "design a video player," this architecture covers all the bases.

Quiz
A user reports that your video player leaks memory when navigating between pages in a SPA. What is the most likely cause?
Quiz
Which bandwidth estimation approach do production video players use for adaptive bitrate decisions?