System Design: Collaborative Editor

advanced35 min read

The Problem That Breaks Most Interviews

You sit down in a system design round and the interviewer says: "Design Google Docs." Your heart rate spikes. This is one of the hardest frontend system design problems because it touches everything at once: real-time networking, conflict resolution algorithms, rich text data models, cursor synchronization, undo/redo stacks, offline support, and performance under concurrent edits.

Most candidates fumble it because they jump straight to WebSockets and hand-wave the hard part: what actually happens when two people type at the same cursor position at the same time?

We are going to tear this problem apart using the RADIO framework (Requirements, Architecture, Data Model, Interface, Optimizations) and build a mental model so solid you could actually implement this.

Mental Model

Think of a collaborative editor like a shared whiteboard in a room full of people. Everyone has their own marker and can draw anywhere. The magic isn't in the drawing itself -- it's in the rules that prevent chaos when two people try to draw in the same spot. OT (Operational Transformation) solves this by having a referee who reorders and adjusts strokes in real time. CRDTs solve it by giving every stroke a unique identity that can be merged in any order without a referee. Different approaches, same goal: everyone sees the same whiteboard.

R — Requirements

Before sketching any architecture, nail down exactly what you are building. In an interview, spend 3-5 minutes here. It shows maturity.

Functional Requirements

Rich text editing — Bold, italic, headings, lists, code blocks, images, tables. Not just plain text.
Real-time collaboration — Multiple users editing simultaneously. Changes appear within 50-100ms for co-located users.
Presence awareness — See who is viewing the document, where their cursor is, and what they have selected. Each user gets a distinct color.
Version history — Browse and restore previous versions. Snapshots at meaningful intervals, not every keystroke.
Comments and suggestions — Anchored to specific text ranges. Resolve/unresolve. Reply threads.
Formatting toolbar — Context-aware toolbar that reflects current selection state.
Offline editing — Users can edit without connection. Changes merge when they reconnect.

Non-Functional Requirements

Conflict resolution — Concurrent edits must converge to the same document state on all clients. No data loss.
Local-first latency — Keystrokes must feel instant (sub-16ms to render locally). Network round-trips must not block the editing experience.
Per-user undo/redo — Undoing my changes should not undo your changes. Each user has an independent undo stack.
Scalability — Support 50+ concurrent editors on a single document without degradation.
Data integrity — No lost edits, no phantom characters, no corrupted formatting.

Quiz

Why is 'sub-100ms local latency' a non-functional requirement rather than a functional one?

ABCD

A — Architecture

Editor Engine: The Foundation Choice

The editor engine determines everything downstream — data model, plugin system, collaboration integration. This is the most consequential architectural decision.

Engine	Approach	Schema	Collab Support	Best For
ProseMirror / TipTap	Schema-driven, transactions	Strict, declarative	Yjs plugin (mature)	Structured documents, Notion-like
Slate	React-first, nested JSON	Flexible, custom	Yjs plugin (community)	Highly custom editors
Lexical (Meta)	EditorState snapshots	Node-based, typed	Yjs plugin (growing)	Performance-critical, Facebook-scale
CodeMirror 6	Functional, immutable state	Text-only (no rich blocks)	Yjs plugin (native)	Code editors, not docs

For a Google Docs-style editor, ProseMirror (or TipTap, which wraps it) + Yjs is the proven combination. TipTap gives you a batteries-included React integration. ProseMirror gives you the raw power. Both use the same underlying engine.

Component Tree

Here is the high-level component architecture:

CollaborativeEditor
├── Toolbar                    // Formatting controls, responds to selection
├── EditorContainer
│   ├── EditorContent          // ProseMirror/TipTap editor view
│   ├── CursorOverlay          // Remote user cursors + selections
│   └── CommentAnchors         // Inline comment markers
├── PresenceBar                // Avatars of active users
├── CommentSidebar             // Comment threads, anchored to text ranges
└── VersionHistory             // Timeline of snapshots, diff viewer

Collaborative Editor Architecture9 components

Click to inspect

Separation of Concerns

Each component has a clear, single job:

Toolbar reads the current selection state and dispatches ProseMirror commands. It never touches the document model directly.
EditorContent owns the ProseMirror EditorView. All document mutations flow through ProseMirror transactions.
CursorOverlay subscribes to the Yjs awareness protocol. It renders absolutely-positioned cursor markers and translucent selection highlights. It never modifies the document.
PresenceBar shows connected users. It reads from awareness state, not from the document.
CommentSidebar manages comment CRUD through REST. Comment anchors are stored as marks in the document schema.
VersionHistory fetches snapshots from the server and renders a diff view. It is lazy-loaded because most users never open it.

Quiz

Why should CursorOverlay be a separate component from EditorContent instead of rendering cursors inside the editor?

ABCD

D — Data Model

This is where collaborative editing gets genuinely hard. The data model is not just about representing the document — it is about representing concurrent changes to the document in a way that always converges.

Document Model: Block-Based

Modern editors use a block-based document model, not flat text. A document is a tree of typed nodes:

type Document = {
  type: 'doc'
  content: Block[]
}

type Block =
  | { type: 'paragraph'; content: InlineContent[] }
  | { type: 'heading'; attrs: { level: 1 | 2 | 3 }; content: InlineContent[] }
  | { type: 'codeBlock'; attrs: { language: string }; content: TextNode[] }
  | { type: 'image'; attrs: { src: string; alt: string } }
  | { type: 'bulletList'; content: ListItem[] }

type InlineContent = TextNode | HardBreak

type TextNode = {
  type: 'text'
  text: string
  marks?: Mark[]
}

type Mark =
  | { type: 'bold' }
  | { type: 'italic' }
  | { type: 'link'; attrs: { href: string } }
  | { type: 'comment'; attrs: { commentId: string } }

ProseMirror enforces this schema at the transaction level — you literally cannot create an invalid document state. This is critical for collaboration because it means malformed edits from any client are rejected before they corrupt the shared state.

The Core Problem: Concurrent Edits

Here is the scenario that breaks naive implementations. Two users are editing the same sentence:

Original:   "The cat sat on the mat"
User A:     Deletes "cat" → "The sat on the mat"
User B:     Bolds "cat" → "The **cat** sat on the mat"

If you just apply both operations in order, you get nonsense. User B's bold operation targets a range that no longer exists after User A's delete. This is the conflict resolution problem, and there are two fundamentally different approaches.

OT vs CRDT: The Two Paradigms

Dimension	OT (Operational Transformation)	CRDT (Conflict-free Replicated Data Type)
Core idea	Transform operations against each other so they converge	Data structure that merges automatically regardless of order
Server requirement	Central server required for transformation ordering	No central server needed (peer-to-peer possible)
Offline support	Complex — must queue and rebase operations	Native — merge whenever you reconnect
Implementation complexity	Simpler data model, harder algorithm (transform functions)	Complex data model, simpler merge (automatic)
Production examples	Google Docs, Google Sheets	Figma (custom), Notion (partial), Linear
Dominant library	ShareDB, OT.js	Yjs, Automerge
Undo/redo	Transform-based inversion	Per-user undo via CRDT tombstones
Performance at scale	Linear with op count per transform	Memory grows with edit history (tombstones)
Correctness guarantee	Requires careful transform function pairs (O(n^2) pairs for n op types)	Mathematically proven to converge

How OT Works

OT keeps a linear history of operations. When two clients submit concurrent operations, the server transforms one against the other so they produce the same final state regardless of application order.

Server state: "ABCD" (version 3)

Client A (at v3): insert('X', position 1) → "AXBCD"
Client B (at v3): delete(position 2)      → "ABD"

Server receives A first → applies insert('X', 1) → "AXBCD" (v4)
Server receives B → but B was based on v3, not v4!

Transform: B.delete(pos 2) against A.insert(pos 1)
Since A inserted before position 2, shift B's position: delete(pos 3)
Result: "AXBD" — both clients converge to this

The problem with OT is that every pair of operation types needs a transform function. For a rich text editor with 20+ operation types, that is 400+ transform function pairs to get right. Google invested years of engineering into getting this correct for Docs.

How CRDTs Work

CRDTs take a completely different approach. Instead of transforming operations, they use a data structure where concurrent edits automatically merge to the same state, regardless of the order they are applied.

Yjs (the dominant CRDT library for web editors) uses a structure where every character has a unique ID based on the client that created it and a logical clock. Insertions are placed relative to their neighbors, not at absolute positions. Deletions mark characters as tombstones rather than removing them.

Client A inserts 'X' after position 1:
  ID: (clientA, clock:7)
  origin: character at position 1
  content: 'X'

Client B deletes position 2:
  Marks character (clientB, clock:3) as deleted (tombstone)

These operations commute — apply them in any order,
you get the same result. No transformation needed.

The trade-off is memory. CRDTs never truly delete anything — tombstones accumulate. For a document edited heavily over months, the CRDT state can grow significantly larger than the visible text. Yjs mitigates this with garbage collection of tombstones that all clients have acknowledged, but it is a real concern at scale.

Quiz

A startup is building a collaborative note-taking app that must work offline-first (like a local desktop app that syncs). Which approach should they choose?

ABCD

Awareness Protocol

Separate from the document model, collaborative editors need an awareness protocol for presence information. Yjs includes this out of the box:

type AwarenessState = {
  user: {
    name: string
    color: string
    avatar?: string
  }
  cursor: {
    anchor: number
    head: number
  } | null
  selection: {
    from: number
    to: number
  } | null
  lastActive: number
}

Awareness state is ephemeral — it is not persisted and is not part of the document. When a user disconnects, their awareness state is automatically removed after a timeout. This is broadcast through a separate channel (typically the same WebSocket connection but a distinct message type) at a lower priority than document operations.

I — Interface (APIs and Protocols)

WebSocket: The Real-Time Backbone

Document sync and awareness both flow over WebSocket. Here is the message protocol:

// Client to Server
type ClientMessage =
  | { type: 'sync-step-1'; payload: Uint8Array }     // Initial state vector
  | { type: 'sync-step-2'; payload: Uint8Array }     // State diff response
  | { type: 'update'; payload: Uint8Array }           // Document update (CRDT encoded)
  | { type: 'awareness'; payload: Uint8Array }        // Cursor/presence update

// Server to Client
type ServerMessage =
  | { type: 'sync-step-1'; payload: Uint8Array }
  | { type: 'sync-step-2'; payload: Uint8Array }
  | { type: 'update'; payload: Uint8Array }
  | { type: 'awareness'; payload: Uint8Array }
  | { type: 'error'; code: string; message: string }

With Yjs, the sync protocol is handled automatically by y-websocket. The three-step handshake works like this:

Yjs WebSocket Sync ProtocolPhase 1 / 3

Phase 1 / 3sync-step-1

Client sends its state vector (a compact summary of which updates it has seen) to the server.

client to server

1/3

REST API: Everything Else

Not everything needs real-time. Use REST for operations that are infrequent and do not need instant propagation:

GET    /api/documents/:id              → Document metadata + initial CRDT state
POST   /api/documents                  → Create new document
GET    /api/documents/:id/versions     → List version snapshots
GET    /api/documents/:id/versions/:v  → Specific version snapshot
POST   /api/documents/:id/comments     → Create comment
PATCH  /api/documents/:id/comments/:c  → Resolve/update comment
DELETE /api/documents/:id/comments/:c  → Delete comment
POST   /api/documents/:id/upload       → Upload image, returns URL

Why Two Protocols?

WebSocket handles the hot path: keystrokes, cursor movements, awareness. REST handles the cold path: document CRUD, version history, comments. Mixing them on the same channel works but creates priority issues — you do not want a large version history response blocking a keystroke sync.

Quiz

A team proposes using Server-Sent Events (SSE) instead of WebSocket for the collaboration sync channel. What is the main problem with this approach?

ABCD

O — Optimizations

This is where a good design becomes a great one. These optimizations are the difference between a sluggish prototype and a production-quality editor.

1. Local-First: Apply Before You Sync

The single most important optimization. Never wait for the server to acknowledge an edit before showing it to the user.

User types 'H':
  1. Apply to local CRDT state immediately      (0ms — feels instant)
  2. Re-render the editor view                   (< 16ms — next frame)
  3. Encode the CRDT update                      (< 1ms)
  4. Send over WebSocket                         (fire and forget)
  5. Server receives, persists, broadcasts       (50-200ms round trip)
  6. Other clients receive and merge             (no conflict possible with CRDT)

The user experiences step 1-2. Steps 3-6 happen in the background. This is why CRDTs pair so well with local-first — since merges are automatic, there is no risk that your local state will diverge permanently from the server.

Edit Lifecycle: Local-First PipelinePhase 1 / 5

Phase 1 / 5User types

Keystroke captured by the editor engine. Triggers a ProseMirror transaction.

0ms

1/5

2. Cursor Interpolation

Remote cursors update at network frequency (every 50-100ms). Without interpolation, they jump jerkily across the screen. Smooth them with CSS transitions or spring animations:

// CursorOverlay renders remote cursors with smooth transitions
// CSS handles the interpolation between position updates
//
// .remote-cursor {
//   transition: transform 80ms ease-out;
// }

The key insight: cursor position updates are low-priority. You can throttle awareness broadcasts to every 100ms without any perceived quality loss. Document updates, on the other hand, should be sent as soon as possible.

3. Lazy Loading Document Blocks

For long documents (100+ pages), do not load the entire document into the editor at once. Virtualize it:

Load the visible viewport plus a buffer above and below
As the user scrolls, fetch and hydrate additional blocks
Detach blocks that scroll far out of view to free memory

This is the same principle as virtualized lists (react-window, @tanstack/virtual), applied to document blocks. Notion uses this approach for their long-form documents.

4. Per-User Undo Stack

Naive undo (Ctrl+Z) in a collaborative editor is a nightmare. If User A types "hello" and User B types "world", then User A presses Ctrl+Z, should it undo "hello" (User A's action) or "world" (the most recent action globally)?

The answer is clear: undo should only affect the current user's actions. This requires a per-user undo stack:

type UndoManager = {
  userId: string
  undoStack: UndoItem[]
  redoStack: UndoItem[]
}

type UndoItem = {
  // Only tracks operations from this specific user
  inverseOperations: CRDTUpdate[]
  timestamp: number
}

Yjs provides y-undomanager which handles this out of the box. It tracks which CRDT updates belong to which user and only inverts that user's changes on undo, leaving everyone else's edits intact.

5. Image Upload with Placeholder

When a user drags an image into the editor, do not block the editing experience while it uploads:

Insert a placeholder block immediately with a local blob: URL and a loading indicator
Upload the image to your CDN in the background
When the upload completes, replace the blob: URL with the permanent CDN URL via a ProseMirror transaction
If the upload fails, show an error state on the placeholder with a retry button

This is optimistic UI applied to media insertion. The document state is always valid — the placeholder is a real node in the document tree, not a hack.

6. Version Snapshots

Do not snapshot every keystroke. Use heuristics:

Snapshot every N minutes of inactivity (e.g., 5 minutes after the last edit)
Snapshot when a user explicitly saves (Ctrl+S habit)
Snapshot before destructive operations (large deletes, bulk formatting changes)
Store snapshots as full CRDT state snapshots, not as operation logs — this makes restoration O(1) instead of replaying operations

Why Not Just Replay Operations for Version History?

Replaying a log of operations to reconstruct a historical version sounds elegant, but it is O(n) where n is the total number of operations since the snapshot. For a document with 6 months of active editing, that could be millions of operations. A full CRDT state snapshot is a single binary blob that can be loaded in milliseconds. The trade-off is storage space, but storage is cheap and user patience is not.

Putting It All Together

Here is the full data flow for a single keystroke in the collaborative editor:

Execution Trace

Keystroke

User presses 'A' key

Browser keydown event captured by ProseMirror

Transaction

ProseMirror creates a transaction: insert('A', pos 42)

Validated against document schema

Local CRDT

Transaction applied to local Yjs document

Yjs assigns unique ID: (client3, clock:157)

View update

ProseMirror re-renders the changed paragraph

Only the affected DOM nodes update (< 1ms)

Encode

Yjs encodes the update as binary diff

Typically 20-50 bytes for a single character insert

WebSocket send

Binary update sent to collaboration server

Fire-and-forget, no ack needed for UI

Server broadcast

Server applies update, broadcasts to 8 other clients

Server CRDT state is the source of truth for persistence

Remote merge

Other clients receive, merge into local CRDT, re-render

Merge is automatic and conflict-free

Common Pitfalls

What developers do	What they should do
Using absolute character positions for cursor sync Absolute positions shift when other users insert or delete text before the cursor. A cursor at position 10 might point to completely different text after a remote insertion at position 5. Relative positions are anchored to specific CRDT items and remain valid regardless of concurrent edits.	Use relative positions (Yjs RelativePosition) that survive concurrent insertions and deletions
Sending the full document state on every edit A full document sync on every keystroke would mean sending megabytes of data per second with multiple active editors. Incremental updates are orders of magnitude smaller and can be applied without re-parsing the entire document.	Send only incremental CRDT updates (typically 20-100 bytes per edit)
Using a global undo stack for all users With a global undo stack, pressing Ctrl+Z might undo someone else's edit from 200ms ago. Users expect undo to reverse their own last action. Yjs provides UndoManager which tracks operation ownership automatically.	Implement per-user undo stacks that only revert the current user's operations
Storing collaboration state in React component state React state and CRDT state would constantly drift apart. The CRDT document must be the authoritative source. React subscribes to CRDT changes and re-renders, but never owns the document data. This is the same principle as treating a database as the source of truth, not your UI cache.	Use the CRDT document (Yjs Doc) as the single source of truth, with React as a rendering layer
Implementing custom conflict resolution from scratch Conflict resolution algorithms have subtle edge cases that take years to discover and fix. Yjs has been battle-tested across thousands of production deployments. Rolling your own CRDT or OT implementation is like writing your own crypto — technically possible, practically a terrible idea.	Use a battle-tested library like Yjs or Automerge

Key Principles

Key Rules

1Local-first always: apply edits instantly, sync in the background. Never let the network block the UI.
2CRDT for new projects, OT only if you are extending an existing OT-based system. CRDTs are mathematically simpler and handle offline natively.
3Separate document sync from presence/awareness. They have different priorities, frequencies, and persistence requirements.
4Per-user undo stacks are non-negotiable in collaborative editors. Global undo breaks user expectations.
5Schema-driven document models (ProseMirror) prevent invalid states from ever entering the CRDT, eliminating an entire class of bugs.
6Version snapshots should be full state captures, not operation logs. Restoration must be O(1).

Quiz

In a Yjs-based collaborative editor, what happens to deleted characters in the CRDT state?

ABCD

Quiz

You are designing the awareness protocol for a collaborative editor with 30 concurrent users. Each user sends cursor position updates. What is the best throttling strategy?

ABCD

Interview Tips

When you get this question in an interview, here is how to structure your 35-40 minutes:

Requirements (3-5 min) — Clarify scope. Ask: "Should we support rich text or just plain text? Real-time or near-real-time? How many concurrent editors?" This shows you do not make assumptions.
Architecture (5-7 min) — Draw the component tree. Explain the editor engine choice. Mention the separation between document sync and awareness.
Data Model (10-12 min) — This is where you spend the most time. Explain OT vs CRDT trade-offs. Walk through a concrete conflict scenario. Show the document model structure.
Interface (5-7 min) — WebSocket protocol, REST endpoints, the sync handshake. Binary encoding for efficiency.
Optimizations (5-10 min) — Local-first, cursor interpolation, per-user undo, lazy loading, image placeholders. Each one shows depth.

The biggest differentiator: walking through a concrete conflict scenario step-by-step. Most candidates say "we use CRDTs" and move on. The ones who get hired can explain exactly what happens when User A inserts at position 5 while User B deletes position 3, and why the CRDT guarantees convergence.

System Design: Collaborative Editor

The Problem That Breaks Most Interviews

R — Requirements

Functional Requirements

Non-Functional Requirements

A — Architecture

Editor Engine: The Foundation Choice

Component Tree

Separation of Concerns

D — Data Model

Document Model: Block-Based

The Core Problem: Concurrent Edits

OT vs CRDT: The Two Paradigms

How OT Works

How CRDTs Work

Awareness Protocol

I — Interface (APIs and Protocols)

WebSocket: The Real-Time Backbone

REST API: Everything Else

Why Two Protocols?

O — Optimizations

1. Local-First: Apply Before You Sync

2. Cursor Interpolation

3. Lazy Loading Document Blocks

4. Per-User Undo Stack

5. Image Upload with Placeholder

6. Version Snapshots

Why Not Just Replay Operations for Version History?

Putting It All Together

Common Pitfalls

Key Principles

Interview Tips

Where to Go Next