System Design: Collaborative Editor
The Problem That Breaks Most Interviews
You sit down in a system design round and the interviewer says: "Design Google Docs." Your heart rate spikes. This is one of the hardest frontend system design problems because it touches everything at once: real-time networking, conflict resolution algorithms, rich text data models, cursor synchronization, undo/redo stacks, offline support, and performance under concurrent edits.
Most candidates fumble it because they jump straight to WebSockets and hand-wave the hard part: what actually happens when two people type at the same cursor position at the same time?
We are going to tear this problem apart using the RADIO framework (Requirements, Architecture, Data Model, Interface, Optimizations) and build a mental model so solid you could actually implement this.
Think of a collaborative editor like a shared whiteboard in a room full of people. Everyone has their own marker and can draw anywhere. The magic isn't in the drawing itself -- it's in the rules that prevent chaos when two people try to draw in the same spot. OT (Operational Transformation) solves this by having a referee who reorders and adjusts strokes in real time. CRDTs solve it by giving every stroke a unique identity that can be merged in any order without a referee. Different approaches, same goal: everyone sees the same whiteboard.
R — Requirements
Before sketching any architecture, nail down exactly what you are building. In an interview, spend 3-5 minutes here. It shows maturity.
Functional Requirements
- Rich text editing — Bold, italic, headings, lists, code blocks, images, tables. Not just plain text.
- Real-time collaboration — Multiple users editing simultaneously. Changes appear within 50-100ms for co-located users.
- Presence awareness — See who is viewing the document, where their cursor is, and what they have selected. Each user gets a distinct color.
- Version history — Browse and restore previous versions. Snapshots at meaningful intervals, not every keystroke.
- Comments and suggestions — Anchored to specific text ranges. Resolve/unresolve. Reply threads.
- Formatting toolbar — Context-aware toolbar that reflects current selection state.
- Offline editing — Users can edit without connection. Changes merge when they reconnect.
Non-Functional Requirements
- Conflict resolution — Concurrent edits must converge to the same document state on all clients. No data loss.
- Local-first latency — Keystrokes must feel instant (sub-16ms to render locally). Network round-trips must not block the editing experience.
- Per-user undo/redo — Undoing my changes should not undo your changes. Each user has an independent undo stack.
- Scalability — Support 50+ concurrent editors on a single document without degradation.
- Data integrity — No lost edits, no phantom characters, no corrupted formatting.
A — Architecture
Editor Engine: The Foundation Choice
The editor engine determines everything downstream — data model, plugin system, collaboration integration. This is the most consequential architectural decision.
| Engine | Approach | Schema | Collab Support | Best For |
|---|---|---|---|---|
| ProseMirror / TipTap | Schema-driven, transactions | Strict, declarative | Yjs plugin (mature) | Structured documents, Notion-like |
| Slate | React-first, nested JSON | Flexible, custom | Yjs plugin (community) | Highly custom editors |
| Lexical (Meta) | EditorState snapshots | Node-based, typed | Yjs plugin (growing) | Performance-critical, Facebook-scale |
| CodeMirror 6 | Functional, immutable state | Text-only (no rich blocks) | Yjs plugin (native) | Code editors, not docs |
For a Google Docs-style editor, ProseMirror (or TipTap, which wraps it) + Yjs is the proven combination. TipTap gives you a batteries-included React integration. ProseMirror gives you the raw power. Both use the same underlying engine.
Component Tree
Here is the high-level component architecture:
CollaborativeEditor
├── Toolbar // Formatting controls, responds to selection
├── EditorContainer
│ ├── EditorContent // ProseMirror/TipTap editor view
│ ├── CursorOverlay // Remote user cursors + selections
│ └── CommentAnchors // Inline comment markers
├── PresenceBar // Avatars of active users
├── CommentSidebar // Comment threads, anchored to text ranges
└── VersionHistory // Timeline of snapshots, diff viewer
Separation of Concerns
Each component has a clear, single job:
- Toolbar reads the current selection state and dispatches ProseMirror commands. It never touches the document model directly.
- EditorContent owns the ProseMirror
EditorView. All document mutations flow through ProseMirror transactions. - CursorOverlay subscribes to the Yjs awareness protocol. It renders absolutely-positioned cursor markers and translucent selection highlights. It never modifies the document.
- PresenceBar shows connected users. It reads from awareness state, not from the document.
- CommentSidebar manages comment CRUD through REST. Comment anchors are stored as marks in the document schema.
- VersionHistory fetches snapshots from the server and renders a diff view. It is lazy-loaded because most users never open it.
D — Data Model
This is where collaborative editing gets genuinely hard. The data model is not just about representing the document — it is about representing concurrent changes to the document in a way that always converges.
Document Model: Block-Based
Modern editors use a block-based document model, not flat text. A document is a tree of typed nodes:
type Document = {
type: 'doc'
content: Block[]
}
type Block =
| { type: 'paragraph'; content: InlineContent[] }
| { type: 'heading'; attrs: { level: 1 | 2 | 3 }; content: InlineContent[] }
| { type: 'codeBlock'; attrs: { language: string }; content: TextNode[] }
| { type: 'image'; attrs: { src: string; alt: string } }
| { type: 'bulletList'; content: ListItem[] }
type InlineContent = TextNode | HardBreak
type TextNode = {
type: 'text'
text: string
marks?: Mark[]
}
type Mark =
| { type: 'bold' }
| { type: 'italic' }
| { type: 'link'; attrs: { href: string } }
| { type: 'comment'; attrs: { commentId: string } }
ProseMirror enforces this schema at the transaction level — you literally cannot create an invalid document state. This is critical for collaboration because it means malformed edits from any client are rejected before they corrupt the shared state.
The Core Problem: Concurrent Edits
Here is the scenario that breaks naive implementations. Two users are editing the same sentence:
Original: "The cat sat on the mat"
User A: Deletes "cat" → "The sat on the mat"
User B: Bolds "cat" → "The **cat** sat on the mat"
If you just apply both operations in order, you get nonsense. User B's bold operation targets a range that no longer exists after User A's delete. This is the conflict resolution problem, and there are two fundamentally different approaches.
OT vs CRDT: The Two Paradigms
| Dimension | OT (Operational Transformation) | CRDT (Conflict-free Replicated Data Type) |
|---|---|---|
| Core idea | Transform operations against each other so they converge | Data structure that merges automatically regardless of order |
| Server requirement | Central server required for transformation ordering | No central server needed (peer-to-peer possible) |
| Offline support | Complex — must queue and rebase operations | Native — merge whenever you reconnect |
| Implementation complexity | Simpler data model, harder algorithm (transform functions) | Complex data model, simpler merge (automatic) |
| Production examples | Google Docs, Google Sheets | Figma (custom), Notion (partial), Linear |
| Dominant library | ShareDB, OT.js | Yjs, Automerge |
| Undo/redo | Transform-based inversion | Per-user undo via CRDT tombstones |
| Performance at scale | Linear with op count per transform | Memory grows with edit history (tombstones) |
| Correctness guarantee | Requires careful transform function pairs (O(n^2) pairs for n op types) | Mathematically proven to converge |
How OT Works
OT keeps a linear history of operations. When two clients submit concurrent operations, the server transforms one against the other so they produce the same final state regardless of application order.
Server state: "ABCD" (version 3)
Client A (at v3): insert('X', position 1) → "AXBCD"
Client B (at v3): delete(position 2) → "ABD"
Server receives A first → applies insert('X', 1) → "AXBCD" (v4)
Server receives B → but B was based on v3, not v4!
Transform: B.delete(pos 2) against A.insert(pos 1)
Since A inserted before position 2, shift B's position: delete(pos 3)
Result: "AXBD" — both clients converge to this
The problem with OT is that every pair of operation types needs a transform function. For a rich text editor with 20+ operation types, that is 400+ transform function pairs to get right. Google invested years of engineering into getting this correct for Docs.
How CRDTs Work
CRDTs take a completely different approach. Instead of transforming operations, they use a data structure where concurrent edits automatically merge to the same state, regardless of the order they are applied.
Yjs (the dominant CRDT library for web editors) uses a structure where every character has a unique ID based on the client that created it and a logical clock. Insertions are placed relative to their neighbors, not at absolute positions. Deletions mark characters as tombstones rather than removing them.
Client A inserts 'X' after position 1:
ID: (clientA, clock:7)
origin: character at position 1
content: 'X'
Client B deletes position 2:
Marks character (clientB, clock:3) as deleted (tombstone)
These operations commute — apply them in any order,
you get the same result. No transformation needed.
The trade-off is memory. CRDTs never truly delete anything — tombstones accumulate. For a document edited heavily over months, the CRDT state can grow significantly larger than the visible text. Yjs mitigates this with garbage collection of tombstones that all clients have acknowledged, but it is a real concern at scale.
Awareness Protocol
Separate from the document model, collaborative editors need an awareness protocol for presence information. Yjs includes this out of the box:
type AwarenessState = {
user: {
name: string
color: string
avatar?: string
}
cursor: {
anchor: number
head: number
} | null
selection: {
from: number
to: number
} | null
lastActive: number
}
Awareness state is ephemeral — it is not persisted and is not part of the document. When a user disconnects, their awareness state is automatically removed after a timeout. This is broadcast through a separate channel (typically the same WebSocket connection but a distinct message type) at a lower priority than document operations.
I — Interface (APIs and Protocols)
WebSocket: The Real-Time Backbone
Document sync and awareness both flow over WebSocket. Here is the message protocol:
// Client to Server
type ClientMessage =
| { type: 'sync-step-1'; payload: Uint8Array } // Initial state vector
| { type: 'sync-step-2'; payload: Uint8Array } // State diff response
| { type: 'update'; payload: Uint8Array } // Document update (CRDT encoded)
| { type: 'awareness'; payload: Uint8Array } // Cursor/presence update
// Server to Client
type ServerMessage =
| { type: 'sync-step-1'; payload: Uint8Array }
| { type: 'sync-step-2'; payload: Uint8Array }
| { type: 'update'; payload: Uint8Array }
| { type: 'awareness'; payload: Uint8Array }
| { type: 'error'; code: string; message: string }
With Yjs, the sync protocol is handled automatically by y-websocket. The three-step handshake works like this:
REST API: Everything Else
Not everything needs real-time. Use REST for operations that are infrequent and do not need instant propagation:
GET /api/documents/:id → Document metadata + initial CRDT state
POST /api/documents → Create new document
GET /api/documents/:id/versions → List version snapshots
GET /api/documents/:id/versions/:v → Specific version snapshot
POST /api/documents/:id/comments → Create comment
PATCH /api/documents/:id/comments/:c → Resolve/update comment
DELETE /api/documents/:id/comments/:c → Delete comment
POST /api/documents/:id/upload → Upload image, returns URL
Why Two Protocols?
WebSocket handles the hot path: keystrokes, cursor movements, awareness. REST handles the cold path: document CRUD, version history, comments. Mixing them on the same channel works but creates priority issues — you do not want a large version history response blocking a keystroke sync.
O — Optimizations
This is where a good design becomes a great one. These optimizations are the difference between a sluggish prototype and a production-quality editor.
1. Local-First: Apply Before You Sync
The single most important optimization. Never wait for the server to acknowledge an edit before showing it to the user.
User types 'H':
1. Apply to local CRDT state immediately (0ms — feels instant)
2. Re-render the editor view (< 16ms — next frame)
3. Encode the CRDT update (< 1ms)
4. Send over WebSocket (fire and forget)
5. Server receives, persists, broadcasts (50-200ms round trip)
6. Other clients receive and merge (no conflict possible with CRDT)
The user experiences step 1-2. Steps 3-6 happen in the background. This is why CRDTs pair so well with local-first — since merges are automatic, there is no risk that your local state will diverge permanently from the server.
2. Cursor Interpolation
Remote cursors update at network frequency (every 50-100ms). Without interpolation, they jump jerkily across the screen. Smooth them with CSS transitions or spring animations:
// CursorOverlay renders remote cursors with smooth transitions
// CSS handles the interpolation between position updates
//
// .remote-cursor {
// transition: transform 80ms ease-out;
// }
The key insight: cursor position updates are low-priority. You can throttle awareness broadcasts to every 100ms without any perceived quality loss. Document updates, on the other hand, should be sent as soon as possible.
3. Lazy Loading Document Blocks
For long documents (100+ pages), do not load the entire document into the editor at once. Virtualize it:
- Load the visible viewport plus a buffer above and below
- As the user scrolls, fetch and hydrate additional blocks
- Detach blocks that scroll far out of view to free memory
This is the same principle as virtualized lists (react-window, @tanstack/virtual), applied to document blocks. Notion uses this approach for their long-form documents.
4. Per-User Undo Stack
Naive undo (Ctrl+Z) in a collaborative editor is a nightmare. If User A types "hello" and User B types "world", then User A presses Ctrl+Z, should it undo "hello" (User A's action) or "world" (the most recent action globally)?
The answer is clear: undo should only affect the current user's actions. This requires a per-user undo stack:
type UndoManager = {
userId: string
undoStack: UndoItem[]
redoStack: UndoItem[]
}
type UndoItem = {
// Only tracks operations from this specific user
inverseOperations: CRDTUpdate[]
timestamp: number
}
Yjs provides y-undomanager which handles this out of the box. It tracks which CRDT updates belong to which user and only inverts that user's changes on undo, leaving everyone else's edits intact.
5. Image Upload with Placeholder
When a user drags an image into the editor, do not block the editing experience while it uploads:
- Insert a placeholder block immediately with a local
blob:URL and a loading indicator - Upload the image to your CDN in the background
- When the upload completes, replace the
blob:URL with the permanent CDN URL via a ProseMirror transaction - If the upload fails, show an error state on the placeholder with a retry button
This is optimistic UI applied to media insertion. The document state is always valid — the placeholder is a real node in the document tree, not a hack.
6. Version Snapshots
Do not snapshot every keystroke. Use heuristics:
- Snapshot every N minutes of inactivity (e.g., 5 minutes after the last edit)
- Snapshot when a user explicitly saves (Ctrl+S habit)
- Snapshot before destructive operations (large deletes, bulk formatting changes)
- Store snapshots as full CRDT state snapshots, not as operation logs — this makes restoration O(1) instead of replaying operations
Why Not Just Replay Operations for Version History?
Replaying a log of operations to reconstruct a historical version sounds elegant, but it is O(n) where n is the total number of operations since the snapshot. For a document with 6 months of active editing, that could be millions of operations. A full CRDT state snapshot is a single binary blob that can be loaded in milliseconds. The trade-off is storage space, but storage is cheap and user patience is not.
Putting It All Together
Here is the full data flow for a single keystroke in the collaborative editor:
Common Pitfalls
| What developers do | What they should do |
|---|---|
| Using absolute character positions for cursor sync Absolute positions shift when other users insert or delete text before the cursor. A cursor at position 10 might point to completely different text after a remote insertion at position 5. Relative positions are anchored to specific CRDT items and remain valid regardless of concurrent edits. | Use relative positions (Yjs RelativePosition) that survive concurrent insertions and deletions |
| Sending the full document state on every edit A full document sync on every keystroke would mean sending megabytes of data per second with multiple active editors. Incremental updates are orders of magnitude smaller and can be applied without re-parsing the entire document. | Send only incremental CRDT updates (typically 20-100 bytes per edit) |
| Using a global undo stack for all users With a global undo stack, pressing Ctrl+Z might undo someone else's edit from 200ms ago. Users expect undo to reverse their own last action. Yjs provides UndoManager which tracks operation ownership automatically. | Implement per-user undo stacks that only revert the current user's operations |
| Storing collaboration state in React component state React state and CRDT state would constantly drift apart. The CRDT document must be the authoritative source. React subscribes to CRDT changes and re-renders, but never owns the document data. This is the same principle as treating a database as the source of truth, not your UI cache. | Use the CRDT document (Yjs Doc) as the single source of truth, with React as a rendering layer |
| Implementing custom conflict resolution from scratch Conflict resolution algorithms have subtle edge cases that take years to discover and fix. Yjs has been battle-tested across thousands of production deployments. Rolling your own CRDT or OT implementation is like writing your own crypto — technically possible, practically a terrible idea. | Use a battle-tested library like Yjs or Automerge |
Key Principles
- 1Local-first always: apply edits instantly, sync in the background. Never let the network block the UI.
- 2CRDT for new projects, OT only if you are extending an existing OT-based system. CRDTs are mathematically simpler and handle offline natively.
- 3Separate document sync from presence/awareness. They have different priorities, frequencies, and persistence requirements.
- 4Per-user undo stacks are non-negotiable in collaborative editors. Global undo breaks user expectations.
- 5Schema-driven document models (ProseMirror) prevent invalid states from ever entering the CRDT, eliminating an entire class of bugs.
- 6Version snapshots should be full state captures, not operation logs. Restoration must be O(1).
Interview Tips
When you get this question in an interview, here is how to structure your 35-40 minutes:
-
Requirements (3-5 min) — Clarify scope. Ask: "Should we support rich text or just plain text? Real-time or near-real-time? How many concurrent editors?" This shows you do not make assumptions.
-
Architecture (5-7 min) — Draw the component tree. Explain the editor engine choice. Mention the separation between document sync and awareness.
-
Data Model (10-12 min) — This is where you spend the most time. Explain OT vs CRDT trade-offs. Walk through a concrete conflict scenario. Show the document model structure.
-
Interface (5-7 min) — WebSocket protocol, REST endpoints, the sync handshake. Binary encoding for efficiency.
-
Optimizations (5-10 min) — Local-first, cursor interpolation, per-user undo, lazy loading, image placeholders. Each one shows depth.
The biggest differentiator: walking through a concrete conflict scenario step-by-step. Most candidates say "we use CRDTs" and move on. The ones who get hired can explain exactly what happens when User A inserts at position 5 while User B deletes position 3, and why the CRDT guarantees convergence.