Skip to content

Feature Flags and Trunk-Based Development

intermediate20 min read

The Branch Problem Nobody Talks About

You have been there. A feature branch lives for three weeks. During that time, main moves forward — new components, refactored utilities, updated dependencies. When you finally try to merge, the conflict resolution takes longer than the feature itself. Half the team is blocked waiting for the merge. QA retests everything because the merge changed code they already validated.

Long-lived branches are a productivity trap disguised as safety. They feel safe because "my changes are isolated." But that isolation is the problem. The longer a branch lives, the more it diverges from reality, and the more painful the inevitable reconciliation becomes.

Mental Model

Think of branches like lanes on a highway. If two cars drive in parallel lanes for 30 seconds and then merge, it is easy. If they drive in parallel lanes for 3 hours and end up in different cities, merging is a nightmare. Trunk-based development keeps everyone on the same road. Feature flags are your turn signals — you signal your intent (the flag) without actually turning yet (releasing to users).

Trunk-Based Development: Everyone Ships to Main

In trunk-based development, there is one primary branch — main (or trunk). All developers commit to it directly or through very short-lived branches (hours, not weeks). The key rule: main is always deployable.

This sounds terrifying at first. "What if someone commits broken code to main?" That is exactly what your CI pipeline prevents. If lint, typecheck, tests, and build all pass, the code is safe to merge. If it introduces a user-facing change that is not ready, you wrap it in a feature flag.

Execution Trace
Traditional branching
Create feature/auth-redesign branch → develop for 2 weeks → resolve 14 merge conflicts → QA retests everything → merge to main → deploy
High risk, long feedback loop, big-bang merge
Trunk-based + flags
Commit auth redesign behind a flag to main daily → CI validates each commit → flag is off in production → enable flag for internal team → gradually roll out → remove flag
Low risk, continuous integration, incremental delivery

Why Trunk-Based Scales

The math is straightforward. With long-lived branches:

  • Merge conflict probability grows exponentially with branch lifetime and team size
  • Integration testing is deferred until merge, when it is most expensive to fix
  • Code review is harder because PRs are massive (try reviewing a 2000-line diff meaningfully)
  • Deployment risk is high because you are deploying weeks of untested-together changes

With trunk-based development:

  • Small, frequent commits mean conflicts are rare and trivial
  • Continuous integration catches problems immediately
  • PRs are small — 50-200 lines, reviewable in 10 minutes
  • Deployment risk is low because each deployment contains hours of changes, not weeks
Quiz
Your team has 8 developers, each working on a separate feature branch for 2 weeks. On merge day, 5 branches have conflicts with each other. What is the root cause?

Feature Flags: Decoupling Deploy from Release

Here is the key insight: deploying code and releasing a feature are two different things. You can deploy code to production with a feature flag wrapping it, and zero users see it until you flip the flag.

function CourseNavigation() {
  const showRedesign = useFeatureFlag('course-nav-redesign');

  if (showRedesign) {
    return <NewCourseNav />;
  }

  return <LegacyCourseNav />;
}

The code for NewCourseNav is in production. It went through CI, code review, and deployment. But users see LegacyCourseNav until you enable the flag. This separation gives you superpowers:

  • Deploy anytime — no freeze windows, no "wait until the feature is ready"
  • Roll back instantly — disable the flag, no deployment needed
  • Test in production — enable the flag for your team only
  • Gradual rollout — show to 1% of users, then 10%, then 50%, then 100%

Client-Side vs Server-Side Flags

Client-side flags are evaluated in the browser. The flag service sends a configuration blob to the client, and your React code checks it. Pros: no server involvement, fast evaluation. Cons: the flag payload is visible in network requests (users can see which flags exist), and the old/new code both ship in the bundle.

Server-side flags are evaluated on the server before rendering. In a Next.js app, you check the flag in a Server Component or middleware and only send the relevant HTML/JS to the client. Pros: users cannot see flag configurations, unused code paths are not shipped. Cons: adds server latency for flag evaluation.

async function CoursePage() {
  const flags = await getFeatureFlags(request);

  return (
    <main>
      {flags.showNewPricing ? (
        <NewPricingTable />
      ) : (
        <LegacyPricingTable />
      )}
    </main>
  );
}

For most frontend features, server-side evaluation is better. It avoids the flash of content switching (user sees the old UI for a split second before the flag loads) and keeps your bundle smaller.

Quiz
You implement a feature flag client-side. Users occasionally see a brief flash of the old UI before the new UI appears. What causes this?

The Flag Lifecycle

Feature flags are not a set-and-forget tool. They have a lifecycle, and managing it is critical to avoiding the dreaded "flag debt."

Naming Conventions

Bad flag names are the first step toward flag debt. When you have 200 flags and half of them are named temp-fix, new-thing, or test-123, nobody knows which are safe to remove.

Good flag names follow a pattern: [scope]-[feature]-[intent]

  • course-nav-redesign — clear what it controls
  • pricing-table-ab-test — indicates it is an experiment with a planned end
  • checkout-paypal-integration — specific feature being gated

The 30-Day Rule

Every flag should have a removal date set at creation time. The industry consensus is 30 days from full rollout to removal. If a flag has been at 100% for 30 days with no issues, the flag is no longer protecting you — it is just dead code with extra indirection.

Common Trap

The most dangerous flag debt is not the flags you know about — it is the interactions between flags. If flag A changes the navigation and flag B changes the layout, what happens when both are enabled? What about when only one is? The number of possible states grows exponentially with the number of active flags. Every stale flag doubles your testing surface area. Clean them up aggressively.

Flag Services: Choosing Your Tool

LaunchDarkly

The enterprise standard. Supports complex targeting rules (by user segment, geography, percentage), multivariate flags (not just on/off but A/B/C/D), and extensive audit logging. Has SDKs for React, Next.js, Node.js, and practically every other platform. The React SDK provides a useFlags() hook and an LDProvider context.

Statsig

Strong experimentation platform that doubles as a feature flag service. Excels at A/B testing with statistical rigor — it calculates statistical significance automatically and tells you when an experiment has enough data to make a decision. Acquired by OpenAI in 2025, which means deep AI integration is likely coming.

Unleash

Open-source feature flag service you can self-host. No vendor lock-in, full data control. The tradeoff is operational overhead — you run the infrastructure. Good for teams with strong DevOps capabilities who want to avoid SaaS costs.

Flagsmith

Another open-source option with a managed cloud offering. Supports remote configuration (changing values, not just on/off) and has a clean REST API. Simpler than LaunchDarkly, which can be an advantage if you do not need enterprise governance.

FeatureLaunchDarklyStatsigUnleashFlagsmith
Pricing modelPer-seat, enterprise pricingFree tier, per-event pricingFree (self-hosted), paid cloudFree tier, per-seat cloud
A/B testingYes, with metricsBest-in-class experimentationBasic, via strategiesYes, with segments
Self-hostingNoNoYes (primary model)Yes
React SDKYes (hooks + context)Yes (hooks + context)Yes (hooks + context)Yes (hooks + context)
Server-side evaluationYesYesYesYes
Best forEnterprise teams needing governanceProduct teams running experimentsTeams wanting full data controlTeams wanting simplicity

Dark Launching

Dark launching is deploying a feature to production and routing real traffic through it — but without showing the result to users. It is the ultimate production test.

async function searchCourses(query: string) {
  const results = await currentSearchEngine.search(query);

  if (featureFlags.isEnabled('new-search-engine-shadow')) {
    newSearchEngine.search(query).then((shadowResults) => {
      trackSearchComparison({
        query,
        currentCount: results.length,
        shadowCount: shadowResults.length,
        latencyDelta: shadowResults.latency - results.latency,
      });
    });
  }

  return results;
}

Users see results from the current search engine. But behind the scenes, the new engine also processes the query, and you log the comparison. After a week of shadow traffic, you know exactly how the new engine performs — latency, result quality, error rates — without any user ever seeing it.

Quiz
Your dark launch of a new search engine shows that it returns results 200ms faster but misses 5% of relevant results compared to the current engine. What should you do?

A/B Testing with Feature Flags

Feature flags naturally support A/B testing. Instead of a binary on/off, you create a multivariate flag that splits users into groups:

function PricingPage() {
  const variant = useFeatureFlag('pricing-layout-experiment');

  switch (variant) {
    case 'control':
      return <CurrentPricingLayout />;
    case 'horizontal':
      return <HorizontalPricingLayout />;
    case 'comparison-table':
      return <ComparisonPricingLayout />;
    default:
      return <CurrentPricingLayout />;
  }
}

The flag service handles the randomization and ensures users consistently see the same variant (sticky sessions). Your analytics tracks conversion rates per variant, and the flag service (especially Statsig) can calculate statistical significance automatically.

Key Rules
  1. 1Always have a control group that sees the current experience — never test two new variants without a baseline
  2. 2Let experiments run until statistical significance is reached — do not peek at early results and make decisions
  3. 3One experiment per user flow at a time — overlapping experiments produce confounded results
  4. 4Log the flag variant with every analytics event so you can segment all metrics by experiment group
What developers doWhat they should do
Creating flags without an owner or removal date
Ownerless flags become orphans that nobody dares to remove. They accumulate into flag debt that makes the codebase increasingly fragile.
Every flag must have an owner and a planned removal date at creation time
Evaluating flags on every render without caching
Flag evaluation on every render adds latency and can cause UI flickering. Most flag SDKs cache values after initial fetch and use server-sent events for updates.
Initialize the flag SDK once and cache flag values for the session
Testing only the flag-on path and ignoring the flag-off path
The flag-off path is what 100% of users see until rollout begins. A bug in the flag-off path is a production bug affecting everyone.
Test both paths explicitly in your test suite
Using feature flags for permanent configuration
Feature flags are temporary by design. Using them for permanent config means they never get cleaned up and clutter the flag service forever.
Use environment variables or config files for permanent settings

The Discipline of Flag Hygiene

Feature flags are powerful, but they require discipline. Without it, you end up with a codebase riddled with conditional branches, impossible-to-test state combinations, and a flag service with 500 flags where nobody knows which ones are still needed.

The teams that succeed with flags treat them like borrowed tools — useful for the job, but you return them when you are done. Every flag has a creator, an expiration date, and a cleanup PR linked in the flag service. CI can even enforce this: a lint rule that flags any feature flag reference older than 30 days post-full-rollout.

The payoff is enormous. Your team ships to production multiple times a day. Rollbacks take seconds instead of hours. New features reach users gradually instead of all-at-once. And when something goes wrong, you flip a switch instead of reverting a deployment. That is the power of decoupling deploy from release.