What It Is

Statistical comparison of two or more design variants using real traffic. When you have enough users and a clear metric, this removes opinion from design decisions and lets behavior speak.

How It Works

Define the metric you're optimizing for. Create variants. Split traffic randomly. Run until you hit statistical significance. Analyze results. The discipline is in not peeking early and not stopping the test when one variant looks like it's winning — that's how you get false positives.

What It Requires

Enough traffic to reach significance in a reasonable timeframe. A clear metric. And variants that are different enough to produce a measurable effect. If you're testing two nearly identical button colors, you'll need millions of visitors to detect a difference that probably doesn't matter anyway.

When Not to Use It

Low-traffic sites. Qualitative questions that can't be reduced to a metric. When the variants test different things simultaneously and you can't isolate the variable. For those, use qualitative methods instead.