Skip to main content

A/B test statistical significance calculator

Enter visitors and conversions for two ad variants. Get a plain-English verdict — winner, loser, or needs more data — plus the numbers behind it.

A/B test significance calculator

Control — Ad A
Variant — Ad B

Conversion rates

Ad A (control)

Ad B (variant)

Absolute lift
Relative lift
Z-score
P-value (two-tailed)
Confidence

When to stop a test

The biggest mistake in A/B testing is peeking: checking the results daily and stopping as soon as significance appears. Every time you check, you increase the chance of a false positive. Best practice: decide your required sample size before starting the test (using the output from this calculator), and only look at the result once you have reached that number of visitors.

Run every test for at least one full business cycle (typically 1–2 weeks) even if you hit sample size faster — day-of-week effects in traffic and conversion are real and will bias a test that runs only on weekdays or only over a weekend.

About this tool

Enter the visitor count and conversion count for your control (A) and variant (B). The calculator runs a two-proportion z-test and tells you whether the difference is statistically significant at the 95% confidence level, the p-value, the relative lift, and — if the test isn't significant yet — how many more visitors you need before you can call a winner.

Frequently asked questions

What does "statistically significant at 95% confidence" mean?

It means that if there were truly no difference between A and B, you'd see a result this extreme (or more extreme) less than 5% of the time by chance. It doesn't prove B is better forever — it means you have strong enough evidence to act. The standard threshold in most A/B testing is 95% confidence (p < 0.05), though some teams accept 90% for lower-stakes decisions.

How many visitors do I need before the test is valid?

The required sample size depends on three things: your baseline conversion rate, the minimum effect size you care about detecting, and the confidence/power levels you want. For a 2% baseline conversion rate and a 10% relative lift (detecting a change from 2% to 2.2%), you typically need 20,000–30,000 visitors per variant. The calculator estimates required sample size based on your observed rates. Do not stop a test early just because it looks significant — early stopping is a well-known source of false positives.

Should I use a one-tailed or two-tailed test?

Two-tailed (this tool's default) is almost always the right choice. A one-tailed test assumes you only care if B is *better* than A — not if it's worse. But bad variants do exist, and deploying a variant that's significantly worse would be harmful. Two-tailed tests are more conservative and correct: you're testing whether there's any difference, in either direction.

What's the difference between statistical significance and practical significance?

Statistical significance tells you the effect is real; practical significance tells you it's worth acting on. A test with 1,000,000 visitors might show that B converts at 2.001% vs A's 2.000% — statistically significant, but meaningless in practice. Always check the absolute lift and ask whether the improvement justifies the engineering cost of rolling out the variant and the risk of unintended side effects.

What if my test keeps running but never reaches significance?

If you've hit the required sample size and the result is still not significant, the most likely explanation is that the variant has a very small effect (or none at all). Consider: the change may not be impactful enough to test, your sample may be too small for the effect size you hypothesised, or the metric you're testing is too noisy. At this point, either accept the null (no meaningful difference) or run a new test with a more substantial change.

Share this tool