Definition
A holdout test is a randomized experiment that measures how much a marketing channel causes conversions, as distinct from how many conversions happen through it. The method: assign a random slice of the channel's traffic (say 10%) to a control group that's not exposed to the channel's treatment, compare conversion outcomes between treated and held-out groups, and attribute the difference to the channel itself.
Holdout tests are the gold standard for incrementality measurement. They're functionally equivalent to the randomized controlled trials pharmaceutical companies use to test drugs.
How a holdout works in affiliate marketing
In an affiliate context, the mechanism is simple:
- Partner sends traffic. A publisher redirects a customer to the brand through a tracked affiliate link.
- Tracker splits traffic deterministically. A hash of the click (visitor fingerprint + timestamp) is compared against the rule's holdout percentage. Clicks in the holdout slice are flagged
is_holdout = true. - Everyone still sees the offer. Clicks in both the treated and holdout slices redirect to the advertiser normally. The customer experience is identical. What differs is whether downstream conversions are counted toward the partner's reported performance.
- Daily aggregation. Every night, the system compares the conversion rate in the held-out slice to the rate in the treated slice for each partner.
- Statistical test. A two-proportion z-test gives a p-value; a Wald 95% CI bounds the lift estimate. If the CI clears zero, you have a significant result.
What makes a good holdout
Randomization must be real
The deterministic-hash approach above ensures every visitor has an unbiased probability of landing in either group. Non-random assignment — e.g., "holdout everyone from California" — is a geo test, not a holdout, and is vulnerable to regional confounders.
The holdout must be blinded from the partner
If the partner knows their traffic is in a measurement test, they'll send their best customers to the treated slice. Holdouts only work when the partner can't see the split.
Volume has to be sufficient
A holdout on 500 clicks a month will never reach significance. Rough rule of thumb for detecting a 10% lift at 95% confidence:
| Baseline conversion rate | Monthly clicks needed | |---|---| | 1% | ~130,000 | | 3% | ~45,000 | | 10% | ~15,000 |
For partners below these thresholds, the holdout will still give you a directional signal, but you'll need to run it longer — and accept that borderline results aren't statistically conclusive.
Holdouts need a clean comparison window
If a partner's creative changes mid-test, or your site does a major redesign, or the customer base shifts (seasonality), the holdout is comparing apples to oranges. Either end the test and start a new one, or segment the analysis.
How to pick the holdout percentage
The trade-off: bigger holdouts reach significance faster but cost more conversions forgone.
- 5–10% is the standard range
- 10% is a good default — balanced between speed-to-significance and revenue impact
- 20%+ only makes sense if you genuinely doubt the partner's contribution and want a fast answer
A 10% holdout on a $100k/month partner means $10k in potential commissions is redirected into measurement. If the test reveals the partner drives zero incremental conversions, you save the full $90k going forward. That's a good trade.
Reading the results
The Incrementality dashboard in most platforms (including Trcker) reports:
- Treated CVR — conversion rate among traffic NOT held out
- Holdout CVR — conversion rate in the random control slice
- Lift % — (treated − holdout) / holdout, positive if the channel is incremental
- p-value — probability the observed difference is due to chance
- 95% CI — range of plausible true-lift values
The decision rule is straightforward:
| Result | Action | |---|---| | Significant + positive lift | Keep the partner — they're driving incremental revenue | | Significant + zero or negative lift | Pause or renegotiate — you're paying for conversions that happen anyway | | Not significant, positive lift | Keep running the test — directionally positive but unproven | | Not significant, negative lift | Consider pausing — at best the partner's neutral, at worst anti-incremental |
Common pitfalls
- Peeking early. Looking at results every day and stopping when they look significant inflates your false-positive rate. Commit to a minimum sample size before you start.
- Running too many tests simultaneously. If you have 20 active holdouts, you'll see one "significant" result at p < 0.05 purely by chance. Correct for multiple comparisons or stagger tests.
- Forgetting about novelty effects. A new creative's first week of performance is never representative. Run holdouts on stable creative, not launches.
- Confusing correlation with causation. A holdout measures causation within its sample — it doesn't tell you what would happen if you scaled the channel up 10x or moved budget around. That's a different test.
Holdouts vs. A/B tests
A/B tests compare two treatments — two creatives, two pricing pages. Holdout tests compare one treatment against no treatment. Every A/B test should technically include a holdout as a third arm so you can distinguish "B is better than A" from "both A and B are worse than showing nothing."
Related concepts
- Incrementality — the thing holdout tests measure
- Attribution — the correlational alternative to causal measurement
- Multi-touch attribution — credit distribution across every touch
- CPA — cost per acquisition, which looks different once you measure incrementality