January 20, 2026
The last project I worked on involved a lot of LLM API calls. One subtask seemed simple: count elements from a specific list. Straightforward, right? Not quite.
This needed production-level accuracy. But the simple API approach wasnât cutting it. After testing 50 cases, I was only hitting around ~75% accuracy (37 out of 50). For production, thatâs a non-starter.
The LLM was doing the task correctly for some instances but missing elements in others. Sometimes it would catch all 10 items, other times only 7 or 8. The pattern was clear: when it failed, it undercounted. It never hallucinated extra elements or went above the true count. It just missed things.
This directional bias turned out to be the key insight.
I decided to apply the âwisdom of crowdsâ principle. The same concept that makes Random Forest work. Instead of relying on a single API call, use multiple calls and aggregate intelligently.
The evaluation rule was simple: Max(API_call_1, API_call_2, âŠ, API_call_n)
Example: If there are 10 elements and three API calls return [7, 10, 3], the final output is 10.
Why this works: The undercounting errors get filtered out. The max function naturally finds the correct answer as long as at least one call succeeds. Since the LLM never overcounts, the highest value is almost always the right one.
Hereâs how the two approaches compare:

With a single API call, the question is: Whatâs the probability of success?
With ensemble, it becomes: Whatâs the probability of at least one success?
The math changes drastically:
P(at least one correct) = 1 - P(all calls wrong)
For n=3 calls with p=0.75 success rate:
Going from 75% to 98.4% with just 3 calls? Not bad at all.
But I couldnât just pick any number. Each API call costs money and adds latency. I needed to balance accuracy against cost.
Hereâs how the numbers break down:
| n calls | Accuracy | Cost Multiplier |
|---|---|---|
| 1 | 75.0% | 1x |
| 2 | 93.8% | 2x |
| 3 | 98.4% | 3x |
| 4 | 99.6% | 4x |
| 5 | 99.9% | 5x |


The diminishing returns kick in hard after n=3. Going from 98.4% to 99.6% costs an entire extra API call for just 1.2 percentage points. But for production-level reliability, I decided that extra margin was worth it.
I settled on n=4: 99.6% accuracy at 4x the cost.
This approach only works because my LLM had a directional bias (like undercounting - in this case). The evaluation function must match your error pattern:
The key is understanding how your model fails, not just that it fails. Directional biases can take many forms - summarization models that are consistently too brief, classifiers that favor certain categories, extractors that miss edge cases. Each needs its own aggregation strategy.
If you donât understand your failure mode, youâre just burning money on redundant calls.
(Now that I think about it, we can dedicate a separate blog post on designing Eval functions)
Sometimes the best solution isnât a better prompt or a bigger model. Itâs understanding your failure mode and exploiting it mathematically.
A single API call gave me 75% accuracy. Four calls with a simple Max() aggregator got me to 99.6%. Same model, same prompt. Just a smarter approach.
The real lesson? When you canât improve the modelâs performance, improve how you use it. In a constrained space, solving a problem becomes more interesting.