Research · January 2026

A Small Model Matches the Best: How Multi-Model Consultation Achieves Frontier Performance

We show that Claude Haiku 4.5—a lightweight, fast model—can match the performance of Claude Opus 4.6 on a rigorous coding benchmark, at 62% lower cost. The key? Asking multiple AI models for their perspective when stuck.

Venkata Subrahmanyam Ghanta · ASU & Polydev AIPujitha Sri Lakshmi Paladugu · Microsoft

Code, Data & Paper

The Big Picture

Everyone assumes you need the biggest, most expensive AI model to get the best results. We tested this assumption on SWE-bench Verified—a benchmark of 500 real GitHub issues that AI systems must solve by writing actual code.

Our finding: A small model (Haiku) that asks other AI models for help performs just as well as a large model (Opus) working alone—and costs 62% less.

How We Compare

SWE-bench Verified Leaderboard (December 2025)

Rank	Model	% Solved
1	Claude Opus 4.6	74.4%
2	Gemini 3 Pro Preview	74.2%
3	GPT-5.2 (high reasoning)	71.8%
4	Claude Sonnet 4.5	70.6%
5	GPT-5.2	69.0%
–	Ours (Haiku 4.5 + Consultation)	74.6%

Our approach uses Claude Haiku 4.5 as the base model, with GPT 5.2 Codex and Gemini 3 Flash Preview as consultants.

How It Works

The principle is simple: when your AI agent faces a difficult decision, let it ask other AI models for their perspective. Different models see different solutions.

Your agent works

Any AI agent—coding assistant, chatbot, research tool—encounters a complex problem.

Consults Polydev

With one API call, get perspectives from GPT, Claude, Gemini, and Grok simultaneously.

polydev.getPerspectives()

GPT-5.2

"Try approach A because..."

Claude Sonnet 4.5

"Consider option B for..."

Gemini 3 Pro

"The pattern here is..."

Grok 4.1

"Watch out for edge..."

Better decisions

Your agent synthesizes the perspectives and makes a more informed choice. In our benchmark, this improved success rate from 64.6% to 74.6%.

Why Different Models Help Each Other

Different AI models are trained on different data and have different strengths. When we analyzed which problems each approach solved:

283

Solved by both

Only Haiku alone

Only with consultation

Key insight: 24% of our successes came from one approach solving problems the other couldn't. The models have genuinely different blind spots—so combining them covers more ground than using either alone.

When Does Consultation Help Most?

Consultation isn't always beneficial. Here's what we found:

Most helpful for:

• Ambiguous requirements (85% helpful)
• Multi-file changes (78% helpful)
• Complex algorithms (81% helpful)

Less helpful for:

• Simple one-line fixes (42% helpful)
• Clear, well-specified bugs (65% helpful)

Takeaway: Use multi-model consultation for hard problems. For simple fixes, a single model is often faster and just as effective.

The Cost Advantage

Same performance, much lower cost:

Approach	% Solved	Cost per Problem
Claude Opus 4.6 (frontier)	74.4%	$0.97
Ours (Haiku + Consultation)	74.6%	$0.37

62% cost reduction while matching performance. This includes running both the baseline and consultation approaches.

Use This in Your Own Agents

The same multi-model consultation that powers our research is available through Polydev MCP. If you're building AI agents with Claude Code, Cursor, Windsurf, or any MCP-compatible tool, you can add this capability in minutes.

Quick Setup

# Install Polydev MCP

npx polydev-ai@latest

# In your agent, when stuck:

polydev.getPerspectives("How should I approach this bug?")

Multi-Model Perspectives

Get insights from GPT-5.2, Claude, Gemini, and Grok—all through one API call.

Works with Claude Code

Built on Model Context Protocol (MCP). Drop-in compatible with your existing setup.

Why this matters: Our research shows that model diversity—not just model size—can unlock frontier performance. Polydev makes it easy to add this pattern to your own AI workflows.

What This Means

You don't always need the biggest model. With the right approach—giving smaller models more time to think and access to other perspectives—you can match frontier performance at a fraction of the cost.

This has practical implications: developers can get top-tier AI coding assistance without paying top-tier prices. The trick is knowing when to ask for help.

Reproducibility

All our code, predictions, and reasoning traces for 500 problems are available:

Benchmark:SWE-bench Verified (500 instances)

Base model:Claude Haiku 4.5

Consultation:GPT 5.2 Codex, Gemini 3 Flash

Total cost:$136.34 (both approaches)

Repository:github.com/backspacevenkat/polydev-swe-bench

Try multi-model consultation today

Add the same technique that achieves 74.6% on SWE-bench to your own AI agents and workflows.

Get Started Free View Research Code