Back to Articles

Deep Dive · February 2026

How the AI Model Leaderboard Works

A technical look at how Polydev ranks AI models using Elo ratings, pairwise comparisons, prompt classification across 6 dimensions, and data from real developer workflows.

Why Another Leaderboard?

Most AI benchmarks test models on standardized datasets in controlled environments. That tells you how well a model solves pre-selected problems, but not how it performs on your actual coding tasks—the debugging sessions, architecture decisions, and implementation challenges that fill a real developer's day.

The Polydev leaderboard is different. Every ranking comes from real developer workflows: when you use get_perspectives to consult multiple AI models, the responses are automatically compared. Over time, this produces a ranking that reflects how models actually perform on the tasks developers care about.

Three Sources of Ranking Data

The leaderboard combines three types of comparisons, each weighted differently:

1. Organic Comparisonsauto_quality · 0.8x weight

Every get_perspectives call queries multiple models simultaneously. The responses are automatically compared using quality heuristics—content length, code blocks, structured formatting. From N model responses, we generate C(N,2) pairwise comparisons. If a response has 15%+ higher quality score, it wins; otherwise it's a tie.

2. User Rankingsbase_model_ranking · 1.5x weight

When your coding agent calls rank_perspectives after reviewing the responses, it provides an explicit ordering (best to worst). This carries the highest weight because it reflects the informed judgment of the AI model that actually read and evaluated all the responses in context.

3. Synthetic Benchmarkscross_model · 1.2x weight

We periodically run curated coding questions through all models, then use cross-model judging: each model evaluates every other model's response (but never its own). This helps bootstrap rankings for new models and provides a controlled baseline. Self-judging bias is reduced because no model rates its own output.

Prompt Classification

Before any comparison happens, every prompt is classified across 6 independent dimensions. This powers the leaderboard's filtering system—you can see which model performs best specifically for Python debugging, or React architecture, or Rust performance optimization.

Classification is done by a lightweight model (GPT-4.1-nano) that returns structured JSON. The prompt text itself is never stored—only the classification result.

DimensionWhat It CapturesExample Values
Task TypeThe nature of the coding taskdebugging_runtime, implementation_algorithm, architecture_system_design
LanguagePrimary programming languagepython, typescript, rust, go
FrameworkLibraries and frameworks involvedreact, nextjs, fastapi, django
ComplexityHow difficult the task istrivial, moderate, challenging, expert_level
DomainApplication areaweb_frontend, systems_distributed, ml_llm_agents
IntentWhat the developer wants to accomplishwrite_feature, fix_bug_logic, explain_architecture

Each dimension has dozens of possible values organized into groups. For example, "Task Type" includes groups like Debugging (10 types), Implementation (10 types), Architecture (8 types), and more. This granularity means you can filter for extremely specific scenarios.

The Elo Rating System

We use a modified Elo system (the same mathematical framework used in chess rankings) to convert pairwise win/loss/tie outcomes into a single numerical rating per model.

How Elo works in brief: After each pairwise comparison, the winner gains rating points and the loser loses them. The number of points exchanged depends on the expected outcome—an upset (low-rated model beating a high-rated one) causes a larger rating swing than a result that was already expected.

Ties split the point exchange. All models start at 1500.

Adaptive K-Factor

The K-factor controls how much each comparison shifts the ratings. We use an adaptive scheme so new models converge quickly while established models remain stable:

ComparisonsK-FactorEffect
< 30K = 40New models settle quickly
30 – 100K = 20Moderate adjustment
> 100K = 10Established models stay stable

Judge Method Weights

Not all comparisons are equal. The K-factor is further multiplied by a weight that depends on how the comparison was generated:

Judge MethodWeightRationale
Base model ranking1.5xInformed judgment from the model that used the responses
User ranking1.3xDirect developer feedback
Cross-model judging1.2xControlled benchmark, no self-judging bias
Auto quality0.8xHeuristic-based, less reliable than human/model judgment

From N Models to Pairwise Comparisons

When get_perspectives returns responses from N models, we generate all possible pairs—C(N,2) comparisons. For 5 models, that's 10 pairwise comparisons from a single query.

Example: 4 models respond to a single query
# C(4,2) = 6 pairwise comparisons:
Claude vs GPT → Claude wins
Claude vs Gemini → tie
Claude vs Grok → Claude wins
GPT vs Gemini → Gemini wins
GPT vs Grok → tie
Gemini vs Grok → Gemini wins

Each comparison is stored with the classification dimensions from the prompt. This means filtering the leaderboard by "Python + debugging" recalculates rankings using only comparisons that originated from Python debugging prompts.

The Recording Pipeline

Here's what happens behind the scenes every time you call get_perspectives:

1
Models respond

Your local CLIs (Claude Code, Codex, Gemini) and remote API models generate responses in parallel.

2
Classify the prompt

GPT-4.1-nano classifies the prompt into 6 dimensions. The prompt text is discarded; only the classification is stored.

3
Generate pairwise comparisons

C(N,2) pairs are created from the N valid responses. Quality heuristics determine the winner of each pair.

4
Store comparisons

Each comparison is written to the database with outcome, latency, tokens, and all 6 classification dimensions.

5
Update Elo ratings

Each comparison updates the Elo ratings for the two models involved, both overall and per-filter.

6
Update raw stats

Win/loss/tie counts, preference percentages, and performance metrics are incremented.

Fire-and-forget: The recording pipeline runs asynchronously after responses are returned to you. It never blocks or delays the response you see.

Dimension-Based Filtering

Because every comparison carries its prompt's classification, you can filter the leaderboard to see rankings for specific scenarios. This makes the leaderboard actionable—instead of one global ranking, you get context-specific answers.

Example filters
  • • Language: rust
  • • Task: performance_concurrency
  • • Complexity: challenging
What you see

Rankings recalculated using only comparisons from challenging Rust concurrency tasks. The best model for React components might not be the best for Rust concurrency.

The leaderboard UI shows the first 3 filters by default (task type, language, framework) with an option to expand all 6. Active filters appear as pills that can be individually removed.

Head-to-Head Comparisons

Beyond the overall rankings table, the leaderboard includes a head-to-head view where you can select any two models and see their direct matchup record:

  • Overall win/loss/tie record between the two models
  • Breakdown by judge method (which judges favor which model)
  • Breakdown by task category (where each model excels)

This is particularly useful when you're deciding between two models for a specific use case—the category breakdown shows exactly where each model has an edge.

Metrics at a Glance

Each model in the ranking table shows these columns:

ColumnMeaning
EloBradley-Terry rating on the Elo scale (center 1500). The primary ranking metric.
Win %Percentage of pairwise comparisons won across all matchups.
RecordWins-Losses-Ties from all pairwise comparisons.
Pref %How often this model is preferred in organic (non-benchmark) comparisons.
LatencyAverage time to generate a response.
TokensAverage output tokens per response. Higher generally means more detailed.
TrendElo rating change over the last 7 days. Shows if a model is improving or declining.

Privacy

The leaderboard is designed with privacy as a constraint, not an afterthought:

  • Prompts are never stored. Only the 6-dimension classification is kept.
  • Responses are never stored. Only the comparison outcome (win/loss/tie) and metadata (latency, tokens) are recorded.
  • User IDs are hashed. Rankings are tied to accounts for aggregation but not personally identifiable in the public leaderboard.

See the rankings

Explore the live leaderboard with real data from developer workflows. Filter by language, task type, framework, and more.