Deep Dive · February 2026
How the AI Model Leaderboard Works
A technical look at how Polydev ranks AI models using Elo ratings, pairwise comparisons, prompt classification across 6 dimensions, and data from real developer workflows.
Why Another Leaderboard?
Most AI benchmarks test models on standardized datasets in controlled environments. That tells you how well a model solves pre-selected problems, but not how it performs on your actual coding tasks—the debugging sessions, architecture decisions, and implementation challenges that fill a real developer's day.
The Polydev leaderboard is different. Every ranking comes from real developer workflows: when you use get_perspectives to consult multiple AI models, the responses are automatically compared. Over time, this produces a ranking that reflects how models actually perform on the tasks developers care about.
Three Sources of Ranking Data
The leaderboard combines three types of comparisons, each weighted differently:
Every get_perspectives call queries multiple models simultaneously. The responses are automatically compared using quality heuristics—content length, code blocks, structured formatting. From N model responses, we generate C(N,2) pairwise comparisons. If a response has 15%+ higher quality score, it wins; otherwise it's a tie.
When your coding agent calls rank_perspectives after reviewing the responses, it provides an explicit ordering (best to worst). This carries the highest weight because it reflects the informed judgment of the AI model that actually read and evaluated all the responses in context.
We periodically run curated coding questions through all models, then use cross-model judging: each model evaluates every other model's response (but never its own). This helps bootstrap rankings for new models and provides a controlled baseline. Self-judging bias is reduced because no model rates its own output.
Prompt Classification
Before any comparison happens, every prompt is classified across 6 independent dimensions. This powers the leaderboard's filtering system—you can see which model performs best specifically for Python debugging, or React architecture, or Rust performance optimization.
Classification is done by a lightweight model (GPT-4.1-nano) that returns structured JSON. The prompt text itself is never stored—only the classification result.
| Dimension | What It Captures | Example Values |
|---|---|---|
| Task Type | The nature of the coding task | debugging_runtime, implementation_algorithm, architecture_system_design |
| Language | Primary programming language | python, typescript, rust, go |
| Framework | Libraries and frameworks involved | react, nextjs, fastapi, django |
| Complexity | How difficult the task is | trivial, moderate, challenging, expert_level |
| Domain | Application area | web_frontend, systems_distributed, ml_llm_agents |
| Intent | What the developer wants to accomplish | write_feature, fix_bug_logic, explain_architecture |
Each dimension has dozens of possible values organized into groups. For example, "Task Type" includes groups like Debugging (10 types), Implementation (10 types), Architecture (8 types), and more. This granularity means you can filter for extremely specific scenarios.
The Elo Rating System
We use a modified Elo system (the same mathematical framework used in chess rankings) to convert pairwise win/loss/tie outcomes into a single numerical rating per model.
How Elo works in brief: After each pairwise comparison, the winner gains rating points and the loser loses them. The number of points exchanged depends on the expected outcome—an upset (low-rated model beating a high-rated one) causes a larger rating swing than a result that was already expected.
Ties split the point exchange. All models start at 1500.
Adaptive K-Factor
The K-factor controls how much each comparison shifts the ratings. We use an adaptive scheme so new models converge quickly while established models remain stable:
| Comparisons | K-Factor | Effect |
|---|---|---|
| < 30 | K = 40 | New models settle quickly |
| 30 – 100 | K = 20 | Moderate adjustment |
| > 100 | K = 10 | Established models stay stable |
Judge Method Weights
Not all comparisons are equal. The K-factor is further multiplied by a weight that depends on how the comparison was generated:
| Judge Method | Weight | Rationale |
|---|---|---|
| Base model ranking | 1.5x | Informed judgment from the model that used the responses |
| User ranking | 1.3x | Direct developer feedback |
| Cross-model judging | 1.2x | Controlled benchmark, no self-judging bias |
| Auto quality | 0.8x | Heuristic-based, less reliable than human/model judgment |
From N Models to Pairwise Comparisons
When get_perspectives returns responses from N models, we generate all possible pairs—C(N,2) comparisons. For 5 models, that's 10 pairwise comparisons from a single query.
Each comparison is stored with the classification dimensions from the prompt. This means filtering the leaderboard by "Python + debugging" recalculates rankings using only comparisons that originated from Python debugging prompts.
The Recording Pipeline
Here's what happens behind the scenes every time you call get_perspectives:
Your local CLIs (Claude Code, Codex, Gemini) and remote API models generate responses in parallel.
GPT-4.1-nano classifies the prompt into 6 dimensions. The prompt text is discarded; only the classification is stored.
C(N,2) pairs are created from the N valid responses. Quality heuristics determine the winner of each pair.
Each comparison is written to the database with outcome, latency, tokens, and all 6 classification dimensions.
Each comparison updates the Elo ratings for the two models involved, both overall and per-filter.
Win/loss/tie counts, preference percentages, and performance metrics are incremented.
Fire-and-forget: The recording pipeline runs asynchronously after responses are returned to you. It never blocks or delays the response you see.
Dimension-Based Filtering
Because every comparison carries its prompt's classification, you can filter the leaderboard to see rankings for specific scenarios. This makes the leaderboard actionable—instead of one global ranking, you get context-specific answers.
- • Language: rust
- • Task: performance_concurrency
- • Complexity: challenging
Rankings recalculated using only comparisons from challenging Rust concurrency tasks. The best model for React components might not be the best for Rust concurrency.
The leaderboard UI shows the first 3 filters by default (task type, language, framework) with an option to expand all 6. Active filters appear as pills that can be individually removed.
Head-to-Head Comparisons
Beyond the overall rankings table, the leaderboard includes a head-to-head view where you can select any two models and see their direct matchup record:
- Overall win/loss/tie record between the two models
- Breakdown by judge method (which judges favor which model)
- Breakdown by task category (where each model excels)
This is particularly useful when you're deciding between two models for a specific use case—the category breakdown shows exactly where each model has an edge.
Metrics at a Glance
Each model in the ranking table shows these columns:
| Column | Meaning |
|---|---|
| Elo | Bradley-Terry rating on the Elo scale (center 1500). The primary ranking metric. |
| Win % | Percentage of pairwise comparisons won across all matchups. |
| Record | Wins-Losses-Ties from all pairwise comparisons. |
| Pref % | How often this model is preferred in organic (non-benchmark) comparisons. |
| Latency | Average time to generate a response. |
| Tokens | Average output tokens per response. Higher generally means more detailed. |
| Trend | Elo rating change over the last 7 days. Shows if a model is improving or declining. |
Privacy
The leaderboard is designed with privacy as a constraint, not an afterthought:
- Prompts are never stored. Only the 6-dimension classification is kept.
- Responses are never stored. Only the comparison outcome (win/loss/tie) and metadata (latency, tokens) are recorded.
- User IDs are hashed. Rankings are tied to accounts for aggregation but not personally identifiable in the public leaderboard.
See the rankings
Explore the live leaderboard with real data from developer workflows. Filter by language, task type, framework, and more.