How are WMArena rankings calculated?

Each blind pairwise vote is a head-to-head outcome between two anonymous models. WMArena fits a regularized Bradley-Terry model to all valid votes, which yields a strength estimate per model that is scaled like an Elo rating, with a 95% confidence interval that narrows as more votes accumulate.

Why does WMArena use human preference instead of automated metrics?

Automated video-quality metrics correlate weakly with what people actually find convincing and are easy to over-optimize. Blind pairwise human preference measures perceived quality directly and is hard to game because matchups are anonymous and randomized.

How does WMArena keep voting fair?

Model identities are hidden until after the vote, sides are randomized, and both clips are withheld until both are ready so neither model is favored by load time. Votes flagged as suspicious by abuse heuristics are excluded from the ranking without being deleted from the audit record.

How WMArena Works

WMArena ranks world-model video models by blind human preference. People watch two anonymous models render the next moment of the same scene, vote for the better one, and those votes feed a regularized Bradley-Terry model that produces an Elo-scaled leaderboard with confidence intervals. No automated quality metric decides the ranking — people do.

1. The blind pairwise vote

Every battle starts from the same inputs: a starting image and an action ("the camera pushes forward," "the subject turns left"). Two image-to-video models each render the next-world clip. You see both, side by side, with identities hidden and sides randomized, and pick the one that renders the action more convincingly — or call it a tie. Only after you vote are the two models revealed.

2. From votes to rankings: regularized Bradley-Terry

Each vote is a head-to-head outcome. The Bradley-Terry model is the standard statistical method for turning many pairwise outcomes into a single strength score per competitor — the same family of method that underpins chess Elo and modern LLM arenas. WMArena fits a regularized variant (regularization keeps brand-new, barely-voted models from getting wild scores) over all valid votes and scales the result like an Elo rating around a 1200 baseline.

Every model also carries a 95% confidence interval. A wide interval means "we don't have enough votes to be sure yet"; it narrows as votes accumulate. Rankings should always be read together with their intervals — two models whose intervals overlap heavily are, statistically, a tie.

3. What voters are judging

Because both clips answer the same starting image and action, the comparison isolates what matters for a world model: does the result look right, and does it actually do what the action asked? In practice voters weigh visual fidelity (does it look real and coherent), action alignment (did the requested change happen), and temporal consistency (does the scene hold together across the clip rather than warping or flickering).

4. Vote integrity

Fairness is structural. Identities are masked until the vote is cast; sides are randomized; and both clips are withheld until both are ready, so a faster model can't win on load time. Votes that abuse heuristics flag as suspicious are excluded from the ranking without being deleted from the audit record, so the leaderboard reflects genuine preference while the underlying data stays complete and inspectable.

5. Why human preference

Automated video metrics — FVD and the like — correlate only loosely with what people find convincing, and they are easy to over-optimize. A blind, pairwise, human-preference arena measures perceived quality directly: the thing you ultimately care about. Because matchups are anonymous and randomized, it is also hard to game. This is why human-preference arenas became the reference for ranking generative models, and WMArena brings the method to world models — starting with video.

6. Transparency

The ranking method is open by design. The live standings and per-model vote counts are on the leaderboard, and the methodology summary is available at /api/methodology. Transparent, inspectable methodology is part of what makes a leaderboard worth trusting — and worth citing.

Frequently asked questions

What does a model's score mean?

It's a Bradley-Terry strength estimate scaled like Elo. Higher means people preferred that model more often, head to head. Read it alongside the confidence interval and vote count.

How many votes before a ranking is meaningful?

There's no hard cutoff, but confidence intervals tell the story: a model with few votes has a wide interval and an unsettled rank. WMArena withholds thin rankings from places where they could be mistaken for settled results until they have enough votes to mean something.

Is this the same method LMArena uses?

The statistical core — blind pairwise votes aggregated with Bradley-Terry — is the same family popularized by Chatbot Arena for LLMs. WMArena applies it to world models and video generation. See WMArena vs LMArena for the differences.

Vote in the arena See the leaderboard What is a World Model Arena?