WMArena ranks world-model video models by blind human preference. People watch two anonymous models render the next moment of the same scene, vote for the better one, and those votes feed a regularized Bradley-Terry model that produces an Elo-scaled leaderboard with confidence intervals. No automated quality metric decides the ranking — people do.
Every battle starts from the same inputs: a starting image and an action ("the camera pushes forward," "the subject turns left"). Two image-to-video models each render the next-world clip. You see both, side by side, with identities hidden and sides randomized, and pick the one that renders the action more convincingly — or call it a tie. Only after you vote are the two models revealed.
Each vote is a head-to-head outcome. The Bradley-Terry model is the standard statistical method for turning many pairwise outcomes into a single strength score per competitor — the same family of method that underpins chess Elo and modern LLM arenas. WMArena fits a regularized variant (regularization keeps brand-new, barely-voted models from getting wild scores) over all valid votes and scales the result like an Elo rating around a 1200 baseline.
Every model also carries a 95% confidence interval. A wide interval means "we don't have enough votes to be sure yet"; it narrows as votes accumulate. Rankings should always be read together with their intervals — two models whose intervals overlap heavily are, statistically, a tie.
Because both clips answer the same starting image and action, the comparison isolates what matters for a world model: does the result look right, and does it actually do what the action asked? In practice voters weigh visual fidelity (does it look real and coherent), action alignment (did the requested change happen), and temporal consistency (does the scene hold together across the clip rather than warping or flickering).
Fairness is structural. Identities are masked until the vote is cast; sides are randomized; and both clips are withheld until both are ready, so a faster model can't win on load time. Votes that abuse heuristics flag as suspicious are excluded from the ranking without being deleted from the audit record, so the leaderboard reflects genuine preference while the underlying data stays complete and inspectable.
Automated video metrics — FVD and the like — correlate only loosely with what people find convincing, and they are easy to over-optimize. A blind, pairwise, human-preference arena measures perceived quality directly: the thing you ultimately care about. Because matchups are anonymous and randomized, it is also hard to game. This is why human-preference arenas became the reference for ranking generative models, and WMArena brings the method to world models — starting with video.
The ranking method is open by design. The live standings and per-model vote counts are on the
leaderboard, and the methodology summary is available at
/api/methodology. Transparent, inspectable methodology is part of what makes a leaderboard
worth trusting — and worth citing.
It's a Bradley-Terry strength estimate scaled like Elo. Higher means people preferred that model more often, head to head. Read it alongside the confidence interval and vote count.
There's no hard cutoff, but confidence intervals tell the story: a model with few votes has a wide interval and an unsettled rank. WMArena withholds thin rankings from places where they could be mistaken for settled results until they have enough votes to mean something.
The statistical core — blind pairwise votes aggregated with Bradley-Terry — is the same family popularized by Chatbot Arena for LLMs. WMArena applies it to world models and video generation. See WMArena vs LMArena for the differences.