A World Model Arena ranks world-model AI by human preference. People watch two anonymous models turn the same starting image and action into the next moment of the world, vote for the better one without knowing which is which, and those votes feed a public leaderboard. WMArena is an open, live World Model Arena — and video generation is its first section, because video-generating world models are the most productized category today.
A world model is a system that learns an internal model of an environment and predicts how it changes. The idea traces back to Kenneth Craik, who in 1943 described minds reasoning through "small-scale models" of reality; in reinforcement learning it is formalized through the partially observable Markov decision process (POMDP) loop of agent → action → state → observation.
In a 2026 essay, Fei-Fei Li and the World Labs team proposed a functional taxonomy that splits world models into three types, which are different projections of that same loop:
"A renderer outputs observations in the form of pixels meant for human eyes, and the quality that matters most is visual fidelity." — Fei-Fei Li / World Labs, A Functional Taxonomy of World Models
This is the key framing for WMArena: video-generation models are world models of the renderer type. They are the most consumer-ready world models that exist right now, which is why an arena for world models naturally starts with video.
A World Model Arena is a crowdsourced, blind, human-preference benchmark for world models. Rather than scoring outputs with an automated metric, it asks people a simple question — which of these two is better? — and turns thousands of those judgments into a ranking. WMArena applies this to world models, with Video Arena (renderer evaluation) as the first live section. Future sections can extend the same method to simulators and planners as those categories productize.
Automated video-quality metrics — FVD and the like — correlate only loosely with what people actually find convincing, and they are easy to over-optimize. A blind, pairwise, human-preference arena measures perceived quality directly: the thing you ultimately care about. Because matchups are anonymous and randomized, it is also hard to game. This is the same reason human-preference arenas became the de facto reference for ranking large language models — WMArena brings the method to world models.
| Leaderboard | What it ranks | Method |
|---|---|---|
| WMArena (World Model Arena) | World models — starting with video-generation renderers | Live, blind, crowdsourced human preference (Bradley-Terry) |
| LMArena / Arena.ai | Large language models, across many modalities | Human preference; video is one section of a generalist board |
| Artificial Analysis | AI video models | Mix of automated metrics and human comparison |
| Academic benchmarks (VBench, WorldModelBench, …) | Video / world-model quality | Automated, static — not a live human arena |
The distinction WMArena owns is the world-model framing: not "which video looks nice," but "given a world and an action, which model renders what happens next more convincingly." No other live, crowdsourced, human-preference arena is built around that question.
Yes — a renderer. In the functional taxonomy above, image-to-video and text-to-video models output pixel observations optimized for visual fidelity, which is exactly the renderer role. They are world models that prioritize how the next moment looks over exact physical structure.
Current and emerging image-to-video models from across the field. The live standings — and how many votes back each model — are on the leaderboard.
It is a Bradley-Terry strength estimate, scaled like an Elo rating, derived from blind pairwise votes. Higher means people preferred that model more often, head to head; the confidence interval shows how settled the estimate is.