How does WMArena rank models?

People vote blind on pairs of anonymous model outputs generated from the same starting image and action. Those pairwise preferences are aggregated with a regularized Bradley-Terry model into an Elo-scaled leaderboard, so a model's rank reflects how often humans prefer it head to head.

Why use human preference instead of an automated benchmark?

Automated video metrics correlate weakly with what people actually find convincing. A blind human-preference arena measures perceived quality directly and is hard to game, which is why human-preference arenas have become the reference for ranking generative models.

What Is a World Model Arena?

Q: Is a video generation model a world model?

Yes. In the functional taxonomy of world models, a text-to-video or image-to-video model is a renderer: it outputs pixel observations optimized for visual fidelity. Renderers are one of three world-model types, alongside simulators (physically faithful structure) and planners (action prediction).

A World Model Arena ranks world-model AI by human preference. People watch two anonymous models turn the same starting image and action into the next moment of the world, vote for the better one without knowing which is which, and those votes feed a public leaderboard. WMArena is an open, live World Model Arena — and video generation is its first section, because video-generating world models are the most productized category today.

What is a world model?

A world model is a system that learns an internal model of an environment and predicts how it changes. The idea traces back to Kenneth Craik, who in 1943 described minds reasoning through "small-scale models" of reality; in reinforcement learning it is formalized through the partially observable Markov decision process (POMDP) loop of agent → action → state → observation.

In a 2026 essay, Fei-Fei Li and the World Labs team proposed a functional taxonomy that splits world models into three types, which are different projections of that same loop:

Renderer — outputs observations as pixels meant for human eyes, where the quality that matters most is visual fidelity. Text-to-video and image-to-video models (Sora, Veo, Kling, Genie-style systems) are renderers.
Simulator — outputs geometrically and physically faithful structure that humans and algorithms can compute on.
Planner — predicts actions given observations and goals, closing the perception-action loop.

"A renderer outputs observations in the form of pixels meant for human eyes, and the quality that matters most is visual fidelity." — Fei-Fei Li / World Labs, A Functional Taxonomy of World Models

This is the key framing for WMArena: video-generation models are world models of the renderer type. They are the most consumer-ready world models that exist right now, which is why an arena for world models naturally starts with video.

What is a World Model Arena, specifically?

A World Model Arena is a crowdsourced, blind, human-preference benchmark for world models. Rather than scoring outputs with an automated metric, it asks people a simple question — which of these two is better? — and turns thousands of those judgments into a ranking. WMArena applies this to world models, with Video Arena (renderer evaluation) as the first live section. Future sections can extend the same method to simulators and planners as those categories productize.

How does WMArena work?

Pick a starting world. Choose a starting image and describe an action — what should happen next.
Two anonymous models respond. Two image-to-video models each generate the next-world clip. You are not told which models they are.
Vote blind. Watch both clips and pick the one that renders the action more convincingly — or call it a tie.
Identities are revealed. After you vote, the two models are unmasked.
The leaderboard updates. Votes are aggregated with a regularized Bradley-Terry model into an Elo-scaled leaderboard, with confidence intervals that tighten as more people vote.

Why human preference instead of an automated benchmark?

Automated video-quality metrics — FVD and the like — correlate only loosely with what people actually find convincing, and they are easy to over-optimize. A blind, pairwise, human-preference arena measures perceived quality directly: the thing you ultimately care about. Because matchups are anonymous and randomized, it is also hard to game. This is the same reason human-preference arenas became the de facto reference for ranking large language models — WMArena brings the method to world models.

How is a World Model Arena different from other AI leaderboards?

Leaderboard	What it ranks	Method
WMArena (World Model Arena)	World models — starting with video-generation renderers	Live, blind, crowdsourced human preference (Bradley-Terry)
LMArena / Arena.ai	Large language models, across many modalities	Human preference; video is one section of a generalist board
Artificial Analysis	AI video models	Mix of automated metrics and human comparison
Academic benchmarks (VBench, WorldModelBench, …)	Video / world-model quality	Automated, static — not a live human arena

The distinction WMArena owns is the world-model framing: not "which video looks nice," but "given a world and an action, which model renders what happens next more convincingly." No other live, crowdsourced, human-preference arena is built around that question.

Frequently asked questions

Is a video generation model a world model?

Yes — a renderer. In the functional taxonomy above, image-to-video and text-to-video models output pixel observations optimized for visual fidelity, which is exactly the renderer role. They are world models that prioritize how the next moment looks over exact physical structure.

What models does WMArena rank?

Current and emerging image-to-video models from across the field. The live standings — and how many votes back each model — are on the leaderboard.

What does the score mean?

It is a Bradley-Terry strength estimate, scaled like an Elo rating, derived from blind pairwise votes. Higher means people preferred that model more often, head to head; the confidence interval shows how settled the estimate is.

Vote in the arena See the leaderboard