What is the difference between WMArena and LMArena?

Both rank AI by blind human preference using Bradley-Terry ratings, but for different domains. WMArena is purpose-built for world-model video generation — image-to-video models that render the next moment of a scene. LMArena (now Arena.ai), formerly Chatbot Arena from UC Berkeley's LMSYS, is the dominant arena for large language models, and it offers video as one section of a multi-modality board.

Is WMArena related to LMArena?

They are independent. WMArena applies the same proven blind, pairwise, Bradley-Terry method that Chatbot Arena popularized for LLMs, but focuses it on world models and video generation rather than text.

Which arena should I use to compare AI video models?

Use WMArena for world-model and image-to-video rankings driven entirely by human preference on the next-world clip. Use LMArena for large language model rankings. For a broad automated video benchmark, Artificial Analysis is another reference.

WMArena vs LMArena

Both are blind, human-preference arenas that rank AI with Bradley-Terry ratings — but they rank different things. WMArena is purpose-built for world-model video generation: people vote on which model better renders the next moment of a scene. LMArena (now Arena.ai) is the dominant arena for large language models, and it includes video as one section of a generalist, multi-modality board.

Short version: for video / world-model rankings, use WMArena. For LLM rankings, use LMArena. They are independent projects built on the same human-preference method.

What is LMArena?

LMArena began as Chatbot Arena, built by UC Berkeley's Sky Computing Lab and the LMSYS group, and became the reference leaderboard for large language models. Users compare two anonymous model responses and vote; the votes feed a Bradley-Terry (Elo-scaled) ranking. It later productized as lmarena.ai and, in early 2026, rebranded to Arena.ai, expanding into many modalities — text, code, web development, image, and a video section. Its authority is anchored by the widely cited "Chatbot Arena" paper. In short: a broad, generalist arena whose center of gravity is LLMs.

What is WMArena?

WMArena is a World Model Arena — a human-preference benchmark for world models, starting with the most productized category: video generation (the "renderer" type of world model). You pick a starting image and an action, two anonymous image-to-video models each render the next-world clip, you vote blind, identities are revealed, and a Bradley-Terry leaderboard updates. WMArena is not a generalist board with video bolted on; video and world models are the whole point.

Side by side

	WMArena	LMArena / Arena.ai
Primary focus	World-model video generation (image-to-video)	Large language models, across many modalities
Video coverage	The core product — purpose-built	One section of a generalist board
Task	Starting image + action → next-world clip	Prompt → model response (text, code, image, video, …)
Method	Blind, pairwise, crowdsourced human preference (Bradley-Terry)	Blind, pairwise, crowdsourced human preference (Bradley-Terry)
Best for	Ranking AI video / world models	Ranking LLMs
Origin	Independent, world-model focused	UC Berkeley / LMSYS (Chatbot Arena)

Which one should you use?

You want to know which AI video model is best → WMArena. It is built end-to-end around image-to-video and world models, so the matchups, dimensions, and leaderboard are all about video quality and how convincingly a model renders an action.
You want to know which LLM is best → LMArena / Arena.ai. It is the established reference for language models and has the scale and history there.
You want a broad automated video benchmark → Artificial Analysis is another reference, though it leans on automated metrics alongside human comparison.

Are they the same thing?

No — they are independent. What they share is a method: blind, pairwise, human-preference ranking with Bradley-Terry, which Chatbot Arena popularized for LLMs and which WMArena applies to world models. The reason this method spread is simple: automated metrics correlate weakly with what people actually prefer, and a blind crowdsourced arena measures perceived quality directly and is hard to game.

Vote in the arena See the leaderboard What is a World Model Arena?