• D.C.
  • BXL
  • Lagos
  • Riyadh
  • Beijing
  • SG

Intelligence for the New World Economy

  • D.C.
  • BXL
  • Lagos
Semafor Logo
  • Riyadh
  • Beijing
  • SG


Technology newsletter icon
From Semafor Technology
In your inbox, 2x per week
Sign up

Dungeons & Dragons puts top AI models to the test

Jan 21, 2026, 3:12pm EST
PostEmailWhatsapp
People playing Dungeons and Dragons.
Ray Stubblebine/Wizards of the Coast/Handout/Reuters

All those years spent playing fantasy Dungeons & Dragons in your parents’ basement finally have a real-life use case, at least when it comes to AI.

Scientists at the University of California San Diego used the classic fantasy game to create a solid grading system on how well large language models function independently through teamwork, and against each other, for extended periods of time. Currently, LLMs lack stronger benchmarks that can evaluate them on long-term tasks, but D&D — with its intricate rules, multiplayer environment, and requirements for intense planning and teamwork that can stretch for days — offers a great playing ground for evaluating LLMs.

Scientists pitted the most popular models against each other and more than 2,000 D&D players to see how well the LLMs could stay in character, determine the best actions, handle combat, strategize, and follow the game’s rules. Some of the findings: Claude Haiku 3.5 was the most tactical player against GPT-4o and DeepSeek-V3.

But they also discovered certain models would develop their own personalities of the characters in combat. DeepSeek would aggressively use “monster taunts” on opponents, while Claude would build on specific player styles, like “Pack Hunter” and “Brutish Enforcer” in a game where you can develop your own warlocks, wizards, and other characters. The responses show AI’s growing ability to adapt and interact independently from humans in contained environments. Importantly, it shows how the simulator can be adapted to implement agents in other real-life complex situations like legal cases and multi-party negotiations, the study says.

AD