AI still fails at completing real-life work, study finds

The News

With all the talk of AI replacing humans at work, a new study suggests the technology couldn’t if it tried. Scale AI and the Center for AI Safety (CAIS) put AI to the test completing various real-world freelance projects, including tasks across product design, game development, data analysis, and scientific writing.

Manus performed the best, but with only 2.5% of its tasks considered acceptable work by a reasonable client, as determined by a panel of 40 judges. Gemini 2.5 Pro came up last, with only 0.8% of its work meeting expectations. The data suggests that while models are improving on benchmarks, there is still significant work to be done in meeting on-the-ground quality demands.

A chart showing the amount of freelance work projects that AI agents successfully completed.

Know More

Dan Hendrycks, CAIS director and advisor to Elon Musk’s xAI, expects the combination of human and AI labor will outperform both entities working alone, but “as with chess it will eventually be more efficient to use AIs alone,” he told Semafor via email.

Meanwhile, AI in the workplace is actually slowing some down, according to a survey by coaching platform BetterUp and Stanford University’s Social Media Lab. Forty percent of 1,150 US workers reported receiving “workslop” — low-quality, AI-generated work — in the last month, forcing them or a colleague to spend time redoing the task. Across a workforce of 10,000, the organizations estimate $9 million in lost productivity per year.

AI still fails at completing real-life work tasks, study finds

The News

Know More