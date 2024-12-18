The problem with AI benchmarks is that the scores are often inflated. That’s because training data used by AI models can be contaminated with the benchmarks themselves — akin to giving someone the answers to a test before they take it.

Despite the grade inflation, a coding test called SWE-bench has proven challenging for AI models. According to its website, the highest-performing model currently only scores 55% on the evaluation, which consists of real-world software problems posted to the popular coding repository site GitHub.

Konwinski said in an interview that the concern with SWE-bench was that the coding problems could simply be downloaded from the internet. He worked with SWE-bench and the machine learning site Kaggle, which often posts similar contests online, to come up with a special test that couldn’t be gamed, he said.

Essentially, Konwinski and SWE-bench will create a test that doesn’t yet exist at the time the AI models are submitted, ensuring the answers can’t be included in the training data.

The contest should provide the most accurate assessment yet of how well AI models can code.

“Better benchmarks could be very much at the heart of better technology,” he said.