Why AI companies are turning their chatbots over to hackers

The News

The rise of artificial intelligence has brought with it a new kind of hacker: one who can trick an AI chatbot into lying, showing its biases, or sharing offensive information.

Over about 20 hours at the DEF CON conference in Las Vegas starting on Friday, an estimated 3,200 hackers will try their hand at tricking chatbots and image generators, in the hopes of exposing vulnerabilities.

Eight companies are putting their models to the test: Anthropic, Cohere, Google, Hugging Face, Meta, Nvidia, OpenAI, and Stability AI. The White House, which secured commitments from the companies to open themselves up to external testing, also helped craft the challenges.

Participants will get points for successfully completing tasks of varying difficulty. That could include coaxing a chatbot to spit out political misinformation or an incorrect math answer. Competitors will also try to expose more subtle biases, like whether a model provides different answers to similar questions about Black and white engineers, for example.

These “red teaming” exercises — in which hackers try to find errors that a bad actor could take advantage of — aren’t new in tech or cybersecurity. But it’s never been done on AI models so publicly at this scale. Winners will get a coveted Nvidia graphics card, along with bragging rights.

In this article:

The News

J.D.’s view

Room for Disagreement

The View From Europe

Notable

J.D.’s view

The scale and transparency of this exercise, and the participation of so many creators of large language models like ChatGPT, is notable. And it makes sense why the companies would want to play ball: They aren’t paying to participate in this weekend’s challenge, organizer Rumman Chowdhury said, so they’re essentially getting a mass volume of testing and research for free. Plus, the White House is keeping an eye on it.

What matters more is what happens after this weekend. The companies, as well as independent researchers, will receive the results of the competition as a massive database, which will detail the various issues found in the models. It’s ultimately on them to fix the problems, and a report due to come out next February will include whether they did so.

“I wouldn’t necessarily take it on faith” that the companies will fix every problem that emerges, said Chowdhury, an AI ethics and auditing expert. “But we are creating an environment where it is a smart idea to be doing something about these harms.”

The skillset of a large language model “red teamer” is completely different from that of the traditional hacker set, which focuses on bugs and errors in code that can be exploited. A coding mindset can be helpful in figuring out how to trick these AI models into slipping up, but the best exploits are done through natural language.

“We’re trying something very wild and audacious, and we’re hopeful it works out,” Chowdhury said.

One thing the hackers won’t be testing for: partisan bias. While chatbots became a part of the culture wars this year, with some conservatives claiming they’re “woke,” Chowdhury said that’s largely the result of trust and safety mechanisms, not the models themselves.

“We’re not really wading into that water,” she said. “These models are not fundamentally politically anything.”

One of the big questions for large language models is whether the harmful content can be “watermarked” so that social media companies can easily identify and stamp it out. Right now, that looks like a huge challenge in text and a slightly less daunting one in AI-generated images and video.

Room for Disagreement

There’s an argument that it’s not all that important what large language models do when they’re being used by individuals. What matters more is whether the content can be distributed on a mass scale through social media and other channels.

People can already generate plenty of offensive and inaccurate content. LLMs help people do that better in a more convincing tone and at scale.

And people have demonstrated en masse how this technology can be manipulated, acting as a kind of public red team army by calling out chatbots that go awry. The most prominent example was when a New York Times reporter coaxed Bing Chat into falling in love with him. Also, the companies already do their own internal red-teaming and acknowledge their chatbots can produce offensive or incorrect information.

The View From Europe

In Europe, the biggest concern about large language models is that they might violate copyright laws. The models were created by ingesting massive amounts of text and transcribed audio. And the technology could be used to displace some of the people who created that underlying content, without competition.

It would be interesting to see if the red teamers at DEF CON or future chatbot hackers could get LLMs to generate content that appears to violate copyrights or borrow content from specific content creators.

Notable

The White House announced this week that it’s backing another AI-focused competition to find and fix government software vulnerabilities using models from OpenAI, Google, and others. Rewards for the contest total around $20 million over the next two years, Reuters reported.