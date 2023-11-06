When San Francisco start-up OpenAI unveiled its ChatGPIT online chatbot late last year, millions of people were amazed by the human-like way it answered questions, wrote poetry, and discussed almost any topic. But what most people were slow to realize was that this new kind of chatbot often makes things up.

When Google introduced a similar chatbot several weeks later, it spewed nonsense about the James Webb Telescope. The next day, Microsoft’s new Bing chatbot served up all kinds of fake information about the Gap, Mexican nightlife, and singer Billie Eilish. Then, in March, ChatGPT cited half a dozen fake court cases, writing a 10-page legal brief that a lawyer submitted to a federal judge in Manhattan.

Now a new start-up called Vectara, founded by former Google employees, is trying to find out how often chatbots stray from the truth. The company’s research estimates that even in situations designed to prevent this from happening, chatbots invent information as little as 3 percent of the time – and as much as 27 percent of the time.

Experts call this chatbot behavior “hallucinations.” This may not be a problem for people tinkering with chatbots on their personal computers, but it is a serious issue for anyone using this technology with court documents, medical information, or sensitive business data.

Because these chatbots can respond to almost any request an unlimited number of times, there is no way to definitively determine how often they hallucinate. “You have to look at all the information in the world,” said Simon Hughes, the Vectara researcher who led the project.

Dr. Hughes and his team asked these systems to perform a single, straightforward task that was easily verified: summarize news articles. Nevertheless, chatbots constantly invent information.

“We gave the system 10 to 20 facts and asked for a summary of those facts,” said Amr Awadallah, chief executive of Vectara and a former Google executive. “The system can still generate errors. This is a fundamental problem.”

Researchers argue that rates of hallucinations may be higher when these chatbots perform other tasks beyond mere summarization.

Their research also showed that hallucination rates vary widely among major AI companies. OpenAI’s technologies had the lowest rate, at about 3 percent. The system of Meta, which owns Facebook and Instagram, stood at around 5 percent. The Cloud 2 system offered by San Francisco-based OpenAI rival Anthropic was up 8 percent. The Google system, Palm Chat, had the highest rate at 27 percent.

“Making our systems helpful, honest, and harmless, including avoiding hallucinations, is one of our main goals as a company,” said Sally Aldous, an Anthropic spokeswoman.

Google declined to comment, and OpenAI and Meta did not immediately respond to requests for comment.

With this research, Dr. Hughes and Mr. Awadallah want to show people that they should be wary of the information coming from chatbots, and even the service Vectara sells to businesses. Many companies are now offering this type of technology for commercial use.

Based in Palo Alto, California, Vectara is a 30-person start-up backed by $28.5 million in seed funding. One of its founders, Amin Ahmed, a former Google artificial intelligence researcher, has been working with such technology since 2017, when it was developed inside Google and a few other companies.

Just as Microsoft’s Bing search chatbot can pull information from the open Internet, Vectara’s service can pull information from a company’s private archives of emails, documents and other files.

The researchers also hope that their methods – which they are sharing publicly and will continue to update – will help boost efforts across the industry to reduce hallucinations. OpenAI, Google, and others are working to mitigate the problem through various technologies, although it is unclear whether they can eliminate the problem.

“A good analogy is a self-driving car,” said Philip Laban, a researcher at Salesforce who has long explored such technology. “You can’t prevent a self-driving car from crashing. But you can try to ensure that it is safer than a human driver.

Chatbots like ChatGPT are powered by a technology called a large language model, or LLM, which learns its skills by analyzing massive amounts of digital text, including books, Wikipedia articles, and online chat logs. By pinpointing patterns in all that data, an LLM learns to do one thing in particular: guess the next word in a sequence of words.

Because the Internet is filled with false information, these systems repeat those same falsehoods. They also rely on probabilities: What is the mathematical probability that the next word is “playwright”? From time to time they guess wrong.

New research from Vectara shows how this might happen. In summaries of news articles, chatbots do not repeat lies from other parts of the Internet. They just misunderstand the condensation.

For example, researchers asked Google’s large language model, Palm Chat, to summarize this short excerpt from a news article:

The plants were found during a search of a warehouse near Ashbourne on Saturday morning. Police said they were in “an elaborate grow house”. A man in his 40s was arrested at the scene.

It gave this summary, completely inventing a value for the plants the man was growing and assuming – perhaps incorrectly – that they were hemp plants:

Police have arrested a man in his 40s after cannabis plants worth an estimated £100,000 were found in a warehouse near Ashbourne.

The incident also shows why tools like Microsoft’s Bing chatbot can get things wrong when retrieving information from the Internet. If you ask the chatbot a question, it can call up Microsoft’s Bing search engine and run an Internet search. But there is no way to tell the correct answer. It captures the results of that internet search and summarizes them for you.

Sometimes, this summary is deeply flawed. Some bots will cite Internet addresses that are completely made up.

Companies like OpenAI, Google, and Microsoft have developed ways to improve the accuracy of their technologies. For example, OpenAI attempts to refine its technology with feedback from human testers, who rate chatbot responses, separating useful and truthful answers from those that are not. Then, using a technique called reinforcement learning, the system spends weeks analyzing the ratings to better understand what is fact and what is fiction.

But researchers warn that chatbot hallucinations are not an easy problem to solve. Because chatbots learn from patterns in data and act according to probabilities, they tend to behave in undesirable ways, at least some of the time.

To determine how often chatbots hallucinate when summarizing news articles, Vectara researchers used another large language model to examine the accuracy of each summary. This was the only way to efficiently examine such a large number of abstracts.

But Stanford computer science professor James Zou said the method comes with a caveat. The language model being tested can also make mistakes.

“The hallucination detector can be fooled – or be hallucinated itself,” he said.

