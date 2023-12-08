Michael Johnson #23, shooting guard of the Chicago Bulls prepares to take a shot during Central , [+] Division game in the Eastern Conference of the National Basketball Association (NBA) 1988–1989 season at Chicago Stadium, Chicago, United States. (Photo by Jonathan Daniels/Allsport/Getty Images) getty images

I talk to dozens of people per week about AI and I always ask about their concerns regarding its adoption in their companies or governments. One of the most frequently mentioned concerns of new users about AI is the concept that sometimes AI can “hallucine” and give nonsensical answers.

Beyond hallucinations, another concern with AI is its tendency to produce inconsistent outputs when the same question is presented to different large language models (LLMs). While this article aims to briefly address both of these concepts, explain their causes, and further clarify the concept, I have shared a practical example that I have recently tested myself.

First, on the topic of “hallucinations”, what are they and why do they occur?

Hallucinations occur when AI produces an unexpected, or nonsensical output

Professionals in highly specialized fields often consider AI incredible because it can actually create things. The most famous recent example involves a lawyer being disbarred for using AI to create a fake case law to cite in his legal arguments.

AI models are created by ‘training’ on huge datasets. These datasets include a wide variety of texts, from literature to online articles, allowing models to learn language patterns and contextual information. There are instances where the model produces false results – and when they do it can be a big deal.

Hallucinations happen for a few different reasons:

not enough training data: AI outputs are generated “’deterministically’”, meaning the AI ​​is predicting what it thinks the best output should be based on your signal and available data. If the AI ​​hasn’t been trained on enough data, it may not be equipped to answer your question. Overfitting: This is a concept where AI is finely tuned to learn answers to highly specialized areas, but is not generalized enough to broaden its answers to new areas. Underfitting: See above, but completely opposite concept. A generalized model can improve answers to more specific questions.

There are certainly other contributing factors, including whether the input is too complex for the AI ​​to understand, too long, etc. Criticisms of AI hallucinations are valid, but AI models are getting better very rapidly.

But what about inconsistent output?

In addition to causing hallucinations, different LLMs can give quite different responses when presented with the same question. To illustrate this point, I asked Anthropic’s Cloud, Google’s Bard, and Open AI’s Chat GPT-4 the following:

“I want you to tell me who you think were the 5 greatest NBA players of all time, based on the statistical method you invented. First, tell me the players you think are the best, then explain the methodology step by step.

Anthropic’s answer:

Michael Jordan Lebron James Kareem Abdul-Jabbar bill russell Wilt Chamberlain

Bard’s answer:

Michael Jordan Kareem Abdul-Jabbar bill russell Lebron James Wilt Chamberlain

Open AI’s answer:

Michael Jordan Lebron James Kareem Abdul-Jabbar bill russell magic johnson

In addition to the outputs being very different, the method by which they arrived at their answers was also very different. So leaving aside the general consensus that Michael Jordan is number one, why is there such a difference in model output?

This variation can be attributed to model architecture, specifically the specific algorithms used by each LLM to process information. Architecture can vary significantly between LLMs. Second, training data: If these LLMs are trained using different data sources, their responses may differ significantly. Finally, training methodology, including techniques for training or tuning models, can also contribute to the variability of the output.

As the LLM war heats up between Open AI, Google, Anthropic and others, output variability and frequency of hallucinations is going to become a big issue. It will be interesting to see how different major players address these issues.