Generative Artificial Intelligence (AI) is notorious for factual errors. So, what do you do when you’ve asked ChatGPT to generate 150 inferred facts and you don’t want to spend an entire weekend verifying each one by hand?

Well, in my case, I turned to other AI. In this article, I’ll explain the project, consider how each AI performed in a fact-checking demonstration, and offer some final thoughts and cautions if you too want to delve into the maze of this tricky, short path. Which are all the same.

Project

Last week, we published a really fun project where we ran DALL-E 3 inside ChatGPT to create 50 picturesque images that it believed represented each US state. I even had a ChatGPT list of “The Three Most Interesting Facts You Should Know About the State”. The results, as my editor wrote in the article’s headline, were “brilliantly strange.”

Chatgpt places the Golden Gate Bridge somewhere in Canada. The instrument placed Lady Liberty somewhere in the midwestern US and on the island of Manhattan. And it built two Empire State Buildings. In short, ChatGPT got its Abstract Expressionism funk going, but the results were great.

As far as individual facts were concerned, he was mostly on target. I’m pretty good at US geography and history and I thought some of the facts made up by ChatGPT were extremely inaccurate. But I did not do any independent fact checking. I just read the results and found them to be quite good.

But what if we really want to know the accuracy of those 150 fact bullets? This kind of question seems like an ideal project for AI.

Methodology

So this is the thing. If GPT-4, the OpenAI Large Language Model (LLM) used by ChatGPT Plus, generated the fact statements, I was not entirely convinced that it should check them. It’s like asking high school students to write a history paper without using any references and then correcting their work themselves. They’re already starting out with questionable information – and then you’re letting them correct themselves? No, that doesn’t seem right to me.

But what if we feed those facts to other LLMs inside other AIs? Google’s Bard and Anthropic’s Cloud both have their own LLMs. Bing uses GPT-4, but I thought I’d test its responses just to be a completionist.

As you’ll see, I got the best response from Bard, so I fed its responses back into ChatGPT in a round-robin perversion of the natural order of the universe. It was a good project.

anthropic cloud

The cloud uses Cloud 2 LLM, which is also used inside Notion’s AI implementation. The cloud allowed me to feed it a PDF containing the full set of facts (without pictures). This is what I got back:

Screenshot by David Gewirtz/ZDNET

Overall, Cloud found the fact list to be mostly accurate, but it had some clarifications for three items. I limited how long ChatGPT facts could be, and that limit prevented nuance in the fact descriptions. Some lack of nuance in Claude’s fact checking was taken issue with.

Overall, it was an encouraging response.

Copilot… or nopilot?

Then we reach Microsoft’s CoPilot, the renamed Bing Chat AI. Copilot doesn’t allow uploading PDFs, so I tried pasting the facts from all 50 states into the text. This approach failed immediately, as Copilot only accepts signals up to 2,000 characters:

Screenshot by David Gewirtz/ZDNET

I asked the copilot the following:

The following text consists of the names of the states followed by three facts for each state. Please check the facts and identify any errors for that state

This is what I got back:

Screenshot by David Gewirtz/ZDNET

It largely replicated the factual data I asked to verify. So, I tried to guide it with a more forceful prompt:

Screenshot by David Gewirtz/ZDNET

Once again, it gave me back the data I asked it to verify. I found this output very strange because Copilot uses the same LLM as ChatGPT. Obviously, Microsoft has tuned it differently than ChatGPT.

I gave up and headed towards the bard.

bard

Google has just announced its new Gemini LLM. I don’t have access to Gemini yet, so I ran these tests on Google’s PaLM 2 model.

In comparison to Claude and the Copilot, the Bard knocked it out of the park, or, more Shakespearean, it “like a colossus leaps forward into the narrow world.”

Check out the results below:

Screenshot by David Gewirtz/ZDNET

It is important to note that even states do not agree or have nuances on the facts in many states. As I’ll show you in the next section, I sent this list back to ChatGPT and found two discrepancies in the Alaska and Ohio answers.

But there are other omissions here too. In some ways, Bard overcompensated for the assignment. For example, Bard is right that states other than Maine produce lobsters. But Maine goes all out in its lobster production. I’ve never been to any other state that has miniature lobster traps as one of the most popular tourist trap trinkets.

Or let’s choose Nevada and Area 51. “Top-secret military base, rumored UFO sightings,” ChatGPT said. Bard tried to correct this by saying, “UFO sightings at Area 51 are not just rumors. It is an actual top-secret military facility, and its purpose is unknown.” They are saying almost the same thing. The Bard simply missed the nuance that comes from strict word limits.

Without understanding the context, Bard chose another location on ChatGPT, Minnesota. Yes, there are a lot of lakes in Wisconsin too. But Bard did not claim that Minnesota had the most lakes. It describes Minnesota as “the land of 10,000 lakes”, one of Minnesota’s most common slogans.

Bard also got stuck in Kansas. ChatGPT said that Kansas is “home to the geographic center of contiguous America.” Bard claimed it was South Dakota. And that’s true if you take into account Alaska and Hawaii. But ChatGPT said “contiguous”, and that honor goes to a point near Lebanon, Kansas.

I could go on and on in the next section, but you get the point. Bard’s fact-checking seems impressive, but it often misses the point and gets things wrong just like any other AI.

Before we proceed with ChatGPT’s limited fact checking of Bard, I want to point out that most of Bard’s entries were either inaccurate or mistitled. And yet, Google puts its AI answers at the front of most search results. Does that bother you? This definitely worries me.

Oh Lords and Ladies, such surprises cannot be talked about.

chatgpt

On top of that, I can say that Bard got one of its facts wrong – Alaska is much bigger than Texas. So, I thought, let’s see if ChatGPT can fact check Bard. For a moment, I thought this pursuit of an AI tail might knock the Moon out of Earth’s orbit, but then I decided I’d rather risk the entire structure of our universe because I knew you’d want to know what. happened:

Here’s what I fed ChatGPT:

Screenshot by David Gewirtz/ZDNET

And here’s what ChatGPT said (and, for clarity, the Moon remained in orbit):

Screenshot by David Gewirtz/ZDNET

As you can see, ChatGPT took issue with Bard’s incorrect claim that Texas is the largest state. There was also a bit of a stir regarding Ohio vs. Kansas as the birth of aviation, which is more controversial than most schools teach.

It is generally accepted that Wilbur and Orville Wright flew the first airplane (actually in Kitty Hawk, North Carolina), although they built their Wright Flyer in Dayton, Ohio. He, Sir George Kelly (1804), Henry Giffard (1852), Félix du Temple (1874), Clement Ader (1890), Otto Lilienthal (1891), Samuel Langley (1896), Gustav Whitehead (1901), and Richard Pearce (1902) – from New Zealand, the UK, France, Germany, and other parts of the US – all have some degree of legitimate claim to having made the first flight.

But we’ll give ChatGPT issue, because it only has 10 words to make the claim, and Ohio was where the Wright Brothers had a bike shop.

Conclusion and warnings

Let’s get something out of the way up front: If you’re presenting a paper or document where you need your facts to be correct, do your own fact-checking. Otherwise, your Texas-sized ambitions may get buried under an Alaska-sized problem.

As we saw in our tests, the results (with Bard) can look quite impressive, but be completely or partially wrong. Overall, it was interesting to ask different AIs to crosscheck each other, and it’s a process I’ll probably explore further, but the results were only limited by how inconclusive they were.

The co-pilot completely gave up and simply asked to go back to his nap. Claude objected to the specifics of some of the answers. Bard took a hard look at a variety of answers – but, apparently, it’s not just human to err, it’s AI too.

Finally, let me quote the real Bard and say, “Illusion has now made his masterpiece!”

What do you think? What kinds of serious errors have you seen from your favorite AI? Are you satisfied with relying on AI for facts, or would you rather do your own fact-checking process now? Let us know in the comments below.

