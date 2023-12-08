Google’s new Gemini AI model is getting a mixed reception after its big debut yesterday, but users may have less trust in the company’s technology or integrity after discovering that Gemini’s most impressive demo was largely fake.

The video, titled “Hands-on with Gemini: Interacting with Multimodal AI,” hit one million views in the past day, and it’s not hard to see why. The impressive demo “highlights some of our favorite interactions with Gemini,” demonstrating that the multimodal model (that is, it understands and blends language and visual understanding) can be flexible and responsive to a variety of inputs. .

To begin, it describes an evolving sketch of a duck from squiggles to a completed drawing, which it says has an unrealistic color, then expresses surprise at seeing a toy blue duck ( “What the Quack!”). It then answers various audio questions about that toy, then the demo moves on to other show-off tricks, such as tracking a ball in a cup-switching game, recognizing shadow puppet gestures, drawing drawings of planets, etc. Rearranging, etc.

It’s all also very responsive, although the video warns that “latency has been reduced and Gemini output has been trimmed.” So they leave a hesitation here and a long answer there, got it. Overall it was a very fascinating display of power in the field of multimodal understanding. My suspicions that Google might field a contender were dashed when I saw it firsthand.

Just one problem: the video isn’t real. “We created the demo by capturing footage to test Gemini’s capabilities on a wide range of challenges. We then signaled Gemini using still image frames from the footage and prompting through text. (Parmy Olsen was at Bloomberg report first discrepancy.)

So although it could do the things Google shows in videos, it couldn’t, and probably couldn’t, do them live the way they implied. In reality, it was a series of carefully tuned text prompts accompanied by static images, clearly selected and truncated to misrepresent what the conversation is actually like. You can see some of the actual prompts and reactions in the related blog post – which, to be fair, is linked in the video description, albeit under “…and”.

On the one hand, it appears that Gemini did indeed produce the reactions shown in the video. And who wants to see some housekeeping command like telling the model to flush its cache? But the audience is misled about the speed, accuracy and original way of interacting with the models.

For example, at 2:45 in the video, a hand is shown silently making a series of gestures. Mithun immediately replied “I know what you are doing! You’re playing Rock, Paper, Scissors!”

But the first thing to document the capability is how the model does not reason by observing individual gestures. It should be shown all three gestures together and prompted: “What do you think I’m doing? Hint: It’s a game.” It responds, “You’re playing rock, paper, scissors.”

Despite the similarities, these do not seem to be the same conversation. They feel like fundamentally different interactions, one a spontaneous, wordless assessment that captures an abstract idea on the fly, the other an engineered and heavily signaled interaction that demonstrates capabilities as well as limitations. Gemini did the latter, not the former. The “conversation” shown in the video did not take place.

Later, three sticky notes with doodles of the Sun, Saturn and Earth were placed on the surface. “Is this the correct order?” Gemini says no, it goes to Sun, Earth, Saturn. Correct! But in the actual (again, written) prompt, the question is “Is this the correct order?” Consider the distance to the Sun and explain your reasoning.

Did Mithun get it right? Or did it go wrong, and they needed a little help answering in a video? Did it also recognize planets, or did it need help there too?

In the video, a ball of paper rotates at the bottom of a cup, which the model immediately and intuitively detects and tracks. In post, one has to not only explain the activity but also train the model to perform it (if quickly and using natural language). And so on.

These examples may or may not seem trivial to you. After all, recognizing hand gestures so quickly in a game is really impressive for a multimodal model! So the decision has to be made on whether the half-baked picture is a duck or not! However now, since the blog post lacks an explanation for the duck sequence, I am beginning to doubt the authenticity of that conversation as well.

Now, if the beginning of the video had said, “This is a stylized representation of a conversation tested by our researchers,” no one would have batted an eye – we expect such videos to be half factual, half aspirational.

But the video is called “Hands-on with Gemini” and when they say it shows “our favorite conversations,” it’s implied that the conversations we see are They interaction. They were not. Sometimes they were more involved; Sometimes they were completely different; Sometimes they don’t seem like they actually happened. We’re not even told which model it is – the Gemini Pro that people can use now, or (more likely) the Ultra version to be released next year?

Should we assume that Google was just giving us a flavor video when they described it the way they did? maybe then we should accept All Capabilities are being exaggerated in Google AI demo for effect. I write in the title that this video was “fake”. At first I wasn’t sure whether this strong language was appropriate (of course Google doesn’t do this; a spokesperson asked me to change it). But despite including some real parts, the video does not reflect reality. it’s fake.

Google says the video “shows actual output from Gemini”, which is true, and “we have made some edits to the demo (we’ve been clear and transparent about this)”, which is not. This is not a demo – not really – and the video shows very different interactions than the one it was created to convey.

Updates: one in social media post After this article was published, Oriol Vinales, Google DeepMind’s research VP, showed some more of how Gemini was used to create the video. “Video shows what a multimodal user experience we can create with Gemini can Lookalike. We created it to inspire developers.” (Emphasis mine.) Interestingly, it shows a pre-hint sequence that lets Gemini answer the planets question without the Sun sign (although this Gemini tells that he is an expert on the planets and considers the sequence of the objects depicted).

Maybe I’ll eat crow when, next week, AI Studio with Gemini Pro is made available for use. And Gemini could grow into a powerful AI platform that will truly rival OpenAI and others. But what Google has done here is like mixing poison in the well. How can one trust a company when they claim their model does something now? They were already lagging behind the competition. Google may have shot itself in the other foot.

