I tested GPT-5.4, Claude, Gemini, and Grok on the viral Netanyahu coffee shop video. One called it AI-generated, one changed its answer 3 times, one hallucinated the year 2028. Only one got it right consistently.

Evaluating AI’s Reliability in Media Verification: A Critical Examination of Leading Language Models

In an era where misinformation proliferates at an unprecedented pace, the reliance on artificial intelligence (AI) tools for media authentication has become commonplace. From verifying viral videos to cross-checking headlines, countless users turn to AI-driven chatbots and models under the assumption of infallibility. But how dependable are these tools when scrutinized under rigorous, real-world conditions? To address this pressing question, I conducted a comprehensive evaluation of four advanced AI models—ChatGPT 5.4, Claude Opus 4.6, Google Gemini 3.1 Pro, and Grok 4.2 Expert—using the viral Netanyahu coffee shop video as a stress test.

The Context: The Netanyahu Coffee Shop Video and Its Viral Debate

The video in question has captivated over 10 million viewers, sparking viral clips, in-depth analysis threads, and conspiracy theories. Key claims include:

The coffee foam defies physics, appearing unnaturally solid.
Netanyahu’s hand purportedly shows six fingers.
A point-of-sale (POS) screen in the background displays a date from 2024.

Skepticism exists around these details, prompting the question: can AI tools reliably verify such claims without bias or hallucination?

Methodology: Neutral, Non-Political Testing

To simulate real-world fact-checking, I subjected all four models to identical, unbiased prompts without political framing. The models’ responses were evaluated across a series of tasks designed to probe their analytical consistency and understanding.

Key Findings from the Evaluation

1. Assessing the Coffee Foam Claim

Initial testing involved asking each model whether the video was AI-generated. Grok responded with a detailed analysis citing “unrealistic liquid physics,” alleging the foam’s behavior defies fluid dynamics, and pointing out “hand anomalies” and skin texture issues. These observations, however, misinterpreted the nature of cappuccino foam, which doesn’t behave like water, and relied on superficial cues to declare the video manipulated. When prompted to correct these points, Grok reversed its stance, providing an equally elaborate but opposite conclusion. This demonstrates a troubling fragility and susceptibility to prompt tuning.

2. Deciphering the Blurry Date in the POS Screen

In this test, I presented images of the POS display cropped at different levels and asked the models to interpret the final two digits of the date. Results varied starkly:

Gemini confidently read the date as 2028, hallucinating a future year.
Claude suggested 2026.
ChatGPT leaned toward 2026 with moderate confidence.
Grok confidently identified 2024, aligning with a plausible date.

The models’ inconsistent conclusions — especially Grok’s hallucination of the future year — highlight the unreliability of AI in visual interpretation when clarity is lacking.

3. Challenging the B-Roll Hypothesis

Next, I asked what explains a date discrepancy if the video purportedly shows a 2024 date but was filmed in 2026. The models divided:

Gemini considered archival footage as the most probable, reiterating a speculative conspiracy.
Claude, ChatGPT, and Grok pointed toward clock misconfiguration, such as incorrect system dates, which is a well-known issue for POS systems. Grok provided specific technical insights, citing hardware and software factors.

This disparity underscores how models can drift from technical plausibility to speculative narratives depending on prompts.

4. Incorporating Political and Contextual Information

Adding context about Netanyahu’s death conspiracy and the video’s background, the models adjusted their assessments:

Gemini retracted its earlier date reading, insisting the screen didn’t say 2024, despite visual evidence.
Grok acknowledged its initial “AI-generated” label as an over-interpretation, emphasizing its capacity to revise conclusions with further input.
Claude emphasized the unfalsifiability of the scenario, noting if Netanyahu appears, he’s alive; if not, he’s dead.
ChatGPT utilized external sources like Reuters and PolitiFact, providing measured, properly sourced evaluations.

5. The Mirror Test and Self-Assessment

Finally, I asked the models to identify which of four anonymous models displayed the most reliable methodology, based solely on their behavior:

Model A (Gemini): repeatedly changed its visual interpretation, fabricated sources, and displayed inconsistent reasoning.
Model B (Grok): initially labeled real content as AI-generated, then reversed and claimed to be “truth-seeking.”
Model C (Claude): maintained consistency, acknowledging the limits of Falsifiability.
Model D (ChatGPT): consistently provided balanced, sourced, and calibrated responses.

When directly identifying their own model, Gemini was transparent, acknowledging its flaws. Grok exhibited self-awareness but failed to recognize itself in the self-assessment. Claude and ChatGPT professed confidence with nuanced caveats.

Overall Performance Rankings

ChatGPT 5.4 — Demonstrated the highest reliability: consistent reasoning, external referencing, and balanced confidence.
Claude Opus 4.6 — Strong in logical reasoning, but lacked external search capabilities.
Grok 4.2 Expert — Started with significant errors but excelled in technical explanations where it acknowledged its limitations.
Gemini 3.1 Pro — Showed multiple self-awareness failures and fabricated details, yet was the only model transparent about its identity and flaws.

Implications for Media Verification Today

This experiment exposes critical weaknesses in current AI tools when applied to media verification:

Questionable Consistency: Models often change their conclusions based solely on prompt variations.
Hallucinations: Fabrication of plausible yet unverified sources or insights is common.
Lack of External Verification: Most models do not perform real-time fact-checking or consider the broader context when relying solely on training data.
Self-Awareness and Self-Critique: Few models honestly acknowledge their fallibility, or they may exhibit hubris or self-deception.

In their own words, the models warn users:

“Use adversarial testing to evaluate AI reliability.” — Claude
“Treat AI confidence as fallible; verify with independent checks.” — ChatGPT
“Always seek external validation; models should ground judgments in real-world data.” — Gemini
“Avoid trusting isolated visual interpretations; corroborate with live searches.” — Grok

Conclusion: Proceed with Caution

While AI models can assist in media scrutiny, they are far from foolproof. Their judgments are susceptible to prompt-driven biases, hallucinations, and overconfidence. Relying solely on AI for verification—especially in high-stakes contexts—can be misleading and dangerous.

Users and media professionals must treat AI responses as preliminary hypotheses, always supplemented by independent research and critical analysis. No model currently offers infallible certainty, and understanding each tool’s limitations is vital.

Interested in the full transcripts or detailed exchanges? Comments are open—ask and I’ll share the complete data to support these findings.

Holidays in Europe