Imagine a world where AI can flawlessly mimic surgical procedures, creating videos so convincing they could train future doctors. Sounds revolutionary, right? But here's the shocking truth: Google's Veo-3, a cutting-edge video AI, can generate surgical footage that looks eerily real, yet it fails miserably when it comes to understanding the actual medical logic behind these procedures.
Researchers decided to put Veo-3 to the test, feeding it real surgical footage and asking it to predict the next eight seconds of a procedure based on a single image. To evaluate its performance, they created the SurgVeo benchmark, a rigorous test using 50 authentic videos from abdominal and brain surgeries. Four seasoned surgeons then scored the AI-generated clips on four critical criteria: visual appearance, instrument use, tissue feedback, and medical plausibility.
And this is the part most people miss: While Veo-3's visuals were initially impressive—with some surgeons praising the clarity—the content crumbled under scrutiny. In abdominal surgery tests, it scored a respectable 3.72 out of 5 for visual plausibility after just one second. But when medical accuracy was required, its performance plummeted. Instrument handling scored a mere 1.78, tissue response 1.64, and surgical logic a dismal 1.61. The AI could create stunning images, but it couldn't replicate the intricate realities of an operating room.
The gaps widened even further in brain surgery footage. Neurosurgery demands fine precision, and Veo-3 struggled from the start. Instrument handling dropped to 2.77 points, and surgical logic bottomed out at 1.13 after eight seconds. Over 93% of errors were rooted in medical logic—the AI invented tools, imagined impossible tissue responses, or performed actions with no clinical basis. Only a tiny fraction of errors (6.2% for abdominal and 2.8% for brain surgery) were related to image quality.
Researchers tried giving Veo-3 more context, such as the type of surgery or its phase, but the results showed no significant improvement. The issue, they concluded, isn’t the lack of information but the model’s inability to process and understand it. This raises a controversial question: Can AI ever truly grasp the complexities of medical procedures, or are we fooling ourselves by relying on visually convincing but fundamentally flawed simulations?
The SurgVeo study highlights just how far current video AI is from achieving real medical understanding. While future systems might one day assist in surgical training or planning, today’s models are far from ready. They produce videos that look real but lack the knowledge to make safe or meaningful decisions. This isn’t just an academic concern—it’s a potential danger. If AI-generated videos depict medically incorrect procedures, they could mislead trainees or robots, leading to harmful outcomes.
But here's where it gets even more controversial: Unlike Nvidia's AI, which uses synthetic videos to train robots for general tasks, healthcare demands precision and accuracy. AI hallucinations in this field aren’t just useless—they’re risky. The concept of video models as “world models,” capable of understanding physical and anatomical logic, remains a distant dream. Current systems can mimic appearance and movement but fail to capture the cause-and-effect relationships that define surgery.
Meanwhile, text-based AI is making strides in medicine. Microsoft’s “MAI Diagnostic Orchestrator,” for instance, demonstrated diagnostic accuracy four times higher than experienced general practitioners in complex cases, though with noted methodological limitations. This contrast underscores the divide between visual and cognitive AI capabilities.
The researchers plan to release the SurgVeo benchmark on GitHub, inviting others to test and improve their models. But the study’s implications are clear: while AI can dazzle us with its visual prowess, it’s the understanding beneath the surface that truly matters. What do you think? Can AI ever bridge this gap, or are we asking too much of it? Share your thoughts in the comments—let’s spark a debate!