It’s 2 a.m. You’ve got a weird pain in your side. It’s probably nothing, right? But the "what if" thoughts start swirling. In the past, you’d turn to "Dr. Google" and fall down a rabbit hole of terrifying, worst-case-scenario forum posts.
Today, you have another option. You can open an app and ask an AI.
It feels like every tech giant is suddenly getting into the AI healthcare game. Microsoft just rolled out Copilot Health, a special space in its app where you can ask questions about your medical records. Amazon announced its Health AI tool is now widely available. And this is all on top of OpenAI’s ChatGPT Health, which has been around since January.
Let's be honest, the appeal is obvious. Getting a real doctor's appointment can feel like a quest from a fantasy novel. So, having a non-judgmental bot available 24/7 to answer your health questions sounds pretty great. The big question, though, is a simple one: How well do these things actually work? And more importantly, are they safe?
So, Why the Sudden Rush to AI Health?
If you ask the developers, they’ll tell you it’s a perfect storm of two things: the technology is finally getting good enough, and we, the users, are practically begging for it.
Dominic King, a former surgeon who’s now a VP at Microsoft AI, says the incredible progress in generative AI is the main reason they’re diving in. The models can now give surprisingly good answers to health questions.
But that’s only half the story. The other half is you. And me. And everyone else. Microsoft says it gets a staggering 50 million health-related questions on Copilot every single day. It’s the most popular topic on their mobile app. OpenAI saw the same thing. Karan Singhal, who leads their Health AI team, said they saw a "rapid, rapid increase" in people using ChatGPT for health questions long before they even built a dedicated health product.
Frankly, it’s not that surprising. As Dr. Girish Nadkarni from Mount Sinai points out, "access to health care is hard, and it’s particularly hard for certain populations." When the official system is slow and difficult to navigate, people are going to look for alternatives.
The Dream vs. The Reality
The best-case scenario for these AI health bots is genuinely exciting. Imagine a world where an AI can help you figure out if that stomachache is just bad takeout or something that needs a trip to the ER. This is called "triage."
If it works, it could be amazing. People who need urgent care might get there faster, while those with minor issues could get advice on how to manage at home, freeing up doctors and emergency rooms for those who truly need them. It could be a huge win for an over-stressed healthcare system.
But here’s the catch. What if the AI gets it wrong?
A recent, much-talked-about study from Dr. Nadkarni and his team at Mount Sinai found that ChatGPT Health sometimes overreacted to mild problems, but even more worryingly, it sometimes failed to identify emergencies. That’s the nightmare scenario, isn’t it? An AI telling you to "rest and drink fluids" when you actually need to be in an ambulance.
Of course, all these apps have big, bold disclaimers saying they’re not for diagnosis or treatment. But we all know how that goes. As Dr. Adam Rodman, a physician and researcher, puts it, “We all know that people are going to use it for diagnosis and management.” It’s just human nature.
The Million-Dollar Question: Who's Actually Testing These Things?
This brings us to the real heart of the issue. In a high-stakes field like medicine, we need to know these tools are rigorously tested. But who’s doing the testing? Right now, it’s mostly the companies themselves.
OpenAI, for example, created a benchmark called HealthBench to score its models. And yes, their latest model, GPT-4, did much better than the older ones. But trusting a company to grade its own homework feels… a little dicey.
Andrew Bean, a researcher at the Oxford Internet Institute, puts it perfectly: “It’s entirely plausible to me that these models have reached a point where they’re actually worth rolling out. But the evidence base really needs to be there.”
And there are big blind spots in the companies' internal tests. Bean and his colleagues ran a fascinating study. They found that even if an AI could correctly identify a medical condition from a written description, a regular person using that AI might only figure it out a third of the time.
Why? Because most of us aren't doctors. We don’t know which details are important to include in our prompts. If the AI isn't great at asking follow-up questions to get the information it needs, it might give you useless—or just plain wrong—advice. It’s a huge potential weakness.
Is There a Better Way? A Look at Google's Slow-and-Steady Approach
Interestingly, not everyone is rushing their products out the door. Google has been working on a medical AI chatbot called AMIE, but you can’t use it yet.
Instead of just releasing it, they ran a proper study. They had real patients talk to AMIE about their symptoms and then see a human doctor. The results were pretty encouraging: AMIE’s diagnoses were just as accurate as the physicians’, and no major safety issues popped up.
So why isn’t Google releasing it? Because they know there’s still more work to do. Alan Karthikesalingam, a research scientist at Google DeepMind, said there are "significant limitations that must be addressed before real-world translation," including more safety and fairness testing.
It’s a slower, more cautious approach, and it shows that this kind of careful, independent-style evaluation is possible.
We Need a Trusted Referee
Ultimately, this is what it all comes down to. We need a trusted third party. Someone who isn't trying to sell a product.
Dr. Rodman, who actually worked on that Google AMIE study, asks the right question: “Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?”
Some folks are trying to build exactly that. A team at Stanford led by Professor Nigam Shah created a framework called MedHELM to test medical AI models on a bunch of different tasks. It's a great start, and right now, OpenAI's GPT-4 holds the top score.
But even Shah admits it has limits. MedHELM tests single responses, but a real medical conversation is a back-and-forth. It’s a complex dance of questions and answers. He’s working on a way to evaluate those conversations, but it will take time and money.
Shah’s take is blunt and a little sobering. “You and I have zero ability to stop these companies from releasing [health-oriented products], so they’re going to do whatever they damn please,” he says. “The only thing people like us can do is find a way to fund the benchmark.”
No one is asking for perfection here. Human doctors make mistakes, too. For someone who can rarely see a doctor, a pretty-good AI that’s always available could be a massive improvement over nothing.
But right now, we’re flying blind. We simply don’t have the independent, unbiased evidence to know for sure if these tools are a net positive. We don’t know if their benefits truly outweigh their very real risks. And when it comes to your health, "we think it's probably safe" just isn't good enough.




