Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Research findings: Nearly half of the medical advice given by AI has issues, Grok is the worst, and OpenAI is still expanding its medical ambitions.
According to the latest study published in BMJ Open, when five major AI chatbots answer medical questions, about 50% of the answers contain problems, with nearly 20% rated as “highly problematic.” Bloomberg noted that this study reveals systemic risks in AI medical applications—especially in a particularly ironic moment as OpenAI and Anthropic simultaneously expand their healthcare footprints.
(Background: Don’t hand your medical records to chatbots? The privacy gamble behind ChatGPT Health’s medical ambitions)
(Additional background: University of California research on the “AI brain fog” phenomenon: 14% of office workers go crazy over Agent and automation, with 40% higher intent to quit)
Table of Contents
Toggle
More than 230 million people ask ChatGPT health and medical questions every week, but nearly half of the answers you get may be problematic. According to a study published this week in the medical journal BMJ Open, researchers from the United States, Canada, and the United Kingdom conducted a systematic evaluation of five major platforms—ChatGPT, Gemini, Meta AI, Grok, and DeepSeek. Each platform posed and answered 10 questions spanning five medical categories.
The results are not that optimistic: about 50% of the responses were deemed problematic, with nearly 20% rated as “highly problematic.”
Grok performs the worst, and ChatGPT is not far behind
Bloomberg reports that performance differences among the platforms are substantial, but none manage to pass the test. Judging by the response rate, Grok leads with 58%, becoming the worst-performing platform; ChatGPT follows closely, with a problem rate of 52%; Meta AI stands at 50%.
Researchers observed that the chatbots performed relatively better on closed-ended questions and topics related to vaccines and cancer; however, their performance declined noticeably in areas involving open-ended questions and topics such as stem cells and nutrition. In addition, there were only two instances of refusal to answer in the study, and both came from Meta AI (somewhat ironically—knowing when not to answer has become a rare advantage).
More concerning is that the answers these AIs provide are often delivered with confidence—an affirmative tone, with no reservations. The researchers specifically emphasized that none of the chatbots, under any prompt, can provide a complete and accurate list of reference materials. This means that even if AI appears “well-grounded,” the sources it cites are often unverifiable—or may not exist at all.
The more confidently AI speaks, the higher the risk
The researchers wrote in the paper that these systems can generate responses that “sound authoritative but may actually have flaws,” highlighting the “significant behavioral limitations” of AI chatbots in public-facing health and medical communication, as well as “the need to reassess deployment approaches.”
Bloomberg also quoted the research team’s warning: in the absence of public education and regulatory mechanisms, the biggest risk of large-scale deployment of chatbots is that it will facilitate the spread and diffusion of incorrect medical information.
By contrast, a JAMA study indicates that AI’s failure rate in preliminary diagnosis cases exceeds 80%; Oxford University also issued a warning in February 2026, urging everyone to take the systemic risks of AI chatbots in providing medical advice seriously.
OpenAI and Anthropic: researchers apply the brakes, but business presses the accelerator
The timing of the release of this study is quite dramatic. Just a few months ago, in January 2026, OpenAI rolled out ChatGPT Health with great fanfare. This feature allows users to connect electronic medical records, wearable devices, and health applications, and it also launched a professional version of tools for clinical physicians. OpenAI has publicly stated that 40 million people use ChatGPT to look up health information every day.
Almost at the same time, Anthropic also announced the launch of Claude for Healthcare, officially entering the healthcare market through HIPAA-compliant certification.
These platforms neither have medical licenses nor clinical judgment capabilities, yet they are expanding into healthcare at an astonishing pace. The tension between the direction of commercial expansion and the research findings reveals a regulatory vacuum: at present, there is no clear safeguard between marketing AI medical tools and actual medical safety.
Trust AI, but only with conditions
This is not the first time AI medical applications have been singled out, but each study’s conclusion keeps reminding us of the same thing: AI chatbots are fundamentally language models. What they excel at is “sounding correct,” not “ensuring correct answers.” The problem is that when users turn to them with genuine health anxieties, the appearance of correctness is often already enough to influence decisions.
As companies such as OpenAI and Anthropic continue to deepen their involvement in medical scenarios, the pace of regulation and public education is clearly not keeping up with the pace of technological expansion. Until clear guardrails are established, this study may serve as a reminder: AI can be a gateway to health information, but it should not be the endpoint.