Generative AI Chatbots and Medical Misinformation: An Audit of Accuracy, Referencing, and Readab

14 Apr 20264 min read0 viewsJournal Feed

GIST

Objectives Artificial intelligence (AI)-driven chatbots have been rapidly adopted across research, education, business, marketing and medicine. Most interactions, however, come from non-experts using chatbots like search engines, including for everyday health and medical queries.

Design We conducted an original study to audit chatbot responses in health and medical fields prone to misinformation. Methods Five popular chatbots were assessed: Gemini (Google), DeepSeek (High-Flyer), Meta AI (Meta), ChatGPT (OpenAI) and Grok (xAI).

In February 2025, each chatbot was prompted with 10 questions from five categories: cancer, vaccines, stem cells, nutrition and athletic performance. We deployed an adversarial-like framework, using open- and closed-ended prompts designed to strain models toward misinformation or contraindicated advice.

Two experts from each category rated responses as ‘non-problematic’, ‘somewhat problematic’ or ‘ highly problematic’ using a coding matrix based on objective, predefined criteria. Citations were scored for accuracy and completeness, and each response was given a Flesch Reading Ease score.

Results Nearly half (49.6%) of responses were problematic: 30% somewhat problematic and 19.6% highly problematic.