Objectives To assess and compare the performance of four contemporary frontier large language models (LLMs)—GPT-5.2 (OpenAI), Gemini 3 Pro (Google DeepMind), Claude Sonnet 4.6 (Anthropic) and Grok 4.1 (xAI)—on a simulated Fellowship of The Royal College of Surgeons Urology (FRCS(Urol)) Part A examination, evaluating overall accuracy, subspecialty-level performance, output consistency and response time. Design Controlled comparative evaluation study using a standardised simulation framework with repeated independent testing runs per model.
Setting All models were accessed via their respective consumer-facing interfaces. No clinical setting or patient data were involved.
Testing was conducted under uniform conditions with conversational memory disabled across all sessions. Participants Four large language models were evaluated.
No human participants were involved. Models were selected to represent the current frontier of publicly accessible LLMs from four distinct commercial developers.
No models were excluded following selection. Interventions Each model was presented with 240 FRCS (Urol) Part A single best answer questions, mapped to the Joint Committee on Intercollegiate Examinations' Urology Syllabus Blueprint (2023).
A standardised prompt was delivered at the start of each session.
BMJ Open published a clinical update in Research Highlights on 13 May 2026.
The item focuses on Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination.
Review the original article for the full source wording and details.