Evaluation of chatgpt-5 reliability in urodynamic study interpretation

Nadav Dekel,
Shachar Aharoni,
Chen Shenhar,
Amir Buchler

Abstract

Publication: AUA26, May 2026

https://www.auajournals.org/doi/10.1097/01.JU.0001191476.64741.7c.04

Introduction and objectives

The use of artificial intelligence (AI) in medicine is rapidly expanding. Urodynamic studies (UDS) are a cornerstone of functional urology, yet their interpretation is complex and highly dependent on expertise. ChatGPT-5 may serve as an assistive tool for interpreting such studies, supporting physicians who are not specialized in functional urology. This study aimed to evaluate the reliability of ChatGPT-5 in interpreting UDS and to compare its clinical recommendations with those of human experts.

Methods

High-resolution images of urodynamic studies performed at a single center during 2025, retrospectively interpreted by three independent experts, were uploaded to the ChatGPT-5 platform. A brief demographic and clinical summary accompanied each study. The model was instructed to complete a structured interpretation form comprising 32 categories covering filling, voiding, and summary parameters, and to propose a treatment plan. Model outputs were compared with expert interpretations, and a reliability score was calculated for each case based on categorical or numerical agreement (10% tolerance margin). The software was not trained between cases; each case was entered and analyzed as a new, independent instance without inter-case learning.

Results

Sixty-four UDS were analyzed (52% males) with a median age of 69 years (IQR 53–77). The median weighted reliability score of ChatGPT-5 was 73% (IQR 57–88). The highest scores were observed in female stress urinary incontinence (SUI) cases (mean 79.5%). No significant association was found between age, sex, or diagnosis and the reliability score. In the cystometrogram phase, categorical agreement was observed for sensation (74%), stability (79%), compliance (91%), and leakage (72%). In voiding parameters, the model correctly identified a voiding phase in 81% of cases, with 65% categorical agreement for voiding type. Numerical parameters demonstrated lower agreement—MCC (58%), Qmax (57%), BOOI (54%), and PVR (33%)—compared to a mean of 76% in categorical parameters (p = 0.001). Agreement on the clinical plan was achieved in 66% of cases, with the highest concordance in female SUI (88%) and neurogenic bladder (70%) and the lowest in male LUTS (45%), showing a significant difference across diagnoses (p = 0.03). Agreement was strongest when experts recommended surgery or advanced therapies (Botox/neuromodulation), reaching 70–80%.

Conclusions

This study demonstrates preliminary yet promising potential of AI-based tools in UDS interpretation. While categorical interpretations and treatment recommendations were often consistent with expert opinions, quantitative reliability remains limited. Further research and model refinement are needed to improve diagnostic accuracy and clinical applicability.

Source of Funding

None