Comparative evaluation of the accuracy and completeness of responses provided by ChatGPT-5, Gemini 2.5 Flash, Grok 4, and DeepSeek-R1 to open-ended questions on cracked and fractured teeth

Makbule Taşyürek(1), Özkan Adıgüzel(2), Suzan Cangül(3), Hatice Ortaç(4)
(1) Dicle University, Faculty of Dentistry, Department of Endodontics, Diyarbakır, Türkiye,
(2) Dicle University, Faculty of Dentistry, Department of Endodontics, Diyarbakır, Türkiye,
(3) Dicle University, Faculty of Dentistry, Department of Restorative Dentistry, Diyarbakır, Türkiye,
(4) Dicle University, Faculty of Medicine, Department of Biostatistics, Diyarbakır, Türkiye

Abstract

Aim: The aim of this study was to evaluate and compare the accuracy and completeness of responses to open-ended questions related to chronic cracked and fractured teeth given by four recently introduced large language models (LLMs): ChatGPT-5, Grok 4, Gemini 2.5 Flash, and DeepSeek-R1-0528.


Methods: A total of 25 open-ended questions related to cracked and fractured teeth were prepared. The responses to these questions given by the LLMs were evaluated in terms of accuracy and completeness by restorative and endodontic specialists with 15 years of experience. Accuracy was evaluated using a 5-point Likert scale, and completeness was evaluated using a 3-point Likert scale. The Intraclass Correlation Coefficient (ICC), Shapiro-Wilk test, Kruskal-Wallis test, Dunn-Bonferroni test, and Spearman correlation analysis were used in the statistical analyses of the study. 


Results: The responses given to the questions about the diagnosis, causes, and treatment of cracked and fractured teeth showed significant differences among the four artificial intelligence (AI) models (p=0.015). In the subgroup analysis, the mean points of ChatGPT-5 (5±0) were significantly higher than those of DeepSeek-R1-0528 (4.48±0.77) (p=0.013). No significant difference was observed in the other paired comparisons (p>0.05). The completeness points also showed a difference between the models (p=0.028). The mean points of ChatGPT-5 (3±0) were significantly higher than those of DeepSeek-R1-0528 (2.76±0.44) (p=0.046), and no significant difference was observed in the other paired comparisons (p>0.05).


Conclusion: Differences in terms of accuracy and completeness were observed between the LLMs in the responses given to open-ended questions about cracked and fractured teeth. The lowest mean points were obtained by DeepSeek-R1-0528, followed by Grok 4 and Gemini 2.5 Flash, with the highest points obtained by ChatGPT-5. It must not be forgotten that the results obtained can show variability according to the subject, the form of the question, and the content. Therefore, further studies encompassing both similar and different subjects are needed. It must also not be ignored that misinformation in healthcare services can lead to serious consequences.

Full text article

Generated from XML file

Authors

Makbule Taşyürek
makbuletasyurek63@gmail.com (Primary Contact)
Özkan Adıgüzel
Suzan Cangül
Hatice Ortaç
1.
Taşyürek M, Adıgüzel Özkan, Cangül S, Ortaç H. Comparative evaluation of the accuracy and completeness of responses provided by ChatGPT-5, Gemini 2.5 Flash, Grok 4, and DeepSeek-R1 to open-ended questions on cracked and fractured teeth. J Med Dent Invest. 2025;6:e250161. doi:10.5577/jomdi.e250161

Article Details

How to Cite

1.
Taşyürek M, Adıgüzel Özkan, Cangül S, Ortaç H. Comparative evaluation of the accuracy and completeness of responses provided by ChatGPT-5, Gemini 2.5 Flash, Grok 4, and DeepSeek-R1 to open-ended questions on cracked and fractured teeth. J Med Dent Invest. 2025;6:e250161. doi:10.5577/jomdi.e250161
Smart Citations via scite_

Similar Articles

You may also start an advanced similarity search for this article.

Emerging technologies and advanced therapies in temporomandibular disorder (TMD) management: A 2024 update

Anushi Goel Anushi Goel, Gargi Sarma, Sadhvi Pandit, Vivek Govekar
Abstract View : 0

Evaluation of the Usability of ChatGPT-4 and Google Gemini in Patient Education About brain tumors

Umut Ogün Mutlucan, Cihan Bedel, Fatih Selvi, Ökkeş Zortuk, Cezmi Çağrı Türk
Abstract View : 0