Comparative evaluation of the accuracy and completeness of responses provided by ChatGPT-5, Gemini 2.5 Flash, Grok 4, and DeepSeek-R1 to open-ended questions on cracked and fractured teeth
Abstract
Aim: The aim of this study was to evaluate and compare the accuracy and completeness of responses to open-ended questions related to chronic cracked and fractured teeth given by four recently introduced large language models (LLMs): ChatGPT-5, Grok 4, Gemini 2.5 Flash, and DeepSeek-R1-0528.
Methods: A total of 25 open-ended questions related to cracked and fractured teeth were prepared. The responses to these questions given by the LLMs were evaluated in terms of accuracy and completeness by restorative and endodontic specialists with 15 years of experience. Accuracy was evaluated using a 5-point Likert scale, and completeness was evaluated using a 3-point Likert scale. The Intraclass Correlation Coefficient (ICC), Shapiro-Wilk test, Kruskal-Wallis test, Dunn-Bonferroni test, and Spearman correlation analysis were used in the statistical analyses of the study.
Results: The responses given to the questions about the diagnosis, causes, and treatment of cracked and fractured teeth showed significant differences among the four artificial intelligence (AI) models (p=0.015). In the subgroup analysis, the mean points of ChatGPT-5 (5±0) were significantly higher than those of DeepSeek-R1-0528 (4.48±0.77) (p=0.013). No significant difference was observed in the other paired comparisons (p>0.05). The completeness points also showed a difference between the models (p=0.028). The mean points of ChatGPT-5 (3±0) were significantly higher than those of DeepSeek-R1-0528 (2.76±0.44) (p=0.046), and no significant difference was observed in the other paired comparisons (p>0.05).
Conclusion: Differences in terms of accuracy and completeness were observed between the LLMs in the responses given to open-ended questions about cracked and fractured teeth. The lowest mean points were obtained by DeepSeek-R1-0528, followed by Grok 4 and Gemini 2.5 Flash, with the highest points obtained by ChatGPT-5. It must not be forgotten that the results obtained can show variability according to the subject, the form of the question, and the content. Therefore, further studies encompassing both similar and different subjects are needed. It must also not be ignored that misinformation in healthcare services can lead to serious consequences.
Full text article
Authors
Copyright © 2025 Journal of Medical and Dental Investigations

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.