Rapid Evolution of Large Language Models in Medical Education: Comparative Performance of ChatGPT-3.5, ChatGPT-5, and DeepSeek on Medical Microbiology MCQs
DOI:
https://doi.org/10.61360/BoniCETR252018770801Keywords:
ChatGPT-5, artificial intelligence, large language models, medical education, medical microbiology, assessmentAbstract
Rapid advances in large language models (LLMs) warrant specialty-specific benchmarking to assess their educational potential and limitations. We evaluated the newly released generative artificial intelligence (genAI) model ChatGPT-5, DeepSeek-R1, and the early ChatGPT-3.5 on 80 multiple-choice questions (MCQs) from a medical microbiology course examination, weighted for midterm and final components. Items were classified according to the revised Bloom’s taxonomy. Performance was compared with that of more than 150 Doctor of Dental Surgery students. Content quality was assessed independently by two consultants in clinical microbiology using the validated CLEAR tool modified to assess AI content completeness, accuracy, and relevance. The mean total scores were 80.5 for ChatGPT-3.5, 96.0 for ChatGPT-5, and 95.5 for DeepSeek, versus a student mean of 86.21/100. ChatGPT-5 and DeepSeek-R1 significantly outperformed ChatGPT-3.5 in completeness and accuracy scores, with no differences between them. ChatGPT-5 maintained high accuracy across lower- and higher-order cognitive Bloom’s domains, whereas DeepSeek-R1 showed a significant drop in higher-order items. For ChatGPT-3.5, incorrect responses had longer answer-choice word counts. CLEAR scores were significantly higher for correct versus incorrect responses in all models (p < 0.001). This study showed that the currently available LLMs can exceed average student performance in medical microbiology while providing high-quality explanations. Regular benchmarking is essential to ensure responsible integration of genAI into educational, pedagogical, and assessment tools.
References
Abdaljaleel, M., Barakat, M., Alsanafi, M., Salim, N. A., Abazid, H., Malaeb, D., et al. (2024). A multinational study on the factors influencing university students’ attitudes and usage of ChatGPT. Scientific Reports, 14(1), 1983. doi:10.1038/s41598-024-52549-8
Ateeq, A., Alzoraiki, M., & Milhem, M. (2024). Artificial intelligence in education: implications for academic integrity and the shift toward holistic assessment. Frontiers in Education, 9, 1470979. doi:10.3389/feduc.2024.1470979
Azaria, A., Azoulay, R., & Reches, S. (2023). ChatGPT is a Remarkable Tool—For Experts. Data Intelligence, 6, 1-49. doi:10.1162/dint_a_00235
Barakat, M., Salim, N. A., & Sallam, M. (2025). University Educators Perspectives on ChatGPT: A Technology Acceptance Model-Based Study. Open Praxis, 17(1), 129–144. doi:10.55982/openpraxis.17.1.718
Bharatha, A., Ojeh, N., Fazle Rabbi, A. M., Campbell, M. H., Krishnamurthy, K., Layne-Yarde, R. N. A., et al. (2024). Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom's Taxonomy. Adv Med Educ Pract, 15, 393-400. doi:10.2147/amep.S457408
Bushuyev, S., Puziichuk, A., Bushueva, N., Bushuyeva, V., & Bushuyev, D. (2025). The evolving landscape of education under the influence of AI. Bulletin of NTU KhPI Series Strategic Management Portfolio Program and Project Management, 3-8. doi:10.20998/2413-3000.2024.9.1
Chelli, M., Descamps, J., Lavoué, V., Trojani, C., Azar, M., Deckert, M., et al. (2024). Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J Med Internet Res, 26, e53164. doi:10.2196/53164
Córdova-Esparza, D.-M. (2025). AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges. Information, 16(6), 469. doi:10.3390/info16060469
Fu, Y., & Weng, Z. (2024). Navigating the ethical terrain of AI in education: A systematic review on framing responsible human-centered AI practices. Computers and Education: Artificial Intelligence, 7, 100306. doi:10.1016/j.caeai.2024.100306
Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., et al. (2023). How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ, 9, e45312. doi:10.2196/45312
Gurajala, S. (2024). Artificial intelligence (AI) and medical microbiology: A narrative review. Indian Journal of Microbiology Research, 11, 156-162. doi:10.18231/j.ijmr.2024.029
Haugen, H. J., & de Lange, T. (2024). Multiple choice as formative assessment in dental education. Eur J Dent Educ, 28(3), 757-769. doi:10.1111/eje.13002
Herrmann-Werner, A., Festl-Wietek, T., Holderried, F., Herschbach, L., Griewatz, J., Masters, K., et al. (2024). Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. J Med Internet Res, 26, e52113. doi:10.2196/52113
Hirani, R., Noruzi, K., Khuram, H., Hussaini, A. S., Aifuwa, E. I., Ely, K. E., et al. (2024). Artificial Intelligence and Healthcare: A Journey through History, Present Innovations, and Future Possibilities. Life (Basel), 14(5), 557. doi:10.3390/life14050557
Holzinger, A., Saranti, A., Angerschmid, A., Finzel, B., Schmid, U., & Mueller, H. (2023). Toward human-level concept learning: Pattern benchmarking for AI algorithms. Patterns (N Y), 4(8), 100788. doi:10.1016/j.patter.2023.100788
Hu, C., Li, F., Wang, S., Gao, Z., Pan, S., & Qing, M. (2025). The role of artificial intelligence in enhancing personalized learning pathways and clinical training in dental education. Cogent Education, 12(1), 2490425. doi:10.1080/2331186X.2025.2490425
Jiang, Q., Gao, Z., & Karniadakis, G. (2025). DeepSeek vs. ChatGPT: A Comparative Study for Scientific Computing and Scientific Machine Learning Tasks. arXiv. doi:10.48550/arXiv.2502.17764
Jin, I., Tangsrivimol, J. A., Darzi, E., Hassan Virk, H. U., Wang, Z., Egger, J., et al. (2025). DeepSeek vs. ChatGPT: prospects and challenges. Front Artif Intell, 8, 1576992. doi:10.3389/frai.2025.1576992
Joshi, L. T. (2021). Using alternative teaching and learning approaches to deliver clinical microbiology during the COVID-19 pandemic. FEMS Microbiol Lett, 368(16). doi:10.1093/femsle/fnab103
Karahan, B. N., & Emekli, E. (2025). Comparison of applicability, difficulty, and discrimination indices of multiple-choice questions on medical imaging generated by different AI-based chatbots. Radiography (Lond), 31(5), 103087. doi:10.1016/j.radi.2025.103087
Katona, J., & Gyonyoru, K. I. K. (2025). AI-based Adaptive Programming Education for Socially Disadvantaged Students: Bridging the Digital Divide. TechTrends. doi:10.1007/s11528-025-01088-8
Khan, M. S., Umer, H., & Faruqe, F. (2024). Artificial intelligence for low income countries. Humanities and Social Sciences Communications, 11(1), 1422. doi:10.1057/s41599-024-03947-w
Kim, J., Yu, S., Detrick, R., & Li, N. (2025). Exploring students’ perspectives on Generative AI-assisted academic writing. Education and Information Technologies, 30(1), 1265-1300. doi:10.1007/s10639-024-12878-7
Kovalainen, T., Pramila-Savukoski, S., Kuivila, H.-M., Juntunen, J., Jarva, E., Rasi, M., et al. (2025). Utilising artificial intelligence in developing education of health sciences higher education: An umbrella review of reviews. Nurse Education Today, 147, 106600. doi:10.1016/j.nedt.2025.106600
Lin, Z., Guan, S., Zhang, W., Zhang, H., Li, Y., & Zhang, H. (2024). Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 57(9), 243. doi:10.1007/s10462-024-10896-y
Martens, D., Shmueli, G., Evgeniou, T., Bauer, K., Janiesch, C., Feuerriegel, S., et al. (2025). Beware of “Explanations” of AI. arXiv. doi:10.48550/arXiv.2504.06791
Matarazzo, A., & Torlone, R. (2025). A Survey on Large Language Models with some Insights on their Capabilities and Limitations. arXiv. doi:10.48550/arXiv.2501.04040
Mawarsih, P. B., Nadzifah, H., Puspa Widuri, A. W., & Kurniawati, E. (2025). Generative AI in higher education: the ChatGPT effect. Asia Pacific Journal of Education, 1-3. doi:10.1080/02188791.2024.2420309
Michel-Villarreal, R., Vilalta-Perdomo, E., Salinas-Navarro, D. E., Thierry-Aguilera, R., & Gerardou, F. S. (2023). Challenges and Opportunities of Generative AI for Higher Education as Explained by ChatGPT. Education Sciences, 13(9), 856. doi:10.3390/educsci13090856
Mirea, C.-M., Bologa, R., Toma, A., Clim, A., Plăcintă, D.-D., & Bobocea, A. (2025). Transforming Learning with Generative AI: From Student Perceptions to the Design of an Educational Solution. Applied Sciences, 15(10), 5785. doi:10.3390/app15105785
Mohseni, P., & Ghorbani, A. (2024). Exploring the synergy of artificial intelligence in microbiology: Advancements, challenges, and future prospects. Computational and Structural Biotechnology Reports, 1, 100005. doi:10.1016/j.csbr.2024.100005
Monrad, S., Zaidi, L., Grob, K., Kurtz, J., Tai, A., Hortsch, M., et al. (2021). What faculty write versus what students see? Perspectives on multiple-choice questions using Bloom's taxonomy. Medical teacher, 43, 1-12. doi:10.1080/0142159X.2021.1879376
Nelson, A. S., Santamaría, P. V., Javens, J. S., & Ricaurte, M. (2025). Students’ Perceptions of Generative Artificial Intelligence (GenAI) Use in Academic Writing in English as a Foreign Language. Education Sciences, 15(5), 611. doi:10.3390/educsci15050611
Newton, P., & Xiromeriti, M. (2024). ChatGPT performance on multiple choice question examinations in higher education. A pragmatic scoping review. Assessment & Evaluation in Higher Education, 49(6), 781-798. doi:10.1080/02602938.2023.2299059
Newton, P. M. (2020). Guidelines for Creating Online MCQ-Based Exams to Evaluate Higher Order Learning and Reduce Academic Misconduct. In S. E. Eaton (Ed.), Handbook of Academic Integrity (pp. 1-17). Singapore: Springer Nature Singapore.
Oyekunle, D., Nwaiku, M., Matthew, U., Onyedibe, N., Onyedibe, O., Nwanakwaugwu, A., et al. (2024). Transition to Sustainable Human-Centric Education in Emerging Artificial Intelligence Industry 5.0: Conversational AI With User-Centric ChatGPT-5. In (pp. 37-76).
Parekh, P., & Bahadoor, V. (2024). The Utility of Multiple-Choice Assessment in Current Medical Education: A Critical Review. Cureus, 16(5), e59778. doi:10.7759/cureus.59778
Parthasarathy, V., Zafar, A., Khan, A., & Shahid, A. (2024). The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv. doi:10.48550/arXiv.2408.13296
Parveen, D., & Ramzan, S. (2024). The Role of Digital Technologies in Education: Benefits and Challenges. International Research Journal on Advanced Engineering and Management (IRJAEM), 2, 2029-2037. doi:10.47392/IRJAEM.2024.0299
Pesovski, I., Santos, R., Henriques, R., & Trajkovik, V. (2024). Generative AI for Customizable Learning Experiences. Sustainability, 16, 3034. doi:10.3390/su16073034
Rajaram, K. (2023). Future of Learning: Teaching and Learning Strategies. In K. Rajaram (Ed.), Learning Intelligence: Innovative and Digital Transformative Learning Strategies: Cultural and Social Engineering Perspectives (pp. 3-53). Singapore: Springer Nature Singapore.
Richardson, M., & Clesham, R. (2021). Rise of the machines? The evolving role of Artificial Intelligence (AI) technologies in high stakes assessment. London Review of Education, 19. doi:10.14324/LRE.19.1.09
Rodger, D., Mann, S. P., Earp, B., Savulescu, J., Bobier, C., & Blackshaw, B. P. (2025). Generative AI in healthcare education: How AI literacy gaps could compromise learning and patient safety. Nurse Education in Practice, 87, 104461. doi:10.1016/j.nepr.2025.104461
Rony, M. K. K., Parvin, M. R., Wahiduzzaman, M., Debnath, M., Bala, S. D., & Kayesh, I. (2024). “I Wonder if my Years of Training and Expertise Will be Devalued by Machines”: Concerns About the Replacement of Medical Professionals by Artificial Intelligence. SAGE Open Nurs, 10, 23779608241245220. doi:10.1177/23779608241245220
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of applied learning and teaching, 6(1), 342-363. doi:10.37074/jalt.2023.6.1.9
Sallam, M. (2023). ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel), 11(6), 887. doi:10.3390/healthcare11060887
Sallam, M., Al-Mahzoum, K., Almutawaa, R. A., Alhashash, J. A., Dashti, R. A., AlSafy, D. R., et al. (2024a). The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses. BMC Research Notes, 17(1), 247. doi:10.1186/s13104-024-06920-7
Sallam, M., Al-Mahzoum, K., Eid, H., Al-Salahat, K., Sallam, M., Ali, G., et al. (2025a). Chinese Generative AI Models Challenge Western AI in Clinical Chemistry MCQs: A Benchmarking Follow-up Study on AI Use in Health Education. Babylonian Journal of Artificial Intelligence, 2025, 1-14. doi:10.58496/BJAI/2025/001
Sallam, M., Al-Mahzoum, K., Sallam, M., & Mijwil, M. M. (2025b). DeepSeek: Is it the End of Generative AI Monopoly or the Mark of the Impending Doomsday? Mesopotamian Journal of Big Data, 2025, 26-34. doi:10.58496/MJBD/2025/002
Sallam, M., & Al-Salahat, K. (2023). Below average ChatGPT performance in medical microbiology exam compared to university students. Frontiers in Education, 8, 1333415. doi:10.3389/feduc.2023.1333415
Sallam, M., Al-Salahat, K., & Al-Ajlouni, E. (2023a). ChatGPT Performance in Diagnostic Clinical Microbiology Laboratory-Oriented Case Scenarios. Cureus, 15(12), e50629. doi:10.7759/cureus.50629
Sallam, M., Al-Salahat, K., Eid, H., Egger, J., & Puladi, B. (2024b). Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions. Adv Med Educ Pract, 15, 857-871. doi:10.2147/amep.S479801
Sallam, M., Barakat, M., & Sallam, M. (2023b). Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models. Cureus, 15(11), e49373. doi:10.7759/cureus.49373
Sallam, M., Barakat, M., & Sallam, M. (2024c). A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res, 13, e54704. doi:10.2196/54704
Sallam, M., Khalil, R., & Sallam, M. (2024d). Benchmarking Generative AI: A Call for Establishing a Comprehensive Framework and a Generative AIQ Test. Mesopotamian Journal of Artificial Intelligence in Healthcare, 2024, 69-75. doi:10.58496/MJAIH/2024/010
Sallam, M., Salim, N. A., Barakat, M., & Al-Tammemi, A. B. (2023c). ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations. Narra J, 3(1), e103. doi:10.52225/narra.v3i1.103
Sallam, M., & Sallam, M. (2025). Ethical aspects of implementing generative artificial intelligence in medical education: a narrative review. History and Philosophy of Medicine, 7, 18–25. doi:10.53388/HPM2025020
Scarlatos, A., Liu, N., Lee, J., Baraniuk, R., & Lan, A. (2025). Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues. arXiv. doi:10.48550/arXiv.2503.06424
Schmidt, D. A., Alboloushi, B., Thomas, A., & Magalhaes, R. (2025). Integrating artificial intelligence in higher education: perceptions, challenges, and strategies for academic innovation. Computers and Education Open, 9, 100274. doi:10.1016/j.caeo.2025.100274
Sharma, S., Mittal, P., Kumar, M., & Bhardwaj, V. (2025). The role of large language models in personalized learning: a systematic review of educational impact. Discover Sustainability, 6(1), 243. doi:10.1007/s43621-025-01094-z
Singh, S. P., & Nagmoti, J. M. (2021). Strengthening clinical microbiology skill acquisition; a nationwide survey of faculty perceptions & practices on teaching & assessment of practical skills to undergraduate students. Indian Journal of Medical Microbiology, 39(2), 154-158. doi:10.1016/j.ijmmb.2020.11.003
Skryd, A., & Lawrence, K. (2024). ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study. JMIR Form Res, 8, e51346. doi:10.2196/51346
Storey, V. C., Yue, W. T., Zhao, J. L., & Lukyanenko, R. (2025). Generative Artificial Intelligence: Evolving Technology, Growing Societal Impact, and Opportunities for Information Systems Research. Information Systems Frontiers. doi:10.1007/s10796-025-10581-7
Tan, X., Cheng, G., & Ling, M. H. (2025). Artificial intelligence in teaching and teacher professional development: A systematic review. Computers and Education: Artificial Intelligence, 8, 100355. doi:10.1016/j.caeai.2024.100355
Trikoili, A., Georgiou, D., Pappa, C. I., & Pittich, D. (2025). Critical Thinking Assessment in Higher Education: A Mixed-Methods Comparative Analysis of AI and Human Evaluator. International Journal of Human–Computer Interaction, 1-14. doi:10.1080/10447318.2025.2499164
Vieriu, A. M., & Petrea, G. (2025). The Impact of Artificial Intelligence (AI) on Students’ Academic Development. Education Sciences, 15(3), 343. doi:10.3390/educsci15030343
Weng, Z., & Fu, Y. (2025). Generative AI in Language Education: Bridging Divide and Fostering Inclusivity. International Journal of Technology in Education, 8, 395-420. doi:10.46328/ijte.1056
Wong, W. K. O. (2024). The sudden disruptive rise of generative artificial intelligence? An evaluation of their impact on higher education and the global workplace. Journal of Open Innovation: Technology, Market, and Complexity, 10(2), 100278. doi:10.1016/j.joitmc.2024.100278
Wu, Y., Zheng, Y., Feng, B., Yang, Y., Kang, K., & Zhao, A. (2024). Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students. JMIR Med Educ, 10, e52483. doi:10.2196/52483
Xia, Q., Weng, X., Ouyang, F., Lin, T. J., & Chiu, T. K. F. (2024). A scoping review on how generative artificial intelligence transforms assessment in higher education. International Journal of Educational Technology in Higher Education, 21(1), 40. doi:10.1186/s41239-024-00468-z
Ying, L., Collins, K., Wong, L., Sucholutsky, I., Liu, R., Weller, A., et al. (2025). On Benchmarking Human-Like Intelligence in Machines. arXiv. doi:10.48550/arXiv.2502.20502
Yusuf, A., Pervin, N., & Román-González, M. (2024). Generative AI and the future of higher education: a threat to academic integrity or reformation? Evidence from multicultural perspectives. International Journal of Educational Technology in Higher Education, 21(1), 21. doi:10.1186/s41239-024-00453-6
Zhu, Y. (2025). Revolutionizing simulation-based clinical training with AI: Integrating FASSLING for enhanced emotional intelligence and therapeutic competency in clinical psychology education. Journal of Clinical Technology and Theory, 2, 38-54. doi:10.54254/3049-5458/2025.21247
Published
Issue
Section
License
Copyright (c) 2025 Malik Sallam, Amal Irshaid, Johan Snygg, Rula Albadri, Mohammed Sallam

This work is licensed under a Creative Commons Attribution 4.0 International License.