Rapid Evolution of Large Language Models in Medical Education: Comparative Performance of ChatGPT-3.5, ChatGPT-5, and DeepSeek on Medical Microbiology MCQs

Contemporary Education and Teaching Research

2737-4203 2737-4335

BONI FUTURE DIGITAL PUBLISHING CO.,LIMITED

https://ojs.bonfuturepress.com/index.php/CETR/article/view/1877 6 8 2025

2025-08-25

Rapid Evolution of Large Language Models in Medical Education: Comparative Performance of ChatGPT-3.5, ChatGPT-5, and DeepSeek on Medical Microbiology MCQs Malik Sallam ,Amal Irshaid ,Johan Snygg ,Rula Albadri ,Mohammed Sallam Rapid advances in large language models (LLMs) warrant specialty-specific benchmarking to assess their educational potential and limitations. We evaluated the newly released generative artificial intelligence (genAI) model ChatGPT-5, DeepSeek-R1, and the early ChatGPT-3.5 on 80 multiple-choice questions (MCQs) from a medical microbiology course examination, weighted for midterm and final components. Items were classified according to the revised Bloom’s taxonomy. Performance was compared with that of more than 150 Doctor of Dental Surgery students. Content quality was assessed independently by two consultants in clinical microbiology using the validated CLEAR tool modified to assess AI content completeness, accuracy, and relevance. The mean total scores were 80.5 for ChatGPT-3.5, 96.0 for ChatGPT-5, and 95.5 for DeepSeek, versus a student mean of 86.21/100. ChatGPT-5 and DeepSeek-R1 significantly outperformed ChatGPT-3.5 in completeness and accuracy scores, with no differences between them. ChatGPT-5 maintained high accuracy across lower- and higher-order cognitive Bloom’s domains, whereas DeepSeek-R1 showed a significant drop in higher-order items. For ChatGPT-3.5, incorrect responses had longer answer-choice word counts. CLEAR scores were significantly higher for correct versus incorrect responses in all models (p < 0.001). This study showed that the currently available LLMs can exceed average student performance in medical microbiology while providing high-quality explanations. Regular benchmarking is essential to ensure responsible integration of genAI into educational, pedagogical, and assessment tools. ChatGPT-5,artificial intelligence,large language models,medical education,medical microbiology, assessment

10.61360/BoniCETR252018770801

Abdaljaleel, M., Barakat, M., Alsanafi, M., Salim, N. A., Abazid, H., Malaeb, D., et al. (2024). A multinational study on the factors influencing university students’ attitudes and usage of ChatGPT. Scientific Reports, 14(1), 1983. doi:10.1038/s41598-024-52549-8 Ateeq, A., Alzoraiki, M., & Milhem, M. (2024). Artificial intelligence in education: implications for academic integrity and the shift toward holistic assessment. Frontiers in Education, 9, 1470979. doi:10.3389/feduc.2024.1470979 Azaria, A., Azoulay, R., & Reches, S. (2023). ChatGPT is a Remarkable Tool—For Experts. Data Intelligence, 6, 1-49. doi:10.1162/dint_a_00235 Barakat, M., Salim, N. A., & Sallam, M. (2025). University Educators Perspectives on ChatGPT: A Technology Acceptance Model-Based Study. Open Praxis, 17(1), 129–144. doi:10.55982/openpraxis.17.1.718 Bharatha, A., Ojeh, N., Fazle Rabbi, A. M., Campbell, M. H., Krishnamurthy, K., Layne-Yarde, R. N. A., et al. (2024). Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom's Taxonomy. Adv Med Educ Pract, 15, 393-400. doi:10.2147/amep.S457408 Bushuyev, S., Puziichuk, A., Bushueva, N., Bushuyeva, V., & Bushuyev, D. (2025). The evolving landscape of education under the influence of AI. Bulletin of NTU KhPI Series Strategic Management Portfolio Program and Project Management, 3-8. doi:10.20998/2413-3000.2024.9.1 Chelli, M., Descamps, J., Lavoué, V., Trojani, C., Azar, M., Deckert, M., et al. (2024). Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J Med Internet Res, 26, e53164. doi:10.2196/53164 Córdova-Esparza, D.-M. (2025). AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges. Information, 16(6), 469. doi:10.3390/info16060469 Fu, Y., & Weng, Z. (2024). Navigating the ethical terrain of AI in education: A systematic review on framing responsible human-centered AI practices. Computers and Education: Artificial Intelligence, 7, 100306. doi:10.1016/j.caeai.2024.100306 Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., et al. (2023). How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ, 9, e45312. doi:10.2196/45312 Gurajala, S. (2024). Artificial intelligence (AI) and medical microbiology: A narrative review. Indian Journal of Microbiology Research, 11, 156-162. doi:10.18231/j.ijmr.2024.029 Haugen, H. J., & de Lange, T. (2024). Multiple choice as formative assessment in dental education. Eur J Dent Educ, 28(3), 757-769. doi:10.1111/eje.13002 Herrmann-Werner, A., Festl-Wietek, T., Holderried, F., Herschbach, L., Griewatz, J., Masters, K., et al. (2024). Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. J Med Internet Res, 26, e52113. doi:10.2196/52113 Hirani, R., Noruzi, K., Khuram, H., Hussaini, A. S., Aifuwa, E. I., Ely, K. E., et al. (2024). Artificial Intelligence and Healthcare: A Journey through History, Present Innovations, and Future Possibilities. Life (Basel), 14(5), 557. doi:10.3390/life14050557 Holzinger, A., Saranti, A., Angerschmid, A., Finzel, B., Schmid, U., & Mueller, H. (2023). Toward human-level concept learning: Pattern benchmarking for AI algorithms. Patterns (N Y), 4(8), 100788. doi:10.1016/j.patter.2023.100788 Hu, C., Li, F., Wang, S., Gao, Z., Pan, S., & Qing, M. (2025). The role of artificial intelligence in enhancing personalized learning pathways and clinical training in dental education. Cogent Education, 12(1), 2490425. doi:10.1080/2331186X.2025.2490425 Jiang, Q., Gao, Z., & Karniadakis, G. (2025). DeepSeek vs. ChatGPT: A Comparative Study for Scientific Computing and Scientific Machine Learning Tasks. arXiv. doi:10.48550/arXiv.2502.17764 Jin, I., Tangsrivimol, J. A., Darzi, E., Hassan Virk, H. U., Wang, Z., Egger, J., et al. (2025). DeepSeek vs. ChatGPT: prospects and challenges. Front Artif Intell, 8, 1576992. doi:10.3389/frai.2025.1576992 Joshi, L. T. (2021). Using alternative teaching and learning approaches to deliver clinical microbiology during the COVID-19 pandemic. FEMS Microbiol Lett, 368(16). doi:10.1093/femsle/fnab103 Karahan, B. N., & Emekli, E. (2025). Comparison of applicability, difficulty, and discrimination indices of multiple-choice questions on medical imaging generated by different AI-based chatbots. Radiography (Lond), 31(5), 103087. doi:10.1016/j.radi.2025.103087 Katona, J., & Gyonyoru, K. I. K. (2025). AI-based Adaptive Programming Education for Socially Disadvantaged Students: Bridging the Digital Divide. TechTrends. doi:10.1007/s11528-025-01088-8 Khan, M. S., Umer, H., & Faruqe, F. (2024). Artificial intelligence for low income countries. Humanities and Social Sciences Communications, 11(1), 1422. doi:10.1057/s41599-024-03947-w Kim, J., Yu, S., Detrick, R., & Li, N. (2025). Exploring students’ perspectives on Generative AI-assisted academic writing. Education and Information Technologies, 30(1), 1265-1300. doi:10.1007/s10639-024-12878-7 Kovalainen, T., Pramila-Savukoski, S., Kuivila, H.-M., Juntunen, J., Jarva, E., Rasi, M., et al. (2025). Utilising artificial intelligence in developing education of health sciences higher education: An umbrella review of reviews. Nurse Education Today, 147, 106600. doi:10.1016/j.nedt.2025.106600 Lin, Z., Guan, S., Zhang, W., Zhang, H., Li, Y., & Zhang, H. (2024). Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 57(9), 243. doi:10.1007/s10462-024-10896-y Martens, D., Shmueli, G., Evgeniou, T., Bauer, K., Janiesch, C., Feuerriegel, S., et al. (2025). Beware of “Explanations” of AI. arXiv. doi:10.48550/arXiv.2504.06791 Matarazzo, A., & Torlone, R. (2025). A Survey on Large Language Models with some Insights on their Capabilities and Limitations. arXiv. doi:10.48550/arXiv.2501.04040 Mawarsih, P. B., Nadzifah, H., Puspa Widuri, A. W., & Kurniawati, E. (2025). Generative AI in higher education: the ChatGPT effect. Asia Pacific Journal of Education, 1-3. doi:10.1080/02188791.2024.2420309 Michel-Villarreal, R., Vilalta-Perdomo, E., Salinas-Navarro, D. E., Thierry-Aguilera, R., & Gerardou, F. S. (2023). Challenges and Opportunities of Generative AI for Higher Education as Explained by ChatGPT. Education Sciences, 13(9), 856. doi:10.3390/educsci13090856 Mirea, C.-M., Bologa, R., Toma, A., Clim, A., Plăcintă, D.-D., & Bobocea, A. (2025). Transforming Learning with Generative AI: From Student Perceptions to the Design of an Educational Solution. Applied Sciences, 15(10), 5785. doi:10.3390/app15105785 Mohseni, P., & Ghorbani, A. (2024). Exploring the synergy of artificial intelligence in microbiology: Advancements, challenges, and future prospects. Computational and Structural Biotechnology Reports, 1, 100005. doi:10.1016/j.csbr.2024.100005 Monrad, S., Zaidi, L., Grob, K., Kurtz, J., Tai, A., Hortsch, M., et al. (2021). What faculty write versus what students see? Perspectives on multiple-choice questions using Bloom's taxonomy. Medical teacher, 43, 1-12. doi:10.1080/0142159X.2021.1879376 Nelson, A. S., Santamaría, P. V., Javens, J. S., & Ricaurte, M. (2025). Students’ Perceptions of Generative Artificial Intelligence (GenAI) Use in Academic Writing in English as a Foreign Language. Education Sciences, 15(5), 611. doi:10.3390/educsci15050611 Newton, P., & Xiromeriti, M. (2024). ChatGPT performance on multiple choice question examinations in higher education. A pragmatic scoping review. Assessment & Evaluation in Higher Education, 49(6), 781-798. doi:10.1080/02602938.2023.2299059 Newton, P. M. (2020). Guidelines for Creating Online MCQ-Based Exams to Evaluate Higher Order Learning and Reduce Academic Misconduct. In S. E. Eaton (Ed.), Handbook of Academic Integrity (pp. 1-17). Singapore: Springer Nature Singapore. Oyekunle, D., Nwaiku, M., Matthew, U., Onyedibe, N., Onyedibe, O., Nwanakwaugwu, A., et al. (2024). Transition to Sustainable Human-Centric Education in Emerging Artificial Intelligence Industry 5.0: Conversational AI With User-Centric ChatGPT-5. In (pp. 37-76). Parekh, P., & Bahadoor, V. (2024). The Utility of Multiple-Choice Assessment in Current Medical Education: A Critical Review. Cureus, 16(5), e59778. doi:10.7759/cureus.59778 Parthasarathy, V., Zafar, A., Khan, A., & Shahid, A. (2024). The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv. doi:10.48550/arXiv.2408.13296 Parveen, D., & Ramzan, S. (2024). The Role of Digital Technologies in Education: Benefits and Challenges. International Research Journal on Advanced Engineering and Management (IRJAEM), 2, 2029-2037. doi:10.47392/IRJAEM.2024.0299 Pesovski, I., Santos, R., Henriques, R., & Trajkovik, V. (2024). Generative AI for Customizable Learning Experiences. Sustainability, 16, 3034. doi:10.3390/su16073034 Rajaram, K. (2023). Future of Learning: Teaching and Learning Strategies. In K. Rajaram (Ed.), Learning Intelligence: Innovative and Digital Transformative Learning Strategies: Cultural and Social Engineering Perspectives (pp. 3-53). Singapore: Springer Nature Singapore. Richardson, M., & Clesham, R. (2021). Rise of the machines? The evolving role of Artificial Intelligence (AI) technologies in high stakes assessment. London Review of Education, 19. doi:10.14324/LRE.19.1.09 Rodger, D., Mann, S. P., Earp, B., Savulescu, J., Bobier, C., & Blackshaw, B. P. (2025). Generative AI in healthcare education: How AI literacy gaps could compromise learning and patient safety. Nurse Education in Practice, 87, 104461. doi:10.1016/j.nepr.2025.104461 Rony, M. K. K., Parvin, M. R., Wahiduzzaman, M., Debnath, M., Bala, S. D., & Kayesh, I. (2024). “I Wonder if my Years of Training and Expertise Will be Devalued by Machines”: Concerns About the Replacement of Medical Professionals by Artificial Intelligence. SAGE Open Nurs, 10, 23779608241245220. doi:10.1177/23779608241245220 Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of applied learning and teaching, 6(1), 342-363. doi:10.37074/jalt.2023.6.1.9 Sallam, M. (2023). ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel), 11(6), 887. doi:10.3390/healthcare11060887 Sallam, M., Al-Mahzoum, K., Almutawaa, R. A., Alhashash, J. A., Dashti, R. A., AlSafy, D. R., et al. (2024a). The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses. BMC Research Notes, 17(1), 247. doi:10.1186/s13104-024-06920-7 Sallam, M., Al-Mahzoum, K., Eid, H., Al-Salahat, K., Sallam, M., Ali, G., et al. (2025a). Chinese Generative AI Models Challenge Western AI in Clinical Chemistry MCQs: A Benchmarking Follow-up Study on AI Use in Health Education. Babylonian Journal of Artificial Intelligence, 2025, 1-14. doi:10.58496/BJAI/2025/001 Sallam, M., Al-Mahzoum, K., Sallam, M., & Mijwil, M. M. (2025b). DeepSeek: Is it the End of Generative AI Monopoly or the Mark of the Impending Doomsday? Mesopotamian Journal of Big Data, 2025, 26-34. doi:10.58496/MJBD/2025/002 Sallam, M., & Al-Salahat, K. (2023). Below average ChatGPT performance in medical microbiology exam compared to university students. Frontiers in Education, 8, 1333415. doi:10.3389/feduc.2023.1333415 Sallam, M., Al-Salahat, K., & Al-Ajlouni, E. (2023a). ChatGPT Performance in Diagnostic Clinical Microbiology Laboratory-Oriented Case Scenarios. Cureus, 15(12), e50629. doi:10.7759/cureus.50629 Sallam, M., Al-Salahat, K., Eid, H., Egger, J., & Puladi, B. (2024b). Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions. Adv Med Educ Pract, 15, 857-871. doi:10.2147/amep.S479801 Sallam, M., Barakat, M., & Sallam, M. (2023b). Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models. Cureus, 15(11), e49373. doi:10.7759/cureus.49373 Sallam, M., Barakat, M., & Sallam, M. (2024c). A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res, 13, e54704. doi:10.2196/54704 Sallam, M., Khalil, R., & Sallam, M. (2024d). Benchmarking Generative AI: A Call for Establishing a Comprehensive Framework and a Generative AIQ Test. Mesopotamian Journal of Artificial Intelligence in Healthcare, 2024, 69-75. doi:10.58496/MJAIH/2024/010 Sallam, M., Salim, N. A., Barakat, M., & Al-Tammemi, A. B. (2023c). ChatGPT applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations. Narra J, 3(1), e103. doi:10.52225/narra.v3i1.103 Sallam, M., & Sallam, M. (2025). Ethical aspects of implementing generative artificial intelligence in medical education: a narrative review. History and Philosophy of Medicine, 7, 18–25. doi:10.53388/HPM2025020 Scarlatos, A., Liu, N., Lee, J., Baraniuk, R., & Lan, A. (2025). Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues. arXiv. doi:10.48550/arXiv.2503.06424 Schmidt, D. A., Alboloushi, B., Thomas, A., & Magalhaes, R. (2025). Integrating artificial intelligence in higher education: perceptions, challenges, and strategies for academic innovation. Computers and Education Open, 9, 100274. doi:10.1016/j.caeo.2025.100274 Sharma, S., Mittal, P., Kumar, M., & Bhardwaj, V. (2025). The role of large language models in personalized learning: a systematic review of educational impact. Discover Sustainability, 6(1), 243. doi:10.1007/s43621-025-01094-z Singh, S. P., & Nagmoti, J. M. (2021). Strengthening clinical microbiology skill acquisition; a nationwide survey of faculty perceptions & practices on teaching & assessment of practical skills to undergraduate students. Indian Journal of Medical Microbiology, 39(2), 154-158. doi:10.1016/j.ijmmb.2020.11.003 Skryd, A., & Lawrence, K. (2024). ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study. JMIR Form Res, 8, e51346. doi:10.2196/51346 Storey, V. C., Yue, W. T., Zhao, J. L., & Lukyanenko, R. (2025). Generative Artificial Intelligence: Evolving Technology, Growing Societal Impact, and Opportunities for Information Systems Research. Information Systems Frontiers. doi:10.1007/s10796-025-10581-7 Tan, X., Cheng, G., & Ling, M. H. (2025). Artificial intelligence in teaching and teacher professional development: A systematic review. Computers and Education: Artificial Intelligence, 8, 100355. doi:10.1016/j.caeai.2024.100355 Trikoili, A., Georgiou, D., Pappa, C. I., & Pittich, D. (2025). Critical Thinking Assessment in Higher Education: A Mixed-Methods Comparative Analysis of AI and Human Evaluator. International Journal of Human–Computer Interaction, 1-14. doi:10.1080/10447318.2025.2499164 Vieriu, A. M., & Petrea, G. (2025). The Impact of Artificial Intelligence (AI) on Students’ Academic Development. Education Sciences, 15(3), 343. doi:10.3390/educsci15030343 Weng, Z., & Fu, Y. (2025). Generative AI in Language Education: Bridging Divide and Fostering Inclusivity. International Journal of Technology in Education, 8, 395-420. doi:10.46328/ijte.1056 Wong, W. K. O. (2024). The sudden disruptive rise of generative artificial intelligence? An evaluation of their impact on higher education and the global workplace. Journal of Open Innovation: Technology, Market, and Complexity, 10(2), 100278. doi:10.1016/j.joitmc.2024.100278 Wu, Y., Zheng, Y., Feng, B., Yang, Y., Kang, K., & Zhao, A. (2024). Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students. JMIR Med Educ, 10, e52483. doi:10.2196/52483 Xia, Q., Weng, X., Ouyang, F., Lin, T. J., & Chiu, T. K. F. (2024). A scoping review on how generative artificial intelligence transforms assessment in higher education. International Journal of Educational Technology in Higher Education, 21(1), 40. doi:10.1186/s41239-024-00468-z Ying, L., Collins, K., Wong, L., Sucholutsky, I., Liu, R., Weller, A., et al. (2025). On Benchmarking Human-Like Intelligence in Machines. arXiv. doi:10.48550/arXiv.2502.20502 Yusuf, A., Pervin, N., & Román-González, M. (2024). Generative AI and the future of higher education: a threat to academic integrity or reformation? Evidence from multicultural perspectives. International Journal of Educational Technology in Higher Education, 21(1), 21. doi:10.1186/s41239-024-00453-6 Zhu, Y. (2025). Revolutionizing simulation-based clinical training with AI: Integrating FASSLING for enhanced emotional intelligence and therapeutic competency in clinical psychology education. Journal of Clinical Technology and Theory, 2, 38-54. doi:10.54254/3049-5458/2025.21247