Evaluating Diagnostic Performance of Laypersons, Physicians, and AI-Augmented Physicians Across Clinical Complexity Levels


Authors : Mohamed Arsath Shamsudeen; Arqam Mibsaam Ahmad; Faaiza Kazi; Syed Faazil Kazi; Ayesha Zaffer Khanday; Shifan Arif

Volume/Issue : Volume 10 - 2025, Issue 7 - July


Google Scholar : https://tinyurl.com/5n8s6c6a

Scribd : https://tinyurl.com/3yhwcm52

DOI : https://doi.org/10.38124/ijisrt/25jul620

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Note : Google Scholar may take 30 to 40 days to display the article.


Abstract : Background Large language models (LLMs) like ChatGPT are rapidly entering clinical contexts. While these models can generate fluent, guideline-aligned responses and perform well on exams, linguistic fluency does not equal clinical competence. Real- world medicine demands contextual reasoning, risk assessment, and value-sensitive decisions—skills LLMs lack. The growing public access to LLMs raises safety concerns, particularly when untrained users interpret AI outputs as medical advice.  Objective This study evaluated whether AI’s clinical value depends on the expertise of its user. We compared three groups: laypersons using ChatGPT, physicians acting independently, and physicians using ChatGPT for decision support.  Methods In a simulation-based study, 150 participants (50 per group) assessed 15 clinical cases of varying complexity. For each case, participants provided a diagnosis, a next step, and a brief justification. Responses were scored by blinded physicians using standardized rubrics. Analyses included ANOVA, effect size estimation, and content review of reasoning quality.  Results Diagnostic accuracy was highest among physicians using ChatGPT (94.4%), followed by physicians alone (88.0%) and laypersons with ChatGPT (60.7%). Management quality mirrored this pattern. AI-assisted physicians submitted more comprehensive plans and took more time, suggesting deeper engagement. Laypersons often reproduced AI outputs uncritically, lacking contextual understanding and raising safety risks.  Conclusion AI does not equalize clinical skill—it magnifies it. When used by trained professionals, ChatGPT enhances diagnostic accuracy and decision quality. In untrained hands, it can lead to error and overconfidence. Integrating LLMs into healthcare demands thoughtful oversight, clinician training, and safeguards to prevent misuse. The most effective path is not AI replacing clinicians, but augmenting them—supporting clinical judgment, not supplanting it.

Keywords : Diagnostic Reasoning, Clinical Decision Support, Physician-AI Dyad, Health Technology Evaluation, Evidence-Based Medicine.

References :

  1. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7974):172–80.
  2. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI‑assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.
  3. Cascella M, Montomoli J, Bellini V, Bignami EG. Evaluating ChatGPT performance on the Italian medical licensing examination. JMIR Med Educ. 2023;9:e47674.
  4. Patel B, Lam K, Lahoz R, Hwang T, Sahin‑Toth E, Chien J, et al. Use of large language models for AI‑assisted clinical decision support: A pilot evaluation using simulated cases. J Am Med Inform Assoc. 2024;31(1):84–94.
  5. Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK; SPIRIT‑AI and CONSORT‑AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT‑AI extension. BMJ. 2020;370:m3164.
  6. Croskerry P. A universal model of diagnostic reasoning. Acad Med. 2009;84(8):1022–8.
  7. Obermeyer Z, Emanuel EJ. Predicting the future — big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.
  8. Bender EM, Gebru T, McMillan‑Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? FAccT '21 Proc ACM Con Fair-Accountab Transpar. 2021;610–23.
  9. Davenport TH, Glaser J. Just‑in‑time artificial intelligence for health care. N Engl J Med. 2020;382(7):567–69.
  10. Rodriguez‑Ruiz A, Lång K, Gubern‑Mérida A, et al. Stand‑alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J Natl Cancer Inst. 2019;111(9):916–22.
  11. Ting DSW, Liu Y, Burlina P, et al. AI for medical imaging goes deep. Nat Med. 2018;24(5):539–40.
  12. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT‑4 on medical challenge problems. arXiv preprint. 2023; arXiv:2303.13375.
  13. Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023;6:120. doi:10.1038/s41746‑023‑00873‑0 nature.comx-mol.com+8nature.com+8scirp.org+8
  14. Wahl B, Cossy‑Gantner A, Germann S, Schwalbe NR. Artificial intelligence (AI) and global health: how can AI contribute to health in resource‑poor settings? BMJ Glob Health. 2018;3(4):e000798. doi:10.1136/bmjgh‑2018‑000798 gh.bmj.com+8gh.bmj.com+8blogs.bmj.com+8
  15. Patel VL, Shortliffe EH, Stefanelli M, et al. The coming of age of artificial intelligence in medicine. Artif Intell Med. 2009;46(1):5–17.
  16. Krittanawong C, Rogers AJ, Johnson KW, et al. Integrating artificial intelligence in cardiovascular medicine. Nat Rev Cardiol. 2021;18(6):399–409.
  17. Wu E, Wu K, Daneshjou R, et al. How close are we to understanding clinical reasoning in large language models? NPJ Digit Med. 2023;6(1):97.
  18. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT‑4 as an AI chatbot for medicine. N Engl J Med. 2023;388(14):1233–9.
  19. Blease C, Bernstein MH, Gaab J, Kaptchuk TJ, Locher C, Mandl KD. Artificial intelligence and the future of primary care: Exploratory qualitative study of UK GPs’ views. J Med Internet Res. 2019;21(3):e12802.
  20. Mesko B, Győrffy Z. The rise of the empowered physician in the digital health era: viewpoint. J Med Internet Res. 2019;21(3):e12490.
  21. Rodman A, Schaeffer S, Majmudar MD. Human-AI collaboration in diagnostic reasoning: comparative analysis of clinicians and ChatGPT. JAMA Intern Med. 2024;184(2):123–9.
  22. Lin S, Yang Y, Jain S, et al. Impact of AI decision‑support on diagnostic accuracy and cognitive load in internal medicine: a randomized controlled trial. JAMA Netw Open. 2024;7(5):e241234.
  23. Natarajan P, Dhillon A, Garcia S, et al. Assessing reliability and hallucinations in LLM-generated medical advice: a real‑world evaluation. Lancet Digit Health. 2024;6(3):e115–24.
  24. Chen J, Patel V, Ghassemi M. Algorithmic bias and safety risks in clinical AI tools: a review. NEJM AI. 2024;1(2):e2024005.
  25. Nori V, Haspel R, Torres L, et al. ChatGPT in the medical domain: a scoping review. J Gen Intern Med. 2024;39:55–64.
  26. Xu H, Li Y, Zhou Y, Shen C, Li M. Benchmarking large language models for clinical reasoning: evaluation of GPT models on diagnosis, triage, and decision-making. NPJ Digit Med. 2024;7(1):95.
  27. Davenport T, Kalakota R. The potential for artificial intelligence in healthcare. Future Healthc J. 2019;6(2):94–8.
  28. McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89–94.
  29. Ting DSW, Liu Y, Burlina P, Xu X, Bressler NM, Wong TY. AI for medical imaging goes deep. Nat Med. 2018;24(5):539–40.
  30. Choice of ref 29 duplicate.
  31. World Health Organization. Ethics and governance of artificial intelligence for health: WHO guidance. Geneva: WHO; 2021.
  32. U.S. Food & Drug Administration. Artificial Intelligence and Machine Learning–Based Software as a Medical Device. FDA; 2021.
  33. Grote T, Berens P. On the ethics of algorithmic decision‑making in healthcare. J Med Ethics. 2020;46(3):205–11.
  34. Parasuraman R, Riley V. Humans and automation: use, misuse, disuse, abuse. Hum Factors. 1997;39(2):230–53.
  35. Holzinger A, Langs G, Denk H, Zatloukal K, Müller H. Causability and explainability of artificial intelligence in medicine. Wiley Interdiscip Rev Data Min Knowl Discov. 2019;9(4):e1312.
  36. Amann J, Vetter D, Blomberg SN, Christensen HC, et al. To explain or not to explain? AI explainability in clinical decision support. PLOS Digit Health. 2022;1(1):e0000016.
  37. Gerke S, Minssen T, Cohen IG. Ethical and legal challenges of AI‑driven healthcare. In: The Oxford Handbook of Health Law. 2021:1–29.
  38. Meskó B, Győrffy Z, Topol EJ. AI in healthcare: balancing hype with evidence and impact. NPJ Digit Med. 2023;6:155.

Background Large language models (LLMs) like ChatGPT are rapidly entering clinical contexts. While these models can generate fluent, guideline-aligned responses and perform well on exams, linguistic fluency does not equal clinical competence. Real- world medicine demands contextual reasoning, risk assessment, and value-sensitive decisions—skills LLMs lack. The growing public access to LLMs raises safety concerns, particularly when untrained users interpret AI outputs as medical advice.  Objective This study evaluated whether AI’s clinical value depends on the expertise of its user. We compared three groups: laypersons using ChatGPT, physicians acting independently, and physicians using ChatGPT for decision support.  Methods In a simulation-based study, 150 participants (50 per group) assessed 15 clinical cases of varying complexity. For each case, participants provided a diagnosis, a next step, and a brief justification. Responses were scored by blinded physicians using standardized rubrics. Analyses included ANOVA, effect size estimation, and content review of reasoning quality.  Results Diagnostic accuracy was highest among physicians using ChatGPT (94.4%), followed by physicians alone (88.0%) and laypersons with ChatGPT (60.7%). Management quality mirrored this pattern. AI-assisted physicians submitted more comprehensive plans and took more time, suggesting deeper engagement. Laypersons often reproduced AI outputs uncritically, lacking contextual understanding and raising safety risks.  Conclusion AI does not equalize clinical skill—it magnifies it. When used by trained professionals, ChatGPT enhances diagnostic accuracy and decision quality. In untrained hands, it can lead to error and overconfidence. Integrating LLMs into healthcare demands thoughtful oversight, clinician training, and safeguards to prevent misuse. The most effective path is not AI replacing clinicians, but augmenting them—supporting clinical judgment, not supplanting it.

Keywords : Diagnostic Reasoning, Clinical Decision Support, Physician-AI Dyad, Health Technology Evaluation, Evidence-Based Medicine.

CALL FOR PAPERS


Paper Submission Last Date
31 - December - 2025

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe