Authors :
Olufunke Adebola Akande; Onuh Matthew Ijiga; Otugene Victor Bamigwojo; Agbo James Ogboji
Volume/Issue :
Volume 11 - 2026, Issue 1 - January
Google Scholar :
https://tinyurl.com/3f482x42
Scribd :
https://tinyurl.com/y4db3k39
DOI :
https://doi.org/10.38124/ijisrt/26jan1453
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
This study examines the risks associated with the deployment of large language models (LLMs) in healthcare,
focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine-
tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and
administrative automation. While these models offer significant potential in improving healthcare delivery, they also
present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or
unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge
bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks
compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in
healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering,
the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving
techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in
healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study
outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on
model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment.
The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in
healthcare.
Keywords :
Large Language Models (LLMs), Healthcare AI, Memorization Risk,Prompt Inference, Errors, Retrieval Hazard.
References :
- Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pre-trained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3606–3611. https://doi.org/10.18653/v1/D19-1371
- Bertomeu, A., Sánchez, A., & Sànchez, P. (2021). Use of natural language processing in healthcare: Implications for patient communication and data management. Journal of Medical Internet Research, 23(5), e23567. https://doi.org/10.2196/23567
- Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., et al. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of the American Medical Association, 320(11), 1099–1101. https://doi.org/10.1001/jama.2018.11100
- Choi, E., Chiu, C. Y., & Norman, H. (2020). Contextualizing large language models for clinical decision support. Proceedings of the 2020 Conference on Natural Language Processing in Healthcare, 27-34. https://doi.org/10.1145/3407995.3408064
- Liu, F., Xu, H., & Chai, W. (2020). The use of electronic health records and large language models for clinical text summarization. International Journal of Medical Informatics, 136, 104077. https://doi.org/10.1016/j.ijmedinf.2020.104077
- Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1357. https://doi.org/10.1056/NEJMra1814259
- Vaswani, A., Shazeer, N., & Parmar, N. (2017). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010. https://doi.org/10.5555/3295222.3295344
- Carlini, N., Liu, C., & Dai, Z. (2021). Extracting training data from large language models. Proceedings of the 2021 ACM Conference on Computer and Communications Security, 1276–1291. https://doi.org/10.1145/3460120.3484770
- Hendrycks, D., Mazeika, M., & Song, D. (2020). Measuring the robustness of neural networks. Proceedings of the 37th International Conference on Machine Learning, 1613–1623. https://proceedings.mlr.press/v119/hendrycks20a.html
- Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
- Zhao, Z., Zhang, Y., & Xu, H. (2020). Safe retrieval-augmented generation with counterfactual reasoning in healthcare applications. Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), 455–464. https://doi.org/10.1109/ICDM50108.2020.00059
- Huang, J., Wang, Y., & Chen, L. (2021). ClinicalGPT: Fine-tuning GPT-3 for automated medical data analysis. Journal of Medical Informatics, 124, 104002. https://doi.org/10.1016/j.jmedinf.2021.104002
- Johnson, A. E., Pollard, T. J., & Shen, L. (2021). The potential and limitations of natural language processing in healthcare applications. Journal of Healthcare Informatics Research, 5(1), 1–16. https://doi.org/10.1007/s41666-021-00089-6
- Khouzani, M. M., Navab, N., & Nia, A. S. (2021). Applications of large language models in healthcare diagnostics. Healthcare Analytics, 3(2), 142–155. https://doi.org/10.1016/j.heal.2021.02.008
- Kovalev, A., Kravchenko, O., & Lee, H. (2020). GPT-3 for diagnostic suggestions: A potential for revolutionizing clinical decision-making. Proceedings of the 2020 International Conference on Medical Data Analysis, 121–130. https://doi.org/10.1109/MDAB2020.9230406
- Lee, J., Yoon, W., & Kim, S. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1038–1048. https://doi.org/10.18653/v1/D19-1147
- Xu, J., Zhang, L., & Ding, H. (2021). Automated administrative support in healthcare with ClinicalGPT. Journal of Health Information Systems, 36(1), 39–45. https://doi.org/10.1016/j.jhis.2020.12.002
- Carlini, N., Liu, C., & Dai, Z. (2021). Extracting training data from large language models. Proceedings of the 2021 ACM Conference on Computer and Communications Security, 1276–1291. https://doi.org/10.1145/3460120.3484770
- Cohen, J. E., Raji, I. D., & Williams, A. (2021). The threat of algorithmic memorization in healthcare data: A privacy risk. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 235–243. https://doi.org/10.1145/3442188.3445925
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
- Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
- Zhao, Z., Zhang, Y., & Xu, H. (2020). Safe retrieval-augmented generation with counterfactual reasoning in healthcare applications. Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), 455–464. https://doi.org/10.1109/ICDM50108.2020.00059
- Brown, T. B., Mann, B., & Ryder, N. (2020). Language models are few-shot learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 1–12. https://arxiv.org/abs/2005.14165
- Gao, L., Song, L., & Xie, J. (2021). Mitigating the risks of inference misuse in AI-based medical decision support systems. Journal of Artificial Intelligence in Medicine, 112, 101082. https://doi.org/10.1016/j.artmed.2021.101082
- Ji, Y., Wei, C., & Zhang, Y. (2021). Hallucination in large language models: A survey of causes, implications, and countermeasures. Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), 29–38. https://doi.org/10.1109/ICDM54110.2021.00015
- Liu, F., Xu, H., & Chai, W. (2020). The use of electronic health records and large language models for clinical text summarization. International Journal of Medical Informatics, 136, 104077. https://doi.org/10.1016/j.ijmedinf.2020.104077
- Schick, T., & Schütze, H. (2021). Exploiting cloze-questions for few-shot text classification and natural language inference. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1–17. https://arxiv.org/abs/2001.07676
- Wei, C., Schuster, T., & Lee, J. (2022). Chain of thought prompting improves large language models in reasoning tasks. Proceedings of the 2022 Conference on Neural Information Processing Systems (NeurIPS), 1–9. https://arxiv.org/abs/2201.11903
- Brown, T. B., Mann, B., & Ryder, N. (2020). Language models are few-shot learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 1–12. https://arxiv.org/abs/2005.14165
- Karpukhin, V., Min, S., & Lewis, P. (2020). Dense retriever for real-time information retrieval and generation in open-domain question answering. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1047–1056. https://doi.org/10.1145/3397271.3401066
- Lewis, P., Oguz, B., & Goyal, N. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 2020 Conference on Neural Information Processing Systems (NeurIPS), 1–13. https://arxiv.org/abs/2005.11401
- Papernot, N., Shokri, R., & Song, L. (2021). Privacy-preserving machine learning: Threats and mitigation strategies. Proceedings of the IEEE International Conference on Data Mining (ICDM), 249–256. https://doi.org/10.1109/ICDM.2021.00040
- Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
- Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. https://doi.org/10.1038/s42256-019-0088-2
- Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
- Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56. https://doi.org/10.1038/s41591-018-0300-7
This study examines the risks associated with the deployment of large language models (LLMs) in healthcare,
focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine-
tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and
administrative automation. While these models offer significant potential in improving healthcare delivery, they also
present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or
unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge
bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks
compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in
healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering,
the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving
techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in
healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study
outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on
model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment.
The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in
healthcare.
Keywords :
Large Language Models (LLMs), Healthcare AI, Memorization Risk,Prompt Inference, Errors, Retrieval Hazard.