Assessment of Memorization, Prompt Inference, and Retrieval Risks in Healthcare Large Language Models


Authors : Olufunke Adebola Akande; Onuh Matthew Ijiga; Otugene Victor Bamigwojo; Agbo James Ogboji

Volume/Issue : Volume 11 - 2026, Issue 1 - January


Google Scholar : https://tinyurl.com/3f482x42

Scribd : https://tinyurl.com/y4db3k39

DOI : https://doi.org/10.38124/ijisrt/26jan1453

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : This study examines the risks associated with the deployment of large language models (LLMs) in healthcare, focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine- tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and administrative automation. While these models offer significant potential in improving healthcare delivery, they also present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering, the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment. The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in healthcare.

Keywords : Large Language Models (LLMs), Healthcare AI, Memorization Risk,Prompt Inference, Errors, Retrieval Hazard.

References :

  1. Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pre-trained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3606–3611. https://doi.org/10.18653/v1/D19-1371
  2. Bertomeu, A., Sánchez, A., & Sànchez, P. (2021). Use of natural language processing in healthcare: Implications for patient communication and data management. Journal of Medical Internet Research23(5), e23567. https://doi.org/10.2196/23567
  3. Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., et al. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of the American Medical Association320(11), 1099–1101. https://doi.org/10.1001/jama.2018.11100
  4. Choi, E., Chiu, C. Y., & Norman, H. (2020). Contextualizing large language models for clinical decision support. Proceedings of the 2020 Conference on Natural Language Processing in Healthcare, 27-34. https://doi.org/10.1145/3407995.3408064
  5. Liu, F., Xu, H., & Chai, W. (2020). The use of electronic health records and large language models for clinical text summarization. International Journal of Medical Informatics136, 104077. https://doi.org/10.1016/j.ijmedinf.2020.104077
  6. Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine380(14), 1347–1357. https://doi.org/10.1056/NEJMra1814259
  7. Vaswani, A., Shazeer, N., & Parmar, N. (2017). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010. https://doi.org/10.5555/3295222.3295344
  8. Carlini, N., Liu, C., & Dai, Z. (2021). Extracting training data from large language models. Proceedings of the 2021 ACM Conference on Computer and Communications Security, 1276–1291. https://doi.org/10.1145/3460120.3484770
  9. Hendrycks, D., Mazeika, M., & Song, D. (2020). Measuring the robustness of neural networks. Proceedings of the 37th International Conference on Machine Learning, 1613–1623. https://proceedings.mlr.press/v119/hendrycks20a.html
  10. Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
  11. Zhao, Z., Zhang, Y., & Xu, H. (2020). Safe retrieval-augmented generation with counterfactual reasoning in healthcare applications. Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), 455–464. https://doi.org/10.1109/ICDM50108.2020.00059
  12. Huang, J., Wang, Y., & Chen, L. (2021). ClinicalGPT: Fine-tuning GPT-3 for automated medical data analysis. Journal of Medical Informatics124, 104002. https://doi.org/10.1016/j.jmedinf.2021.104002
  13. Johnson, A. E., Pollard, T. J., & Shen, L. (2021). The potential and limitations of natural language processing in healthcare applications. Journal of Healthcare Informatics Research5(1), 1–16. https://doi.org/10.1007/s41666-021-00089-6
  14. Khouzani, M. M., Navab, N., & Nia, A. S. (2021). Applications of large language models in healthcare diagnostics. Healthcare Analytics3(2), 142–155. https://doi.org/10.1016/j.heal.2021.02.008
  15. Kovalev, A., Kravchenko, O., & Lee, H. (2020). GPT-3 for diagnostic suggestions: A potential for revolutionizing clinical decision-making. Proceedings of the 2020 International Conference on Medical Data Analysis, 121–130. https://doi.org/10.1109/MDAB2020.9230406
  16. Lee, J., Yoon, W., & Kim, S. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1038–1048. https://doi.org/10.18653/v1/D19-1147
  17. Xu, J., Zhang, L., & Ding, H. (2021). Automated administrative support in healthcare with ClinicalGPT. Journal of Health Information Systems36(1), 39–45. https://doi.org/10.1016/j.jhis.2020.12.002
  18. Carlini, N., Liu, C., & Dai, Z. (2021). Extracting training data from large language models. Proceedings of the 2021 ACM Conference on Computer and Communications Security, 1276–1291. https://doi.org/10.1145/3460120.3484770
  19. Cohen, J. E., Raji, I. D., & Williams, A. (2021). The threat of algorithmic memorization in healthcare data: A privacy risk. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 235–243. https://doi.org/10.1145/3442188.3445925
  20. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
  21. Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
  22. Zhao, Z., Zhang, Y., & Xu, H. (2020). Safe retrieval-augmented generation with counterfactual reasoning in healthcare applications. Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), 455–464. https://doi.org/10.1109/ICDM50108.2020.00059
  23. Brown, T. B., Mann, B., & Ryder, N. (2020). Language models are few-shot learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 1–12. https://arxiv.org/abs/2005.14165
  24. Gao, L., Song, L., & Xie, J. (2021). Mitigating the risks of inference misuse in AI-based medical decision support systems. Journal of Artificial Intelligence in Medicine112, 101082. https://doi.org/10.1016/j.artmed.2021.101082
  25. Ji, Y., Wei, C., & Zhang, Y. (2021). Hallucination in large language models: A survey of causes, implications, and countermeasures. Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), 29–38. https://doi.org/10.1109/ICDM54110.2021.00015
  26. Liu, F., Xu, H., & Chai, W. (2020). The use of electronic health records and large language models for clinical text summarization. International Journal of Medical Informatics136, 104077. https://doi.org/10.1016/j.ijmedinf.2020.104077
  27. Schick, T., & Schütze, H. (2021). Exploiting cloze-questions for few-shot text classification and natural language inference. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1–17. https://arxiv.org/abs/2001.07676
  28. Wei, C., Schuster, T., & Lee, J. (2022). Chain of thought prompting improves large language models in reasoning tasks. Proceedings of the 2022 Conference on Neural Information Processing Systems (NeurIPS), 1–9. https://arxiv.org/abs/2201.11903
  29. Brown, T. B., Mann, B., & Ryder, N. (2020). Language models are few-shot learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 1–12. https://arxiv.org/abs/2005.14165
  30. Karpukhin, V., Min, S., & Lewis, P. (2020). Dense retriever for real-time information retrieval and generation in open-domain question answering. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1047–1056. https://doi.org/10.1145/3397271.3401066
  31. Lewis, P., Oguz, B., & Goyal, N. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 2020 Conference on Neural Information Processing Systems (NeurIPS), 1–13. https://arxiv.org/abs/2005.11401
  32. Papernot, N., Shokri, R., & Song, L. (2021). Privacy-preserving machine learning: Threats and mitigation strategies. Proceedings of the IEEE International Conference on Data Mining (ICDM), 249–256. https://doi.org/10.1109/ICDM.2021.00040
  33. Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
  34. Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence1(9), 389–399. https://doi.org/10.1038/s42256-019-0088-2
  35. Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
  36. Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine25(1), 44–56. https://doi.org/10.1038/s41591-018-0300-7

This study examines the risks associated with the deployment of large language models (LLMs) in healthcare, focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine- tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and administrative automation. While these models offer significant potential in improving healthcare delivery, they also present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering, the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment. The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in healthcare.

Keywords : Large Language Models (LLMs), Healthcare AI, Memorization Risk,Prompt Inference, Errors, Retrieval Hazard.

Paper Submission Last Date
28 - February - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe