Assessment of memorization prompt inference and retrieval risks in healthcare large language models| International Journal of Innovative Science and Research Technology

Assessment of Memorization, Prompt Inference, and Retrieval Risks in Healthcare Large Language Models

Authors : Olufunke Adebola Akande; Onuh Matthew Ijiga; Otugene Victor Bamigwojo; Agbo James Ogboji

Volume/Issue : Volume 11 - 2026, Issue 1 - January

Google Scholar : https://tinyurl.com/3f482x42

Scribd : https://tinyurl.com/y4db3k39

DOI : https://doi.org/10.38124/ijisrt/26jan1453

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : This study examines the risks associated with the deployment of large language models (LLMs) in healthcare, focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine- tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and administrative automation. While these models offer significant potential in improving healthcare delivery, they also present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering, the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment. The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in healthcare.

Keywords : Large Language Models (LLMs), Healthcare AI, Memorization Risk,Prompt Inference, Errors, Retrieval Hazard.

References :

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pre-trained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3606–3611. https://doi.org/10.18653/v1/D19-1371
Bertomeu, A., Sánchez, A., & Sànchez, P. (2021). Use of natural language processing in healthcare: Implications for patient communication and data management. Journal of Medical Internet Research, 23(5), e23567. https://doi.org/10.2196/23567
Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., et al. (2018). Opportunities and obstacles for deep learning in biology and medicine. Journal of the American Medical Association, 320(11), 1099–1101. https://doi.org/10.1001/jama.2018.11100
Choi, E., Chiu, C. Y., & Norman, H. (2020). Contextualizing large language models for clinical decision support. Proceedings of the 2020 Conference on Natural Language Processing in Healthcare, 27-34. https://doi.org/10.1145/3407995.3408064
Liu, F., Xu, H., & Chai, W. (2020). The use of electronic health records and large language models for clinical text summarization. International Journal of Medical Informatics, 136, 104077. https://doi.org/10.1016/j.ijmedinf.2020.104077
Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1357. https://doi.org/10.1056/NEJMra1814259
Vaswani, A., Shazeer, N., & Parmar, N. (2017). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010. https://doi.org/10.5555/3295222.3295344
Carlini, N., Liu, C., & Dai, Z. (2021). Extracting training data from large language models. Proceedings of the 2021 ACM Conference on Computer and Communications Security, 1276–1291. https://doi.org/10.1145/3460120.3484770
Hendrycks, D., Mazeika, M., & Song, D. (2020). Measuring the robustness of neural networks. Proceedings of the 37th International Conference on Machine Learning, 1613–1623. https://proceedings.mlr.press/v119/hendrycks20a.html
Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
Zhao, Z., Zhang, Y., & Xu, H. (2020). Safe retrieval-augmented generation with counterfactual reasoning in healthcare applications. Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), 455–464. https://doi.org/10.1109/ICDM50108.2020.00059
Huang, J., Wang, Y., & Chen, L. (2021). ClinicalGPT: Fine-tuning GPT-3 for automated medical data analysis. Journal of Medical Informatics, 124, 104002. https://doi.org/10.1016/j.jmedinf.2021.104002
Johnson, A. E., Pollard, T. J., & Shen, L. (2021). The potential and limitations of natural language processing in healthcare applications. Journal of Healthcare Informatics Research, 5(1), 1–16. https://doi.org/10.1007/s41666-021-00089-6
Khouzani, M. M., Navab, N., & Nia, A. S. (2021). Applications of large language models in healthcare diagnostics. Healthcare Analytics, 3(2), 142–155. https://doi.org/10.1016/j.heal.2021.02.008
Kovalev, A., Kravchenko, O., & Lee, H. (2020). GPT-3 for diagnostic suggestions: A potential for revolutionizing clinical decision-making. Proceedings of the 2020 International Conference on Medical Data Analysis, 121–130. https://doi.org/10.1109/MDAB2020.9230406
Lee, J., Yoon, W., & Kim, S. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1038–1048. https://doi.org/10.18653/v1/D19-1147
Xu, J., Zhang, L., & Ding, H. (2021). Automated administrative support in healthcare with ClinicalGPT. Journal of Health Information Systems, 36(1), 39–45. https://doi.org/10.1016/j.jhis.2020.12.002
Carlini, N., Liu, C., & Dai, Z. (2021). Extracting training data from large language models. Proceedings of the 2021 ACM Conference on Computer and Communications Security, 1276–1291. https://doi.org/10.1145/3460120.3484770
Cohen, J. E., Raji, I. D., & Williams, A. (2021). The threat of algorithmic memorization in healthcare data: A privacy risk. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 235–243. https://doi.org/10.1145/3442188.3445925
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
Zhao, Z., Zhang, Y., & Xu, H. (2020). Safe retrieval-augmented generation with counterfactual reasoning in healthcare applications. Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), 455–464. https://doi.org/10.1109/ICDM50108.2020.00059
Brown, T. B., Mann, B., & Ryder, N. (2020). Language models are few-shot learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 1–12. https://arxiv.org/abs/2005.14165
Gao, L., Song, L., & Xie, J. (2021). Mitigating the risks of inference misuse in AI-based medical decision support systems. Journal of Artificial Intelligence in Medicine, 112, 101082. https://doi.org/10.1016/j.artmed.2021.101082
Ji, Y., Wei, C., & Zhang, Y. (2021). Hallucination in large language models: A survey of causes, implications, and countermeasures. Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), 29–38. https://doi.org/10.1109/ICDM54110.2021.00015
Liu, F., Xu, H., & Chai, W. (2020). The use of electronic health records and large language models for clinical text summarization. International Journal of Medical Informatics, 136, 104077. https://doi.org/10.1016/j.ijmedinf.2020.104077
Schick, T., & Schütze, H. (2021). Exploiting cloze-questions for few-shot text classification and natural language inference. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1–17. https://arxiv.org/abs/2001.07676
Wei, C., Schuster, T., & Lee, J. (2022). Chain of thought prompting improves large language models in reasoning tasks. Proceedings of the 2022 Conference on Neural Information Processing Systems (NeurIPS), 1–9. https://arxiv.org/abs/2201.11903
Brown, T. B., Mann, B., & Ryder, N. (2020). Language models are few-shot learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 1–12. https://arxiv.org/abs/2005.14165
Karpukhin, V., Min, S., & Lewis, P. (2020). Dense retriever for real-time information retrieval and generation in open-domain question answering. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1047–1056. https://doi.org/10.1145/3397271.3401066
Lewis, P., Oguz, B., & Goyal, N. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 2020 Conference on Neural Information Processing Systems (NeurIPS), 1–13. https://arxiv.org/abs/2005.11401
Papernot, N., Shokri, R., & Song, L. (2021). Privacy-preserving machine learning: Threats and mitigation strategies. Proceedings of the IEEE International Conference on Data Mining (ICDM), 249–256. https://doi.org/10.1109/ICDM.2021.00040
Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. https://doi.org/10.1038/s42256-019-0088-2
Shokri, R., Stronati, M., & Song, L. (2017). Membership inference attacks against machine learning models. Proceedings of the 2017 IEEE Symposium on Security and Privacy, 3–18. https://doi.org/10.1109/SP.2017.41
Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56. https://doi.org/10.1038/s41591-018-0300-7

This study examines the risks associated with the deployment of large language models (LLMs) in healthcare, focusing on memorization, prompt inference errors, and retrieval hazards. LLMs, such as GPT-4, MedPaLM, and fine- tuned clinical models like ClinicalBERT, are increasingly used in clinical decision support, diagnostic assistance, and administrative automation. While these models offer significant potential in improving healthcare delivery, they also present privacy and safety risks. The study investigates how these models memorize sensitive data, generate incorrect or unsafe responses due to prompt errors, and retrieve irrelevant or confidential information through external knowledge bases. The findings reveal that GPT-4, a general-purpose model, exhibits higher memorization and inference risks compared to domain-specific models like MedPaLM and ClinicalBERT, which showed improved performance in healthcare tasks and reduced memorization tendencies. The study also emphasizes the importance of prompt engineering, the potential hazards of retrieval-augmented generation (RAG) systems, and the necessity of privacy-preserving techniques. Based on these findings, the paper proposes a set of practical recommendations for safe LLM integration in healthcare, including data governance practices, prompt validation protocols, and retrieval safeguards. Finally, the study outlines a framework for risk mitigation and suggests directions for future research, including longitudinal studies on model drift, cross-institutional validation of risk profiles, and human-in-the-loop interventions for real-world deployment. The findings provide essential insights for clinicians, AI researchers, and policymakers working to safely deploy AI in healthcare.

Keywords : Large Language Models (LLMs), Healthcare AI, Memorization Risk,Prompt Inference, Errors, Retrieval Hazard.

Paper Submission Last Date
31 - March - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.