Towards Autonomous Kubernetes: A Framework for AI-Driven Operations (AIOps)


Authors : Kishan Raj Bellala

Volume/Issue : Volume 10 - 2025, Issue 9 - September


Google Scholar : https://tinyurl.com/27zxaphd

Scribd : https://tinyurl.com/4terns6x

DOI : https://doi.org/10.38124/ijisrt/25sep1016

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Note : Google Scholar may take 30 to 40 days to display the article.


Abstract : The increasing use of Kubernetes has brought substantial operational complexity because manual management of its numerous dynamic components (pods, nodes, networks) is slow, error-prone, and unsustainable at scale. This research investigates how AIOps (Artificial Intelligence for IT Operations) principles can move past native automation to establish fully autonomous Kubernetes management. The proposed framework uses machine learning to detect anomalies, identify causes, and predict scaling needs before executing automatic remediation steps. Our methodology demonstrates that AIOps can enhance system reliability and reduce operational Toil while optimizing resource efficiency through closed-loop observation-action cycles, leading to self-healing Kubernetes ecosystems that require minimal human intervention.

Keywords : Kubernetes, AIOps (Artificial Intelligence for IT Operations), Autonomous Operations, Self-Healing Systems, Anomaly Detection, Root Cause Analysis (RCA), Predictive Scaling, Automated Remediation, Operational Complexity, Machine Learning for IT Operations, Container Orchestration, Site Reliability Engineering (SRE).

References :

  1. Liu, C., Wang, B., Liu, J., Tang, Z., & Cai, Z. (2020). A protocol-independent container network observability analysis system based on eBPF. 697–702. https://doi.org/10.1109/icpads51040.2020.00099
  2. Qi, S., Kulkarni, S. G., & Ramakrishnan, K. K. (2020). Assessing Container Network Interface Plugins: Functionality, Performance, and Scalability. IEEE Transactions on Network and Service Management, 18(1), 656–671. https://doi.org/10.1109/tnsm.2020.3047545
  3. Itiel Shwartz. (2025, August 21). AIOPs for kubernetes (or KAIOPs?). Komodor. https://komodor.com/blog/aiops-for-kubernetes-or-kaiops/
  4. Arshad, K., Naseer, S., Ali, R. F., Muneer, A., Aziz, I. A., Khan, N. S., & Taib, S. M. (2022). Deep Reinforcement Learning for Anomaly Detection: A Systematic Review. IEEE Access, 10, 124017–124035. https://doi.org/10.1109/access.2022.3224023
  5. Alsalman, D. (2024). A Comparative Study of Anomaly Detection Techniques for IoT Security Using Adaptive Machine Learning for IoT Threats. IEEE Access, 12, 14719–14730. https://doi.org/10.1109/access.2024.3359033
  6. Shahzad, F., Al-Jumeily Obe, D., Mannan, A. Almadhor, A. S., Javed, A. R., & Baker, T. (2022). Cloud-based multiclass anomaly detection and categorization using ensemble learning. Journal of Cloud Computing, 11(1). https://doi.org/10.1186/s13677-022-00329-y
  7. Yan, H., Ge, Z., Yates, J., Breslau, L., Massey, D., & Pei, D. (2012). G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks. IEEE/ACM Transactions on Networking, 20(6), 1734–1747. https://doi.org/10.1109/tnet.2012.2188837
  8. Sun, Y., Qin, W., Xu, H., & Zhuang, Z. (2021). An adaptive fault detection and root-cause analysis scheme for complex industrial processes using moving window KPCA and information geometric causal inference. Journal of Intelligent Manufacturing, 32(7), 2007–2021. https://doi.org/10.1007/s10845-021-01752-9
  9. Li, M., Li, Z., Pei, D., Zhang, W., Sui, K., Yin, K., & Nie, X. (2022). Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. 53, 3230–3240. https://doi.org/10.1145/3534678.3539041
  10. Yuan, H., & Liao, S. (2024). A Time Series-Based Approach to Elastic Kubernetes Scaling. Electronics, 13(2), 285. https://doi.org/10.3390/electronics13020285
  11. Toka, L., Dobreff, G., Sonkoly, B., & Fodor, B. (2020). Adaptive AI-based auto-scaling for Kubernetes. 16, 599–608. https://doi.org/10.1109/ccgrid49817.2020.00-33
  12. Taherizadeh, S., & Stankovski, V. (2018). Dynamic Multi-Level Auto-scaling Rules for Containerized Applications. The Computer Journal, 62(2), 174–197. https://doi.org/10.1093/comjnl/bxy043
  13. Toka, L., Dobreff, G., Sonkoly, B., & Fodor, B. (2021). Machine Learning-Based Scaling Management for Kubernetes Edge Clusters. IEEE Transactions on Network and Service Management, 18(1), 958–972. https://doi.org/10.1109/tnsm.2021.3052837
  14. Zhao, A., Song, J., Huang, Q., Huang, Y., Chen, Z., & Zou, L. (2019). Research on Resource Prediction Model Based on Kubernetes Container Auto-scaling Technology. IOP Conference Series: Materials Science and Engineering, 569(5), 052092. https://doi.org/10.1088/1757-899x/569/5/052092
  15. Nguyen, T.-T., Yeom, Y.-J., Kim, T., Park, D.-H., & Kim, S. (2020). Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration. Sensors (Basel, Switzerland), 20(16), 4621. https://doi.org/10.3390/s20164621
  16. Tran, M.-N., Vu, X. T., & Kim, Y. (2022). Proactive Stateful Fault-Tolerant System for Kubernetes Containerized Services. IEEE Access, 10, 102181–102194. https://doi.org/10.1109/access.2022.3209257
  17. Kim, D., Kim, E., Lee, C., Helal, S., & Muhammad, H. (2019). TOSCA-Based and Federation-Aware Cloud Orchestration for Kubernetes Container Platform. Applied Sciences, 9(1), 191. https://doi.org/10.3390/app9010191
  18. Bose, D. B., Shamim, S. I., & Rahman, A. (2021). ‘Under-reported’ Security Defects in Kubernetes Manifests. 9–12. https://doi.org/10.1109/encycris52570.2021.00009
  19. Tran, M.-N., Vu, X. T., & Kim, Y. (2022). Proactive Stateful Fault-Tolerant System for Kubernetes Containerized Services. IEEE Access, 10, 102181–102194. https://doi.org/10.1109/access.2022.3209257
  20. Tien, C., Huang, T., Tien, C., Huang, T., & Kuo, S. (2019). KubAnomaly: Anomaly detection for the Docker orchestration platform with neural network approaches. Engineering Reports, 1(5). https://doi.org/10.1002/eng2.12080
  21. Li, H., Sun, J., & Ke, X. (2024). AI-Driven Optimization System for Large-Scale Kubernetes Clusters: Enhancing Cloud Infrastructure Availability, Security, and Disaster Recovery. Journal of Artificial Intelligence General Science (JAIGS) ISSN:3006-4023, 2(1), 281–306. https://doi.org/10.60087/jaigs.v2i1.244
  22. Levin, A., Mcshane, N., Garion, S., Kolodner, E. K., Kugler, M., Lorenz, D. H., & Barabash, K. (2019). AIOps for a Cloud Object Storage Service. 165–169. https://doi.org/10.1109/bigdatacongress.2019.00036
  23. Nedelkoski, S., Cardoso, J., & Kao, O. (2019). Anomaly Detection from System Tracing Data Using Multimodal Deep Learning. 179–186. https://doi.org/10.1109/cloud.2019.00038
  24. Kim, D., Kim, E., Lee, C., Helal, S., & Muhammad, H. (2019). TOSCA-Based and Federation-Aware Cloud Orchestration for Kubernetes Container Platform. Applied Sciences, 9(1), 191. https://doi.org/10.3390/app9010191
  25. Bose, D. B., Shamim, S. I., & Rahman, A. (2021). ‘Under-reported’ Security Defects in Kubernetes Manifests. 9–12. https://doi.org/10.1109/encycris52570.2021.00009
  26. KASHIV, D. J. AI-Driven Networks: Architecting the Future of Autonomous, Secure, and Cloud-Native connectivity 2025. YASHITA PRAKASHAN PRIVATE LIMITED.
  27. Johansson, B., Papadopoulos, A. V., Ragberger, M., & Nolte, T. (2022). Kubernetes Orchestration of High Availability Distributed Control Systems. 1–8. https://doi.org/10.1109/icit48603.2022.10002757
  28. Jorge-Martinez, D., Ariza-Colpas, P., Chakraborty, C., Butt, S. A., De-La-Hoz-Franco, E., Onyema, E. M., & Shaheen, Q. (2021). Artificial intelligence-based Kubernetes container for scheduling nodes of energy composition. International Journal of System Assurance Engineering and Management. https://doi.org/10.1007/s13198-021-01195-8
  29. Bogatinovski, J., Kao, O., Nedelkoski, S., & Cardoso, J. (2020). Self-Supervised Anomaly Detection from Distributed Traces. 342–347. https://doi.org/10.1109/ucc48980.2020.00054
  30. Wei-Guo, Z., Xi-Lin, M., & Jin-Zhong, Z. (2018). Research on Kubernetes’ Resource Scheduling Scheme. 144–148. https://doi.org/10.1145/3290480.3290507
  31. Nguyen, T.-T., Yeom, Y.-J., Kim, T., Park, D.-H., & Kim, S. (2020). Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration. Sensors (Basel, Switzerland), 20(16), 4621. https://doi.org/10.3390/s20164621
  32. Tien, C., Huang, T., Tien, C., Huang, T., & Kuo, S. (2019). KubAnomaly: Anomaly detection for the Docker orchestration platform with neural network approaches. Engineering Reports, 1(5). https://doi.org/10.1002/eng2.12080
  33. Sabharwal, N., & Bhardwaj, G. (2022). Hands-on AIOps. Apress eBooks. https://doi. org/10.1007/978-1-4842-8267-0.
  34. Reiter, L., & Wedel, F. H. (2021). AIOps–A Systematic Literature Review.

The increasing use of Kubernetes has brought substantial operational complexity because manual management of its numerous dynamic components (pods, nodes, networks) is slow, error-prone, and unsustainable at scale. This research investigates how AIOps (Artificial Intelligence for IT Operations) principles can move past native automation to establish fully autonomous Kubernetes management. The proposed framework uses machine learning to detect anomalies, identify causes, and predict scaling needs before executing automatic remediation steps. Our methodology demonstrates that AIOps can enhance system reliability and reduce operational Toil while optimizing resource efficiency through closed-loop observation-action cycles, leading to self-healing Kubernetes ecosystems that require minimal human intervention.

Keywords : Kubernetes, AIOps (Artificial Intelligence for IT Operations), Autonomous Operations, Self-Healing Systems, Anomaly Detection, Root Cause Analysis (RCA), Predictive Scaling, Automated Remediation, Operational Complexity, Machine Learning for IT Operations, Container Orchestration, Site Reliability Engineering (SRE).

CALL FOR PAPERS


Paper Submission Last Date
31 - December - 2025

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe