Authors :
Kishan Raj Bellala
Volume/Issue :
Volume 10 - 2025, Issue 9 - September
Google Scholar :
https://tinyurl.com/27zxaphd
Scribd :
https://tinyurl.com/4terns6x
DOI :
https://doi.org/10.38124/ijisrt/25sep1016
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Note : Google Scholar may take 30 to 40 days to display the article.
Abstract :
The increasing use of Kubernetes has brought substantial operational complexity because manual management
of its numerous dynamic components (pods, nodes, networks) is slow, error-prone, and unsustainable at scale. This research
investigates how AIOps (Artificial Intelligence for IT Operations) principles can move past native automation to establish
fully autonomous Kubernetes management. The proposed framework uses machine learning to detect anomalies, identify
causes, and predict scaling needs before executing automatic remediation steps. Our methodology demonstrates that AIOps
can enhance system reliability and reduce operational Toil while optimizing resource efficiency through closed-loop
observation-action cycles, leading to self-healing Kubernetes ecosystems that require minimal human intervention.
Keywords :
Kubernetes, AIOps (Artificial Intelligence for IT Operations), Autonomous Operations, Self-Healing Systems, Anomaly Detection, Root Cause Analysis (RCA), Predictive Scaling, Automated Remediation, Operational Complexity, Machine Learning for IT Operations, Container Orchestration, Site Reliability Engineering (SRE).
References :
- Liu, C., Wang, B., Liu, J., Tang, Z., & Cai, Z. (2020). A protocol-independent container network observability analysis system based on eBPF. 697–702. https://doi.org/10.1109/icpads51040.2020.00099
- Qi, S., Kulkarni, S. G., & Ramakrishnan, K. K. (2020). Assessing Container Network Interface Plugins: Functionality, Performance, and Scalability. IEEE Transactions on Network and Service Management, 18(1), 656–671. https://doi.org/10.1109/tnsm.2020.3047545
- Itiel Shwartz. (2025, August 21). AIOPs for kubernetes (or KAIOPs?). Komodor. https://komodor.com/blog/aiops-for-kubernetes-or-kaiops/
- Arshad, K., Naseer, S., Ali, R. F., Muneer, A., Aziz, I. A., Khan, N. S., & Taib, S. M. (2022). Deep Reinforcement Learning for Anomaly Detection: A Systematic Review. IEEE Access, 10, 124017–124035. https://doi.org/10.1109/access.2022.3224023
- Alsalman, D. (2024). A Comparative Study of Anomaly Detection Techniques for IoT Security Using Adaptive Machine Learning for IoT Threats. IEEE Access, 12, 14719–14730. https://doi.org/10.1109/access.2024.3359033
- Shahzad, F., Al-Jumeily Obe, D., Mannan, A. Almadhor, A. S., Javed, A. R., & Baker, T. (2022). Cloud-based multiclass anomaly detection and categorization using ensemble learning. Journal of Cloud Computing, 11(1). https://doi.org/10.1186/s13677-022-00329-y
- Yan, H., Ge, Z., Yates, J., Breslau, L., Massey, D., & Pei, D. (2012). G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks. IEEE/ACM Transactions on Networking, 20(6), 1734–1747. https://doi.org/10.1109/tnet.2012.2188837
- Sun, Y., Qin, W., Xu, H., & Zhuang, Z. (2021). An adaptive fault detection and root-cause analysis scheme for complex industrial processes using moving window KPCA and information geometric causal inference. Journal of Intelligent Manufacturing, 32(7), 2007–2021. https://doi.org/10.1007/s10845-021-01752-9
- Li, M., Li, Z., Pei, D., Zhang, W., Sui, K., Yin, K., & Nie, X. (2022). Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. 53, 3230–3240. https://doi.org/10.1145/3534678.3539041
- Yuan, H., & Liao, S. (2024). A Time Series-Based Approach to Elastic Kubernetes Scaling. Electronics, 13(2), 285. https://doi.org/10.3390/electronics13020285
- Toka, L., Dobreff, G., Sonkoly, B., & Fodor, B. (2020). Adaptive AI-based auto-scaling for Kubernetes. 16, 599–608. https://doi.org/10.1109/ccgrid49817.2020.00-33
- Taherizadeh, S., & Stankovski, V. (2018). Dynamic Multi-Level Auto-scaling Rules for Containerized Applications. The Computer Journal, 62(2), 174–197. https://doi.org/10.1093/comjnl/bxy043
- Toka, L., Dobreff, G., Sonkoly, B., & Fodor, B. (2021). Machine Learning-Based Scaling Management for Kubernetes Edge Clusters. IEEE Transactions on Network and Service Management, 18(1), 958–972. https://doi.org/10.1109/tnsm.2021.3052837
- Zhao, A., Song, J., Huang, Q., Huang, Y., Chen, Z., & Zou, L. (2019). Research on Resource Prediction Model Based on Kubernetes Container Auto-scaling Technology. IOP Conference Series: Materials Science and Engineering, 569(5), 052092. https://doi.org/10.1088/1757-899x/569/5/052092
- Nguyen, T.-T., Yeom, Y.-J., Kim, T., Park, D.-H., & Kim, S. (2020). Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration. Sensors (Basel, Switzerland), 20(16), 4621. https://doi.org/10.3390/s20164621
- Tran, M.-N., Vu, X. T., & Kim, Y. (2022). Proactive Stateful Fault-Tolerant System for Kubernetes Containerized Services. IEEE Access, 10, 102181–102194. https://doi.org/10.1109/access.2022.3209257
- Kim, D., Kim, E., Lee, C., Helal, S., & Muhammad, H. (2019). TOSCA-Based and Federation-Aware Cloud Orchestration for Kubernetes Container Platform. Applied Sciences, 9(1), 191. https://doi.org/10.3390/app9010191
- Bose, D. B., Shamim, S. I., & Rahman, A. (2021). ‘Under-reported’ Security Defects in Kubernetes Manifests. 9–12. https://doi.org/10.1109/encycris52570.2021.00009
- Tran, M.-N., Vu, X. T., & Kim, Y. (2022). Proactive Stateful Fault-Tolerant System for Kubernetes Containerized Services. IEEE Access, 10, 102181–102194. https://doi.org/10.1109/access.2022.3209257
- Tien, C., Huang, T., Tien, C., Huang, T., & Kuo, S. (2019). KubAnomaly: Anomaly detection for the Docker orchestration platform with neural network approaches. Engineering Reports, 1(5). https://doi.org/10.1002/eng2.12080
- Li, H., Sun, J., & Ke, X. (2024). AI-Driven Optimization System for Large-Scale Kubernetes Clusters: Enhancing Cloud Infrastructure Availability, Security, and Disaster Recovery. Journal of Artificial Intelligence General Science (JAIGS) ISSN:3006-4023, 2(1), 281–306. https://doi.org/10.60087/jaigs.v2i1.244
- Levin, A., Mcshane, N., Garion, S., Kolodner, E. K., Kugler, M., Lorenz, D. H., & Barabash, K. (2019). AIOps for a Cloud Object Storage Service. 165–169. https://doi.org/10.1109/bigdatacongress.2019.00036
- Nedelkoski, S., Cardoso, J., & Kao, O. (2019). Anomaly Detection from System Tracing Data Using Multimodal Deep Learning. 179–186. https://doi.org/10.1109/cloud.2019.00038
- Kim, D., Kim, E., Lee, C., Helal, S., & Muhammad, H. (2019). TOSCA-Based and Federation-Aware Cloud Orchestration for Kubernetes Container Platform. Applied Sciences, 9(1), 191. https://doi.org/10.3390/app9010191
- Bose, D. B., Shamim, S. I., & Rahman, A. (2021). ‘Under-reported’ Security Defects in Kubernetes Manifests. 9–12. https://doi.org/10.1109/encycris52570.2021.00009
- KASHIV, D. J. AI-Driven Networks: Architecting the Future of Autonomous, Secure, and Cloud-Native connectivity 2025. YASHITA PRAKASHAN PRIVATE LIMITED.
- Johansson, B., Papadopoulos, A. V., Ragberger, M., & Nolte, T. (2022). Kubernetes Orchestration of High Availability Distributed Control Systems. 1–8. https://doi.org/10.1109/icit48603.2022.10002757
- Jorge-Martinez, D., Ariza-Colpas, P., Chakraborty, C., Butt, S. A., De-La-Hoz-Franco, E., Onyema, E. M., & Shaheen, Q. (2021). Artificial intelligence-based Kubernetes container for scheduling nodes of energy composition. International Journal of System Assurance Engineering and Management. https://doi.org/10.1007/s13198-021-01195-8
- Bogatinovski, J., Kao, O., Nedelkoski, S., & Cardoso, J. (2020). Self-Supervised Anomaly Detection from Distributed Traces. 342–347. https://doi.org/10.1109/ucc48980.2020.00054
- Wei-Guo, Z., Xi-Lin, M., & Jin-Zhong, Z. (2018). Research on Kubernetes’ Resource Scheduling Scheme. 144–148. https://doi.org/10.1145/3290480.3290507
- Nguyen, T.-T., Yeom, Y.-J., Kim, T., Park, D.-H., & Kim, S. (2020). Horizontal Pod Autoscaling in Kubernetes for Elastic Container Orchestration. Sensors (Basel, Switzerland), 20(16), 4621. https://doi.org/10.3390/s20164621
- Tien, C., Huang, T., Tien, C., Huang, T., & Kuo, S. (2019). KubAnomaly: Anomaly detection for the Docker orchestration platform with neural network approaches. Engineering Reports, 1(5). https://doi.org/10.1002/eng2.12080
- Sabharwal, N., & Bhardwaj, G. (2022). Hands-on AIOps. Apress eBooks. https://doi. org/10.1007/978-1-4842-8267-0.
- Reiter, L., & Wedel, F. H. (2021). AIOps–A Systematic Literature Review.
The increasing use of Kubernetes has brought substantial operational complexity because manual management
of its numerous dynamic components (pods, nodes, networks) is slow, error-prone, and unsustainable at scale. This research
investigates how AIOps (Artificial Intelligence for IT Operations) principles can move past native automation to establish
fully autonomous Kubernetes management. The proposed framework uses machine learning to detect anomalies, identify
causes, and predict scaling needs before executing automatic remediation steps. Our methodology demonstrates that AIOps
can enhance system reliability and reduce operational Toil while optimizing resource efficiency through closed-loop
observation-action cycles, leading to self-healing Kubernetes ecosystems that require minimal human intervention.
Keywords :
Kubernetes, AIOps (Artificial Intelligence for IT Operations), Autonomous Operations, Self-Healing Systems, Anomaly Detection, Root Cause Analysis (RCA), Predictive Scaling, Automated Remediation, Operational Complexity, Machine Learning for IT Operations, Container Orchestration, Site Reliability Engineering (SRE).