Authors :
Germain Uwiringiyedata
Volume/Issue :
Volume 10 - 2025, Issue 9 - September
Google Scholar :
https://tinyurl.com/5n7ew8er
Scribd :
https://tinyurl.com/yrjbv4ph
DOI :
https://doi.org/10.38124/ijisrt/25sep1309
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Note : Google Scholar may take 30 to 40 days to display the article.
Abstract :
The rapid expansion of telecommunications services produces enormous quan- tities of Call Detail Records
(CDRs), requiring real-time ingestion, storage, and analysis to support billing operations and fraud detection systems,
and network op- timization. paper presents an end-to-end, containerized big data pipeline Call Detail Records (CDRs)
are generated as high-volume event streams that require low-latency ingestion, durable storage, and dependable
analytics. This paper presents an end-to- end, containerized big data pipeline that integrates Apache Kafka, Kafka
Connect, Hadoop Distributed File System (HDFS), PySpark, and Apache Airflow within a reproducible Docker
environment. Unlike conventional batch-oriented approaches, the proposed architecture demonstrates low-latency
ingestion, fault-tolerant storage, and scalable processing of high-throughput CDR streams. Experimental results show
zero delivery loss at 25 records per second (RPS), balanced partition throughput, and immediate analytical readiness,
with roaming traffic analysis and cell-level usage statistics produced in seconds. The work contributes a practical
reference model for telecom streaming pipelines, highlighting the advantages of containerized deployment, automated
orchestration, and reproducible analytics, and it outlines directions for scaling and production integration.
Keywords :
Kafka; Kafka Connect; HDFS; PySpark; Airflow; Docker; Streaming; CDR.
References :
- J. Kreps, N. Narkhede, and J. Rao, “Kafka: A Distributed Messaging System for Log Processing,” in NetDB (co-located with SIGMOD), 2011. [Online]. Available: https://notes.stephenholiday.com/Kafka.pdf
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in MSST, IEEE, 2010. [Online]. Available: https://pages.cs.wisc.edu/∼akella/CS838/F15/838-CloudPapers/hdfs.pdf
- M. Zaharia, T. Das, H. Li, et al., “Discretized Streams: Fault-Tolerant Streaming Computation at Scale,” in SOSP, 2013. [Online]. Available: https://people.csail.mit. edu/matei/papers/2013/sosp spark streaming.pdf
- M. Armbrust, T. Das, J. Torres, et al., “Structured Streaming: A Declara- tive API for Real-Time Applications in Apache Spark,” in SIGMOD, 2018. doi: 10.1145/3183713.3190664. [Online]. Available: https://people.eecs.berkeley.edu/ ∼matei/papers/2018/sigmod structured streaming.pdf
- Apache Software Foundation, “Apache Airflow Documentation (Stable),” 2025. [On- line]. Available: https://airflow.apache.org/docs/apache-airflow/stable/index.html
- Apache Software Foundation, “Airflow Scheduler,” 2025. [Online]. Available: https: //airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/ scheduler.html
- Docker, Inc., “Docker Overview,” 2024. [Online]. Available: https://docs.docker.com/ get-started/docker-overview/
- Docker, Inc., “What is a container?” 2025. [Online]. Available: https://docs.docker. com/get-started/docker-concepts/the-basics/what-is-a-container/
- Confluent, Inc., “HDFS 3 Sink Connector for Confluent Platform,” 2025. [Online]. Available: https://docs.confluent.io/kafka-connectors/hdfs3-sink/current/overview. html
- Cloudera, “HDFS Sink Connector (Cloudera Runtime 7.3.1),” 2023. [On- line]. Available: https://docs.cloudera.com/runtime/7.3.1/kafka-connect/topics/ kafka-connect-connector-hdfs-sink.html
- W. H. Inmon, Building the Data Warehouse, 4th ed. Wiley, 2005.
- R. Kimball and M. Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed. Wiley, 2013.
- N. Marz and J. Warren, Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications, 2015.
- M. Chen, S. Mao, and Y. Zhang, “Big Data: A Survey,” Mobile Networks and Applications, vol. 19, no. 2, pp. 171–209, 2014. Springer.
The rapid expansion of telecommunications services produces enormous quan- tities of Call Detail Records
(CDRs), requiring real-time ingestion, storage, and analysis to support billing operations and fraud detection systems,
and network op- timization. paper presents an end-to-end, containerized big data pipeline Call Detail Records (CDRs)
are generated as high-volume event streams that require low-latency ingestion, durable storage, and dependable
analytics. This paper presents an end-to- end, containerized big data pipeline that integrates Apache Kafka, Kafka
Connect, Hadoop Distributed File System (HDFS), PySpark, and Apache Airflow within a reproducible Docker
environment. Unlike conventional batch-oriented approaches, the proposed architecture demonstrates low-latency
ingestion, fault-tolerant storage, and scalable processing of high-throughput CDR streams. Experimental results show
zero delivery loss at 25 records per second (RPS), balanced partition throughput, and immediate analytical readiness,
with roaming traffic analysis and cell-level usage statistics produced in seconds. The work contributes a practical
reference model for telecom streaming pipelines, highlighting the advantages of containerized deployment, automated
orchestration, and reproducible analytics, and it outlines directions for scaling and production integration.
Keywords :
Kafka; Kafka Connect; HDFS; PySpark; Airflow; Docker; Streaming; CDR.