Realtime streaming of call detail records to hdfs an endtoend big data pipeline using kafka connect apache airflow and apache spark| International Journal of Innovative Science and Research Technology

Real-Time Streaming of Call Detail Records to HDFS: An End-to-End Big Data Pipeline Using Kafka Connect, Apache Airflow, and Apache Spark

Authors : Germain Uwiringiyedata

Volume/Issue : Volume 10 - 2025, Issue 9 - September

Google Scholar : https://tinyurl.com/5n7ew8er

Scribd : https://tinyurl.com/yrjbv4ph

DOI : https://doi.org/10.38124/ijisrt/25sep1309

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : The rapid expansion of telecommunications services produces enormous quan- tities of Call Detail Records (CDRs), requiring real-time ingestion, storage, and analysis to support billing operations and fraud detection systems, and network op- timization. paper presents an end-to-end, containerized big data pipeline Call Detail Records (CDRs) are generated as high-volume event streams that require low-latency ingestion, durable storage, and dependable analytics. This paper presents an end-to- end, containerized big data pipeline that integrates Apache Kafka, Kafka Connect, Hadoop Distributed File System (HDFS), PySpark, and Apache Airflow within a reproducible Docker environment. Unlike conventional batch-oriented approaches, the proposed architecture demonstrates low-latency ingestion, fault-tolerant storage, and scalable processing of high-throughput CDR streams. Experimental results show zero delivery loss at 25 records per second (RPS), balanced partition throughput, and immediate analytical readiness, with roaming traffic analysis and cell-level usage statistics produced in seconds. The work contributes a practical reference model for telecom streaming pipelines, highlighting the advantages of containerized deployment, automated orchestration, and reproducible analytics, and it outlines directions for scaling and production integration.

Keywords : Kafka; Kafka Connect; HDFS; PySpark; Airflow; Docker; Streaming; CDR.

References :

J. Kreps, N. Narkhede, and J. Rao, “Kafka: A Distributed Messaging System for Log Processing,” in NetDB (co-located with SIGMOD), 2011. [Online]. Available: https://notes.stephenholiday.com/Kafka.pdf
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in MSST, IEEE, 2010. [Online]. Available: https://pages.cs.wisc.edu/∼akella/CS838/F15/838-CloudPapers/hdfs.pdf
M. Zaharia, T. Das, H. Li, et al., “Discretized Streams: Fault-Tolerant Streaming Computation at Scale,” in SOSP, 2013. [Online]. Available: https://people.csail.mit. edu/matei/papers/2013/sosp spark streaming.pdf
M. Armbrust, T. Das, J. Torres, et al., “Structured Streaming: A Declara- tive API for Real-Time Applications in Apache Spark,” in SIGMOD, 2018. doi: 10.1145/3183713.3190664. [Online]. Available: https://people.eecs.berkeley.edu/ ∼matei/papers/2018/sigmod structured streaming.pdf
Apache Software Foundation, “Apache Airflow Documentation (Stable),” 2025. [On- line]. Available: https://airflow.apache.org/docs/apache-airflow/stable/index.html
Apache Software Foundation, “Airflow Scheduler,” 2025. [Online]. Available: https: //airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/ scheduler.html
Docker, Inc., “Docker Overview,” 2024. [Online]. Available: https://docs.docker.com/ get-started/docker-overview/
Docker, Inc., “What is a container?” 2025. [Online]. Available: https://docs.docker. com/get-started/docker-concepts/the-basics/what-is-a-container/
Confluent, Inc., “HDFS 3 Sink Connector for Confluent Platform,” 2025. [Online]. Available: https://docs.confluent.io/kafka-connectors/hdfs3-sink/current/overview. html
Cloudera, “HDFS Sink Connector (Cloudera Runtime 7.3.1),” 2023. [On- line]. Available: https://docs.cloudera.com/runtime/7.3.1/kafka-connect/topics/ kafka-connect-connector-hdfs-sink.html
W. H. Inmon, Building the Data Warehouse, 4th ed. Wiley, 2005.
R. Kimball and M. Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed. Wiley, 2013.
N. Marz and J. Warren, Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications, 2015.
M. Chen, S. Mao, and Y. Zhang, “Big Data: A Survey,” Mobile Networks and Applications, vol. 19, no. 2, pp. 171–209, 2014. Springer.

The rapid expansion of telecommunications services produces enormous quan- tities of Call Detail Records (CDRs), requiring real-time ingestion, storage, and analysis to support billing operations and fraud detection systems, and network op- timization. paper presents an end-to-end, containerized big data pipeline Call Detail Records (CDRs) are generated as high-volume event streams that require low-latency ingestion, durable storage, and dependable analytics. This paper presents an end-to- end, containerized big data pipeline that integrates Apache Kafka, Kafka Connect, Hadoop Distributed File System (HDFS), PySpark, and Apache Airflow within a reproducible Docker environment. Unlike conventional batch-oriented approaches, the proposed architecture demonstrates low-latency ingestion, fault-tolerant storage, and scalable processing of high-throughput CDR streams. Experimental results show zero delivery loss at 25 records per second (RPS), balanced partition throughput, and immediate analytical readiness, with roaming traffic analysis and cell-level usage statistics produced in seconds. The work contributes a practical reference model for telecom streaming pipelines, highlighting the advantages of containerized deployment, automated orchestration, and reproducible analytics, and it outlines directions for scaling and production integration.

Keywords : Kafka; Kafka Connect; HDFS; PySpark; Airflow; Docker; Streaming; CDR.

Paper Submission Last Date
31 - July - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.