Real-Time Streaming of Call Detail Records to HDFS: An End-to-End Big Data Pipeline Using Kafka Connect, Apache Airflow, and Apache Spark


Authors : Germain Uwiringiyedata

Volume/Issue : Volume 10 - 2025, Issue 9 - September


Google Scholar : https://tinyurl.com/5n7ew8er

Scribd : https://tinyurl.com/yrjbv4ph

DOI : https://doi.org/10.38124/ijisrt/25sep1309

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Note : Google Scholar may take 30 to 40 days to display the article.


Abstract : The rapid expansion of telecommunications services produces enormous quan- tities of Call Detail Records (CDRs), requiring real-time ingestion, storage, and analysis to support billing operations and fraud detection systems, and network op- timization. paper presents an end-to-end, containerized big data pipeline Call Detail Records (CDRs) are generated as high-volume event streams that require low-latency ingestion, durable storage, and dependable analytics. This paper presents an end-to- end, containerized big data pipeline that integrates Apache Kafka, Kafka Connect, Hadoop Distributed File System (HDFS), PySpark, and Apache Airflow within a reproducible Docker environment. Unlike conventional batch-oriented approaches, the proposed architecture demonstrates low-latency ingestion, fault-tolerant storage, and scalable processing of high-throughput CDR streams. Experimental results show zero delivery loss at 25 records per second (RPS), balanced partition throughput, and immediate analytical readiness, with roaming traffic analysis and cell-level usage statistics produced in seconds. The work contributes a practical reference model for telecom streaming pipelines, highlighting the advantages of containerized deployment, automated orchestration, and reproducible analytics, and it outlines directions for scaling and production integration.

Keywords : Kafka; Kafka Connect; HDFS; PySpark; Airflow; Docker; Streaming; CDR.

References :

  1. J. Kreps, N. Narkhede, and J. Rao, “Kafka: A Distributed Messaging System for Log Processing,” in NetDB (co-located with SIGMOD), 2011. [Online]. Available: https://notes.stephenholiday.com/Kafka.pdf
  2. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in MSST, IEEE, 2010. [Online]. Available: https://pages.cs.wisc.edu/∼akella/CS838/F15/838-CloudPapers/hdfs.pdf
  3. M. Zaharia, T. Das, H. Li, et al., “Discretized Streams: Fault-Tolerant Streaming Computation at Scale,” in SOSP, 2013. [Online]. Available: https://people.csail.mit. edu/matei/papers/2013/sosp spark streaming.pdf
  4. M. Armbrust, T. Das, J. Torres, et al., “Structured Streaming: A Declara- tive API for Real-Time Applications in Apache Spark,” in SIGMOD, 2018. doi: 10.1145/3183713.3190664.  [Online].  Available:  https://people.eecs.berkeley.edu/  ∼matei/papers/2018/sigmod structured streaming.pdf
  5. Apache Software Foundation, “Apache Airflow Documentation (Stable),” 2025. [On- line]. Available: https://airflow.apache.org/docs/apache-airflow/stable/index.html
  6. Apache Software Foundation, “Airflow Scheduler,” 2025. [Online]. Available: https: //airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/ scheduler.html
  7. Docker, Inc., “Docker Overview,” 2024. [Online]. Available: https://docs.docker.com/ get-started/docker-overview/
  8. Docker, Inc., “What is a container?” 2025. [Online]. Available: https://docs.docker. com/get-started/docker-concepts/the-basics/what-is-a-container/
  9. Confluent, Inc., “HDFS 3 Sink Connector for Confluent Platform,” 2025. [Online]. Available: https://docs.confluent.io/kafka-connectors/hdfs3-sink/current/overview. html
  10. Cloudera, “HDFS Sink Connector (Cloudera Runtime 7.3.1),” 2023. [On- line]. Available: https://docs.cloudera.com/runtime/7.3.1/kafka-connect/topics/ kafka-connect-connector-hdfs-sink.html
  11. W. H. Inmon, Building the Data Warehouse, 4th ed. Wiley, 2005.
  12. R. Kimball and M. Ross, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed. Wiley, 2013.
  13. N. Marz and J. Warren, Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications, 2015.
  14. M. Chen, S. Mao, and Y. Zhang, “Big Data: A Survey,” Mobile Networks and Applications, vol. 19, no. 2, pp. 171–209, 2014. Springer.

The rapid expansion of telecommunications services produces enormous quan- tities of Call Detail Records (CDRs), requiring real-time ingestion, storage, and analysis to support billing operations and fraud detection systems, and network op- timization. paper presents an end-to-end, containerized big data pipeline Call Detail Records (CDRs) are generated as high-volume event streams that require low-latency ingestion, durable storage, and dependable analytics. This paper presents an end-to- end, containerized big data pipeline that integrates Apache Kafka, Kafka Connect, Hadoop Distributed File System (HDFS), PySpark, and Apache Airflow within a reproducible Docker environment. Unlike conventional batch-oriented approaches, the proposed architecture demonstrates low-latency ingestion, fault-tolerant storage, and scalable processing of high-throughput CDR streams. Experimental results show zero delivery loss at 25 records per second (RPS), balanced partition throughput, and immediate analytical readiness, with roaming traffic analysis and cell-level usage statistics produced in seconds. The work contributes a practical reference model for telecom streaming pipelines, highlighting the advantages of containerized deployment, automated orchestration, and reproducible analytics, and it outlines directions for scaling and production integration.

Keywords : Kafka; Kafka Connect; HDFS; PySpark; Airflow; Docker; Streaming; CDR.

CALL FOR PAPERS


Paper Submission Last Date
31 - December - 2025

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe