Authors :
Dr. N. Leelavathy; Sudipta Akash; Md. Sohail Ansari; Rayudu Navya Sri
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/45b4pvpa
Scribd :
https://tinyurl.com/bdpspywn
DOI :
https://doi.org/10.38124/ijisrt/26apr358
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Multi-camera target tracking remains a fundamental challenge in intelligent video surveillance, particularly when
maintaining consistent identity associations across spatially disjoint camera views with overlapping or non-overlapping
fields of view. This paper introduces VisionNet, a unified deep learning framework that integrates state-of-the-art detection,
single-camera tracking, and cross-camera re-identification into a cohesive and computationally efficient pipeline. VisionNet
employs YOLOv8 for high-accuracy object detection, Deep Simple Online and Realtime Tracking (DeepSORT) for intracamera trajectory persistence, and Omni-Scale Network (OSNet) for discriminative cross-camera identity association. The
framework leverages GPU-accelerated inference to sustain real-time throughput while preserving tracking fidelity under
challenging surveillance conditions such as occlusion, lighting variation, and crowded scenes. A Flask-based monitoring
dashboard provides operators with live visualization of tracked targets, inter-camera handoff events, and aggregate
behavioral analytics. Extensive evaluation on benchmark datasets including MOT17, DukeMTMC-reID, and Market-1501
demonstrates that VisionNet achieves an IDF1 score of 76.4%, a MOTA of 73.9%, and a Rank-1 re-identification accuracy
of 96.2%, outperforming competing baselines on identity-switch minimization. The modular architecture facilitates seamless
integration with existing closed-circuit television (CCTV) infrastructure and supports diverse application domains including
intelligent security systems, crowd flow analytics, retail behavior monitoring, and multi-athlete sports event tracking.
Keywords :
Multi-Camera Tracking, Person Re-Identification, YOLOv8, DeepSORT, OSNet, Video Surveillance, Deep Learning.
References :
- G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” GitHub repository, 2023. [Online]. Available: https://github.com/ultralytics/ultralytics
- C. Luo, X. Zhao, W. Nie, and Y. Cui, “Graph Neural Network-Based Multi-Camera Multi-Target Tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 5, pp. 2302–2315, May 2023.
- A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Phoenix, AZ, USA, 2016, pp. 3464–3468.
- N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Beijing, China, 2017, pp. 3645–3649.
- K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person re-identification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Seoul, Korea, 2019, pp. 3702–3712.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
- Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “FairMOT: On the fairness of detection and re-identification in multiple object tracking,” Int. J. Comput. Vis., vol. 129, no. 11, pp. 3069–3087, Nov. 2021.
- Z. Zhang, W. Cheng, X. Hou, L. Qi, Y. Lu, J. Shi, and C. Change Loy, “ByteTrack: Multi-object tracking by associating every detection box,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Tel Aviv, Israel, 2022, pp. 1–18.
- A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “MOT16: A benchmark for multi-object tracking,” arXiv: 1603.00831, 2016.
- E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in Proc. Eur. Conf. Comput. Vis. Workshops (ECCVW), Amsterdam, The Netherlands, 2016, pp. 17–35.
- Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Munich, Germany, 2018, pp. 501–518.
- G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in Proc. ACM Int. Conf. Multimedia (ACM MM), Seoul, Korea, 2018, pp. 274–282.
- L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Santiago, Chile, 2015, pp. 1116–1124.
- W. Li, R. Zhao, T. Xiao, and X. Wang, “DeepReID: Deep filter pairing neural network for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Columbus, OH, USA, 2014, pp. 152–159.
Multi-camera target tracking remains a fundamental challenge in intelligent video surveillance, particularly when
maintaining consistent identity associations across spatially disjoint camera views with overlapping or non-overlapping
fields of view. This paper introduces VisionNet, a unified deep learning framework that integrates state-of-the-art detection,
single-camera tracking, and cross-camera re-identification into a cohesive and computationally efficient pipeline. VisionNet
employs YOLOv8 for high-accuracy object detection, Deep Simple Online and Realtime Tracking (DeepSORT) for intracamera trajectory persistence, and Omni-Scale Network (OSNet) for discriminative cross-camera identity association. The
framework leverages GPU-accelerated inference to sustain real-time throughput while preserving tracking fidelity under
challenging surveillance conditions such as occlusion, lighting variation, and crowded scenes. A Flask-based monitoring
dashboard provides operators with live visualization of tracked targets, inter-camera handoff events, and aggregate
behavioral analytics. Extensive evaluation on benchmark datasets including MOT17, DukeMTMC-reID, and Market-1501
demonstrates that VisionNet achieves an IDF1 score of 76.4%, a MOTA of 73.9%, and a Rank-1 re-identification accuracy
of 96.2%, outperforming competing baselines on identity-switch minimization. The modular architecture facilitates seamless
integration with existing closed-circuit television (CCTV) infrastructure and supports diverse application domains including
intelligent security systems, crowd flow analytics, retail behavior monitoring, and multi-athlete sports event tracking.
Keywords :
Multi-Camera Tracking, Person Re-Identification, YOLOv8, DeepSORT, OSNet, Video Surveillance, Deep Learning.