A Hybrid Deep Learning Approach for Video Object Detection


Authors : Priyanka Panchal; Dr. Dinesh J. Prajapati

Volume/Issue : Volume 10 - 2025, Issue 1 - January


Google Scholar : https://tinyurl.com/3chjr8ty

Scribd : https://tinyurl.com/5d44shxb

DOI : https://doi.org/10.5281/zenodo.14637077


Abstract : The rapid growth of video data in various domains has led to an increased demand for effective and efficient methods to analyze and extract valuable information from videos. Deep learning methods have demonstrated exceptional performance success in object detection, but their performance heavily relies on large- scale labeled datasets. This study proposes a novel model for object detection from video by combining deep learning and transfer learning algorithms. The use of the power of CNN to learn spatio temporal features in the video frames are employed to propose the model. To address the limited labeled video data, transfer learning is employed, which is previously-trained CNN method, such as ResNet50, is refined on the UCF101, Sports1M and Youtube8M Video datasets. Transfer learning enables the model to learn generalizable features from these rich datasets, enhancing its ability to detect objects in unseen videos. Furthermore, the proposed model incorporates temporal information by employing LSTM and 3D convolutional networks to capture the motion dynamics across consecutive frames. Spatial and temporal features fusion enhance the robustness and accuracy of object detection. Proposed model is used extensively to evaluate on the UCF101, Sports1M and YouTube8M Dataset. The proposed model effectively determines the results that show localizing and classifying objects in video sequences, outperforming existing cutting-edge methods. Overall, the novel research provides a promising approach for object detection in video, showcasing the Deep learning & transfer learning algorithms' potential in tackling the challenges of limited labeled video data and exploiting the spatio-temporal context for improved object detection performance.

Keywords : Video Object Detection; Deep Learning; Convolutional Neural Networks; Spatial-Temporal Feature; LSTM.

References :

  1. Zhu H, Wei H, Li B, Yuan X, Kehtarnavaz N. A review of video object detection: Datasets, metrics and methods. Applied Sciences. 2020 Nov 4;10(21):7834.
  2. Gothane S. A practice for object detection using YOLO algorithm. International Journal of Scientific Research in Computer Science, Engineering and Information Technology. 2021 Apr;7(2):268-72.
  3. Bertasius G, Torresani L. Classifying, segmenting, and tracking object instances in video with mask propagation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020 (pp. 9739-9748).
  4. Zhang H, Chang H, Ma B, Wang N, Chen X. Dynamic R-CNN: Towards high quality object detection via dynamic training. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16 2020 (pp. 260-275). Springer International Publishing.
  5. Diwan T, Anirudh G, Tembhurne JV. Object detection using YOLO: Challenges, architectural successors, datasets and applications. multimedia Tools and Applications. 2023 Mar;82(6):9243-75.
  6. Deng J, Pan Y, Yao T, Zhou W, Li H, Mei T. Single shot video object detector. IEEE Transactions on Multimedia. 2020 Apr 23;23:846-58.
  7. Han M, Wang Y, Chang X, Qiao Y. Mining inter-video proposal relations for video object detection. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16 2020 (pp. 431-446). Springer International Publishing.
  8. Zhou Q, Li X, He L, Yang Y, Cheng G, Tong Y, Ma L, Tao D. TransVOD: end-to-end video object detection with spatial-temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022 Nov 23;45(6):7853-69.
  9. Pray Somaldo PS, Dina Chahyati DC. Comparison of FairMOT-VGG16 and MCMOT Implementation for Multi-Object Tracking and Gender Detection on Mall CCTV. Jurnal Ilmu Komputer dan Informasi. 2021;14(1):49-64.
  10. Pal SK, Pramanik A, Maiti J, Mitra P. Deep learning in multi-object detection and tracking: state of the art. Applied Intelligence. 2021 Sep;51:6400-29.
  11. Qasim AB, Pettirsch A. Recurrent neural networks for video object detection. arXiv preprint arXiv:2010.15740. 2020 Oct 29.
  12. Lohia A, Kadam KD, Joshi RR, Bongale AM. Bibliometric analysis of one-stage and two-stage object detection. Libr. Philos. Pract. 2021 Feb 1;4910:34.
  13. Oh SW, Lee JY, Xu N, Kim SJ. Space-time memory networks for video object segmentation with user guidance. IEEE transactions on pattern analysis and machine intelligence. 2020 Jul 13;44(1):442-55.
  14. Hong L, Zhang W, Chen L, Zhang W, Fan J. Adaptive selection of reference frames for video object segmentation. IEEE Transactions on Image Processing. 2021 Dec 29;31:1057-71.
  15. Gao M, Zheng F, Yu JJ, Shan C, Ding G, Han J. Deep learning for video object segmentation: a review. Artificial Intelligence Review. 2023 Jan;56(1):457-531.
  16. Kumar B, Singh AK, Banerjee P. A deep learning approach for product recommendation using resnet-50 cnn model. In2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS) 2023 Jun 14 (pp. 604-610). IEEE.
  17. Jain S, Gajbhiye S, Jain A, Tiwari S, Naithani K. A Quarter Century Journey: Evolution of Object Detection Methods. In2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT) 2024 Jan 11 (pp. 1-8). IEEE.
  18. Sahoo PK, Panda MK, Panigrahi U, Panda G, Jain P, Islam MS, Islam MT. An Improved VGG-19 Network Induced Enhanced Feature Pooling For Precise Moving Object Detection In Complex Video Scenes. IEEE Access. 2024 Mar 27.
  19. Jiao L, Zhang R, Liu F, Yang S, Hou B, Li L, Tang X. New generation deep learning for video object detection: A survey. IEEE Transactions on Neural Networks and Learning Systems. 2021 Feb 3;33(8):3195-215.
  20. Cui Y, Yan L, Cao Z, Liu D. Tf-blender: Temporal feature blender for video object detection. InProceedings of the IEEE/CVF international conference on computer vision 2021 (pp. 8138-8147).
  21. Zhao W, Zhang J, Li L, Barnes N, Liu N, Han J. Weakly supervised video salient object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2021 (pp. 16826-16835).
  22. Xu C, Zhang J, Wang M, Tian G, Liu Y. Multilevel spatial-temporal feature aggregation for video object detection. IEEE Transactions on Circuits and Systems for Video Technology. 2022 Jun 16;32(11):7809-20.

The rapid growth of video data in various domains has led to an increased demand for effective and efficient methods to analyze and extract valuable information from videos. Deep learning methods have demonstrated exceptional performance success in object detection, but their performance heavily relies on large- scale labeled datasets. This study proposes a novel model for object detection from video by combining deep learning and transfer learning algorithms. The use of the power of CNN to learn spatio temporal features in the video frames are employed to propose the model. To address the limited labeled video data, transfer learning is employed, which is previously-trained CNN method, such as ResNet50, is refined on the UCF101, Sports1M and Youtube8M Video datasets. Transfer learning enables the model to learn generalizable features from these rich datasets, enhancing its ability to detect objects in unseen videos. Furthermore, the proposed model incorporates temporal information by employing LSTM and 3D convolutional networks to capture the motion dynamics across consecutive frames. Spatial and temporal features fusion enhance the robustness and accuracy of object detection. Proposed model is used extensively to evaluate on the UCF101, Sports1M and YouTube8M Dataset. The proposed model effectively determines the results that show localizing and classifying objects in video sequences, outperforming existing cutting-edge methods. Overall, the novel research provides a promising approach for object detection in video, showcasing the Deep learning & transfer learning algorithms' potential in tackling the challenges of limited labeled video data and exploiting the spatio-temporal context for improved object detection performance.

Keywords : Video Object Detection; Deep Learning; Convolutional Neural Networks; Spatial-Temporal Feature; LSTM.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe