Authors :
Priyanka Panchal; Dr. Dinesh J. Prajapati
Volume/Issue :
Volume 10 - 2025, Issue 1 - January
Google Scholar :
https://tinyurl.com/3chjr8ty
Scribd :
https://tinyurl.com/5d44shxb
DOI :
https://doi.org/10.5281/zenodo.14637077
Abstract :
The rapid growth of video data in various
domains has led to an increased demand for effective and
efficient methods to analyze and extract valuable
information from videos. Deep learning methods have
demonstrated exceptional performance success in object
detection, but their performance heavily relies on large-
scale labeled datasets. This study proposes a novel model
for object detection from video by combining deep
learning and transfer learning algorithms. The use of the
power of CNN to learn spatio temporal features in the
video frames are employed to propose the model. To
address the limited labeled video data, transfer learning is
employed, which is previously-trained CNN method, such
as ResNet50, is refined on the UCF101, Sports1M and
Youtube8M Video datasets. Transfer learning enables the
model to learn generalizable features from these rich
datasets, enhancing its ability to detect objects in unseen
videos. Furthermore, the proposed model incorporates
temporal information by employing LSTM and 3D
convolutional networks to capture the motion dynamics
across consecutive frames. Spatial and temporal features
fusion enhance the robustness and accuracy of object
detection. Proposed model is used extensively to evaluate
on the UCF101, Sports1M and YouTube8M Dataset. The
proposed model effectively determines the results that
show localizing and classifying objects in video sequences,
outperforming existing cutting-edge methods. Overall, the
novel research provides a promising approach for object
detection in video, showcasing the Deep learning &
transfer learning algorithms' potential in tackling the
challenges of limited labeled video data and exploiting the
spatio-temporal context for improved object detection
performance.
Keywords :
Video Object Detection; Deep Learning; Convolutional Neural Networks; Spatial-Temporal Feature; LSTM.
References :
- Zhu H, Wei H, Li B, Yuan X, Kehtarnavaz N. A review of video object detection: Datasets, metrics and methods. Applied Sciences. 2020 Nov 4;10(21):7834.
- Gothane S. A practice for object detection using YOLO algorithm. International Journal of Scientific Research in Computer Science, Engineering and Information Technology. 2021 Apr;7(2):268-72.
- Bertasius G, Torresani L. Classifying, segmenting, and tracking object instances in video with mask propagation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020 (pp. 9739-9748).
- Zhang H, Chang H, Ma B, Wang N, Chen X. Dynamic R-CNN: Towards high quality object detection via dynamic training. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16 2020 (pp. 260-275). Springer International Publishing.
- Diwan T, Anirudh G, Tembhurne JV. Object detection using YOLO: Challenges, architectural successors, datasets and applications. multimedia Tools and Applications. 2023 Mar;82(6):9243-75.
- Deng J, Pan Y, Yao T, Zhou W, Li H, Mei T. Single shot video object detector. IEEE Transactions on Multimedia. 2020 Apr 23;23:846-58.
- Han M, Wang Y, Chang X, Qiao Y. Mining inter-video proposal relations for video object detection. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16 2020 (pp. 431-446). Springer International Publishing.
- Zhou Q, Li X, He L, Yang Y, Cheng G, Tong Y, Ma L, Tao D. TransVOD: end-to-end video object detection with spatial-temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022 Nov 23;45(6):7853-69.
- Pray Somaldo PS, Dina Chahyati DC. Comparison of FairMOT-VGG16 and MCMOT Implementation for Multi-Object Tracking and Gender Detection on Mall CCTV. Jurnal Ilmu Komputer dan Informasi. 2021;14(1):49-64.
- Pal SK, Pramanik A, Maiti J, Mitra P. Deep learning in multi-object detection and tracking: state of the art. Applied Intelligence. 2021 Sep;51:6400-29.
- Qasim AB, Pettirsch A. Recurrent neural networks for video object detection. arXiv preprint arXiv:2010.15740. 2020 Oct 29.
- Lohia A, Kadam KD, Joshi RR, Bongale AM. Bibliometric analysis of one-stage and two-stage object detection. Libr. Philos. Pract. 2021 Feb 1;4910:34.
- Oh SW, Lee JY, Xu N, Kim SJ. Space-time memory networks for video object segmentation with user guidance. IEEE transactions on pattern analysis and machine intelligence. 2020 Jul 13;44(1):442-55.
- Hong L, Zhang W, Chen L, Zhang W, Fan J. Adaptive selection of reference frames for video object segmentation. IEEE Transactions on Image Processing. 2021 Dec 29;31:1057-71.
- Gao M, Zheng F, Yu JJ, Shan C, Ding G, Han J. Deep learning for video object segmentation: a review. Artificial Intelligence Review. 2023 Jan;56(1):457-531.
- Kumar B, Singh AK, Banerjee P. A deep learning approach for product recommendation using resnet-50 cnn model. In2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS) 2023 Jun 14 (pp. 604-610). IEEE.
- Jain S, Gajbhiye S, Jain A, Tiwari S, Naithani K. A Quarter Century Journey: Evolution of Object Detection Methods. In2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT) 2024 Jan 11 (pp. 1-8). IEEE.
- Sahoo PK, Panda MK, Panigrahi U, Panda G, Jain P, Islam MS, Islam MT. An Improved VGG-19 Network Induced Enhanced Feature Pooling For Precise Moving Object Detection In Complex Video Scenes. IEEE Access. 2024 Mar 27.
- Jiao L, Zhang R, Liu F, Yang S, Hou B, Li L, Tang X. New generation deep learning for video object detection: A survey. IEEE Transactions on Neural Networks and Learning Systems. 2021 Feb 3;33(8):3195-215.
- Cui Y, Yan L, Cao Z, Liu D. Tf-blender: Temporal feature blender for video object detection. InProceedings of the IEEE/CVF international conference on computer vision 2021 (pp. 8138-8147).
- Zhao W, Zhang J, Li L, Barnes N, Liu N, Han J. Weakly supervised video salient object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2021 (pp. 16826-16835).
- Xu C, Zhang J, Wang M, Tian G, Liu Y. Multilevel spatial-temporal feature aggregation for video object detection. IEEE Transactions on Circuits and Systems for Video Technology. 2022 Jun 16;32(11):7809-20.
The rapid growth of video data in various
domains has led to an increased demand for effective and
efficient methods to analyze and extract valuable
information from videos. Deep learning methods have
demonstrated exceptional performance success in object
detection, but their performance heavily relies on large-
scale labeled datasets. This study proposes a novel model
for object detection from video by combining deep
learning and transfer learning algorithms. The use of the
power of CNN to learn spatio temporal features in the
video frames are employed to propose the model. To
address the limited labeled video data, transfer learning is
employed, which is previously-trained CNN method, such
as ResNet50, is refined on the UCF101, Sports1M and
Youtube8M Video datasets. Transfer learning enables the
model to learn generalizable features from these rich
datasets, enhancing its ability to detect objects in unseen
videos. Furthermore, the proposed model incorporates
temporal information by employing LSTM and 3D
convolutional networks to capture the motion dynamics
across consecutive frames. Spatial and temporal features
fusion enhance the robustness and accuracy of object
detection. Proposed model is used extensively to evaluate
on the UCF101, Sports1M and YouTube8M Dataset. The
proposed model effectively determines the results that
show localizing and classifying objects in video sequences,
outperforming existing cutting-edge methods. Overall, the
novel research provides a promising approach for object
detection in video, showcasing the Deep learning &
transfer learning algorithms' potential in tackling the
challenges of limited labeled video data and exploiting the
spatio-temporal context for improved object detection
performance.
Keywords :
Video Object Detection; Deep Learning; Convolutional Neural Networks; Spatial-Temporal Feature; LSTM.