Authors :
Sai Vivek Reddy Gurram
Volume/Issue :
Volume 9 - 2024, Issue 9 - September
Google Scholar :
https://tinyurl.com/3mhxraz4
Scribd :
https://tinyurl.com/mr452k9v
DOI :
https://doi.org/10.38124/ijisrt/IJISRT24SEP1105
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
An efficient system for text-based video
segment retrieval is presented, leveraging transformer-
based embeddings and the FAISS library for
similarity search. The sys- tem enables users to
perform real-time, scalable searches over video datasets
by converting video segments into combined text and
image embeddings. Key components include video
segmentation, speech-to-text transcription using
Wav2Vec 2.0, frame extraction, embedding generation
using Vision Transformers and Sentence Transformers,
and efficient similarity search using FAISS.
Experimental results demonstrate the system’s
applicability in media archives, education, and content
discovery, even when applied to a small dataset.
References :
- C. G. M. Snoek, M. Worring, and A. W. M. Smeul- ders. Early versus late fusion in semantic video anal- ysis. In Proceedings of the 13th Annual ACM Inter- national Conference on Multimedia, pages 399–402, Singapore, 2005.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 4171–4186, Minneapolis, MN, 2019.
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- supervised learning of speech representations. Ad- vances in Neural Information Processing Systems, 33:12449–12460, 2020.
- Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with gpus. IEEE Trans- actions on Big Data, 7(3):535–547, 2019.
- F. Zulko. MoviePy: Video editing with Python, 2015. Zenodo.
- Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Ni- eto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, pages 18–25, Austin, TX, 2015.
- Clark. Pillow (PIL Fork) Documentation, 2015. Python Imaging Library (PIL).
- Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982– 3992, Hong Kong, China, 2019.
An efficient system for text-based video
segment retrieval is presented, leveraging transformer-
based embeddings and the FAISS library for
similarity search. The sys- tem enables users to
perform real-time, scalable searches over video datasets
by converting video segments into combined text and
image embeddings. Key components include video
segmentation, speech-to-text transcription using
Wav2Vec 2.0, frame extraction, embedding generation
using Vision Transformers and Sentence Transformers,
and efficient similarity search using FAISS.
Experimental results demonstrate the system’s
applicability in media archives, education, and content
discovery, even when applied to a small dataset.