An Efficient Transformer-Based System for Text-Based Video Segment Retrieval Using FAISS


Authors : Sai Vivek Reddy Gurram

Volume/Issue : Volume 9 - 2024, Issue 9 - September


Google Scholar : https://tinyurl.com/3mhxraz4

Scribd : https://tinyurl.com/mr452k9v

DOI : https://doi.org/10.38124/ijisrt/IJISRT24SEP1105

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : An efficient system for text-based video segment retrieval is presented, leveraging transformer- based embeddings and the FAISS library for similarity search. The sys- tem enables users to perform real-time, scalable searches over video datasets by converting video segments into combined text and image embeddings. Key components include video segmentation, speech-to-text transcription using Wav2Vec 2.0, frame extraction, embedding generation using Vision Transformers and Sentence Transformers, and efficient similarity search using FAISS. Experimental results demonstrate the system’s applicability in media archives, education, and content discovery, even when applied to a small dataset.

References :

  1. C. G. M. Snoek, M. Worring,  and A. W. M. Smeul- ders. Early versus late fusion in semantic video anal- ysis. In Proceedings of the 13th Annual ACM Inter- national Conference on Multimedia, pages 399–402, Singapore, 2005.
  2. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers  for  language  understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 4171–4186, Minneapolis, MN, 2019.
  3. Alexey Dosovitskiy, Lucas Beyer,  Alexander Kolesnikov,   Dirk   Weissenborn,   et   al.    An   image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  4. Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- supervised learning of speech representations. Ad- vances in Neural Information Processing Systems, 33:12449–12460, 2020.
  5. Jeff   Johnson,   Matthijs   Douze,   and   Herv´e   J´egou. Billion-scale similarity search with gpus. IEEE Trans- actions on Big Data, 7(3):535–547, 2019.
  6. F. Zulko. MoviePy: Video editing with Python, 2015. Zenodo.
  7. Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Ni- eto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, pages 18–25, Austin, TX, 2015.
  8. Clark. Pillow (PIL Fork) Documentation, 2015. Python Imaging Library (PIL).
  9. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese  bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982– 3992, Hong Kong, China, 2019.

An efficient system for text-based video segment retrieval is presented, leveraging transformer- based embeddings and the FAISS library for similarity search. The sys- tem enables users to perform real-time, scalable searches over video datasets by converting video segments into combined text and image embeddings. Key components include video segmentation, speech-to-text transcription using Wav2Vec 2.0, frame extraction, embedding generation using Vision Transformers and Sentence Transformers, and efficient similarity search using FAISS. Experimental results demonstrate the system’s applicability in media archives, education, and content discovery, even when applied to a small dataset.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe