Video Quality Assessment (VQA) using Vision Transformers


Authors : Kallam Lalithendar Reddy; Pogaku Sahnaya; Vattikuti Hareen Sai; Gummuluri Venkata Keerthana

Volume/Issue : Volume 9 - 2024, Issue 1 - January

Google Scholar : http://tinyurl.com/3v39ex9u

Scribd : http://tinyurl.com/3pj8uvsv

DOI : https://doi.org/10.5281/zenodo.10526231

Abstract : In this paper, we check the potential of vision transformers in the field of Video Quality Assessment (VQA). Vision Transformers (ViT) are used in field of computer vision based on working nature of transformers in Natural Language Processing (NLP) tasks. They work on the relationship between the input tokens internally. In NLP we use words as tokens, whereas in computer vision we use image patches as tokens where we try to capture the connection between different portions of the image. A pre-trained model of ViT B/16 over imageNet-1k was used to extract features from the video and to validate them over the MOS scores of the video. The patch embeddings are given tokens called as positional embeddings and are send to transformer encoder. There are total 12 layers in ViT - Base Transformer Encoder. Each encoder has a Layer Norm, Multi-Head Attention followed by an another Layer Norm with Multi-Layer Perceptron (MLP) block. The classifier head of the Transformer was removed to get feature vector as our aim is not to classification. After the features are achieved we use an Support Vector Regressor (SVR) of Radial Basis Function (RBF) kernel to assess the video quality.

Keywords : Konvid 1-k Dataset, Vision Transformer, Support Vector Regressor, Attention, Token Embeddings.

In this paper, we check the potential of vision transformers in the field of Video Quality Assessment (VQA). Vision Transformers (ViT) are used in field of computer vision based on working nature of transformers in Natural Language Processing (NLP) tasks. They work on the relationship between the input tokens internally. In NLP we use words as tokens, whereas in computer vision we use image patches as tokens where we try to capture the connection between different portions of the image. A pre-trained model of ViT B/16 over imageNet-1k was used to extract features from the video and to validate them over the MOS scores of the video. The patch embeddings are given tokens called as positional embeddings and are send to transformer encoder. There are total 12 layers in ViT - Base Transformer Encoder. Each encoder has a Layer Norm, Multi-Head Attention followed by an another Layer Norm with Multi-Layer Perceptron (MLP) block. The classifier head of the Transformer was removed to get feature vector as our aim is not to classification. After the features are achieved we use an Support Vector Regressor (SVR) of Radial Basis Function (RBF) kernel to assess the video quality.

Keywords : Konvid 1-k Dataset, Vision Transformer, Support Vector Regressor, Attention, Token Embeddings.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe