Authors :
Kallam Lalithendar Reddy; Pogaku Sahnaya; Vattikuti Hareen Sai; Gummuluri Venkata Keerthana
Volume/Issue :
Volume 9 - 2024, Issue 1 - January
Google Scholar :
http://tinyurl.com/3v39ex9u
Scribd :
http://tinyurl.com/3pj8uvsv
DOI :
https://doi.org/10.5281/zenodo.10526231
Abstract :
In this paper, we check the potential of vision
transformers in the field of Video Quality Assessment
(VQA). Vision Transformers (ViT) are used in field of
computer vision based on working nature of
transformers in Natural Language Processing (NLP)
tasks. They work on the relationship between the input
tokens internally. In NLP we use words as tokens,
whereas in computer vision we use image patches as
tokens where we try to capture the connection between
different portions of the image. A pre-trained model of
ViT B/16 over imageNet-1k was used to extract features
from the video and to validate them over the MOS
scores of the video. The patch embeddings are given
tokens called as positional embeddings and are send to
transformer encoder. There are total 12 layers in ViT -
Base Transformer Encoder. Each encoder has a Layer
Norm, Multi-Head Attention followed by an another
Layer Norm with Multi-Layer Perceptron (MLP) block.
The classifier head of the Transformer was removed to
get feature vector as our aim is not to classification.
After the features are achieved we use an Support
Vector Regressor (SVR) of Radial Basis Function (RBF)
kernel to assess the video quality.
Keywords :
Konvid 1-k Dataset, Vision Transformer, Support Vector Regressor, Attention, Token Embeddings.
In this paper, we check the potential of vision
transformers in the field of Video Quality Assessment
(VQA). Vision Transformers (ViT) are used in field of
computer vision based on working nature of
transformers in Natural Language Processing (NLP)
tasks. They work on the relationship between the input
tokens internally. In NLP we use words as tokens,
whereas in computer vision we use image patches as
tokens where we try to capture the connection between
different portions of the image. A pre-trained model of
ViT B/16 over imageNet-1k was used to extract features
from the video and to validate them over the MOS
scores of the video. The patch embeddings are given
tokens called as positional embeddings and are send to
transformer encoder. There are total 12 layers in ViT -
Base Transformer Encoder. Each encoder has a Layer
Norm, Multi-Head Attention followed by an another
Layer Norm with Multi-Layer Perceptron (MLP) block.
The classifier head of the Transformer was removed to
get feature vector as our aim is not to classification.
After the features are achieved we use an Support
Vector Regressor (SVR) of Radial Basis Function (RBF)
kernel to assess the video quality.
Keywords :
Konvid 1-k Dataset, Vision Transformer, Support Vector Regressor, Attention, Token Embeddings.