Authors :
Sonia Singh B; Shubhaprada K P
Volume/Issue :
Volume 8 - 2023, Issue 8 - August
Google Scholar :
https://bit.ly/3TmGbDi
Scribd :
https://tinyurl.com/4kehztc6
DOI :
https://doi.org/10.5281/zenodo.8327791
Abstract :
The project proposes an end-to-end deep
learning architecture for word-level visual speech
recognition without the need for explicit word boundary
information. The methodology includes spatiotemporal
convolutional layers, Residual Networks (Res Nets), and
bidirectional Long Short-Term Memory (Bi- LSTM)
networks. The system is trained using the CTC loss
function and requires data preprocessing with facial
landmark extraction, image cropping, resizing, grayscale
conversion, and data augmentation to focus on the
mouth region. The model is implemented in Tensor
Flow and trained with an adaptive learning rate
schedule. With this approach, the proposed system
achieves end-to-end lip reading from a video frame and
implicitly identifies keywords in utterances. Analysis
using the CTC loss function confirms the model’s
effectiveness. The results suggest potential applications
in dictation, hearing aids, and biometric authentication,
thus advancing visual speech recognition compared to
traditional methods. In summary, the project presents
an innovative deep learning architecture for word-level
visual speech recognition, surpassing traditional
methods and enabling practical applications.
Keywords :
Recurrent Neural Network, Long Short-Term Memory, Graphics Processing Unit, Solid State Drive, Text- to- Speech, Application Programming Interface, Audio- Visual, Lip Reading, Bidirectional Long Short-Term Memory, Graphical User Interface, Red Green Blue, Mean Squared Error, Mean Absolute Error, Adaptive Moment Estimation.
The project proposes an end-to-end deep
learning architecture for word-level visual speech
recognition without the need for explicit word boundary
information. The methodology includes spatiotemporal
convolutional layers, Residual Networks (Res Nets), and
bidirectional Long Short-Term Memory (Bi- LSTM)
networks. The system is trained using the CTC loss
function and requires data preprocessing with facial
landmark extraction, image cropping, resizing, grayscale
conversion, and data augmentation to focus on the
mouth region. The model is implemented in Tensor
Flow and trained with an adaptive learning rate
schedule. With this approach, the proposed system
achieves end-to-end lip reading from a video frame and
implicitly identifies keywords in utterances. Analysis
using the CTC loss function confirms the model’s
effectiveness. The results suggest potential applications
in dictation, hearing aids, and biometric authentication,
thus advancing visual speech recognition compared to
traditional methods. In summary, the project presents
an innovative deep learning architecture for word-level
visual speech recognition, surpassing traditional
methods and enabling practical applications.
Keywords :
Recurrent Neural Network, Long Short-Term Memory, Graphics Processing Unit, Solid State Drive, Text- to- Speech, Application Programming Interface, Audio- Visual, Lip Reading, Bidirectional Long Short-Term Memory, Graphical User Interface, Red Green Blue, Mean Squared Error, Mean Absolute Error, Adaptive Moment Estimation.