Authors :
Y V Nagesh Meesala; V Sai Surya; P Sai Kiran; R Sree Vardhan
Volume/Issue :
Volume 9 - 2024, Issue 4 - April
Google Scholar :
https://tinyurl.com/47jbk744
Scribd :
https://tinyurl.com/3puss745
DOI :
https://doi.org/10.38124/ijisrt/IJISRT24APR1907
Abstract :
Understanding foreign languages can be
challenging for individuals living in India's diverse
linguistic landscapes. We propose a new technology that
utilizes machine translation to address this issue,
specifically focusing on speech recognition and synthesis.
It aims to convert online video resources into Indian
languages by integrating open-source technologies like
text-to-speech (TTS), speech-to-text (STT) systems, and
FFmpeg library to separate or augment audio and video.
We used the whisper model, the application that can read
up to 60 different Languages in the form of audio as
input, and it transcripts the audio into text with segments
of sentences based on timestamps. The sentence-based
transcription generated by whisper is then translated into
the desired language using Google Cloud translate_v2.
Later, Each timestamp was individually converted into
audio using the Google Cloud text-to-speech service,
ensuring the audio fits inside the length of its respective
timestamp. The individual audio segments are then
augmented to generate the final audio in the desired
language. Finally, the audio is attached to the original
video, ensuring video-audio synchronization. The
accuracy of the translation was verified by comparing the
naturalness of the audio with general spoken language
standards. This application benefits visually impaired
individuals and those who cannot read text, providing
them with a means to acquire knowledge in their native
languages.
Keywords :
Open Innovation, Text-to-Speech Speech-to-Text, Machine Translation.
Understanding foreign languages can be
challenging for individuals living in India's diverse
linguistic landscapes. We propose a new technology that
utilizes machine translation to address this issue,
specifically focusing on speech recognition and synthesis.
It aims to convert online video resources into Indian
languages by integrating open-source technologies like
text-to-speech (TTS), speech-to-text (STT) systems, and
FFmpeg library to separate or augment audio and video.
We used the whisper model, the application that can read
up to 60 different Languages in the form of audio as
input, and it transcripts the audio into text with segments
of sentences based on timestamps. The sentence-based
transcription generated by whisper is then translated into
the desired language using Google Cloud translate_v2.
Later, Each timestamp was individually converted into
audio using the Google Cloud text-to-speech service,
ensuring the audio fits inside the length of its respective
timestamp. The individual audio segments are then
augmented to generate the final audio in the desired
language. Finally, the audio is attached to the original
video, ensuring video-audio synchronization. The
accuracy of the translation was verified by comparing the
naturalness of the audio with general spoken language
standards. This application benefits visually impaired
individuals and those who cannot read text, providing
them with a means to acquire knowledge in their native
languages.
Keywords :
Open Innovation, Text-to-Speech Speech-to-Text, Machine Translation.