Authors :
K. Jeevan Reddy; Dr. Syed Jahangir Badashah; S. Tharun Govind; M. Dinesh; K. Bindusri
Volume/Issue :
Volume 10 - 2025, Issue 7 - July
Google Scholar :
https://tinyurl.com/467k8nuc
DOI :
https://doi.org/10.38124/ijisrt/25jul358
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Note : Google Scholar may take 30 to 40 days to display the article.
Abstract :
According to recent studies, feed-forward Deep neural networks (DNNs) perform better than text-to-speech (TTS)
systems that use decision-tree clustered context-dependent hidden Markov models (HMMs) [1, 4]. The feed-forward aspect
of DNN-based models makes it difficult to incorporate the long-span contextual influence into spoken utterances. Another
typical strategy in HMM-based TTS for establishing a continuous speech trajectory is using the dynamic characteristics to
constrain the production of speech parameters [2]. Parametric time-to-speech synthesis is used in this study by capturing
the co-occurrence or correlation data between any two points in a spoken phrase using time aware memory network cells.
Based on our experiments, a combination of DNN and BLSTM-RNN is the best system to use. Upper hidden layers of this
system use a bidirectional RNN structure of LSTM, the low layers use a simple, one way structure followed by additional
layers. On objective and subjective metrics, it surpasses both the traditional decision-based tree HMM’s and the DNN-TTS
system. Dynamic limitations are superfluous since the BLSTM-RNN TTS produces very smooth speech trajectories.
Keywords :
Bidirectional Long Short-Term Memory(BLSTM), Deep_Neural_Network, Recurrent_Neural_Network, Statistical Parametric Voice Synthesis, hidden Markov Model.
References :
- H. Zen, K. Tokuda, and W. Black, "Statistical parametric speech synthesis," Speech Communication, vol. 51, no. 11, pp. 1039-1064, 2009.
- K. Tokuda, T. Kobayashi, T. Masuko, and T. Kitamura, "Speech parameter generation algorithms for HMM-based speech synthesis," in Proc. ICASSP, pp. 1315-1318, 2000.
- H. Zen, A. Senior, and M. Senior, "Statistical parametric speech synthesis using deep neural networks," in Proc. ICASSP, pp. 8012-8016, 2013.
- Y. Qian, Y.-C. Fan, W.-P. Hu, and F. K. Soong, "On the training aspects of deep neural network (DNN) for parametric TTS synthesis," in Proc. ICASSP, 2014.
- H. Lu, S. King, and O. Watts, "Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis," in Proc. 8th ISCA Workshop on Speech Synthesis, pp. 281-285, 2013.
- G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
- S. Kang, X. Qian, and H. Meng, "Multi-distribution deep belief network for speech synthesis," in Proc. ICASSP, pp. 7962-7966, 2013.
- Z.-H. Ling, L. Deng, and D. Yu, "Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis," in Proc. ICASSP, pp. 7825-7829, 2013.
- A. Graves, A.-R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in Proc. ICASSP, pp. 6645-6649, 2013.
- A. Graves, N. Jaitly, and A.-R. Mohamed, "Hybrid speech recognition with deep bidirectional LSTM," in Proc. IEEE ASRU, pp. 273-278, 2013.
- S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
- F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, "Learning precise timing with LSTM recurrent networks," Journal of Machine Learning Research, vol. 3, pp. 115-143, 2003.
- M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
- A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks," in Proc. ICML, Pittsburgh, USA, 2006.
- A. Graves, "Sequence transduction with recurrent neural networks," in ICML Representation Learning Workshop, 2012.
- Y. Bengio, "Learning deep architectures for AI," Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.
- K. Shinoda and T. Watanabe, "MDL-based context-dependent sub-word modeling for speech recognition," J. Acoust. Soc. Jpn. (E), vol. 21, no. 2, pp. 79-86, 2000.
- Y.-J. Wu and R. H. Wang, "Minimum generation error training for HMM-based speech synthesis," in Proc. ICASSP, 2006.
- C. J. Chen, R. A. Gopinath, M. D. Monkowski, M. A. Picheny, and K. Shen, "New methods in continuous Mandarin speech recognition," in Proc. EUROSPEECH, 1997.
- F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. IEEE ASRU, 2011. [Online]. Available: http://sourceforge.net/projects/currennt/
- H. Zen, "Deep learning in speech synthesis," Proc. ISCA SSW8, 2013. [Online]. Available: http://research.google.com/pubs/archive/41539.pdf
- Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang, "USTC system for Blizzard Challenge 2006: An improved HMM-based speech synthesis method," in Proc. Blizzard Challenge Workshop, 2006.
According to recent studies, feed-forward Deep neural networks (DNNs) perform better than text-to-speech (TTS)
systems that use decision-tree clustered context-dependent hidden Markov models (HMMs) [1, 4]. The feed-forward aspect
of DNN-based models makes it difficult to incorporate the long-span contextual influence into spoken utterances. Another
typical strategy in HMM-based TTS for establishing a continuous speech trajectory is using the dynamic characteristics to
constrain the production of speech parameters [2]. Parametric time-to-speech synthesis is used in this study by capturing
the co-occurrence or correlation data between any two points in a spoken phrase using time aware memory network cells.
Based on our experiments, a combination of DNN and BLSTM-RNN is the best system to use. Upper hidden layers of this
system use a bidirectional RNN structure of LSTM, the low layers use a simple, one way structure followed by additional
layers. On objective and subjective metrics, it surpasses both the traditional decision-based tree HMM’s and the DNN-TTS
system. Dynamic limitations are superfluous since the BLSTM-RNN TTS produces very smooth speech trajectories.
Keywords :
Bidirectional Long Short-Term Memory(BLSTM), Deep_Neural_Network, Recurrent_Neural_Network, Statistical Parametric Voice Synthesis, hidden Markov Model.