Context-Aware Speech Generation Using BiLSTM-Based Neural Networks


Authors : K. Jeevan Reddy; Dr. Syed Jahangir Badashah; S. Tharun Govind; M. Dinesh; K. Bindusri

Volume/Issue : Volume 10 - 2025, Issue 7 - July


Google Scholar : https://tinyurl.com/467k8nuc

DOI : https://doi.org/10.38124/ijisrt/25jul358

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Note : Google Scholar may take 30 to 40 days to display the article.


Abstract : According to recent studies, feed-forward Deep neural networks (DNNs) perform better than text-to-speech (TTS) systems that use decision-tree clustered context-dependent hidden Markov models (HMMs) [1, 4]. The feed-forward aspect of DNN-based models makes it difficult to incorporate the long-span contextual influence into spoken utterances. Another typical strategy in HMM-based TTS for establishing a continuous speech trajectory is using the dynamic characteristics to constrain the production of speech parameters [2]. Parametric time-to-speech synthesis is used in this study by capturing the co-occurrence or correlation data between any two points in a spoken phrase using time aware memory network cells. Based on our experiments, a combination of DNN and BLSTM-RNN is the best system to use. Upper hidden layers of this system use a bidirectional RNN structure of LSTM, the low layers use a simple, one way structure followed by additional layers. On objective and subjective metrics, it surpasses both the traditional decision-based tree HMM’s and the DNN-TTS system. Dynamic limitations are superfluous since the BLSTM-RNN TTS produces very smooth speech trajectories.

Keywords : Bidirectional Long Short-Term Memory(BLSTM), Deep_Neural_Network, Recurrent_Neural_Network, Statistical Parametric Voice Synthesis, hidden Markov Model.

References :

  1. H. Zen, K. Tokuda, and W. Black, "Statistical parametric speech synthesis," Speech Communication, vol. 51, no. 11, pp. 1039-1064, 2009.
  2. K. Tokuda, T. Kobayashi, T. Masuko, and T. Kitamura, "Speech parameter generation algorithms for HMM-based speech synthesis," in Proc. ICASSP, pp. 1315-1318, 2000.
  3. H. Zen, A. Senior, and M. Senior, "Statistical parametric speech synthesis using deep neural networks," in Proc. ICASSP, pp. 8012-8016, 2013.
  4. Y. Qian, Y.-C. Fan, W.-P. Hu, and F. K. Soong, "On the training aspects of deep neural network (DNN) for parametric TTS synthesis," in Proc. ICASSP, 2014.
  5. H. Lu, S. King, and O. Watts, "Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis," in Proc. 8th ISCA Workshop on Speech Synthesis, pp. 281-285, 2013.
  6. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
  7. S. Kang, X. Qian, and H. Meng, "Multi-distribution deep belief network for speech synthesis," in Proc. ICASSP, pp. 7962-7966, 2013.
  8. Z.-H. Ling, L. Deng, and D. Yu, "Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis," in Proc. ICASSP, pp. 7825-7829, 2013.
  9. A. Graves, A.-R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in Proc. ICASSP, pp. 6645-6649, 2013.
  10. A. Graves, N. Jaitly, and A.-R. Mohamed, "Hybrid speech recognition with deep bidirectional LSTM," in Proc. IEEE ASRU, pp. 273-278, 2013.
  11. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
  12. F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, "Learning precise timing with LSTM recurrent networks," Journal of Machine Learning Research, vol. 3, pp. 115-143, 2003.
  13. M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
  14. A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks," in Proc. ICML, Pittsburgh, USA, 2006.
  15. A. Graves, "Sequence transduction with recurrent neural networks," in ICML Representation Learning Workshop, 2012.
  16. Y. Bengio, "Learning deep architectures for AI," Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.
  17. K. Shinoda and T. Watanabe, "MDL-based context-dependent sub-word modeling for speech recognition," J. Acoust. Soc. Jpn. (E), vol. 21, no. 2, pp. 79-86, 2000.
  18. Y.-J. Wu and R. H. Wang, "Minimum generation error training for HMM-based speech synthesis," in Proc. ICASSP, 2006.
  19. C. J. Chen, R. A. Gopinath, M. D. Monkowski, M. A. Picheny, and K. Shen, "New methods in continuous Mandarin speech recognition," in Proc. EUROSPEECH, 1997.
  20. F. Seide, G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," in Proc. IEEE ASRU, 2011. [Online]. Available: http://sourceforge.net/projects/currennt/
  21. H. Zen, "Deep learning in speech synthesis," Proc. ISCA SSW8, 2013. [Online]. Available: http://research.google.com/pubs/archive/41539.pdf
  22. Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang, "USTC system for Blizzard Challenge 2006: An improved HMM-based speech synthesis method," in Proc. Blizzard Challenge Workshop, 2006.

According to recent studies, feed-forward Deep neural networks (DNNs) perform better than text-to-speech (TTS) systems that use decision-tree clustered context-dependent hidden Markov models (HMMs) [1, 4]. The feed-forward aspect of DNN-based models makes it difficult to incorporate the long-span contextual influence into spoken utterances. Another typical strategy in HMM-based TTS for establishing a continuous speech trajectory is using the dynamic characteristics to constrain the production of speech parameters [2]. Parametric time-to-speech synthesis is used in this study by capturing the co-occurrence or correlation data between any two points in a spoken phrase using time aware memory network cells. Based on our experiments, a combination of DNN and BLSTM-RNN is the best system to use. Upper hidden layers of this system use a bidirectional RNN structure of LSTM, the low layers use a simple, one way structure followed by additional layers. On objective and subjective metrics, it surpasses both the traditional decision-based tree HMM’s and the DNN-TTS system. Dynamic limitations are superfluous since the BLSTM-RNN TTS produces very smooth speech trajectories.

Keywords : Bidirectional Long Short-Term Memory(BLSTM), Deep_Neural_Network, Recurrent_Neural_Network, Statistical Parametric Voice Synthesis, hidden Markov Model.

CALL FOR PAPERS


Paper Submission Last Date
31 - December - 2025

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe