Authors :
Nicholas Simeon Dienagha; Biralatei Fawei
Volume/Issue :
Volume 11 - 2026, Issue 3 - March
Google Scholar :
https://tinyurl.com/bdddcnw2
Scribd :
https://tinyurl.com/yc2z9t78
DOI :
https://doi.org/10.38124/ijisrt/26mar1942
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Effective phonetic acquisition remains a significant hurdle for second-language (L2) learners, particularly in
environments where access to expert pedagogical feedback is limited. This study details the design and implementation of
a Computer-Aided Pronunciation (CAP) tool developed to bridge this gap through real-time speech visualization. The
system leverages a Python-based computational framework, utilizing Librosa for robust audio signal extraction, NumPy
for high-performance numerical processing, and Matplotlib for the generation of visual feedback. The core methodology
focuses on transforming complex acoustic data into intuitive visual representations, specifically spectrograms and
simplified line graphs. The system was evaluated against praat and the Results indicated that the peaks in the 2D line
graph accurately corresponded to the first and second formants ($F_1$ and $F_2$) of vowel sounds generated in praat.
Preliminary results suggest that this visual-centric approach reduces the cognitive load of phonetic drills and fosters
learner self-correction, offering a scalable solution for language education in resource-constrained contexts. With the
integration of multi-modal engagement, the tool promotes autonomous corrective feedback loops and enhances the efficacy
of pronunciation training as it allowed learners to engage in comparative analysis by overlaying their speech patterns
against native-speaker models, facilitating immediate auditory and visual feedback.
Keywords :
Computer-Aided Pronunciation Training (CAPT), Speech Visualization, Signal Processing, Python, Phonetic Acquisition, L2 Learning;
References :
- Brière, E. J. (2017). An investigation of phonological interference. In Pronunciation (pp. 61-94). Routledge.
- McKenzie, B., Bull, R., & Gray, C. (2003). The effects of phonological and visual-spatial interference on children’s arithmetical performance. Educational and Child Psychology, 20(3), 93-108.
- Stockwell, G. (2013). Mobile-assisted language learning. Contemporary computer-assisted language learning, 201-216.
- Chapelle, C. A. (2017). Evaluation of technology and language learning. The handbook of technology and second language teaching and learning, 378-392.
- Levy, M. (2009). Technologies in use for second language learning. The modern language journal, 93, 769-782.
- Kern, R., Ware, P., & Warschauer, M. (2016). Computer-mediated communication and language learning. In The Routledge handbook of English language teaching (pp. 542-555). Routledge.
- Dudeney, G., & Hockly, N. (2016). Literacies, technology and language teaching. In The Routledge handbook of language learning and technology (pp. 115-126). Routledge.
- Chen, M. R. A., Hwang, G. J., & Chang, Y. Y. (2019). A reflective thinking‐promoting approach to enhancing graduate students' flipped learning engagement, participation behaviors, reflective thinking and project learning outcomes. British Journal of Educational Technology, 50(5), 2288-2307.
- Bhardwaj, V., Ben Othman, M. T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B. S., ... & Hamam, H. (2022). Automatic speech recognition (asr) systems for children: A systematic literature review. Applied Sciences, 12(9), 4419.
- Ngueajio, M. K., & Washington, G. (2022, June). Hey ASR system! Why aren’t you more inclusive? Automatic speech recognition systems’ bias and proposed bias mitigation techniques. A literature review. In International conference on human-computer interaction (pp. 421-440). Cham: Springer Nature Switzerland.
- Gruberg, E., Dudkin, E., Wang, Y., Marín, G., Salas, C., Sentis, E., ... & Udin, S. (2006). Influencing and interpreting visual input: the role of a visual feedback system. Journal of Neuroscience, 26(41), 10368-10371.
- Rourke, M. J. (2025). A Gamified Mobile App for Learning Linguistics: Applying Software Design and Thinking to Educational Engagement.
- Eskenazi, M. (2013). The basics. Crowdsourcing for speech processing: Applications to data collection, transcription and assessment, 8-36.
- Derwing, T. M., & Munro, M. J. (2022). Pronunciation learning and teaching. In The Routledge handbook of second language acquisition and speaking (pp. 147-159). Routledge.
- Boersma, P., & Van Heuven, V. (2001). Speak and unSpeak with PRAAT. Glot International, 5(9/10), 341-347.
- Fruehwald, J., & Brickhouse, C. (2024). aligned-textgrid: Lightweight access to structured phonetic data. Proceedings of the Society for Computation in Linguistics (SCiL), 329-330.
- Godwin-Jones, R. (2011). Mobile apps for language learning.
- Sailer, M., Hense, J. U., Mayr, S. K., & Mandl, H. (2017). How gamification motivates: An experimental study of the effects of specific game design elements on psychological need satisfaction. Computers in human behavior, 69, 371-380.
- Hadi Mogavi, R., Guo, B., Zhang, Y., Haq, E. U., Hui, P., & Ma, X. (2022, June). When gamification spoils your learning: A qualitative case study of gamification misuse in a language-learning app. In Proceedings of the ninth ACM conference on learning@ scale (pp. 175-188).
- Howard, D. M. (2005). Human hearing modelling real-time spectrography for visual feedback in singing training. Folia phoniatrica et logopaedica, 57(5-6), 328-341.
- Hillier, A. F., Hillier, C. E., & Hillier, D. A. (2018). A modified spectrogram with possible application as a visual hearing aid for the deaf. The Journal of the Acoustical Society of America, 144(3), 1517-1520.
- Tran, T., & Lundgren, J. (2020). Drill fault diagnosis based on the scalogram and mel spectrogram of sound signals using artificial intelligence. Ieee Access, 8, 203655-203666.
- Hardison, D. M. (2017). Computer-assisted pronunciation training. In The Routledge handbook of contemporary English pronunciation (pp. 478-494). Routledge.
- Ertmer, D. J. (2004). How well can children recognize speech features in spectrograms? Comparisons by age and hearing status. Journal of Speech, Language, and Hearing Research, 47(3), 484-495.
- Celce-Murcia, M., Brinton, D. M., & Goodwin, J. M. (2010). Teaching pronunciation hardback with audio CDs (2): A course book and reference guide. Cambridge University Press.
- Higgins, S. (2015). A recent history of teaching thinking. In The Routledge international handbook of research on teaching thinking (pp. 19-28). Routledge.
- Hincks, R., & Edlund, J. (2009, September). Using speech technology to promote increased pitch variation in oral presentations. In SLaTE (pp. 117-120).
- Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. speech communication, 51(11), 1039-1064.
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems (NeurIPS), 33, 12449-12460.
- Lima, L., & Zawadzki, A. (2018). Improving speaker intelligibility: Using sitcoms and engaging activities to develop learners' perception and production of word stress. Pronunciation in Second Language Learning and Teaching.
- Setter, J., & Sebina, B. (2017). English lexical stress, prominence and rhythm. The Routledge Handbook of Contemporary English Pronunciation, 137–153. https://doi.org/10.4324/9781315145006-9
- McLoughlin, I., Pham, L., Song, Y., Miao, X., Phan, H., Cai, P., ... & Soh, D. (2026). Spectrogram Features for Audio and Speech Analysis. Applied Sciences, 16(2), 572.
- Ertmer, D. J., & Maki, J. J. (2000). A comparison of speech training methods with deaf adolescents: Spectrographic versus noninstrumental instruction. Journal of Speech, Language, and Hearing Research
- Hardison, D. M., & Pennington, M. C. (2021). Multimodal second-language communication: Research findings and pedagogical implications. Relc Journal, 52(1), 62-76.
- Abdul, Z. K., & Al-Talabani, A. K. (2022). Mel frequency cepstral coefficient and its applications: A review. Ieee Access, 10, 122136-122158.
- Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
- Jassim, W. A., Skoglund, J., Chinen, M., & Hines, A. (2022). Speech quality assessment with WARP-Q: From similarity to subsequence dynamic time warp cost. IET Signal Processing, 16(9), 1050–1070. https://doi.org/10.1049/sil2.12151
- Garreau, D., Lajugie, R., Arlot, S., & Bach, F. (2014). Metric learning for temporal sequence alignment. arXiv. https://doi.org/10.48550/arxiv.1409.3136
- Sakoe, H., & Chiba, S. (1990). Dynamic programming algorithm optimization for spoken word recognition. Readings in Speech Recognition, 159–165. https://doi.org/10.1016/b978-0-08-051584-7.50016-4
Effective phonetic acquisition remains a significant hurdle for second-language (L2) learners, particularly in
environments where access to expert pedagogical feedback is limited. This study details the design and implementation of
a Computer-Aided Pronunciation (CAP) tool developed to bridge this gap through real-time speech visualization. The
system leverages a Python-based computational framework, utilizing Librosa for robust audio signal extraction, NumPy
for high-performance numerical processing, and Matplotlib for the generation of visual feedback. The core methodology
focuses on transforming complex acoustic data into intuitive visual representations, specifically spectrograms and
simplified line graphs. The system was evaluated against praat and the Results indicated that the peaks in the 2D line
graph accurately corresponded to the first and second formants ($F_1$ and $F_2$) of vowel sounds generated in praat.
Preliminary results suggest that this visual-centric approach reduces the cognitive load of phonetic drills and fosters
learner self-correction, offering a scalable solution for language education in resource-constrained contexts. With the
integration of multi-modal engagement, the tool promotes autonomous corrective feedback loops and enhances the efficacy
of pronunciation training as it allowed learners to engage in comparative analysis by overlaying their speech patterns
against native-speaker models, facilitating immediate auditory and visual feedback.
Keywords :
Computer-Aided Pronunciation Training (CAPT), Speech Visualization, Signal Processing, Python, Phonetic Acquisition, L2 Learning;