Design and implementation of a computeraided pronunciation tool for autonomous phonetic acquisition| International Journal of Innovative Science and Research Technology

Design and Implementation of a Computer-Aided Pronunciation Tool for Autonomous Phonetic Acquisition

Authors : Nicholas Simeon Dienagha; Biralatei Fawei

Volume/Issue : Volume 11 - 2026, Issue 3 - March

Google Scholar : https://tinyurl.com/bdddcnw2

Scribd : https://tinyurl.com/yc2z9t78

DOI : https://doi.org/10.38124/ijisrt/26mar1942

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : Effective phonetic acquisition remains a significant hurdle for second-language (L2) learners, particularly in environments where access to expert pedagogical feedback is limited. This study details the design and implementation of a Computer-Aided Pronunciation (CAP) tool developed to bridge this gap through real-time speech visualization. The system leverages a Python-based computational framework, utilizing Librosa for robust audio signal extraction, NumPy for high-performance numerical processing, and Matplotlib for the generation of visual feedback. The core methodology focuses on transforming complex acoustic data into intuitive visual representations, specifically spectrograms and simplified line graphs. The system was evaluated against praat and the Results indicated that the peaks in the 2D line graph accurately corresponded to the first and second formants ($F_1$ and $F_2$) of vowel sounds generated in praat. Preliminary results suggest that this visual-centric approach reduces the cognitive load of phonetic drills and fosters learner self-correction, offering a scalable solution for language education in resource-constrained contexts. With the integration of multi-modal engagement, the tool promotes autonomous corrective feedback loops and enhances the efficacy of pronunciation training as it allowed learners to engage in comparative analysis by overlaying their speech patterns against native-speaker models, facilitating immediate auditory and visual feedback.

Keywords : Computer-Aided Pronunciation Training (CAPT), Speech Visualization, Signal Processing, Python, Phonetic Acquisition, L2 Learning;

References :

Brière, E. J. (2017). An investigation of phonological interference. In Pronunciation (pp. 61-94). Routledge.
McKenzie, B., Bull, R., & Gray, C. (2003). The effects of phonological and visual-spatial interference on children’s arithmetical performance. Educational and Child Psychology, 20(3), 93-108.
Stockwell, G. (2013). Mobile-assisted language learning. Contemporary computer-assisted language learning, 201-216.
Chapelle, C. A. (2017). Evaluation of technology and language learning. The handbook of technology and second language teaching and learning, 378-392.
Levy, M. (2009). Technologies in use for second language learning. The modern language journal, 93, 769-782.
Kern, R., Ware, P., & Warschauer, M. (2016). Computer-mediated communication and language learning. In The Routledge handbook of English language teaching (pp. 542-555). Routledge.
Dudeney, G., & Hockly, N. (2016). Literacies, technology and language teaching. In The Routledge handbook of language learning and technology (pp. 115-126). Routledge.
Chen, M. R. A., Hwang, G. J., & Chang, Y. Y. (2019). A reflective thinking‐promoting approach to enhancing graduate students' flipped learning engagement, participation behaviors, reflective thinking and project learning outcomes. British Journal of Educational Technology, 50(5), 2288-2307.
Bhardwaj, V., Ben Othman, M. T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B. S., ... & Hamam, H. (2022). Automatic speech recognition (asr) systems for children: A systematic literature review. Applied Sciences, 12(9), 4419.
Ngueajio, M. K., & Washington, G. (2022, June). Hey ASR system! Why aren’t you more inclusive? Automatic speech recognition systems’ bias and proposed bias mitigation techniques. A literature review. In International conference on human-computer interaction (pp. 421-440). Cham: Springer Nature Switzerland.
Gruberg, E., Dudkin, E., Wang, Y., Marín, G., Salas, C., Sentis, E., ... & Udin, S. (2006). Influencing and interpreting visual input: the role of a visual feedback system. Journal of Neuroscience, 26(41), 10368-10371.
Rourke, M. J. (2025). A Gamified Mobile App for Learning Linguistics: Applying Software Design and Thinking to Educational Engagement.
Eskenazi, M. (2013). The basics. Crowdsourcing for speech processing: Applications to data collection, transcription and assessment, 8-36.
Derwing, T. M., & Munro, M. J. (2022). Pronunciation learning and teaching. In The Routledge handbook of second language acquisition and speaking (pp. 147-159). Routledge.
Boersma, P., & Van Heuven, V. (2001). Speak and unSpeak with PRAAT. Glot International, 5(9/10), 341-347.
Fruehwald, J., & Brickhouse, C. (2024). aligned-textgrid: Lightweight access to structured phonetic data. Proceedings of the Society for Computation in Linguistics (SCiL), 329-330.
Godwin-Jones, R. (2011). Mobile apps for language learning.
Sailer, M., Hense, J. U., Mayr, S. K., & Mandl, H. (2017). How gamification motivates: An experimental study of the effects of specific game design elements on psychological need satisfaction. Computers in human behavior, 69, 371-380.
Hadi Mogavi, R., Guo, B., Zhang, Y., Haq, E. U., Hui, P., & Ma, X. (2022, June). When gamification spoils your learning: A qualitative case study of gamification misuse in a language-learning app. In Proceedings of the ninth ACM conference on learning@ scale (pp. 175-188).
Howard, D. M. (2005). Human hearing modelling real-time spectrography for visual feedback in singing training. Folia phoniatrica et logopaedica, 57(5-6), 328-341.
Hillier, A. F., Hillier, C. E., & Hillier, D. A. (2018). A modified spectrogram with possible application as a visual hearing aid for the deaf. The Journal of the Acoustical Society of America, 144(3), 1517-1520.
Tran, T., & Lundgren, J. (2020). Drill fault diagnosis based on the scalogram and mel spectrogram of sound signals using artificial intelligence. Ieee Access, 8, 203655-203666.
Hardison, D. M. (2017). Computer-assisted pronunciation training. In The Routledge handbook of contemporary English pronunciation (pp. 478-494). Routledge.
Ertmer, D. J. (2004). How well can children recognize speech features in spectrograms? Comparisons by age and hearing status. Journal of Speech, Language, and Hearing Research, 47(3), 484-495.
Celce-Murcia, M., Brinton, D. M., & Goodwin, J. M. (2010). Teaching pronunciation hardback with audio CDs (2): A course book and reference guide. Cambridge University Press.
Higgins, S. (2015). A recent history of teaching thinking. In The Routledge international handbook of research on teaching thinking (pp. 19-28). Routledge.
Hincks, R., & Edlund, J. (2009, September). Using speech technology to promote increased pitch variation in oral presentations. In SLaTE (pp. 117-120).
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. speech communication, 51(11), 1039-1064.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems (NeurIPS), 33, 12449-12460.
Lima, L., & Zawadzki, A. (2018). Improving speaker intelligibility: Using sitcoms and engaging activities to develop learners' perception and production of word stress. Pronunciation in Second Language Learning and Teaching.
Setter, J., & Sebina, B. (2017). English lexical stress, prominence and rhythm. The Routledge Handbook of Contemporary English Pronunciation, 137–153. https://doi.org/10.4324/9781315145006-9
McLoughlin, I., Pham, L., Song, Y., Miao, X., Phan, H., Cai, P., ... & Soh, D. (2026). Spectrogram Features for Audio and Speech Analysis. Applied Sciences, 16(2), 572.
Ertmer, D. J., & Maki, J. J. (2000). A comparison of speech training methods with deaf adolescents: Spectrographic versus noninstrumental instruction. Journal of Speech, Language, and Hearing Research
Hardison, D. M., & Pennington, M. C. (2021). Multimodal second-language communication: Research findings and pedagogical implications. Relc Journal, 52(1), 62-76.
Abdul, Z. K., & Al-Talabani, A. K. (2022). Mel frequency cepstral coefficient and its applications: A review. Ieee Access, 10, 122136-122158.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
Jassim, W. A., Skoglund, J., Chinen, M., & Hines, A. (2022). Speech quality assessment with WARP-Q: From similarity to subsequence dynamic time warp cost. IET Signal Processing, 16(9), 1050–1070. https://doi.org/10.1049/sil2.12151
Garreau, D., Lajugie, R., Arlot, S., & Bach, F. (2014). Metric learning for temporal sequence alignment. arXiv. https://doi.org/10.48550/arxiv.1409.3136
Sakoe, H., & Chiba, S. (1990). Dynamic programming algorithm optimization for spoken word recognition. Readings in Speech Recognition, 159–165. https://doi.org/10.1016/b978-0-08-051584-7.50016-4

Effective phonetic acquisition remains a significant hurdle for second-language (L2) learners, particularly in environments where access to expert pedagogical feedback is limited. This study details the design and implementation of a Computer-Aided Pronunciation (CAP) tool developed to bridge this gap through real-time speech visualization. The system leverages a Python-based computational framework, utilizing Librosa for robust audio signal extraction, NumPy for high-performance numerical processing, and Matplotlib for the generation of visual feedback. The core methodology focuses on transforming complex acoustic data into intuitive visual representations, specifically spectrograms and simplified line graphs. The system was evaluated against praat and the Results indicated that the peaks in the 2D line graph accurately corresponded to the first and second formants ($F_1$ and $F_2$) of vowel sounds generated in praat. Preliminary results suggest that this visual-centric approach reduces the cognitive load of phonetic drills and fosters learner self-correction, offering a scalable solution for language education in resource-constrained contexts. With the integration of multi-modal engagement, the tool promotes autonomous corrective feedback loops and enhances the efficacy of pronunciation training as it allowed learners to engage in comparative analysis by overlaying their speech patterns against native-speaker models, facilitating immediate auditory and visual feedback.

Keywords : Computer-Aided Pronunciation Training (CAPT), Speech Visualization, Signal Processing, Python, Phonetic Acquisition, L2 Learning;

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.