⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



Design and Implementation of a Computer-Aided Pronunciation Tool for Autonomous Phonetic Acquisition


Authors : Nicholas Simeon Dienagha; Biralatei Fawei

Volume/Issue : Volume 11 - 2026, Issue 3 - March


Google Scholar : https://tinyurl.com/bdddcnw2

Scribd : https://tinyurl.com/yc2z9t78

DOI : https://doi.org/10.38124/ijisrt/26mar1942

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Effective phonetic acquisition remains a significant hurdle for second-language (L2) learners, particularly in environments where access to expert pedagogical feedback is limited. This study details the design and implementation of a Computer-Aided Pronunciation (CAP) tool developed to bridge this gap through real-time speech visualization. The system leverages a Python-based computational framework, utilizing Librosa for robust audio signal extraction, NumPy for high-performance numerical processing, and Matplotlib for the generation of visual feedback. The core methodology focuses on transforming complex acoustic data into intuitive visual representations, specifically spectrograms and simplified line graphs. The system was evaluated against praat and the Results indicated that the peaks in the 2D line graph accurately corresponded to the first and second formants ($F_1$ and $F_2$) of vowel sounds generated in praat. Preliminary results suggest that this visual-centric approach reduces the cognitive load of phonetic drills and fosters learner self-correction, offering a scalable solution for language education in resource-constrained contexts. With the integration of multi-modal engagement, the tool promotes autonomous corrective feedback loops and enhances the efficacy of pronunciation training as it allowed learners to engage in comparative analysis by overlaying their speech patterns against native-speaker models, facilitating immediate auditory and visual feedback.

Keywords : Computer-Aided Pronunciation Training (CAPT), Speech Visualization, Signal Processing, Python, Phonetic Acquisition, L2 Learning;

References :

  1. Brière, E. J. (2017). An investigation of phonological interference. In Pronunciation (pp. 61-94). Routledge.
  2. McKenzie, B., Bull, R., & Gray, C. (2003). The effects of phonological and visual-spatial interference on children’s arithmetical performance. Educational and Child Psychology20(3), 93-108.
  3. Stockwell, G. (2013). Mobile-assisted language learning. Contemporary computer-assisted language learning, 201-216.
  4. Chapelle, C. A. (2017). Evaluation of technology and language learning. The handbook of technology and second language teaching and learning, 378-392.
  5. Levy, M. (2009). Technologies in use for second language learning. The modern language journal93, 769-782.
  6. Kern, R., Ware, P., & Warschauer, M. (2016). Computer-mediated communication and language learning. In The Routledge handbook of English language teaching (pp. 542-555). Routledge.
  7. Dudeney, G., & Hockly, N. (2016). Literacies, technology and language teaching. In The Routledge handbook of language learning and technology (pp. 115-126). Routledge.
  8. Chen, M. R. A., Hwang, G. J., & Chang, Y. Y. (2019). A reflective thinking‐promoting approach to enhancing graduate students' flipped learning engagement, participation behaviors, reflective thinking and project learning outcomes. British Journal of Educational Technology50(5), 2288-2307.
  9. Bhardwaj, V., Ben Othman, M. T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B. S., ... & Hamam, H. (2022). Automatic speech recognition (asr) systems for children: A systematic literature review. Applied Sciences12(9), 4419.
  10. Ngueajio, M. K., & Washington, G. (2022, June). Hey ASR system! Why aren’t you more inclusive? Automatic speech recognition systems’ bias and proposed bias mitigation techniques. A literature review. In International conference on human-computer interaction (pp. 421-440). Cham: Springer Nature Switzerland.
  11. Gruberg, E., Dudkin, E., Wang, Y., Marín, G., Salas, C., Sentis, E., ... & Udin, S. (2006). Influencing and interpreting visual input: the role of a visual feedback system. Journal of Neuroscience26(41), 10368-10371.
  12. Rourke, M. J. (2025). A Gamified Mobile App for Learning Linguistics: Applying Software Design and Thinking to Educational Engagement.
  13. Eskenazi, M. (2013). The basics. Crowdsourcing for speech processing: Applications to data collection, transcription and assessment, 8-36.
  14. Derwing, T. M., & Munro, M. J. (2022). Pronunciation learning and teaching. In The Routledge handbook of second language acquisition and speaking (pp. 147-159). Routledge.
  15. Boersma, P., & Van Heuven, V. (2001). Speak and unSpeak with PRAAT. Glot International5(9/10), 341-347.
  16. Fruehwald, J., & Brickhouse, C. (2024). aligned-textgrid: Lightweight access to structured phonetic data. Proceedings of the Society for Computation in Linguistics (SCiL), 329-330.
  17. Godwin-Jones, R. (2011). Mobile apps for language learning.
  18. Sailer, M., Hense, J. U., Mayr, S. K., & Mandl, H. (2017). How gamification motivates: An experimental study of the effects of specific game design elements on psychological need satisfaction. Computers in human behavior69, 371-380.
  19. Hadi Mogavi, R., Guo, B., Zhang, Y., Haq, E. U., Hui, P., & Ma, X. (2022, June). When gamification spoils your learning: A qualitative case study of gamification misuse in a language-learning app. In Proceedings of the ninth ACM conference on learning@ scale (pp. 175-188).
  20. Howard, D. M. (2005). Human hearing modelling real-time spectrography for visual feedback in singing training. Folia phoniatrica et logopaedica57(5-6), 328-341.
  21. Hillier, A. F., Hillier, C. E., & Hillier, D. A. (2018). A modified spectrogram with possible application as a visual hearing aid for the deaf. The Journal of the Acoustical Society of America144(3), 1517-1520.
  22. Tran, T., & Lundgren, J. (2020). Drill fault diagnosis based on the scalogram and mel spectrogram of sound signals using artificial intelligence. Ieee Access8, 203655-203666.
  23. Hardison, D. M. (2017). Computer-assisted pronunciation training. In The Routledge handbook of contemporary English pronunciation (pp. 478-494). Routledge.
  24. Ertmer, D. J. (2004). How well can children recognize speech features in spectrograms? Comparisons by age and hearing status. Journal of Speech, Language, and Hearing Research47(3), 484-495.
  25. Celce-Murcia, M., Brinton, D. M., & Goodwin, J. M. (2010). Teaching pronunciation hardback with audio CDs (2): A course book and reference guide. Cambridge University Press.
  26. Higgins, S. (2015). A recent history of teaching thinking. In The Routledge international handbook of research on teaching thinking (pp. 19-28). Routledge.
  27. Hincks, R., & Edlund, J. (2009, September). Using speech technology to promote increased pitch variation in oral presentations. In SLaTE (pp. 117-120).
  28. Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. speech communication51(11), 1039-1064.
  29. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems (NeurIPS), 33, 12449-12460.
  30. Lima, L., & Zawadzki, A. (2018). Improving speaker intelligibility: Using sitcoms and engaging activities to develop learners' perception and production of word stress. Pronunciation in Second Language Learning and Teaching.
  31. Setter, J., & Sebina, B. (2017). English lexical stress, prominence and rhythm. The Routledge Handbook of Contemporary English Pronunciation, 137–153. https://doi.org/10.4324/9781315145006-9
  32. McLoughlin, I., Pham, L., Song, Y., Miao, X., Phan, H., Cai, P., ... & Soh, D. (2026). Spectrogram Features for Audio and Speech Analysis. Applied Sciences16(2), 572.
  33. Ertmer, D. J., & Maki, J. J. (2000). A comparison of speech training methods with deaf adolescents: Spectrographic versus noninstrumental instruction. Journal of Speech, Language, and Hearing Research
  34. Hardison, D. M., & Pennington, M. C. (2021). Multimodal second-language communication: Research findings and pedagogical implications. Relc Journal52(1), 62-76.
  35. Abdul, Z. K., & Al-Talabani, A. K. (2022). Mel frequency cepstral coefficient and its applications: A review. Ieee Access10, 122136-122158.
  36. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
  37. Jassim, W. A., Skoglund, J., Chinen, M., & Hines, A. (2022). Speech quality assessment with WARP-Q: From similarity to subsequence dynamic time warp cost. IET Signal Processing, 16(9), 1050–1070. https://doi.org/10.1049/sil2.12151
  38. Garreau, D., Lajugie, R., Arlot, S., & Bach, F. (2014). Metric learning for temporal sequence alignment. arXiv. https://doi.org/10.48550/arxiv.1409.3136
  39. Sakoe, H., & Chiba, S. (1990). Dynamic programming algorithm optimization for spoken word recognition. Readings in Speech Recognition, 159–165. https://doi.org/10.1016/b978-0-08-051584-7.50016-4

Effective phonetic acquisition remains a significant hurdle for second-language (L2) learners, particularly in environments where access to expert pedagogical feedback is limited. This study details the design and implementation of a Computer-Aided Pronunciation (CAP) tool developed to bridge this gap through real-time speech visualization. The system leverages a Python-based computational framework, utilizing Librosa for robust audio signal extraction, NumPy for high-performance numerical processing, and Matplotlib for the generation of visual feedback. The core methodology focuses on transforming complex acoustic data into intuitive visual representations, specifically spectrograms and simplified line graphs. The system was evaluated against praat and the Results indicated that the peaks in the 2D line graph accurately corresponded to the first and second formants ($F_1$ and $F_2$) of vowel sounds generated in praat. Preliminary results suggest that this visual-centric approach reduces the cognitive load of phonetic drills and fosters learner self-correction, offering a scalable solution for language education in resource-constrained contexts. With the integration of multi-modal engagement, the tool promotes autonomous corrective feedback loops and enhances the efficacy of pronunciation training as it allowed learners to engage in comparative analysis by overlaying their speech patterns against native-speaker models, facilitating immediate auditory and visual feedback.

Keywords : Computer-Aided Pronunciation Training (CAPT), Speech Visualization, Signal Processing, Python, Phonetic Acquisition, L2 Learning;

Paper Submission Last Date
30 - April - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe