Bertbased speechtotext notes generator for educational content accessibility| International Journal of Innovative Science and Research Technology

Bert-Based Speech-to-Text Notes Generator for Educational Content Accessibility

Authors : Isaac, Onoriode, Oshevire; Caleb, Ande; Oluwatosin, Oluwaseun, Babatunde; Grace, Jesutola, Ajayi; Chigozie, David, Eze; Timilehin Ilupeju; Oluwatobi Balogun

Volume/Issue : Volume 11 - 2026, Issue 3 - March

Google Scholar : https://tinyurl.com/4nefxj2x

Scribd : https://tinyurl.com/3wu8dyeb

DOI : https://doi.org/10.38124/ijisrt/26mar883

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : Students often find it difficult to take accurate and complete notes during lectures due to fast-paced speech, unfamiliar accents, background noise, and the pressure of multitasking. These challenges are even more pronounced for students with learning difficulties, disabilities, or those who are non-native English speakers. Traditional note-taking methods do not always guarantee clarity or completeness, which affects comprehension and academic performance. With advancements in artificial intelligence (AI), it is now possible to explore automated tools that can transcribe and summarize lectures to support more effective learning. This study addresses the problem of limited access to accurate and real-time lecture notes. Existing speech-to-text systems are often trained on clean, studio-quality datasets and struggle to perform well in real-world classroom environments with noise, diverse accents, and technical terms. Most available solutions are not tailored for Nigerian contexts and fail to meet the academic needs of students. To solve this problem, a solution that integrates advanced AI models was developed to improve transcription accuracy and automatically summarize educational content. The system combines Wav2Vec 2.0 for speech recognition and BERT for extractive summarization. Publicly available datasets such as LJ Speech and CNN/DailyMail were used for training and testing. The audio was preprocessed using noise reduction and segmentation, while the text data underwent tokenization and lemmatization. The models were fine-tuned and integrated into a single application with a graphical interface. The system achieved a Word Error Rate (WER) of 0.2 and a ROUGE-1 score of 0.8, indicating strong performance. The interface allows users to upload or record audio, generate full transcripts, produce summaries, and export the output in readable formats. In conclusion, this project demonstrates that combining transformer-based models like Wav2Vec 2.0 and BERT can provide an efficient and accessible solution for lecture note generation. It enhances learning for all students, particularly those with special needs, and supports inclusive education through AI-based tools.

References :

Ahmadu, A. N., Akinrefon, A. A., Torsen, E., & Yakubu, N. (2024). Survival Analysis of Students’ Dropout in a Nigerian University System. International Journal of Development Mathematics (IJDM), 1(2), 160-168.
Anidjar, O. H., & Yozevitch, R. (2025). Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets. arXiv preprint arXiv:2501.11065.
Arafath, K. M. I. Y., & Routray, A. (2025). Detection of breath sounds in speech: A deep learning approach. Engineering Applications of Artificial Intelligence, 141, 109808.
Avro, S. B. H., Taher, T., & Mamun, N. (2025). EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information with Hybrid Recurrent Network. arXiv preprint arXiv:2501.12674.
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Benzirar, A., Hamidi, M., & Bouami, M. F. (2025). Conception of speech emotion recognition methods: a review. Indonesian Journal of Electrical Engineering and Computer Science, 37(3), 1856-1864.
Benzirar, M., Elhassouny, A., & Benhlima, L. (2025). A Survey on Speech Recognition: Techniques, Models, and Challenges. International Journal of Artificial Intelligence Research, 13(2), 95–110.
Chang, J., Lee, M., & Park, K. (2024). Transformer Models in Educational AI: Applications and Challenges. Journal of Educational Computing, 22(1), 65–81.
Chang, O., Liao, H., Serdyuk, D., Shahy, A., & Siohan, O. (2024, April). Conformer is All You Need for Visual Speech Recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 10136-10140). IEEE.
Chen, W., Xing, X., Chen, P., & Xu, X. (2024). Vesper: A compact and effective pretrained model for speech emotion recognition. IEEE Transactions on Affective Computing.
Choi, J., Kim, S., & Lee, D. (2024). Advancements in Deep Neural Architectures for Speech Recognition. Journal of Computational Linguistics and AI, 19(2), 88–104.
Choi, J., Park, S. J., Kim, M., & Ro, Y. M. (2024). AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 27325-27337).
Dekel, A., Shechtman, S., Fernandez, R., Haws, D., Kons, Z., & Hoory, R. (2024, April). Speak While You Think: Streaming Speech Synthesis During Text Generation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 11931-11935). IEEE.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT. https://arxiv.org/abs/1810.04805
Dolaeva, A., Beliaeva, U., Grigoriev, D., Semenov, A., & Rysz, M. (2025). Analyzing and forecasting P/E ratios using investor sentiment in panel data regression and LSTM models. International Review of Economics & Finance, 103840.
Du, C., Guo, Y., Shen, F., Liu, Z., Liang, Z., Chen, X., ... & Yu, K. (2024, March). UniCATS: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 16, pp. 17924-17932).
Fan, J., Yang, J., Zhang, X., & Yao, Y. (2022). Real-time single-channel speech enhancement based on causal attention mechanism. Applied Acoustics, 201, 109084.
Fang, Q., Guo, S., Zhou, Y., Ma, Z., Zhang, S., & Feng, Y. (2024). Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666.
Fang, Y., Wu, J., & Li, X. (2024). The Impact of Big Data and GPU Acceleration on Speech-to-Text Model Performance. International Journal of Speech Processing, 12(1), 55–70.
Fu, P., Liu, D., & Yang, H. (2022). LAS-transformer: An enhanced transformer based on the local attention mechanism for speech recognition. Information, 13(5), 250.
Gimeno-Gómez, D., & Martınez-Hinarejos, C. D. (2024). The PRHLT Speech Recognition System for the Albayzın 2024 Bilingual Basque-Spanish Speech to Text Challenge. In Proc. IberSPEECH 2024 (pp. 310-314).
Hasan, M. M., Das, R. K., Hassan, M., Razia, S., Ani, J. F., Khushbu, S. A., & Islam, M. (2025). Hybrid deep learning: a comparative study on ai algorithms in natural language processing for text classification. Bulletin of Electrical Engineering and Informatics, 14(1), 551-559.
Jbene, M., Chehri, A., Saadane, R., Tigani, S., & Jeon, G. (2025). Intent detection for task‐oriented conversational agents: A comparative study of recurrent neural networks and transformer models. Expert Systems, 42(2), e13712.
Karmakar, P., Teng, S. W., & Lu, G. (2024). Thank you for attention: a survey on attention-based artificial neural networks for automatic speech recognition. Intelligent Systems with Applications, 200406.
Karmakar, S., Dutta, A., & Roy, N. (2024). Speech-to-Text Systems: A Comparative Analysis of Transformer-Based Approaches. International Journal of Computational Linguistics, 18(3), 134–150.
Kazemi, M. H., & Alvanchi, A. (2025). Application of NLP-based models in automated detection of risky contract statements written in complex script system. Expert Systems with Applications, 259, 125296.
Khan, F., Abdullahi, R., & Zhang, Y. (2025). Enhancing Contextual Understanding in Low-Resource Languages Using Multilingual BERT. Proceedings of the International Conference on Computational Linguistics (COLING), 112(2), 134–146.
Khan, L., Qazi, A., Chang, H. T., Alhajlah, M., & Mahmood, A. (2025). Empowering Urdu sentiment analysis: an attention-based stacked CNN-Bi-LSTM DNN with multilingual BERT. Complex & Intelligent Systems, 11(1), 10.
Khan, M., Gueaieb, W., El Saddik, A., & Kwon, S. (2024). MSER: Multimodal speech emotion recognition using cross-attention with deep fusion. Expert Systems with Applications, 245, 122946.
Le, M., Vyas, A., Shi, B., Karrer, B., Sari, L., Moritz, R., ... & Hsu, W. N. (2024). Voicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information processing systems, 36.
Liu, Y., Wei, L. F., Qian, X., Zhang, T. H., Chen, S. L., & Yin, X. C. (2024). M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing. Pattern Recognition Letters, 179, 158-164.
Mamatov, N. S., Niyozmatova, N. A., Yuldoshev, Y. S., Abdullaev, S. S., & Samijonov, A. N. (2022, October). Automatic speech recognition on the neutral network based on attention mechanism. In International Conference on Intelligent Human Computer Interaction (pp. 100-108). Cham: Springer Nature Switzerland.
Mishra, S. P., Warule, P., & Deb, S. (2025). Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition. Speech Communication, 166, 103148.
Nazir, O., Malik, A., Singh, S., & Pathan, A. S. K. (2024). Multi speaker text-to-speech synthesis using generalized end-to-end loss function. Multimedia Tools and Applications, 1-18.
Niyozmatova, N. A., Mamatov, N. S., Samijonov, A. N., & Samijonov, B. N. (2025). Language and acoustic modeling in Uzbek speech recognition. In Artificial Intelligence and Information Technologies (pp. 558-564). CRC Press.
Orosoo, M., Raash, N., Treve, M., Lahza, H. F. M., Alshammry, N., Ramesh, J. V. N., & Rengarajan, M. (2025). Transforming English language learning: Advanced speech recognition with MLP-LSTM for personalized education. Alexandria Engineering Journal, 111, 21-32.
Patil, R. N., Rawandale, S. A., Yadav, G. B., & Kadam, P. (2025). Leveraging Machine Learning and Neural Networks for Enhanced Communication in Leadership. In Leadership Paradigms and the Impact of Technology (pp. 247-284). IGI Global Scientific Publishing.
Poorna, S. S., Menon, V., & Gopalan, S. (2025). Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition. Biomedical Signal Processing and Control, 100, 106967.
Pradhan, A., & Yajnik, A. (2024). Parts-of-speech tagging of Nepali texts with Bidirectional LSTM, Conditional Random Fields and HMM. Multimedia Tools and Applications, 83(4), 9893-9909.
Quayum, R. A., Bayor, B., Ji, C., & Malik, U.(2025). Arabic Diacritization with Viterbi N-gram Model, Transformers, and Recurrent Neural Networks.
Shao, S. (2025). Enhancing Sentiment Analysis with a CNN-Stacked LSTM Hybrid Model. In ITM Web of Conferences (Vol. 70, p. 02002). EDP Sciences.
Sharon, R., Sur, M., & Murthy, H. (2025). Harnessing the Multi-phasal Nature of Speech-EEG for Enhancing Imagined Speech Recognition. IEEE Open Journal of Signal Processing.
Soydaner, D. (2022). Attention mechanism in neural networks: where it comes and where it goes. Neural Computing and Applications, 34(16), 13371-13385.
Sujatha, R., Chatterjee, J. M., Pathy, B., & Hu, Y. C. (2025). Automatic emotion recognition using deep neural network. Multimedia Tools and Applications, 1-30.
Sun, Y., Wang, L., & Li, M. (2021). Modern Applications of BLEU in Text Summarization and Generation. Journal of Natural Language Engineering, 27(4), 567–580.
Tamayo, A., & Abaurrea, A. R. (2024). Speech-to-text Recognition for the Creation of Subtitles in Basque: An Analysis of ADITU Based on the NER Model. The Journal of Specialised Translation, (41), 48-73.
Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., ... & Liu, T. Y. (2024). Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Tang, Y., & Liao, J. (2025). Research on digital entertainment technology and gaming methods based on hidden Markov models in English e-learning classroom mode. Entertainment Computing, 52, 100856.
Tang, Y., & Liao, J. (2025). Research on digital entertainment technology and gaming methods based on hidden Markov models in English e-learning classroom mode. Entertainment Computing, 52, 100856.
Valencia-Angulo, E. A., Ramírez-Vanegas, C. A., & Giraldo, O. D. M. (2025). Distance measures for hidden Markov models based on Hilbert space embeddings for time series classification. Statistics, Optimization & Information Computing.
Valencia-Angulo, L. A., Martínez-González, A., & López-Moreno, J. (2025). A Historical Overview of Speech Recognition Technologies: From Template Matching to Deep Learning. Journal of Speech Technology and Applications, 18(1), 22–39.
Valencia-Angulo, P., Ishaq, S., & Wang, H. (2025). A Historical Overview of Speech-to-Text Systems: From Templates to Transformers. ACM Transactions on Speech and Language Processing, 18(1), 1–24.
Vinothkumar, G., & Kumar, M. (2024). Speech Enhancement with Background Noise Suppression in Various Data Corpus Using Bi-LSTM Algorithm. International Journal of Electrical and Electronics Research, 12(1), 322-328.
Wang, H., Pandey, A., & Wang, D. (2025). A systematic study of DNN based speech enhancement in reverberant and reverberant-noisy environments. Computer Speech & Language, 89, 101677.
Wang, K., Li, J., & Sun, Z. (2025). Generative adaptable design based on hidden Markov model. Advanced Engineering Informatics, 64, 103034.
Wang, S., Du, Y., Guo, X., Pan, B., Qin, Z., & Zhao, L. (2024). Controllable Data Generation by Deep Learning: A Review. ACM Computing Surveys, 56(9), 1-38.
Wang, T., Ezike, F., & Ogundipe, M. (2025). Improving Speech-to-Text Accuracy in Noisy and Reverberant Environments Using DNN-Based Enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 33(4), 402–416.
Wang, Y., Xu, Z., Zheng, Z., Zheng, Z., & Wu, J. (2024). Review on the Use of Speech Synthesis Technology in Education. New Explorations in Education and Teaching, 2(2).
Xu, M., Li, X., & Liu, J. (2023). Evaluation Metrics for End-to-End Speech Recognition Systems. Journal of Speech Technology, 18(1), 45–60.
Zhang, D., Zhang, X., Zhan, J., Li, S., Zhou, Y., & Qiu, X. (2024). SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation. arXiv preprint arXiv:2401.13527.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. Proceedings of ICLR 2020. https://openreview.net/forum?id=SkeHuCVFDr
Zhao, H., Lyu, Z., & Mendoza, R. (2025). Speech Recognition for d/Deaf and Hard-of-Hearing Accessibility: A Comparative Analysis. Journal of Assistive Technologies, 19(2), 55–71.
Zhao, R., Choi, A. S., Koenecke, A., & Rameau, A. (2025). Quantification of Automatic Speech Recognition System Performance on d/Deaf and Hard of Hearing Speech. The Laryngoscope, 135(1), 191-197.
Zhou, Q., Yang, Y., & Liu, J. (2023). Evaluation Metrics for Summarization Models in Education Technology. Journal of Computational Linguistics and AI, 21(2), 134–149.

Students often find it difficult to take accurate and complete notes during lectures due to fast-paced speech, unfamiliar accents, background noise, and the pressure of multitasking. These challenges are even more pronounced for students with learning difficulties, disabilities, or those who are non-native English speakers. Traditional note-taking methods do not always guarantee clarity or completeness, which affects comprehension and academic performance. With advancements in artificial intelligence (AI), it is now possible to explore automated tools that can transcribe and summarize lectures to support more effective learning. This study addresses the problem of limited access to accurate and real-time lecture notes. Existing speech-to-text systems are often trained on clean, studio-quality datasets and struggle to perform well in real-world classroom environments with noise, diverse accents, and technical terms. Most available solutions are not tailored for Nigerian contexts and fail to meet the academic needs of students. To solve this problem, a solution that integrates advanced AI models was developed to improve transcription accuracy and automatically summarize educational content. The system combines Wav2Vec 2.0 for speech recognition and BERT for extractive summarization. Publicly available datasets such as LJ Speech and CNN/DailyMail were used for training and testing. The audio was preprocessed using noise reduction and segmentation, while the text data underwent tokenization and lemmatization. The models were fine-tuned and integrated into a single application with a graphical interface. The system achieved a Word Error Rate (WER) of 0.2 and a ROUGE-1 score of 0.8, indicating strong performance. The interface allows users to upload or record audio, generate full transcripts, produce summaries, and export the output in readable formats. In conclusion, this project demonstrates that combining transformer-based models like Wav2Vec 2.0 and BERT can provide an efficient and accessible solution for lecture note generation. It enhances learning for all students, particularly those with special needs, and supports inclusive education through AI-based tools.

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.