A multimodal cnn transformer framework for early dyslexia and dysgraphia detection using handwriting and speech| International Journal of Innovative Science and Research Technology

A Multimodal CNN Transformer Framework for Early Dyslexia and Dysgraphia Detection Using Handwriting and Speech

Authors : Siddhesh Pote; Gaurav Zagade; Shrikrushna Suryavanshi; Ashwini Shahapurkar

Volume/Issue : Volume 11 - 2026, Issue 4 - April

Google Scholar : https://tinyurl.com/34semrra

Scribd : https://tinyurl.com/kny9psue

DOI : https://doi.org/10.38124/ijisrt/26apr2145

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : This paper presents a multimodal framework for early screening of learning disabilities—focusing on dyslexia and dysgraphia—by jointly modeling handwriting and speech to capture graphophonological interactions that singlemodality systems often miss. Prior studies have shown high within-cohort accuracy using CNNs for handwriting, classical machine learning and deep models for EEG and imaging, and educational analytics; however, most rely on small, homogeneous datasets, use late fusion when combining modalities, and lack cross-site or cross-language validation, limiting generalizability and deployment potential. The proposed system integrates a vision encoder for handwriting (optionally incorporating tablet kinematics) and a speech encoder that fuses acoustic and ASR-derived linguistic features via cross-modal transformers, trained with supervised and contrastive losses for robust alignment. Methodological considerations include multilingual data collection, standardized preprocessing, calibrated uncertainty, and privacypreserving learning to support equitable classroom deployment. The evaluation plan compares unimodal baselines, latefusion ensembles, and the proposed intermediate-fusion architecture across within-site, cross-site, and cross-language settings using AUROC, macro-F1, severity kappa, and fairness audits. Expected outcomes include improved out-ofdistribution performance and interpretable per-modality rationales to assist educators and clinicians in early intervention.

Keywords : Dyslexia; Dysgraphia; Handwriting Analysis; Speech Processing; Multimodal Learning; Cross Modal Attention.

References :

V. S. M. S. Harsh Patel, An early detection of learning disabilities using machine learning, 2025.
J. Z. ,. Z. X. Chun Wang, AI-Powered Educational Data Analysis, 2024.
M. R. Ghadah Aldehim1, Deep Learning for Dyslexia Detection: A Comprehensive CNN Approach with Handwriting Analysis and Benchmark Comparisons, 2024.
Y. A. a. A. R. W. Sait, Deep learning-driven dyslexia detection model using multi-modality data, 2024.
P. A. a. T. M. S Weraduwa1, ADVANCED COMPUTATIONAL TECHNIQUES FOR DYSGRAPHIA PREDICTION THROUGH HANDWRITING RECOGNITION USING MACHINE LEARNING AND DEEP LEARNING METHODS, 2024.
B. A. a. M. S. R. Norah Dhafer Alqahtani, Deep Learning Applications for Dyslexia Prediction, 2023.
a. A. R. W. S. Yazeed Alkhurayyif, Deep Learning-Based Model for Detecting Dyslexia, 2023.
P. D. A. T. V. M. Sandushi Weraduwa, Early Detection and Severity Assessment of Dysgraphia in Sinhala-Speaking Children Using a Multi-Modal Machine Learning Approach, 2025.
Y. Alkhurayyif, Developing an Image-Based Dyslexia Detection Model Using the Deep Learning Technique, 2023.
M. N. V. M. Pragasthi1, AI – Powered Learning Disability Detection and Classification System, 2025.
P. Y. a. P. B. Bhushan, Deep Learning Approach to Automated Detection of Dyslexia-Dysgraphia, 2020.

This paper presents a multimodal framework for early screening of learning disabilities—focusing on dyslexia and dysgraphia—by jointly modeling handwriting and speech to capture graphophonological interactions that singlemodality systems often miss. Prior studies have shown high within-cohort accuracy using CNNs for handwriting, classical machine learning and deep models for EEG and imaging, and educational analytics; however, most rely on small, homogeneous datasets, use late fusion when combining modalities, and lack cross-site or cross-language validation, limiting generalizability and deployment potential. The proposed system integrates a vision encoder for handwriting (optionally incorporating tablet kinematics) and a speech encoder that fuses acoustic and ASR-derived linguistic features via cross-modal transformers, trained with supervised and contrastive losses for robust alignment. Methodological considerations include multilingual data collection, standardized preprocessing, calibrated uncertainty, and privacypreserving learning to support equitable classroom deployment. The evaluation plan compares unimodal baselines, latefusion ensembles, and the proposed intermediate-fusion architecture across within-site, cross-site, and cross-language settings using AUROC, macro-F1, severity kappa, and fairness audits. Expected outcomes include improved out-ofdistribution performance and interpretable per-modality rationales to assist educators and clinicians in early intervention.

Keywords : Dyslexia; Dysgraphia; Handwriting Analysis; Speech Processing; Multimodal Learning; Cross Modal Attention.

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.