Authors :
Siddhesh Pote; Gaurav Zagade; Shrikrushna Suryavanshi; Ashwini Shahapurkar
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/34semrra
Scribd :
https://tinyurl.com/kny9psue
DOI :
https://doi.org/10.38124/ijisrt/26apr2145
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
This paper presents a multimodal framework for early screening of learning disabilities—focusing on dyslexia
and dysgraphia—by jointly modeling handwriting and speech to capture graphophonological interactions that
singlemodality systems often miss. Prior studies have shown high within-cohort accuracy using CNNs for handwriting,
classical machine learning and deep models for EEG and imaging, and educational analytics; however, most rely on small,
homogeneous datasets, use late fusion when combining modalities, and lack cross-site or cross-language validation,
limiting generalizability and deployment potential. The proposed system integrates a vision encoder for handwriting
(optionally incorporating tablet kinematics) and a speech encoder that fuses acoustic and ASR-derived linguistic features
via cross-modal transformers, trained with supervised and contrastive losses for robust alignment. Methodological
considerations include multilingual data collection, standardized preprocessing, calibrated uncertainty, and privacypreserving learning to support equitable classroom deployment. The evaluation plan compares unimodal baselines, latefusion ensembles, and the proposed intermediate-fusion architecture across within-site, cross-site, and cross-language
settings using AUROC, macro-F1, severity kappa, and fairness audits. Expected outcomes include improved out-ofdistribution performance and interpretable per-modality rationales to assist educators and clinicians in early intervention.
Keywords :
Dyslexia; Dysgraphia; Handwriting Analysis; Speech Processing; Multimodal Learning; Cross Modal Attention.
References :
- V. S. M. S. Harsh Patel, An early detection of learning disabilities using machine learning, 2025.
- J. Z. ,. Z. X. Chun Wang, AI-Powered Educational Data Analysis, 2024.
- M. R. Ghadah Aldehim1, Deep Learning for Dyslexia Detection: A Comprehensive CNN Approach with Handwriting Analysis and Benchmark Comparisons, 2024.
- Y. A. a. A. R. W. Sait, Deep learning-driven dyslexia detection model using multi-modality data, 2024.
- P. A. a. T. M. S Weraduwa1, ADVANCED COMPUTATIONAL TECHNIQUES FOR DYSGRAPHIA PREDICTION THROUGH HANDWRITING RECOGNITION USING MACHINE LEARNING AND DEEP LEARNING METHODS, 2024.
- B. A. a. M. S. R. Norah Dhafer Alqahtani, Deep Learning Applications for Dyslexia Prediction, 2023.
- a. A. R. W. S. Yazeed Alkhurayyif, Deep Learning-Based Model for Detecting Dyslexia, 2023.
- P. D. A. T. V. M. Sandushi Weraduwa, Early Detection and Severity Assessment of Dysgraphia in Sinhala-Speaking Children Using a Multi-Modal Machine Learning Approach, 2025.
- Y. Alkhurayyif, Developing an Image-Based Dyslexia Detection Model Using the Deep Learning Technique, 2023.
- M. N. V. M. Pragasthi1, AI – Powered Learning Disability Detection and Classification System, 2025.
- P. Y. a. P. B. Bhushan, Deep Learning Approach to Automated Detection of Dyslexia-Dysgraphia, 2020.
This paper presents a multimodal framework for early screening of learning disabilities—focusing on dyslexia
and dysgraphia—by jointly modeling handwriting and speech to capture graphophonological interactions that
singlemodality systems often miss. Prior studies have shown high within-cohort accuracy using CNNs for handwriting,
classical machine learning and deep models for EEG and imaging, and educational analytics; however, most rely on small,
homogeneous datasets, use late fusion when combining modalities, and lack cross-site or cross-language validation,
limiting generalizability and deployment potential. The proposed system integrates a vision encoder for handwriting
(optionally incorporating tablet kinematics) and a speech encoder that fuses acoustic and ASR-derived linguistic features
via cross-modal transformers, trained with supervised and contrastive losses for robust alignment. Methodological
considerations include multilingual data collection, standardized preprocessing, calibrated uncertainty, and privacypreserving learning to support equitable classroom deployment. The evaluation plan compares unimodal baselines, latefusion ensembles, and the proposed intermediate-fusion architecture across within-site, cross-site, and cross-language
settings using AUROC, macro-F1, severity kappa, and fairness audits. Expected outcomes include improved out-ofdistribution performance and interpretable per-modality rationales to assist educators and clinicians in early intervention.
Keywords :
Dyslexia; Dysgraphia; Handwriting Analysis; Speech Processing; Multimodal Learning; Cross Modal Attention.