A comprehensive review of multimodal financial sentiment analysis| International Journal of Innovative Science and Research Technology

A Comprehensive Review of Multimodal Financial Sentiment Analysis

Authors : Jin Wang; Inam Ullah

Volume/Issue : Volume 11 - 2026, Issue 3 - March

Google Scholar : https://tinyurl.com/8z39c9j2

Scribd : https://tinyurl.com/ydnz44x3

DOI : https://doi.org/10.38124/ijisrt/26mar1387

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : This study aims to explain how textual content, vocal cues, and affective expressions jointly shape investor reactions and market fluctuations by synthesizing existing research on multi-modal sentiment analysis in financial settings. It uses earnings conference calls as a representative example. It adopts a structured literature review approach, organizing and comparing prior work across theoretical foundations, data and feature construction, model architectures, and fusion strategies, including domain-specific language models and multi-modal Transformer frameworks. The review concludes that multi-modal methods generally outperform text-only approaches because acoustic signals capture soft information, such as managerial uncertainty, stress, and confidence, thereby improving the modelling of market reactions and return-related outcomes. However, progress is constrained by scarce and heterogeneous multimodal datasets, imperfect cross-modal temporal alignment, and limited transparency and causal identification, which together hinder reproducibility, generalizability, and real-time deployment in practice.

Keywords : Multimodal Sentiment Analysis; Earnings Conference Calls; Financial Communication; Vocal Emotion; Explainable Artificial Intelligence; Causal Inference; Behavioral Finance.

References :

Anastasiou, D., Katsafados, A., Ongena, S., & Tzomakas, C. (June 19, 2025). Beyond words: Fed chair voice sentiments and US bank stock price crash risk. VoxEU/CEPR. https://cepr.org/voxeu/columns/beyond-words-fed-chair-voice-sentiments-and-us-bank-stock-price-crash-risk
Baik, B.; Kim, A. G.; Kim, D. S.; Yoon, S. (2025). Vocal delivery quality in earnings conference calls. Journal of Accounting and Economics, 80(1), 101763. https://doi.org/10.1016/j.jacceco.2024.101763
Ball, R., & Brown, P. (1968). An empirical evaluation of accounting income numbers. Journal of Accounting Research, 6(2), 159–178. https://doi.org/10.2307/2490232
Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Bernard, V. L., & Thomas, J. K. (1989). Post-earnings-announcement drift: Delayed price response or risk premium. Journal of Accounting Research, 27(Supplement), 1–36. https://doi.org/10.2307/2491062
Bernard, V. L., & Thomas, J. K. (1990). Evidence that stock prices do not fully reflect the implications of current earnings for future earnings. Journal of Accounting and Economics, 13(4), 305–340. https://doi.org/10.1016/0165-4101(90)90008-R
Chen, X., Yu, X., Chang, L., Jing, T., He, J., Wang, Z., Luo, Y., Chen, X., Liang, J., Wang, Y., & Xie, J. (2025). The sound of risk: A multimodal physics-informed acoustic model for forecasting market volatility and enhancing market interpretability. arXiv. https://doi.org/10.48550/arxiv.2508.18653
Da, Y., Bossa, M. N., Díaz Berenguer, A., & Sahli, H. (2024). Reducing bias in sentiment analysis models through causal mediation analysis and targeted counterfactual training. IEEE Access, 12, 10120–10134. https://doi.org/10.1109/access.2024.3353056
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.
Du, K.; Xing, F.; Mao, R.; Cambria, E. (2024). Financial sentiment analysis: Techniques and applications. ACM Computing Surveys, 56(9), Article 220, 1–42. https://doi.org/10.1145/3649451
Ergun, Z. E.; Sefer, E. (2025). FinSentiment: Predicting financial sentiment through transfer learningIntelligent Systems in Accounting, Finance & Management, 32(3). https://doi.org/10.1002/isaf.70015Ewertz, J., Knickrehm, C., Nienhaus, M., & Reichmann, D. (2025). Listen closely: Measuring vocal tone in corporate disclosures. Journal of Accounting Research. https://doi.org/10.1111/1475-679X.70015
Ewertz, J.; Knickrehm, C.; Nienhaus, M.; Reichmann, D. (2026). Listen closely: Measuring vocal tone in corporate disclosures. Journal of Accounting Research, 64(1), 229–277. https://doi.org/10.1111/1475-679X.70015
Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383–417. https://doi.org/10.2307/2325486
Gill, S. H., Mahar, J. A., Mahar, S. A., Razzaq, M. A., Mehmood, A., Choi, G. S., & Ashraf, I. (2026). Prosodic information extraction and classification based on MFCC features and machine learning models. Measurement and Control, 59(1). https://doi.org/10.1177/00202940251315031
Gupta, I. (2025). Acoustic features of corporate conference calls and market reactions (2010–2025). SSRN. https://doi.org/10.2139/ssrn.5607250
Hoekstra, J., & Güler, D. (2022). The mediating effect of trading volume on the relationship between investor sentiment and the return of tech companies. Journal of Behavioral Finance, 25, 356–373. https://doi.org/10.1080/15427560.2022.2138394
Huang, L. (2023). The impact of China economic policy uncertainty on CSI 300: An analysis of the mediating effect of investor sentiment. Advances in Economics, Management and Political Sciences, 51, 20230608. https://doi.org/10.54254/2754-1169/51/20230608
Huang, Y., Zhang, J., & Liu, S. (2021). Vocal tone and investor reactions: Evidence from matched earnings calls. Review of Accounting Studies, 26(4), 1456–1492. https://doi.org/10.1007/s11142-021-09640-7
Larcker, D. F., & Zakolyukina, A. A. (2012). Detecting deceptive discussions in conference calls. Journal of Accounting Research, 50(2), 495–540. https://doi.org/10.1111/j.1475-679X.2012.00450.x
Li, S.; Tang, H. (2024). Multimodal alignment and fusion: A survey. arXiv. https://doi.org/10.48550/arXiv.2411.17040
Livnat, J., & Mendenhall, R. R. (2006). Comparing the post–earnings announcement drift for surprises calculated from analyst and time series forecasts. Journal of Accounting Research, 44(1), 177–205. https://doi.org/10.1111/j.1475-679X.2006.00196.x
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (NeurIPS 2017, Vol. 30). Curran Associates. https://proceedings.neurips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
Mai, Z.; Zhang, J.; Xu, Z.; Xiao, Z. (2024). Financial sentiment analysis meets LLaMA 3: A comprehensive analysis. In Proceedings of the 2024 7th International Conference on Machine Learning and Machine Intelligence (MLMI ’24) (pp. 171–175). Association for Computing Machinery. https://doi.org/10.1145/3696271.3696299
Matsumoto, D., Pronk, M., & Roelofsen, E. (2011). What makes conference calls useful? The information content of managers’ presentations and analysts’ discussion sessions. The Accounting Review, 86(4), 1383–1414. https://doi.org/10.2308/accr-10034
Mayew, W. J., & Venkatachalam, M. (2012). The power of voice: Managerial affective states and future firm performance. The Journal of Finance, 67(1), 1–43. https://doi.org/10.1111/j.1540-6261.2011.01705.x
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939778
Ruan, L., & Jiang, H. (2025). Stock price prediction using FinBERT-enhanced sentiment with SHAP explainability and differential privacy. Mathematics, 13(17), 2747. https://doi.org/10.3390/math13172747
Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. https://doi.org/10.1037/h0077714
Todd, A.; Bowden, J.; Moshfeghi, Y. (2024). Text-based sentiment analysis in finance: Synthesising the existing literature and exploring future directions. Intelligent Systems in Accounting, Finance & Management, 31(1), e1549. https://doi.org/10.1002/isaf.1549
Wang, Z., Li, Y., & Zhang, H. (2023). Counterfactual multimodal modeling in financial communication analysis. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 890–902. https://doi.org/10.18653/v1/2023.emnlp-main.78
Yamaguchi, R., & Yanai, K. (2025). Exploring cross-attention maps in multi-modal diffusion transformers for training-free semantic segmentation. In M. Cho, I. Laptev, D. Tran, A. Yao, & H. B. Zha (Eds.), Computer vision – ACCV 2024 workshops (Lecture Notes in Computer Science, Vol. 15482). Springer. https://doi.org/10.1007/978-981-96-2641-0_18
Zhang, L., Wu, Z., & Jin, Y. (2022). Structural causal modeling for vocal sentiment in earnings calls. Journal of Financial Markets, 58, 100745. https://doi.org/10.1016/j.finmar.2021.100745

This study aims to explain how textual content, vocal cues, and affective expressions jointly shape investor reactions and market fluctuations by synthesizing existing research on multi-modal sentiment analysis in financial settings. It uses earnings conference calls as a representative example. It adopts a structured literature review approach, organizing and comparing prior work across theoretical foundations, data and feature construction, model architectures, and fusion strategies, including domain-specific language models and multi-modal Transformer frameworks. The review concludes that multi-modal methods generally outperform text-only approaches because acoustic signals capture soft information, such as managerial uncertainty, stress, and confidence, thereby improving the modelling of market reactions and return-related outcomes. However, progress is constrained by scarce and heterogeneous multimodal datasets, imperfect cross-modal temporal alignment, and limited transparency and causal identification, which together hinder reproducibility, generalizability, and real-time deployment in practice.

Keywords : Multimodal Sentiment Analysis; Earnings Conference Calls; Financial Communication; Vocal Emotion; Explainable Artificial Intelligence; Causal Inference; Behavioral Finance.

Paper Submission Last Date
31 - May - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.