⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



A Comprehensive Review of Multimodal Financial Sentiment Analysis


Authors : Jin Wang; Inam Ullah

Volume/Issue : Volume 11 - 2026, Issue 3 - March


Google Scholar : https://tinyurl.com/8z39c9j2

Scribd : https://tinyurl.com/ydnz44x3

DOI : https://doi.org/10.38124/ijisrt/26mar1387

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : This study aims to explain how textual content, vocal cues, and affective expressions jointly shape investor reactions and market fluctuations by synthesizing existing research on multi-modal sentiment analysis in financial settings. It uses earnings conference calls as a representative example. It adopts a structured literature review approach, organizing and comparing prior work across theoretical foundations, data and feature construction, model architectures, and fusion strategies, including domain-specific language models and multi-modal Transformer frameworks. The review concludes that multi-modal methods generally outperform text-only approaches because acoustic signals capture soft information, such as managerial uncertainty, stress, and confidence, thereby improving the modelling of market reactions and return-related outcomes. However, progress is constrained by scarce and heterogeneous multimodal datasets, imperfect cross-modal temporal alignment, and limited transparency and causal identification, which together hinder reproducibility, generalizability, and real-time deployment in practice.

Keywords : Multimodal Sentiment Analysis; Earnings Conference Calls; Financial Communication; Vocal Emotion; Explainable Artificial Intelligence; Causal Inference; Behavioral Finance.

References :

  1. Anastasiou, D., Katsafados, A., Ongena, S., & Tzomakas, C. (June 19, 2025). Beyond words: Fed chair voice sentiments and US bank stock price crash risk. VoxEU/CEPR. https://cepr.org/voxeu/columns/beyond-words-fed-chair-voice-sentiments-and-us-bank-stock-price-crash-risk
  2. Baik, B.; Kim, A. G.; Kim, D. S.; Yoon, S. (2025). Vocal delivery quality in earnings conference calls. Journal of Accounting and Economics, 80(1), 101763. https://doi.org/10.1016/j.jacceco.2024.101763
  3. Ball, R., & Brown, P. (1968). An empirical evaluation of accounting income numbers. Journal of Accounting Research, 6(2), 159–178. https://doi.org/10.2307/2490232
  4. Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
  5. Bernard, V. L., & Thomas, J. K. (1989). Post-earnings-announcement drift: Delayed price response or risk premium. Journal of Accounting Research, 27(Supplement), 1–36. https://doi.org/10.2307/2491062
  6. Bernard, V. L., & Thomas, J. K. (1990). Evidence that stock prices do not fully reflect the implications of current earnings for future earnings. Journal of Accounting and Economics, 13(4), 305–340. https://doi.org/10.1016/0165-4101(90)90008-R
  7. Chen, X., Yu, X., Chang, L., Jing, T., He, J., Wang, Z., Luo, Y., Chen, X., Liang, J., Wang, Y., & Xie, J. (2025). The sound of risk: A multimodal physics-informed acoustic model for forecasting market volatility and enhancing market interpretability. arXiv. https://doi.org/10.48550/arxiv.2508.18653
  8. Da, Y., Bossa, M. N., Díaz Berenguer, A., & Sahli, H. (2024). Reducing bias in sentiment analysis models through causal mediation analysis and targeted counterfactual training. IEEE Access, 12, 10120–10134. https://doi.org/10.1109/access.2024.3353056
  9. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.
  10. Du, K.; Xing, F.; Mao, R.; Cambria, E. (2024). Financial sentiment analysis: Techniques and applications. ACM Computing Surveys, 56(9), Article 220, 1–42. https://doi.org/10.1145/3649451
  11. Ergun, Z. E.; Sefer, E. (2025). FinSentiment: Predicting financial sentiment through transfer learningIntelligent Systems in Accounting, Finance & Management, 32(3). https://doi.org/10.1002/isaf.70015Ewertz, J., Knickrehm, C., Nienhaus, M., & Reichmann, D. (2025). Listen closely: Measuring vocal tone in corporate disclosures. Journal of Accounting Research. https://doi.org/10.1111/1475-679X.70015
  12. Ewertz, J.; Knickrehm, C.; Nienhaus, M.; Reichmann, D. (2026). Listen closely: Measuring vocal tone in corporate disclosures. Journal of Accounting Research, 64(1), 229–277. https://doi.org/10.1111/1475-679X.70015
  13. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383–417. https://doi.org/10.2307/2325486
  14. Gill, S. H., Mahar, J. A., Mahar, S. A., Razzaq, M. A., Mehmood, A., Choi, G. S., & Ashraf, I. (2026). Prosodic information extraction and classification based on MFCC features and machine learning models. Measurement and Control, 59(1). https://doi.org/10.1177/00202940251315031
  15. Gupta, I. (2025). Acoustic features of corporate conference calls and market reactions (2010–2025). SSRN. https://doi.org/10.2139/ssrn.5607250
  16. Hoekstra, J., & Güler, D. (2022). The mediating effect of trading volume on the relationship between investor sentiment and the return of tech companies. Journal of Behavioral Finance, 25, 356–373. https://doi.org/10.1080/15427560.2022.2138394
  17. Huang, L. (2023). The impact of China economic policy uncertainty on CSI 300: An analysis of the mediating effect of investor sentiment. Advances in Economics, Management and Political Sciences, 51, 20230608. https://doi.org/10.54254/2754-1169/51/20230608
  18. Huang, Y., Zhang, J., & Liu, S. (2021). Vocal tone and investor reactions: Evidence from matched earnings calls. Review of Accounting Studies, 26(4), 1456–1492. https://doi.org/10.1007/s11142-021-09640-7
  19. Larcker, D. F., & Zakolyukina, A. A. (2012). Detecting deceptive discussions in conference calls. Journal of Accounting Research, 50(2), 495–540. https://doi.org/10.1111/j.1475-679X.2012.00450.x
  20. Li, S.; Tang, H. (2024). Multimodal alignment and fusion: A survey. arXiv. https://doi.org/10.48550/arXiv.2411.17040
  21. Livnat, J., & Mendenhall, R. R. (2006). Comparing the post–earnings announcement drift for surprises calculated from analyst and time series forecasts. Journal of Accounting Research, 44(1), 177–205. https://doi.org/10.1111/j.1475-679X.2006.00196.x
  22. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (NeurIPS 2017, Vol. 30). Curran Associates. https://proceedings.neurips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
  23. Mai, Z.; Zhang, J.; Xu, Z.; Xiao, Z. (2024). Financial sentiment analysis meets LLaMA 3: A comprehensive analysis. In Proceedings of the 2024 7th International Conference on Machine Learning and Machine Intelligence (MLMI ’24) (pp. 171–175). Association for Computing Machinery. https://doi.org/10.1145/3696271.3696299
  24. Matsumoto, D., Pronk, M., & Roelofsen, E. (2011). What makes conference calls useful? The information content of managers’ presentations and analysts’ discussion sessions. The Accounting Review, 86(4), 1383–1414. https://doi.org/10.2308/accr-10034
  25. Mayew, W. J., & Venkatachalam, M. (2012). The power of voice: Managerial affective states and future firm performance. The Journal of Finance, 67(1), 1–43. https://doi.org/10.1111/j.1540-6261.2011.01705.x
  26. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939778
  27. Ruan, L., & Jiang, H. (2025). Stock price prediction using FinBERT-enhanced sentiment with SHAP explainability and differential privacy. Mathematics, 13(17), 2747. https://doi.org/10.3390/math13172747
  28. Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. https://doi.org/10.1037/h0077714
  29. Todd, A.; Bowden, J.; Moshfeghi, Y. (2024). Text-based sentiment analysis in finance: Synthesising the existing literature and exploring future directions. Intelligent Systems in Accounting, Finance & Management, 31(1), e1549. https://doi.org/10.1002/isaf.1549
  30. Wang, Z., Li, Y., & Zhang, H. (2023). Counterfactual multimodal modeling in financial communication analysis. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 890–902. https://doi.org/10.18653/v1/2023.emnlp-main.78
  31. Yamaguchi, R., & Yanai, K. (2025). Exploring cross-attention maps in multi-modal diffusion transformers for training-free semantic segmentation. In M. Cho, I. Laptev, D. Tran, A. Yao, & H. B. Zha (Eds.), Computer vision – ACCV 2024 workshops (Lecture Notes in Computer Science, Vol. 15482). Springer. https://doi.org/10.1007/978-981-96-2641-0_18
  32. Zhang, L., Wu, Z., & Jin, Y. (2022). Structural causal modeling for vocal sentiment in earnings calls. Journal of Financial Markets, 58, 100745. https://doi.org/10.1016/j.finmar.2021.100745

This study aims to explain how textual content, vocal cues, and affective expressions jointly shape investor reactions and market fluctuations by synthesizing existing research on multi-modal sentiment analysis in financial settings. It uses earnings conference calls as a representative example. It adopts a structured literature review approach, organizing and comparing prior work across theoretical foundations, data and feature construction, model architectures, and fusion strategies, including domain-specific language models and multi-modal Transformer frameworks. The review concludes that multi-modal methods generally outperform text-only approaches because acoustic signals capture soft information, such as managerial uncertainty, stress, and confidence, thereby improving the modelling of market reactions and return-related outcomes. However, progress is constrained by scarce and heterogeneous multimodal datasets, imperfect cross-modal temporal alignment, and limited transparency and causal identification, which together hinder reproducibility, generalizability, and real-time deployment in practice.

Keywords : Multimodal Sentiment Analysis; Earnings Conference Calls; Financial Communication; Vocal Emotion; Explainable Artificial Intelligence; Causal Inference; Behavioral Finance.

Paper Submission Last Date
30 - April - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe