Authors :
Jin Wang; Inam Ullah
Volume/Issue :
Volume 11 - 2026, Issue 3 - March
Google Scholar :
https://tinyurl.com/8z39c9j2
Scribd :
https://tinyurl.com/ydnz44x3
DOI :
https://doi.org/10.38124/ijisrt/26mar1387
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
This study aims to explain how textual content, vocal cues, and affective expressions jointly shape investor
reactions and market fluctuations by synthesizing existing research on multi-modal sentiment analysis in financial settings.
It uses earnings conference calls as a representative example. It adopts a structured literature review approach, organizing
and comparing prior work across theoretical foundations, data and feature construction, model architectures, and fusion
strategies, including domain-specific language models and multi-modal Transformer frameworks. The review concludes that
multi-modal methods generally outperform text-only approaches because acoustic signals capture soft information, such as
managerial uncertainty, stress, and confidence, thereby improving the modelling of market reactions and return-related
outcomes. However, progress is constrained by scarce and heterogeneous multimodal datasets, imperfect cross-modal
temporal alignment, and limited transparency and causal identification, which together hinder reproducibility,
generalizability, and real-time deployment in practice.
Keywords :
Multimodal Sentiment Analysis; Earnings Conference Calls; Financial Communication; Vocal Emotion; Explainable Artificial Intelligence; Causal Inference; Behavioral Finance.
References :
- Anastasiou, D., Katsafados, A., Ongena, S., & Tzomakas, C. (June 19, 2025). Beyond words: Fed chair voice sentiments and US bank stock price crash risk. VoxEU/CEPR. https://cepr.org/voxeu/columns/beyond-words-fed-chair-voice-sentiments-and-us-bank-stock-price-crash-risk
- Baik, B.; Kim, A. G.; Kim, D. S.; Yoon, S. (2025). Vocal delivery quality in earnings conference calls. Journal of Accounting and Economics, 80(1), 101763. https://doi.org/10.1016/j.jacceco.2024.101763
- Ball, R., & Brown, P. (1968). An empirical evaluation of accounting income numbers. Journal of Accounting Research, 6(2), 159–178. https://doi.org/10.2307/2490232
- Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443. https://doi.org/10.1109/TPAMI.2018.2798607
- Bernard, V. L., & Thomas, J. K. (1989). Post-earnings-announcement drift: Delayed price response or risk premium. Journal of Accounting Research, 27(Supplement), 1–36. https://doi.org/10.2307/2491062
- Bernard, V. L., & Thomas, J. K. (1990). Evidence that stock prices do not fully reflect the implications of current earnings for future earnings. Journal of Accounting and Economics, 13(4), 305–340. https://doi.org/10.1016/0165-4101(90)90008-R
- Chen, X., Yu, X., Chang, L., Jing, T., He, J., Wang, Z., Luo, Y., Chen, X., Liang, J., Wang, Y., & Xie, J. (2025). The sound of risk: A multimodal physics-informed acoustic model for forecasting market volatility and enhancing market interpretability. arXiv. https://doi.org/10.48550/arxiv.2508.18653
- Da, Y., Bossa, M. N., Díaz Berenguer, A., & Sahli, H. (2024). Reducing bias in sentiment analysis models through causal mediation analysis and targeted counterfactual training. IEEE Access, 12, 10120–10134. https://doi.org/10.1109/access.2024.3353056
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics.
- Du, K.; Xing, F.; Mao, R.; Cambria, E. (2024). Financial sentiment analysis: Techniques and applications. ACM Computing Surveys, 56(9), Article 220, 1–42. https://doi.org/10.1145/3649451
- Ergun, Z. E.; Sefer, E. (2025). FinSentiment: Predicting financial sentiment through transfer learningIntelligent Systems in Accounting, Finance & Management, 32(3). https://doi.org/10.1002/isaf.70015Ewertz, J., Knickrehm, C., Nienhaus, M., & Reichmann, D. (2025). Listen closely: Measuring vocal tone in corporate disclosures. Journal of Accounting Research. https://doi.org/10.1111/1475-679X.70015
- Ewertz, J.; Knickrehm, C.; Nienhaus, M.; Reichmann, D. (2026). Listen closely: Measuring vocal tone in corporate disclosures. Journal of Accounting Research, 64(1), 229–277. https://doi.org/10.1111/1475-679X.70015
- Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383–417. https://doi.org/10.2307/2325486
- Gill, S. H., Mahar, J. A., Mahar, S. A., Razzaq, M. A., Mehmood, A., Choi, G. S., & Ashraf, I. (2026). Prosodic information extraction and classification based on MFCC features and machine learning models. Measurement and Control, 59(1). https://doi.org/10.1177/00202940251315031
- Gupta, I. (2025). Acoustic features of corporate conference calls and market reactions (2010–2025). SSRN. https://doi.org/10.2139/ssrn.5607250
- Hoekstra, J., & Güler, D. (2022). The mediating effect of trading volume on the relationship between investor sentiment and the return of tech companies. Journal of Behavioral Finance, 25, 356–373. https://doi.org/10.1080/15427560.2022.2138394
- Huang, L. (2023). The impact of China economic policy uncertainty on CSI 300: An analysis of the mediating effect of investor sentiment. Advances in Economics, Management and Political Sciences, 51, 20230608. https://doi.org/10.54254/2754-1169/51/20230608
- Huang, Y., Zhang, J., & Liu, S. (2021). Vocal tone and investor reactions: Evidence from matched earnings calls. Review of Accounting Studies, 26(4), 1456–1492. https://doi.org/10.1007/s11142-021-09640-7
- Larcker, D. F., & Zakolyukina, A. A. (2012). Detecting deceptive discussions in conference calls. Journal of Accounting Research, 50(2), 495–540. https://doi.org/10.1111/j.1475-679X.2012.00450.x
- Li, S.; Tang, H. (2024). Multimodal alignment and fusion: A survey. arXiv. https://doi.org/10.48550/arXiv.2411.17040
- Livnat, J., & Mendenhall, R. R. (2006). Comparing the post–earnings announcement drift for surprises calculated from analyst and time series forecasts. Journal of Accounting Research, 44(1), 177–205. https://doi.org/10.1111/j.1475-679X.2006.00196.x
- Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (NeurIPS 2017, Vol. 30). Curran Associates. https://proceedings.neurips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
- Mai, Z.; Zhang, J.; Xu, Z.; Xiao, Z. (2024). Financial sentiment analysis meets LLaMA 3: A comprehensive analysis. In Proceedings of the 2024 7th International Conference on Machine Learning and Machine Intelligence (MLMI ’24) (pp. 171–175). Association for Computing Machinery. https://doi.org/10.1145/3696271.3696299
- Matsumoto, D., Pronk, M., & Roelofsen, E. (2011). What makes conference calls useful? The information content of managers’ presentations and analysts’ discussion sessions. The Accounting Review, 86(4), 1383–1414. https://doi.org/10.2308/accr-10034
- Mayew, W. J., & Venkatachalam, M. (2012). The power of voice: Managerial affective states and future firm performance. The Journal of Finance, 67(1), 1–43. https://doi.org/10.1111/j.1540-6261.2011.01705.x
- Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939778
- Ruan, L., & Jiang, H. (2025). Stock price prediction using FinBERT-enhanced sentiment with SHAP explainability and differential privacy. Mathematics, 13(17), 2747. https://doi.org/10.3390/math13172747
- Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. https://doi.org/10.1037/h0077714
- Todd, A.; Bowden, J.; Moshfeghi, Y. (2024). Text-based sentiment analysis in finance: Synthesising the existing literature and exploring future directions. Intelligent Systems in Accounting, Finance & Management, 31(1), e1549. https://doi.org/10.1002/isaf.1549
- Wang, Z., Li, Y., & Zhang, H. (2023). Counterfactual multimodal modeling in financial communication analysis. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 890–902. https://doi.org/10.18653/v1/2023.emnlp-main.78
- Yamaguchi, R., & Yanai, K. (2025). Exploring cross-attention maps in multi-modal diffusion transformers for training-free semantic segmentation. In M. Cho, I. Laptev, D. Tran, A. Yao, & H. B. Zha (Eds.), Computer vision – ACCV 2024 workshops (Lecture Notes in Computer Science, Vol. 15482). Springer. https://doi.org/10.1007/978-981-96-2641-0_18
- Zhang, L., Wu, Z., & Jin, Y. (2022). Structural causal modeling for vocal sentiment in earnings calls. Journal of Financial Markets, 58, 100745. https://doi.org/10.1016/j.finmar.2021.100745
This study aims to explain how textual content, vocal cues, and affective expressions jointly shape investor
reactions and market fluctuations by synthesizing existing research on multi-modal sentiment analysis in financial settings.
It uses earnings conference calls as a representative example. It adopts a structured literature review approach, organizing
and comparing prior work across theoretical foundations, data and feature construction, model architectures, and fusion
strategies, including domain-specific language models and multi-modal Transformer frameworks. The review concludes that
multi-modal methods generally outperform text-only approaches because acoustic signals capture soft information, such as
managerial uncertainty, stress, and confidence, thereby improving the modelling of market reactions and return-related
outcomes. However, progress is constrained by scarce and heterogeneous multimodal datasets, imperfect cross-modal
temporal alignment, and limited transparency and causal identification, which together hinder reproducibility,
generalizability, and real-time deployment in practice.
Keywords :
Multimodal Sentiment Analysis; Earnings Conference Calls; Financial Communication; Vocal Emotion; Explainable Artificial Intelligence; Causal Inference; Behavioral Finance.