Deep Audio-Visual Fusion with Attention Mechanisms for Multimodal Perception: A Systematic Review
DOI:
https://doi.org/10.64845/jistech.v2i1.315Keywords:
Multimodal Fusion, Audio-Visual Deep Learning, Attention MechanismsAbstract
Advances in multimodal deep learning have driven growing interest in attention mechanisms that enhance audio and visual integration for tasks such as emotion recognition, event localization, and human computer interaction. This comprehensive survey synthesizes recent progress in attention based fusion methods and highlights the evolution from early fusion strategies to more advanced architectures, including self-attention, cross modal attention, co attention, and hierarchical attention. Transformer based models, in particular, now play a central role in state of the art audio visual systems because they capture long range temporal and semantic relationships across modalities. This survey examines how these mechanisms improve contextual understanding and task performance, while also identifying persistent challenges related to interpretability, robustness to noisy or missing modalities, modality imbalance, and computational efficiency. Limitations associated with dataset bias and the lack of standardized evaluation metrics are also discussed. Finally, the survey presents future research directions, including the development of cross modal transformer architectures, hierarchical attention models, and comprehensive attention diagnostics frameworks to support trustworthy and effective multimodal artificial intelligence systems.
References
A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proc. ACL, 2018, pp. 2236–2246. doi: 10.18653/v1/P18-1208.
A. Farinhas, A. F. T. Martins, and P. M. Q. Aguiar, “Multimodal continuous visual attention mechanisms,” in Proc. ICCV, 2021, pp. 1047–1056. doi: 10.1109/ICCV48922.2021.00110.
A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” Advances in Neural Information Processing Systems, vol. 34, pp. 14200–14213, 2021.
A. V. Geetha, T. Mala, D. Priyanka, and E. Uma, “Multimodal emotion recognition with deep learning: Advancements, challenges, and future directions,” Information Fusion, vol. 105, p. 102218, 2024. doi: 10.1016/j.inffus.2023.102218.
B. Mocanu, R. Tapu, and T. Zaharia, “Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning,” Image and Vision Computing, vol. 133, 2023. doi: 10.1016/j.imavis.2023.104624.
B. Pan, K. Hirota, Z. Jia, L. Zhao, X. Jin, and Y. Dai, “Multimodal emotion recognition based on feature selection and extreme learning machine in video clips,” Journal of Ambient Intelligence and Humanized Computing, vol. 14, no. 3, pp. 1903–1917, 2023. doi: 10.1007/s12652-021-03407-2.
C. Liu, Z. Mao, T. Zhang, A.-A. Liu, B. Wang, and Y. Zhang, “Focus your attention: A focal attention for multimodal learning,” IEEE Transactions on Multimedia, vol. 24, pp. 103–115, 2020. doi: 10.1109/TMM.2020.2977824.
D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Learning salient features for multimodal emotion recognition with recurrent neural networks and attention based fusion,” 2020. doi: 10.21437/AVSP.2019-5.
D. Vamsidhar, P. Desai, A. K. Shahade, S. Patil, and P. V. Deshmukh, “Hierarchical cross-modal attention and dual audio pathways for enhanced multimodal sentiment analysis,” Scientific Reports, vol. 15, no. 1, p. 25440, 2025. doi: 10.1038/s41598-025-25440-0.
E. Ghaleb, J. Niehues, and S. Asteriadis, “Joint modelling of audio-visual cues using attention mechanisms for emotion recognition,” Multimedia Tools and Applications, vol. 82, no. 8, pp. 11239–11264, 2023. doi: 10.1007/s11042-022-13557-w.
M. Hassanin, S. Anwar, I. Radwan, F. S. Khan, and A. Mian, “Visual attention methods in deep learning: An in-depth survey,” Information Fusion, vol. 108, p. 102417, 2024. doi: 10.1016/j.inffus.2024.102417.
N. E. H. Dehimi and Z. Tolba, “Attention mechanisms in deep learning: Towards explainable artificial intelligence,” in Proc. PAIS, 2024, pp. 1–7. doi: 10.1109/PAIS62026.2024.00006.
N. Khatri, T. Laakkonen, and J. Liu, “On the anatomy of attention,” arXiv preprint arXiv:2407.02423, 2024. doi: 10.48550/arXiv.2407.02423.
P. H. Martins, V. Niculae, Z. Marinho, and A. F. T. Martins, “Sparse and structured visual attention,” in Proc. ICIP, 2021, pp. 379–383. doi: 10.1109/ICIP42928.2021.9506060.
R. Gnana Praveen, E. Granger, and P. Cardinal, “Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention,” arXiv preprint arXiv:2209.XXXXX, 2022. doi: 10.48550/arXiv.2209.XXXXX.
R. G. Praveen and J. Alam, “Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition,” in Proc. CVPR, 2024, pp. 4803–4813. doi: 10.1109/CVPR.2024.00480.
S. Ghaffarian, J. Valente, M. Van Der Voort, and B. Tekinerdogan, “Effect of attention mechanism in deep learning-based remote sensing image processing: A systematic literature review,” Remote Sensing, vol. 13, no. 15, p. 2965, 2021. doi: 10.3390/rs13152965.
S. Moorthy and Y. K. Moon, “Hybrid multi-attention network for audio–visual emotion recognition through multimodal feature fusion,” Mathematics, vol. 13, no. 7, 2025. doi: 10.3390/math13071100.
T. Baltrusaitis, C. Ahuja, and L. P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019. doi: 10.1109/TPAMI.2018.2798607.
X. He, D. Zhao, Y. Dong, G. Shen, X. Yang, and Y. Zeng, “Enhancing audio-visual spiking neural networks through semantic-alignment and cross-modal residual learning,” arXiv preprint arXiv:2502.12488, 2025. doi: 10.48550/arXiv.2502.12488.
Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proc. ACL, 2019. doi: 10.18653/v1/P19-1656.
Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021. doi: 10.1016/j.neucom.2021.03.091.
Z. Zou, C. Tang, W. Zhang, K. Sun, and L. Jiang, “Hierarchical attention learning for multimodal classification,” in Proc. ICME, 2023, pp. 936–941. doi: 10.1109/ICME55011.2023.00163.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Dwi Fatmasari (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
Similar Articles
- Nurul Hidayati, Bridging Vision and Understanding: The Central Role of Computer Vision in AI , Journal of Information Systems and Technology: Vol. 1 No. 1 (2025): Journal of Information Systems and Technology
- Muhammad Idris, Pemanfaatan Machine Learning untuk Optimasi Big Data dalam Sistem Informatika Modern , Journal of Information Systems and Technology: Vol. 1 No. 1 (2025): Journal of Information Systems and Technology
- Moh. Habibur Rahman, Kecerdasan Buatan dalam Personalisasi Pembelajaran Online: Sebuah Tinjauan dari Pendekatan Komputer dan Sistem Informatika , Journal of Information Systems and Technology: Vol. 1 No. 1 (2025): Journal of Information Systems and Technology
- Egi Al Fansyah, Muharman Lubis, Muhammad Dwi Hary Sandy, Towards Adaptive Cybersecurity in Smart Cities: Threat Trends, Mitigation Strategies, and Future Scenarios , Journal of Information Systems and Technology: Vol. 2 No. 1 (2026): Journal of Information Systems and Technology
You may also start an advanced similarity search for this article.



