Deep Audio-Visual Fusion with Attention Mechanisms for Multimodal Perception: A Systematic Review

Dwi Fatmasari

doi:10.64845/jistech.v2i1.315

Authors

Dwi Fatmasari Universitas Negeri Surabaya Author

DOI:

https://doi.org/10.64845/jistech.v2i1.315

Keywords:

Multimodal Fusion, Audio-Visual Deep Learning, Attention Mechanisms

Abstract

Advances in multimodal deep learning have driven growing interest in attention mechanisms that enhance audio and visual integration for tasks such as emotion recognition, event localization, and human computer interaction. This comprehensive survey synthesizes recent progress in attention based fusion methods and highlights the evolution from early fusion strategies to more advanced architectures, including self-attention, cross modal attention, co attention, and hierarchical attention. Transformer based models, in particular, now play a central role in state of the art audio visual systems because they capture long range temporal and semantic relationships across modalities. This survey examines how these mechanisms improve contextual understanding and task performance, while also identifying persistent challenges related to interpretability, robustness to noisy or missing modalities, modality imbalance, and computational efficiency. Limitations associated with dataset bias and the lack of standardized evaluation metrics are also discussed. Finally, the survey presents future research directions, including the development of cross modal transformer architectures, hierarchical attention models, and comprehensive attention diagnostics frameworks to support trustworthy and effective multimodal artificial intelligence systems.

References

A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proc. ACL, 2018, pp. 2236–2246. doi: 10.18653/v1/P18-1208.

A. Farinhas, A. F. T. Martins, and P. M. Q. Aguiar, “Multimodal continuous visual attention mechanisms,” in Proc. ICCV, 2021, pp. 1047–1056. doi: 10.1109/ICCV48922.2021.00110.

A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” Advances in Neural Information Processing Systems, vol. 34, pp. 14200–14213, 2021.

A. V. Geetha, T. Mala, D. Priyanka, and E. Uma, “Multimodal emotion recognition with deep learning: Advancements, challenges, and future directions,” Information Fusion, vol. 105, p. 102218, 2024. doi: 10.1016/j.inffus.2023.102218.

B. Mocanu, R. Tapu, and T. Zaharia, “Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning,” Image and Vision Computing, vol. 133, 2023. doi: 10.1016/j.imavis.2023.104624.

B. Pan, K. Hirota, Z. Jia, L. Zhao, X. Jin, and Y. Dai, “Multimodal emotion recognition based on feature selection and extreme learning machine in video clips,” Journal of Ambient Intelligence and Humanized Computing, vol. 14, no. 3, pp. 1903–1917, 2023. doi: 10.1007/s12652-021-03407-2.

C. Liu, Z. Mao, T. Zhang, A.-A. Liu, B. Wang, and Y. Zhang, “Focus your attention: A focal attention for multimodal learning,” IEEE Transactions on Multimedia, vol. 24, pp. 103–115, 2020. doi: 10.1109/TMM.2020.2977824.

D. Priyasad, T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Learning salient features for multimodal emotion recognition with recurrent neural networks and attention based fusion,” 2020. doi: 10.21437/AVSP.2019-5.

D. Vamsidhar, P. Desai, A. K. Shahade, S. Patil, and P. V. Deshmukh, “Hierarchical cross-modal attention and dual audio pathways for enhanced multimodal sentiment analysis,” Scientific Reports, vol. 15, no. 1, p. 25440, 2025. doi: 10.1038/s41598-025-25440-0.

E. Ghaleb, J. Niehues, and S. Asteriadis, “Joint modelling of audio-visual cues using attention mechanisms for emotion recognition,” Multimedia Tools and Applications, vol. 82, no. 8, pp. 11239–11264, 2023. doi: 10.1007/s11042-022-13557-w.

M. Hassanin, S. Anwar, I. Radwan, F. S. Khan, and A. Mian, “Visual attention methods in deep learning: An in-depth survey,” Information Fusion, vol. 108, p. 102417, 2024. doi: 10.1016/j.inffus.2024.102417.

N. E. H. Dehimi and Z. Tolba, “Attention mechanisms in deep learning: Towards explainable artificial intelligence,” in Proc. PAIS, 2024, pp. 1–7. doi: 10.1109/PAIS62026.2024.00006.

N. Khatri, T. Laakkonen, and J. Liu, “On the anatomy of attention,” arXiv preprint arXiv:2407.02423, 2024. doi: 10.48550/arXiv.2407.02423.

P. H. Martins, V. Niculae, Z. Marinho, and A. F. T. Martins, “Sparse and structured visual attention,” in Proc. ICIP, 2021, pp. 379–383. doi: 10.1109/ICIP42928.2021.9506060.

R. Gnana Praveen, E. Granger, and P. Cardinal, “Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention,” arXiv preprint arXiv:2209.XXXXX, 2022. doi: 10.48550/arXiv.2209.XXXXX.

R. G. Praveen and J. Alam, “Recursive joint cross-modal attention for multimodal fusion in dimensional emotion recognition,” in Proc. CVPR, 2024, pp. 4803–4813. doi: 10.1109/CVPR.2024.00480.

S. Ghaffarian, J. Valente, M. Van Der Voort, and B. Tekinerdogan, “Effect of attention mechanism in deep learning-based remote sensing image processing: A systematic literature review,” Remote Sensing, vol. 13, no. 15, p. 2965, 2021. doi: 10.3390/rs13152965.

S. Moorthy and Y. K. Moon, “Hybrid multi-attention network for audio–visual emotion recognition through multimodal feature fusion,” Mathematics, vol. 13, no. 7, 2025. doi: 10.3390/math13071100.

T. Baltrusaitis, C. Ahuja, and L. P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2019. doi: 10.1109/TPAMI.2018.2798607.

X. He, D. Zhao, Y. Dong, G. Shen, X. Yang, and Y. Zeng, “Enhancing audio-visual spiking neural networks through semantic-alignment and cross-modal residual learning,” arXiv preprint arXiv:2502.12488, 2025. doi: 10.48550/arXiv.2502.12488.

Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proc. ACL, 2019. doi: 10.18653/v1/P19-1656.

Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021. doi: 10.1016/j.neucom.2021.03.091.

Z. Zou, C. Tang, W. Zhang, K. Sun, and L. Jiang, “Hierarchical attention learning for multimodal classification,” in Proc. ICME, 2023, pp. 936–941. doi: 10.1109/ICME55011.2023.00163.

Deep Audio-Visual Fusion with Attention Mechanisms for Multimodal Perception: A Systematic Review

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

SPECIAL MENU

Latest publications

Keywords