Enhancing the robustness of computer vision models to adversarial perturbations using multiscale attention mechanisms| International Journal of Innovative Science and Research Technology

Enhancing the Robustness of Computer Vision Models to Adversarial Perturbations Using Multi-Scale Attention Mechanisms

Authors : Darren Kevin T. Nguemdjom; Alidor M. Mbayandjambe; Grevi B. Nkwimi; Fiston Oshasha; Célestin Muluba; Héritier I. Mbengandji; Ibsen G. BAZIE; Raphael Kpoghomou; Alain M. Kuyunsa

Volume/Issue : Volume 10 - 2025, Issue 4 - April

Google Scholar : https://tinyurl.com/5da2x39a

Scribd : https://tinyurl.com/3bxf42yd

DOI : https://doi.org/10.38124/ijisrt/25apr2118

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : This study evaluates the effectiveness of integrating multi-scale attention mechanisms, specifically the Bottleneck Attention Module (BAM), into deep learning architectures such as ResNet18 and SqueezeNet, using the CIFAR-10 dataset. BAM combines spatial and channel attention, enabling the simultaneous capture of local and global dependencies, thereby enhancing the models’ ability to handle visual disruptions and adversarial attacks. A comparison with existing mechanisms, such as ECA-Net and CBAM, demonstrates that BAM outperforms them through its parallel approach, which efficiently optimizes spatial and channel dimensions while maintaining computational efficiency.Potential applications include critical domains such as medical imaging and surveillance, where precision and robustness are essential, particularly in dynamic environments or under adversarial constraints. The study also highlights avenues for integrating BAM with emerging architectures like Transformers to combine the advantages of long-range relationships and multi-scale dependencies. Experimental results confirm BAM’s effectiveness: on clean data, ResNet18’s accuracy improves from 74.83% to 90.58%, and SqueezeNet from 75.50% to 86.70%. Under adversarial conditions, BAM enhances ResNet18’s robustness from 59.2% to 70.4% under PGD attacks, while the hybrid model achieves a maximum accuracy of 75.8%. Activation analysis reveals that BAM strengthens model interpretability by focusing attention on regions of interest, reducing false activations and improving overall reliability. These findings position BAM as an ideal solution for modern embedded vision systems that require an optimal balance between performance, robustness, and efficiency.

Keywords : Robustness, Adversarial Perturbations, Multi-Scale Attention, BAM, ResNet18, SqueezeNet.

References :

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://arxiv.org/abs/1409.1556.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1-9). https://arxiv.org/abs/1409.4842
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770-778). https://arxiv.org/abs/1512.03385
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 25, pp. 1097-1105). https://dl.acm.org/doi/10.1145/3065386
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770-778). https://arxiv.org/abs/1512.03385
Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., & Sun, Y. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2106.02204. https://arxiv.org/abs/2106.02204
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4700-4708). https://arxiv.org/abs/1608.06993
Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML) (pp. 6105-6114). https://arxiv.org/abs/1905.11946
Wightman, R., Touvron, H., & Jégou, S. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2102.07624. https://arxiv.org/abs/2102.07624
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360. https://arxiv.org/abs/1602.07360
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.org/abs/1704.04861
Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems (NeurIPS) (pp. 1135-1143). https://arxiv.org/abs/1510.00149
Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1251-1258). https://arxiv.org/abs/1704.04861
Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6848-6856). https://arxiv.org/abs/1801.04381
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3-19). https://arxiv.org/abs/1807.06514
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7132-7141). https://arxiv.org/abs/1709.01507
Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., & Sun, Y. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2106.02204. https://arxiv.org/abs/2106.02204
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. (2020). ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11534-11542). https://arxiv.org/abs/2004.13621
Li, X., Wang, W., Hu, X., & Yang, J. (2022). Selective Kernel Networks: Adaptive kernel selection for convolutional neural networks. arXiv preprint arXiv:2201.05639. https://arxiv.org/abs/2201.05639
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Retrieved from https://arxiv.org/abs/1312.6199
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Retrieved from https://arxiv.org/abs/1412.6572
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. Retrieved from https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Retrieved from https://arxiv.org/abs/1502.03167
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Retrieved from https://arxiv.org/abs/1706.06083
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Retrieved from https://arxiv.org/abs/1710.09412
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). AutoAugment: Learning augmentation strategies from data. arXiv preprint arXiv:1805.09501. Retrieved from https://arxiv.org/abs/1805.09501
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1-48. Retrieved from https://link.springer.com/article/10.1186/s40537-019-0192-0
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV), 3–19. Retrieved from https://arxiv.org/abs/1807.06521
Zhang, H., Dai, Y., & Wang, L. (2021). Multi-scale attention networks for robust feature extraction. arXiv preprint arXiv:2108.12250. Retrieved from https://arxiv.org/abs/2108.12250
Li, Q., Jiang, H., & Zhao, X. (2022). Enhanced feature learning with attention mechanisms in deep networks. arXiv preprint arXiv:2202.05296. Retrieved from https://arxiv.org/abs/2202.05296
Chen, Y., Liu, Z., & Xiao, H. (2020). Improving robustness with bottleneck attention modules. arXiv preprint arXiv:2001.06487. Retrieved from https://arxiv.org/abs/2001.06487
Guo, X., Zhou, Y., & Zhao, H. (2021). BAM-based networks for semantic segmentation in noisy environments. arXiv preprint arXiv:2112.12183. Retrieved from https://arxiv.org/abs/2112.12183
Jiang, R., Wang, F., & Zhao, X. (2022). Adversarially robust networks with enhanced attention. arXiv preprint arXiv:2201.11089. Retrieved from https://arxiv.org/abs/2201.11089
Zhao, L., Huang, W., & Lin, F. (2023). Attention-driven architectures for reliable computer vision. arXiv preprint arXiv:2303.12827. Retrieved from https://arxiv.org/abs/2303.12827
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). BAM: Bottleneck Attention Module. arXiv preprint arXiv:1807.06514. Disponible à : https://arxiv.org/abs/1807.06514
Zhang, Y., Li, P., & Guo, L. (2021). SCORN: Sinter Composition Optimization with Regressive Convolutional Neural Network. IEEE Transactions on Industrial Informatics. Disponible à : https://youshanzhang.github.io/publications/
Li, X., Li, X., Zhang, L., Cheng, G., Shi, J., Lin, Z., Tan, S., & Tong, Y. (2022). Improving Semantic Segmentation via Decoupled Body and Edge Supervision. European Conference on Computer Vision (ECCV). Disponible à : https://scholar.google.com/citations?hl=en&user=-wOTCE8AAAAJ
Guo, Y., Zhang, L., & Tao, D. (2021). Attention Distillation: Self-supervised Vision Transformer Students Need More Guidance. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Disponible à : https://arxiv.org/pdf/2210.00944
Hu, J., Shen, L., & Sun, G. (2019). Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8), 2011–2023. DOI: https://doi.org/10.1109/TPAMI.2019.2913372
Zhao, Y., Zhang, L., & Tao, D. (2022). Attention Distillation: Self-supervised Vision Transformer Students Need More Guidance. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Available at: https://arxiv.org/pdf/2210.00944
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030. Available at: https://arxiv.org/abs/2103.14030
Chen, Z., Li, Z., Song, L., Chen, L., & Yu, J. (2021). NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed Neural Radiance Fields. IEEE Transactions on Visualization and Computer Graphics. Available at: https://scholar.google.com/citations?hl=en&user=4MIbSrAAAAA
Jordan, K. (2024). 94% on CIFAR-10 in 3.29 Seconds on a Single GPU. arXiv preprint arXiv:2404.00498. Retrieved from https://arxiv.org/abs/2404.00498
Li, X., Zhou, Y., & Wang, X. (2023). A2-Aug: Adaptive Automated Data Augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 123-132). Retrieved from CVPR
Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 702-703). Retrieved from CVPR
Wang et al., 2020 (Efficient Channel Attention - ECA-Net)Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. (2020). ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks.
Li et al., 2019 (Selective Kernel Networks - SKNet) https://arxiv.org/abs/1903.06586
Dosovitskiy et al., 2020 (Vision Transformers - ViT) https://arxiv.org/abs/2010.11929
Bykovets, E., Metz, Y., El-Assady, M., Keim, D. A., & Buhmann, J. M. (2022). BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning. arXiv preprint arXiv:2208.10481. Disponible sur : https://arxiv.org/abs/2208.10481.

This study evaluates the effectiveness of integrating multi-scale attention mechanisms, specifically the Bottleneck Attention Module (BAM), into deep learning architectures such as ResNet18 and SqueezeNet, using the CIFAR-10 dataset. BAM combines spatial and channel attention, enabling the simultaneous capture of local and global dependencies, thereby enhancing the models’ ability to handle visual disruptions and adversarial attacks. A comparison with existing mechanisms, such as ECA-Net and CBAM, demonstrates that BAM outperforms them through its parallel approach, which efficiently optimizes spatial and channel dimensions while maintaining computational efficiency.Potential applications include critical domains such as medical imaging and surveillance, where precision and robustness are essential, particularly in dynamic environments or under adversarial constraints. The study also highlights avenues for integrating BAM with emerging architectures like Transformers to combine the advantages of long-range relationships and multi-scale dependencies. Experimental results confirm BAM’s effectiveness: on clean data, ResNet18’s accuracy improves from 74.83% to 90.58%, and SqueezeNet from 75.50% to 86.70%. Under adversarial conditions, BAM enhances ResNet18’s robustness from 59.2% to 70.4% under PGD attacks, while the hybrid model achieves a maximum accuracy of 75.8%. Activation analysis reveals that BAM strengthens model interpretability by focusing attention on regions of interest, reducing false activations and improving overall reliability. These findings position BAM as an ideal solution for modern embedded vision systems that require an optimal balance between performance, robustness, and efficiency.

Keywords : Robustness, Adversarial Perturbations, Multi-Scale Attention, BAM, ResNet18, SqueezeNet.

CALL FOR PAPERS

Paper Submission Last Date
30 - June - 2025

Video Explanation for Published paper

CALL FOR PAPERS

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.