Enhancing the Robustness of Computer Vision Models to Adversarial Perturbations Using Multi-Scale Attention Mechanisms


Authors : Darren Kevin T. Nguemdjom; Alidor M. Mbayandjambe; Grevi B. Nkwimi; Fiston Oshasha; Célestin Muluba; Héritier I. Mbengandji; Ibsen G. BAZIE; Raphael Kpoghomou; Alain M. Kuyunsa

Volume/Issue : Volume 10 - 2025, Issue 4 - April


Google Scholar : https://tinyurl.com/5da2x39a

Scribd : https://tinyurl.com/3bxf42yd

DOI : https://doi.org/10.38124/ijisrt/25apr2118

Google Scholar

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Note : Google Scholar may take 15 to 20 days to display the article.


Abstract : This study evaluates the effectiveness of integrating multi-scale attention mechanisms, specifically the Bottleneck Attention Module (BAM), into deep learning architectures such as ResNet18 and SqueezeNet, using the CIFAR-10 dataset. BAM combines spatial and channel attention, enabling the simultaneous capture of local and global dependencies, thereby enhancing the models’ ability to handle visual disruptions and adversarial attacks. A comparison with existing mechanisms, such as ECA-Net and CBAM, demonstrates that BAM outperforms them through its parallel approach, which efficiently optimizes spatial and channel dimensions while maintaining computational efficiency.Potential applications include critical domains such as medical imaging and surveillance, where precision and robustness are essential, particularly in dynamic environments or under adversarial constraints. The study also highlights avenues for integrating BAM with emerging architectures like Transformers to combine the advantages of long-range relationships and multi-scale dependencies. Experimental results confirm BAM’s effectiveness: on clean data, ResNet18’s accuracy improves from 74.83% to 90.58%, and SqueezeNet from 75.50% to 86.70%. Under adversarial conditions, BAM enhances ResNet18’s robustness from 59.2% to 70.4% under PGD attacks, while the hybrid model achieves a maximum accuracy of 75.8%. Activation analysis reveals that BAM strengthens model interpretability by focusing attention on regions of interest, reducing false activations and improving overall reliability. These findings position BAM as an ideal solution for modern embedded vision systems that require an optimal balance between performance, robustness, and efficiency.

Keywords : Robustness, Adversarial Perturbations, Multi-Scale Attention, BAM, ResNet18, SqueezeNet.

References :

  1. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539
  2. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://arxiv.org/abs/1409.1556.
  3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1-9). https://arxiv.org/abs/1409.4842
  4. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770-778). https://arxiv.org/abs/1512.03385
  5. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 25, pp. 1097-1105). https://dl.acm.org/doi/10.1145/3065386
  6. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770-778). https://arxiv.org/abs/1512.03385
  7. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., & Sun, Y. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2106.02204. https://arxiv.org/abs/2106.02204
  8. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4700-4708). https://arxiv.org/abs/1608.06993
  9. Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML) (pp. 6105-6114). https://arxiv.org/abs/1905.11946
  10. Wightman, R., Touvron, H., & Jégou, S. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2102.07624. https://arxiv.org/abs/2102.07624
  11. Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360. https://arxiv.org/abs/1602.07360
  12. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.org/abs/1704.04861
  13. Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems (NeurIPS) (pp. 1135-1143). https://arxiv.org/abs/1510.00149
  14. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1251-1258). https://arxiv.org/abs/1704.04861
  15. Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6848-6856). https://arxiv.org/abs/1801.04381
  16. Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3-19). https://arxiv.org/abs/1807.06514
  17. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7132-7141). https://arxiv.org/abs/1709.01507
  18. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., & Sun, Y. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2106.02204. https://arxiv.org/abs/2106.02204
  19. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. (2020). ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11534-11542). https://arxiv.org/abs/2004.13621
  20. Li, X., Wang, W., Hu, X., & Yang, J. (2022). Selective Kernel Networks: Adaptive kernel selection for convolutional neural networks. arXiv preprint arXiv:2201.05639. https://arxiv.org/abs/2201.05639
  21. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Retrieved from https://arxiv.org/abs/1312.6199
  22. Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Retrieved from https://arxiv.org/abs/1412.6572
  23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958. Retrieved from https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
  24. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Retrieved from https://arxiv.org/abs/1502.03167
  25. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Retrieved from https://arxiv.org/abs/1706.06083
  26. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Retrieved from https://arxiv.org/abs/1710.09412
  27. Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). AutoAugment: Learning augmentation strategies from data. arXiv preprint arXiv:1805.09501. Retrieved from https://arxiv.org/abs/1805.09501
  28. Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1-48. Retrieved from https://link.springer.com/article/10.1186/s40537-019-0192-0
  29. Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV), 3–19. Retrieved from https://arxiv.org/abs/1807.06521
  30. Zhang, H., Dai, Y., & Wang, L. (2021). Multi-scale attention networks for robust feature extraction. arXiv preprint arXiv:2108.12250. Retrieved from https://arxiv.org/abs/2108.12250
  31. Li, Q., Jiang, H., & Zhao, X. (2022). Enhanced feature learning with attention mechanisms in deep networks. arXiv preprint arXiv:2202.05296. Retrieved from https://arxiv.org/abs/2202.05296
  32. Chen, Y., Liu, Z., & Xiao, H. (2020). Improving robustness with bottleneck attention modules. arXiv preprint arXiv:2001.06487. Retrieved from https://arxiv.org/abs/2001.06487
  33. Guo, X., Zhou, Y., & Zhao, H. (2021). BAM-based networks for semantic segmentation in noisy environments. arXiv preprint arXiv:2112.12183. Retrieved from https://arxiv.org/abs/2112.12183
  34. Jiang, R., Wang, F., & Zhao, X. (2022). Adversarially robust networks with enhanced attention. arXiv preprint arXiv:2201.11089. Retrieved from https://arxiv.org/abs/2201.11089
  35. Zhao, L., Huang, W., & Lin, F. (2023). Attention-driven architectures for reliable computer vision. arXiv preprint arXiv:2303.12827. Retrieved from https://arxiv.org/abs/2303.12827
  36. Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). BAM: Bottleneck Attention Module. arXiv preprint arXiv:1807.06514. Disponible à : https://arxiv.org/abs/1807.06514
  37. Zhang, Y., Li, P., & Guo, L. (2021). SCORN: Sinter Composition Optimization with Regressive Convolutional Neural Network. IEEE Transactions on Industrial Informatics. Disponible à : https://youshanzhang.github.io/publications/
  38. Li, X., Li, X., Zhang, L., Cheng, G., Shi, J., Lin, Z., Tan, S., & Tong, Y. (2022). Improving Semantic Segmentation via Decoupled Body and Edge Supervision. European Conference on Computer Vision (ECCV). Disponible à : https://scholar.google.com/citations?hl=en&user=-wOTCE8AAAAJ
  39. Guo, Y., Zhang, L., & Tao, D. (2021). Attention Distillation: Self-supervised Vision Transformer Students Need More Guidance. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Disponible à : https://arxiv.org/pdf/2210.00944
  40. Hu, J., Shen, L., & Sun, G. (2019). Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8), 2011–2023. DOI: https://doi.org/10.1109/TPAMI.2019.2913372
  41. Zhao, Y., Zhang, L., & Tao, D. (2022). Attention Distillation: Self-supervised Vision Transformer Students Need More Guidance. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Available at: https://arxiv.org/pdf/2210.00944
  42. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030. Available at: https://arxiv.org/abs/2103.14030
  43. Chen, Z., Li, Z., Song, L., Chen, L., & Yu, J. (2021). NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed Neural Radiance Fields. IEEE Transactions on Visualization and Computer Graphics. Available at: https://scholar.google.com/citations?hl=en&user=4MIbSrAAAAA
  44. Jordan, K. (2024). 94% on CIFAR-10 in 3.29 Seconds on a Single GPU. arXiv preprint arXiv:2404.00498. Retrieved from https://arxiv.org/abs/2404.00498
  45. Li, X., Zhou, Y., & Wang, X. (2023). A2-Aug: Adaptive Automated Data Augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 123-132). Retrieved from CVPR
  46. Cubuk, E. D., Zoph, B., Shlens, J., & Le, Q. V. (2020). RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 702-703). Retrieved from CVPR
  47. Wang et al., 2020 (Efficient Channel Attention - ECA-Net)Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. (2020). ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks.
  48. Li et al., 2019 (Selective Kernel Networks - SKNet) https://arxiv.org/abs/1903.06586
  49. Dosovitskiy et al., 2020 (Vision Transformers - ViT) https://arxiv.org/abs/2010.11929
  50. Bykovets, E., Metz, Y., El-Assady, M., Keim, D. A., & Buhmann, J. M. (2022). BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning. arXiv preprint arXiv:2208.10481. Disponible sur : https://arxiv.org/abs/2208.10481.

This study evaluates the effectiveness of integrating multi-scale attention mechanisms, specifically the Bottleneck Attention Module (BAM), into deep learning architectures such as ResNet18 and SqueezeNet, using the CIFAR-10 dataset. BAM combines spatial and channel attention, enabling the simultaneous capture of local and global dependencies, thereby enhancing the models’ ability to handle visual disruptions and adversarial attacks. A comparison with existing mechanisms, such as ECA-Net and CBAM, demonstrates that BAM outperforms them through its parallel approach, which efficiently optimizes spatial and channel dimensions while maintaining computational efficiency.Potential applications include critical domains such as medical imaging and surveillance, where precision and robustness are essential, particularly in dynamic environments or under adversarial constraints. The study also highlights avenues for integrating BAM with emerging architectures like Transformers to combine the advantages of long-range relationships and multi-scale dependencies. Experimental results confirm BAM’s effectiveness: on clean data, ResNet18’s accuracy improves from 74.83% to 90.58%, and SqueezeNet from 75.50% to 86.70%. Under adversarial conditions, BAM enhances ResNet18’s robustness from 59.2% to 70.4% under PGD attacks, while the hybrid model achieves a maximum accuracy of 75.8%. Activation analysis reveals that BAM strengthens model interpretability by focusing attention on regions of interest, reducing false activations and improving overall reliability. These findings position BAM as an ideal solution for modern embedded vision systems that require an optimal balance between performance, robustness, and efficiency.

Keywords : Robustness, Adversarial Perturbations, Multi-Scale Attention, BAM, ResNet18, SqueezeNet.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe