Authors :
Ramneet Singh Chadha; Jugesh; Jasmehar Singh
Volume/Issue :
Volume 11 - 2026, Issue 3 - March
Google Scholar :
https://tinyurl.com/ycxc692w
Scribd :
https://tinyurl.com/yc2cy4ce
DOI :
https://doi.org/10.38124/ijisrt/26mar1065
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Since face recognition is so common and non-intrusive, it is used extensively in contemporary biometric systems.
Recent developments in deep learning have significantly improved recognition performance on common benchmarks,
particularly with Vision Transformers (ViTs). However, real-world face recognition needs to be computationally efficient
and able to handle significant pose variations (such as frontal vs. profile views). A difficult pose-variant dataset (Celebrities
in Frontal-Profile in the Wild) is used in this study to assess the state-of-the-art LVFace-B model, a ViT-based model with
Progressive Cluster Optimization. End-to-end performance is evaluated using MediaPipe's BlazeFace, a lightning-fast face
detector that operates at over 275 frames per second on a mobile CPU. Furthermore, a Fusion Embedding strategy is
presented, wherein multiple embeddings from the same identity are averaged to generate a singular representative vector.
Three identification scenarios are analyzed: a single embedding for each identity, multiple embeddings for each identity,
and a fused mean embedding for each identity. Extensive experiments demonstrate that fusion embedding attains the highest
accuracy (Rank-1 = 96.98%) while significantly decreasing computational demands. The results show that averaging
embeddings makes them more robust when the pose changes and is a useful compromise for large-scale 1:N search. The
suggested method is ready to be used in real time because it strikes a good balance between speed and accuracy.
Keywords :
Computer Vision, Vision Transformer, Face Recognition, LvFace, Fusion Embedding.
References :
- J. You et al., “LVFACE: Progressive cluster optimization for large vision models in face recognition,” arXiv (Cornell University), Jan. 2025, doi: 10.48550/arxiv.2501.13420.
- S. Sengupta, J. -C. Chen, C. Castillo, V. M. Patel, R. Chellappa and D. W. Jacobs, "Frontal to profile face verification in the wild," 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 2016, pp. 1-9, doi: 10.1109/WACV.2016.7477558.
- V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and M. Grundmann, “BlazeFace: Sub-millisecond Neural face Detection on mobile GPUs,” arXiv (Cornell University), Jul. 2019, doi: 10.48550/arxiv.1907.05047.
- C. Lugaresi et al., “MediaPipe: A framework for building perception Pipelines,” arXiv (Cornell University), Jun. \2019, doi: 10.48550/arxiv.1906.08172.
- Md. I. Hossain, Sama-E-Shan, and H. Kabir, “An efficient way to recognize faces using mean embeddings,” 2021 International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), vol. 10, pp. 1–10, Feb. 2021, doi: 10.1109/icaect49130.2021.9392401.
- Wikipedia contributors. (2025, September 17). Cosine similarity. Wikipedia. https://en.wikipedia.org/wiki/Cosine_similarity
- T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, Dec. 2005, doi: 10.1016/j.patrec.2005.10.010.
- A. K. Jain, A. A. Ross, and K. Nandakumar, Introduction to Biometrics. 2011. doi: 10.1007/978-0-387-77326-1.
- B. DeCann and A. Ross, “Relating ROC and CMC curves via the biometric menagerie,” IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Arlington, VA, USA, 2013, pp. 1–8, Sep. 2013, doi: 10.1109/btas.2013.6712705.
- N. Damer, A. Opel, and A. Nouak, “CMC curve properties and biometric source weighting in multi-biometric score-level fusion,” 17th International Conference on Information Fusion (FUSION), Salamanca, Spain, 2014, pp. 1–6, Jul. 2014, [Online]. Available: https://publica.fraunhofer.de/handle/publica/387491
- “Multiclass Receiver Operating Characteristic (ROC),” Scikit-learn. https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
- A. Nemavhola, C. Chibaya, and S. Viriri, “A systematic review of CNN architectures, databases, performance metrics, and applications in face recognition,” Information, vol. 16, no. 2, p. 107, Feb. 2025, doi: 10.3390/info16020107.
- Deng, Jiankang et al. “ArcFace: Additive Angular Margin Loss for Deep Face Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2018): 5962-5979.
- M. Kim, A. Jain, and X. Liu, “50 years of automated face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, pp. 1–20, Jan. 2026, doi: 10.1109/tpami.2026.3664269.
- J. Dan et al., "TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective," 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023, pp. 20585-20596, doi: 10.1109/ICCV51070.2023.01887.
Since face recognition is so common and non-intrusive, it is used extensively in contemporary biometric systems.
Recent developments in deep learning have significantly improved recognition performance on common benchmarks,
particularly with Vision Transformers (ViTs). However, real-world face recognition needs to be computationally efficient
and able to handle significant pose variations (such as frontal vs. profile views). A difficult pose-variant dataset (Celebrities
in Frontal-Profile in the Wild) is used in this study to assess the state-of-the-art LVFace-B model, a ViT-based model with
Progressive Cluster Optimization. End-to-end performance is evaluated using MediaPipe's BlazeFace, a lightning-fast face
detector that operates at over 275 frames per second on a mobile CPU. Furthermore, a Fusion Embedding strategy is
presented, wherein multiple embeddings from the same identity are averaged to generate a singular representative vector.
Three identification scenarios are analyzed: a single embedding for each identity, multiple embeddings for each identity,
and a fused mean embedding for each identity. Extensive experiments demonstrate that fusion embedding attains the highest
accuracy (Rank-1 = 96.98%) while significantly decreasing computational demands. The results show that averaging
embeddings makes them more robust when the pose changes and is a useful compromise for large-scale 1:N search. The
suggested method is ready to be used in real time because it strikes a good balance between speed and accuracy.
Keywords :
Computer Vision, Vision Transformer, Face Recognition, LvFace, Fusion Embedding.