Authors :
Pragati Narote; Shrayanshi; Priyanka S Chauhan; Vaddempudi Charan Teja; Ponnaganti Karthik
Volume/Issue :
Volume 9 - 2024, Issue 3 - March
Google Scholar :
https://tinyurl.com/3v75ey3y
Scribd :
https://tinyurl.com/43kvfvj7
DOI :
https://doi.org/10.38124/ijisrt/IJISRT24MAR1362
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Action recognition has seen significant
advancements with the integration of spatio-temporal
representations, particularly leveraging skeleton-based
models and cross-modal data fusion techniques. However,
existing approaches face challenges in capturing long-
range dependencies within the human body skeleton and
effectively balancing features from diverse modalities. To
address these limitations, a novel framework, the
Dynamic Spatio-Temporal Graph Attention Transformer
(D-STGAT), is proposed, which seamlessly integrates the
strengths of dynamic graph attention mechanisms and
transformer architectures for enhanced action
recognition. The framework builds upon recent
innovations in graph attention networks (GAT) and
transformer models. First, the Spatial-Temporal
Dynamic Graph Attention Network (ST-DGAT) is
introduced, extending traditional GAT by incorporating
a dynamic attention mechanism to capture spatial-
temporal patterns within skeleton sequences. By
reordering the weighted vector operations in GAT, the
approach achieves a global approximate attention
function, significantly enhancing its expressivity and
capturing long-distance dependencies more effectively
than static attention mechanisms. Furthermore, to
address the challenges of cross-modal feature
representation and fusion, the spatio-temporal Cross
Attention Transformer (ST-CAT) is introduced. This
model efficiently integrates spatio-temporal information
from both video frames and skeleton sequences by
employing a combination of full spatio-temporal attention
(FAttn), zigzag spatio-temporal attention (ZAttn), and
binary spatio-temporal attention (BAttn) modules.
Through the proper arrangement of these modules within
the transformer encoder and decoder, ST-CAT learns a
multi-feature representation that effectively captures the
intricate spatiotemporal dynamics inherent in action
recognition tasks. Experimental results on the Penn-
Action, NTU-RGB+D 60, and 120 datasets showcase the
efficacy of the approach, yielding promising performance
improvements over previous state-of-the-art methods. In
summary, the proposed D-STGAT and ST-CAT
frameworks offer novel solutions for action recognition
tasks by leveraging dynamic graph attention mechanisms
and transformer architectures to effectively capture and
fuse spatiotemporal features from diverse modalities,
leading to superior performance compared to existing
approaches.
Keywords :
Graph Attention Network, Skeleton-Based Models, Dynamic Attention Mechanism, Multi-Feature Learning, Spatial-Temporal Patterns.
Action recognition has seen significant
advancements with the integration of spatio-temporal
representations, particularly leveraging skeleton-based
models and cross-modal data fusion techniques. However,
existing approaches face challenges in capturing long-
range dependencies within the human body skeleton and
effectively balancing features from diverse modalities. To
address these limitations, a novel framework, the
Dynamic Spatio-Temporal Graph Attention Transformer
(D-STGAT), is proposed, which seamlessly integrates the
strengths of dynamic graph attention mechanisms and
transformer architectures for enhanced action
recognition. The framework builds upon recent
innovations in graph attention networks (GAT) and
transformer models. First, the Spatial-Temporal
Dynamic Graph Attention Network (ST-DGAT) is
introduced, extending traditional GAT by incorporating
a dynamic attention mechanism to capture spatial-
temporal patterns within skeleton sequences. By
reordering the weighted vector operations in GAT, the
approach achieves a global approximate attention
function, significantly enhancing its expressivity and
capturing long-distance dependencies more effectively
than static attention mechanisms. Furthermore, to
address the challenges of cross-modal feature
representation and fusion, the spatio-temporal Cross
Attention Transformer (ST-CAT) is introduced. This
model efficiently integrates spatio-temporal information
from both video frames and skeleton sequences by
employing a combination of full spatio-temporal attention
(FAttn), zigzag spatio-temporal attention (ZAttn), and
binary spatio-temporal attention (BAttn) modules.
Through the proper arrangement of these modules within
the transformer encoder and decoder, ST-CAT learns a
multi-feature representation that effectively captures the
intricate spatiotemporal dynamics inherent in action
recognition tasks. Experimental results on the Penn-
Action, NTU-RGB+D 60, and 120 datasets showcase the
efficacy of the approach, yielding promising performance
improvements over previous state-of-the-art methods. In
summary, the proposed D-STGAT and ST-CAT
frameworks offer novel solutions for action recognition
tasks by leveraging dynamic graph attention mechanisms
and transformer architectures to effectively capture and
fuse spatiotemporal features from diverse modalities,
leading to superior performance compared to existing
approaches.
Keywords :
Graph Attention Network, Skeleton-Based Models, Dynamic Attention Mechanism, Multi-Feature Learning, Spatial-Temporal Patterns.