Abstract:
Sign language recognition (SLR) remains a challenging task due to complex spatio-temporal dependencies, high inter-signer variability, and background clutter in real-world environments. To address these challenges, we propose a hybrid deep learning framework that combines YOLOv11 for spatial feature extraction with a Long Short-Term Memory (LSTM) network for temporal sequence modeling. YOLOv11 introduces architectural innovations such as C3k2 modules, Spatial Pyramid Pooling-Fast (SPPF), and C2PSA (Cross-Stage Partial with Spatial Attention), which improve multi- scale feature aggregation while reducing parameter count. These improvements yield superior localization accuracy and efficiency compared to YOLOv8 and YOLOv10, making YOLOv11 particularly well-suited for real-time hand gesture detection. In our framework, YOLOv11 detects and tracks hand regions and key gestures at speeds exceeding 35 fps, achieving 97% mAP@0.5 on custom gesture datasets. The extracted embeddings are sequentially fed into a stacked LSTM, which models the temporal dynamics of hand motion and captures long-range dependencies essential for recognizing continuous signing. Evaluations on the RWTH-PHOENIX-Weather 2014T dataset and a custom isolated sign dataset demonstrate that our YOLOv11+LSTM model achieves a 20% relative improvement in sequence-level accuracy and reduces Word Error Rate (WER) by 12% compared to conventional CNN-LSTM baselines. Ablation studies further confirm that YOLOv11's spatial attention mechanisms play a critical role in suppressing background noise and enhancing signer- independent recognition. This research demonstrates that integrating YOLOv11's efficient, attention-driven detection with LSTM's temporal learning yields a robust, realtime SLR system. The framework is scalable to diverse sign languages, supporting practical deployment in assistive technologies and human-computer interaction applications.
Page(s):
135-135
DOI:
DOI not available
Published:
Journal: 4th International Conference of Sciences “Revamped Scientific Outlook of 21st Century, 2025” , November 12,2025, Volume: 1, Issue: 1, Year: 2025
Keywords:
deep learning
,
LSTM
,
Humancomputer interaction
,
Sequence modeling
,
Realtime sign language recognition
,
YOLOv11