Pakistan Science Abstracts
Article details & metrics
No Detail Found!!
IT-2017: Spatio-Temporal Modeling of Sign Language Using YOLOv11 Detection and LSTM Sequences
Author(s):
1. Tabassum Kanwal: Rawalpindi Women University,Rawalpindi, 46000,Pakistan
2. Rehan Mehmood Yousaf: PMAS-University of Arid Agriculture,Rawalpindi, 46000,Pakistan
3. Saud Altaf: National Skills University,Islamabad 44000,Pakistan
4. Kanza Gulzar: PMAS-University of Arid Agriculture,Rawalpindi, 46000,Pakistan
Abstract:
Sign language recognition (SLR) remains a challenging task due to complex spatio-temporal dependencies, high inter-signer variability, and background clutter in real-world environments. To address these challenges, we propose a hybrid deep learning framework that combines YOLOv11 for spatial feature extraction with a Long Short-Term Memory (LSTM) network for temporal sequence modeling. YOLOv11 introduces architectural innovations such as C3k2 modules, Spatial Pyramid Pooling-Fast (SPPF), and C2PSA (Cross-Stage Partial with Spatial Attention), which improve multi- scale feature aggregation while reducing parameter count. These improvements yield superior localization accuracy and efficiency compared to YOLOv8 and YOLOv10, making YOLOv11 particularly well-suited for real-time hand gesture detection. In our framework, YOLOv11 detects and tracks hand regions and key gestures at speeds exceeding 35 fps, achieving 97% mAP@0.5 on custom gesture datasets. The extracted embeddings are sequentially fed into a stacked LSTM, which models the temporal dynamics of hand motion and captures long-range dependencies essential for recognizing continuous signing. Evaluations on the RWTH-PHOENIX-Weather 2014T dataset and a custom isolated sign dataset demonstrate that our YOLOv11+LSTM model achieves a 20% relative improvement in sequence-level accuracy and reduces Word Error Rate (WER) by 12% compared to conventional CNN-LSTM baselines. Ablation studies further confirm that YOLOv11's spatial attention mechanisms play a critical role in suppressing background noise and enhancing signer- independent recognition. This research demonstrates that integrating YOLOv11's efficient, attention-driven detection with LSTM's temporal learning yields a robust, realtime SLR system. The framework is scalable to diverse sign languages, supporting practical deployment in assistive technologies and human-computer interaction applications.
Page(s): 135-135
DOI: DOI not available
Published: Journal: 4th International Conference of Sciences “Revamped Scientific Outlook of 21st Century, 2025” , November 12,2025, Volume: 1, Issue: 1, Year: 2025
Keywords:
deep learning , LSTM , Humancomputer interaction , Sequence modeling , Realtime sign language recognition , YOLOv11
References:
References are not available for this document.
Citations
Citations are not available for this document.
0

Citations

0

Downloads

5

Views