IT-2017: Spatio-Temporal Modeling of Sign Language Using YOLOv11 Detection and LSTM Sequences | [4th International Conference of Sciences “Revamped Scientific Outlook of 21st Century, 2025” , November 12,2025 • 2025]

Author(s):

1. Tabassum Kanwal: Rawalpindi Women University,Rawalpindi, 46000,Pakistan

2. Rehan Mehmood Yousaf: PMAS-University of Arid Agriculture,Rawalpindi, 46000,Pakistan

3. Saud Altaf: National Skills University,Islamabad 44000,Pakistan

4. Kanza Gulzar: PMAS-University of Arid Agriculture,Rawalpindi, 46000,Pakistan

Abstract:

Sign language recognition (SLR) remains a challenging task due to complex spatio-temporal dependencies, high inter-signer variability, and background clutter in real-world environments. To address these challenges, we propose a hybrid deep learning framework that combines YOLOv11 for spatial feature extraction with a Long Short-Term Memory (LSTM) network for temporal sequence modeling. YOLOv11 introduces architectural innovations such as C3k2 modules, Spatial Pyramid Pooling-Fast (SPPF), and C2PSA (Cross-Stage Partial with Spatial Attention), which improve multi- scale feature aggregation while reducing parameter count. These improvements yield superior localization accuracy and efficiency compared to YOLOv8 and YOLOv10, making YOLOv11 particularly well-suited for real-time hand gesture detection. In our framework, YOLOv11 detects and tracks hand regions and key gestures at speeds exceeding 35 fps, achieving 97% mAP@0.5 on custom gesture datasets. The extracted embeddings are sequentially fed into a stacked LSTM, which models the temporal dynamics of hand motion and captures long-range dependencies essential for recognizing continuous signing. Evaluations on the RWTH-PHOENIX-Weather 2014T dataset and a custom isolated sign dataset demonstrate that our YOLOv11+LSTM model achieves a 20% relative improvement in sequence-level accuracy and reduces Word Error Rate (WER) by 12% compared to conventional CNN-LSTM baselines. Ablation studies further confirm that YOLOv11's spatial attention mechanisms play a critical role in suppressing background noise and enhancing signer- independent recognition. This research demonstrates that integrating YOLOv11's efficient, attention-driven detection with LSTM's temporal learning yields a robust, realtime SLR system. The framework is scalable to diverse sign languages, supporting practical deployment in assistive technologies and human-computer interaction applications.

Page(s): 135-135

DOI: DOI not available

Published: Journal: 4th International Conference of Sciences “Revamped Scientific Outlook of 21st Century, 2025” , November 12,2025, Volume: 1, Issue: 1, Year: 2025

Keywords:

deep learning , LSTM , Humancomputer interaction , Sequence modeling , Realtime sign language recognition , YOLOv11

References:

References are not available for this document.

Citations

Citations are not available for this document.

Citations

Downloads

Views