Abstract:
This paper describes a novel database for Urdu Text detection and recognition in natural scene images. Many standard benchmarks for Latin text have been published, where remarkable classification and recognition techniques for text extraction in natural scenes are proposed. Recently, a dataset for multilanguage text in natural scene images has been published by the International Conference on Document Analysis and Recognition (ICDAR). This dataset contains natural scene images in six different languages including Arabic, Korean and Chinese texts. Currently, there is no any dataset available for Urdu text in natural scene images. Therefore, the main objective of this paper is to create a novel dataset of Urdu text in natural scene images and provide to the research community to develop and evaluate state-of-the-art algorithms for text localization and recognition. The dataset consists of cropped words and segmented character images in natural scenes. All the characters are manually segmented from the captured images. All the images are captured in varying lighting conditions, low resolution, occlusions and perspective conditions. The dataset consists of 8000 cropped Urdu word-images and 16000 segmented Urdu character-images in different forms (isolated, initial, medial and final). The dataset is further increased by synthetically generating Urdu characters and putting on the real background images. The dataset is compared with the recently published Arabic natural scene datasets and Latin text datasets including ARASTI, ICDAR03 and Chars74k. The proposed dataset contains more natural scene images as well as segmented characters and cropped words, which show that the dataset can be used as a benchmark for recognizing Urdu text in natural scene images.
Keywords:
Urdu Text Scene Charcter Recognition
,
Urdu Scene Dataset
,
Urdu Scene Text
,
Synthetic Urdu Scene Text