A model for Sindhi text segmentation into word tokens. | [Sindh University Research Journal • 2012]

Author(s):

1. J. A. Mahar: Faculty of Engineering, Science and Technology, Hamdard University, Karachi, Pakistan

2. H. Shaikh: Department of Computer Science, Shah Abdul Latif University, Khairpur, Pakistan

3. G. Q. Memon: Faculty of Engineering, Science and Technology, Hamdard University, Karachi, Pakistan

Abstract:

The corpus is prerequisite to conduct the experiments of computational linguistic applications on any language. Generally, the corpora are downloaded from Internet in different formats. Usually, the downloaded corpora have some types of word ambiguities regarding computational processes; however, it is observed that in Sindhi language, two types of ambiguities are commonly found i.e. compound words typed without embedded space and typo errors. Without correct segmentation of text into word tokens, it is difficult to get better results of linguistic applications. Therefore, tokenization is the inevitable component of natural language and speech processing applications. This paper presents a new model that correctly segments the words of Sindhi language. The model consists of three layers; layer 1 is used to input the text and segment the words using white space, simple and compound words are segmented in layer 2 and complex word are segmented in layer 3. The tokenizer is tested on 2792 Sindhi words and it achieved the accuracy of 91.76%.

Page(s): 43-48

DOI: DOI not available

Published: Journal: Sindh University Research Journal, Volume: 44, Issue: 1, Year: 2012

Keywords:

Keywords are not available for this article.

References:

[1] Mahar J. A.,Shaikh H.,Memon G. Q. .2012 .A Model for Sindhi Text Segmentation into Word Tokens. , : .

Citations

Downloads

Views