Pakistan Science Abstracts
Article details & metrics
No Detail Found!!
A model for Sindhi text segmentation into word tokens.
Author(s):
1. J. A. Mahar: Faculty of Engineering, Science and Technology, Hamdard University, Karachi, Pakistan
2. H. Shaikh: Department of Computer Science, Shah Abdul Latif University, Khairpur, Pakistan
3. G. Q. Memon: Faculty of Engineering, Science and Technology, Hamdard University, Karachi, Pakistan
Abstract:
The corpus is prerequisite to conduct the experiments of computational linguistic applications on any language. Generally, the corpora are downloaded from Internet in different formats. Usually, the downloaded corpora have some types of word ambiguities regarding computational processes; however, it is observed that in Sindhi language, two types of ambiguities are commonly found i.e. compound words typed without embedded space and typo errors. Without correct segmentation of text into word tokens, it is difficult to get better results of linguistic applications. Therefore, tokenization is the inevitable component of natural language and speech processing applications. This paper presents a new model that correctly segments the words of Sindhi language. The model consists of three layers; layer 1 is used to input the text and segment the words using white space, simple and compound words are segmented in layer 2 and complex word are segmented in layer 3. The tokenizer is tested on 2792 Sindhi words and it achieved the accuracy of 91.76%.
Page(s): 43-48
DOI: DOI not available
Published: Journal: Sindh University Research Journal, Volume: 44, Issue: 1, Year: 2012
Keywords:
Keywords are not available for this article.
References:
[1] Mahar J. A.,Shaikh H.,Memon G. Q. .2012 .A Model for Sindhi Text Segmentation into Word Tokens. , : .
Citations
Citations are not available for this document.
0

Citations

0

Downloads

26

Views