An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods. | [Mehran University Research Journal of Engineering and Technology • 2019]

Author(s):

1. MAZHAR ALI: Benazir Bhutto Shaheed University, Lyari, Karachi, Pakistan

2. Asim Imdad Wagan: Mohammad Ali Jinnah University, Karachi, Pakistan

Abstract:

The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on. There is little computational linguistics work done on Sindhi text whereas, English, Arabic, Urdu and some other languages are fully resourced computationally. The grammar and morphemes of these languages are analyzed properly using dissimilar machine learning methods. The development and research work regarding computational linguistics are in progress on Sindhi language at this time. This study is planned to develop the Sindhi annotated corpus using universal POS (Part of Speech) tag set and Sindhi POS tag set for the purpose of language features and variation analysis. The features are extracted using TF-IDF (Term Frequency and Inverse Document Frequency) technique. The supervised machine learning model is developed to assess the annotated corpus to know the grammatical annotation of Sindhi language. The model is trained with 80% of annotated corpus and tested with 20% of test set. The cross-validation technique with 10-folds is utilized to evaluate and validate the model. The results of model show the better performance of model as well as confirm the proper annotation to Sindhi corpus. This study described a number of research gaps to work more on topic modeling, language variation, sentiment and semantic analysis of Sindhi language.

Page(s): 185-196

DOI: 10.22581/muet1982.1901.15

Published: Journal: Mehran University Research Journal of Engineering and Technology, Volume: 38, Issue: 1, Year: 2019

Keywords:

Keywords are not available for this article.

References:

[1] .The presented research study has developed a novel on Sindhi corpus. Study has performed supervised classification on Sindhi annotate corpus to assess the accuracy of traditional machine learning approaches to solve the NLP problems of Sindhi language. Supervised regarding machine learning methods are evaluated and assessed with 10-fold cross validation. The Sindhi annotated corpus is segmented into 80% training dataset and 20% test dataset. The machine is trained with 80% training dataset. Each fold of cross validation has processed to partition corpus into subsets to analyze the training set and validate the test set. All processes of cross validation have done randomly. The study observes the performance of RF machine learning method better than the SVM non-linear on basis of obtained results., -

[2] 2016.Language Technology Tools and Resources for a Resource-Poor Language: Sindhi”, 51 -58

[3] Mahar , J.A.,G.Q., 2012.,Science Series) 1 43 -47

[4] Mahar, J.A., and Memon, G.Q., “Rule Based Part of Speech Tagging of Sindhi Language”, IEEE International Conference on Signal Acquisition and Processing, pp. 101-106, 2010.

[5] Mahar, J.A., Shaikh, H., and Memon, G.Q., “A Model for Sindhi Text Segmentation into Word Tokens”, Sindh University Research Journal (Science Series), Volume 44, No. 1, pp. 43-47, Jamshoro, Pakistan

[6] Mahar, J.A., and Memon, G.Q., “Sindhi Part of Speech Tagging System using WordNet”, International Journal of Computer Theory and Engineering, Volume 2, No. 4, pp. 538, 2010

[7] Dootio, M.A., and Wagan, A.I., “Syntactic Parsing and Supervised Analysis of Sindhi Text”, Journal of King Saud University – Computer and Information Sciences, [DOI:10.1016/j.jksuci.2017.10.004],

[8] Motlani, R., Lalwani, H., Shrivastava, M., and Sharma, D.M., “Developing Part-of-Speech Tagger for a Resource Poor Language: Sindhi”, Proceedings of 7th Conference on Language and Technology, Poznan, Poland

[9] Motlani, R., Tyers, F.M., and Sharma, D.M., “A Finite- State Morphological Analyzer for Sindhi”, Proceedings of 10th International Conference on Language Resources and Evaluation, 2016

[10] Siraj, “Sindhi Boli”, 2nd Edition, Sindhi Language Authority, Hyderabad, Sindh, Pakistan,

[11] Bag, M.K., “Sindhi Vyakaran”, Sindhi Adabi Board, Jamshoro, Sindh, Pakistan, 2015.

[12] Bag, M.K., “Sindhi Vyakaran”, Sindhi Adabi Board, Jamshoro, Sindh, Pakistan, 2015.

[13] Taylor, A., Mitchell, M., and Beatrice, S., “The Penn Treebank: An Overview”, Treebanks, pp. 5-22. Springer, Dordrecht, 2003.

[14] Sarker, A., and Graciela, G., “Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-Corpus Training”, Journal of Biomedical Informatics, Volume 53, pp. 196-207,

[15] Onan, A., Serdar, K., and Hasan, B., “Ensemble of Keyword Extraction Methods and Classifiers in Text Classification”, Expert Systems with Applications, Volume 57, pp. 232-247, 2016

Citations

Downloads

Views