Pakistan Science Abstracts
Article details & metrics
No Detail Found!!
An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods.
Author(s):
1. MAZHAR ALI: Benazir Bhutto Shaheed University, Lyari, Karachi, Pakistan
2. Asim Imdad Wagan: Mohammad Ali Jinnah University, Karachi, Pakistan
Abstract:
The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on. There is little computational linguistics work done on Sindhi text whereas, English, Arabic, Urdu and some other languages are fully resourced computationally. The grammar and morphemes of these languages are analyzed properly using dissimilar machine learning methods. The development and research work regarding computational linguistics are in progress on Sindhi language at this time. This study is planned to develop the Sindhi annotated corpus using universal POS (Part of Speech) tag set and Sindhi POS tag set for the purpose of language features and variation analysis. The features are extracted using TF-IDF (Term Frequency and Inverse Document Frequency) technique. The supervised machine learning model is developed to assess the annotated corpus to know the grammatical annotation of Sindhi language. The model is trained with 80% of annotated corpus and tested with 20% of test set. The cross-validation technique with 10-folds is utilized to evaluate and validate the model. The results of model show the better performance of model as well as confirm the proper annotation to Sindhi corpus. This study described a number of research gaps to work more on topic modeling, language variation, sentiment and semantic analysis of Sindhi language.
Page(s): 185-196
Published: Journal: Mehran University Research Journal of Engineering and Technology, Volume: 38, Issue: 1, Year: 2019
Keywords:
Keywords are not available for this article.
References:
[1] .The  presented  research  study  has  developed  a  novel on  Sindhi  corpus.  Study  has  performed  supervised classification  on  Sindhi  annotate  corpus  to  assess  the accuracy of traditional machine learning approaches to solve the NLP problems of Sindhi language. Supervised regarding machine learning methods are evaluated and assessed  with 10-fold  cross  validation.  The  Sindhi annotated corpus is segmented into 80% training dataset and 20% test dataset. The machine is trained with 80% training  dataset. Each  fold  of  cross  validation  has processed to partition corpus into subsets to analyze the training  set  and  validate  the  test  set.  All  processes  of cross validation have done randomly. The study observes the performance of RF machine learning method better than  the  SVM  non-linear on  basis  of  obtained  results., -
[2] 2016.Language  Technology  Tools and  Resources  for  a  Resource-Poor  Language:  Sindhi”, 51 -58
[3] Mahar ,   J.A.,G.Q., 2012.,Science  Series) 1 43 -47
[4] Mahar, J.A., and Memon, G.Q., “Rule Based Part of Speech Tagging of Sindhi Language”, IEEE International Conference on Signal Acquisition and Processing, pp. 101-106, 2010.
[5] Mahar, J.A., Shaikh, H., and Memon, G.Q., “A Model for Sindhi Text Segmentation into Word Tokens”, Sindh University Research Journal (Science Series), Volume 44, No. 1, pp. 43-47, Jamshoro, Pakistan
[6] Mahar, J.A., and Memon, G.Q., “Sindhi Part of Speech Tagging System using WordNet”, International Journal of Computer Theory and Engineering, Volume 2, No. 4, pp. 538, 2010
[7] Dootio, M.A., and Wagan, A.I., “Syntactic Parsing and Supervised Analysis of Sindhi Text”, Journal of King Saud University – Computer and Information Sciences, [DOI:10.1016/j.jksuci.2017.10.004],
[8] Motlani, R., Lalwani, H., Shrivastava, M., and Sharma, D.M., “Developing Part-of-Speech Tagger for a Resource Poor Language: Sindhi”, Proceedings of 7th Conference on Language and Technology, Poznan, Poland
[9] Motlani, R., Tyers, F.M., and Sharma, D.M., “A Finite- State Morphological Analyzer for Sindhi”, Proceedings of 10th International Conference on Language Resources and Evaluation, 2016
[10] Siraj, “Sindhi Boli”, 2nd Edition, Sindhi Language Authority, Hyderabad, Sindh, Pakistan,
[11] Bag, M.K., “Sindhi Vyakaran”, Sindhi Adabi Board, Jamshoro, Sindh, Pakistan, 2015.
[12] Bag, M.K., “Sindhi Vyakaran”, Sindhi Adabi Board, Jamshoro, Sindh, Pakistan, 2015.
[13] Taylor, A., Mitchell, M., and Beatrice, S., “The Penn Treebank: An Overview”, Treebanks, pp. 5-22. Springer, Dordrecht, 2003.
[14] Sarker, A., and Graciela, G., “Portable Automatic Text Classification for Adverse Drug Reaction Detection via Multi-Corpus Training”, Journal of Biomedical Informatics, Volume 53, pp. 196-207,
[15] Onan, A., Serdar, K., and Hasan, B., “Ensemble of Keyword Extraction Methods and Classifiers in Text Classification”, Expert Systems with Applications, Volume 57, pp. 232-247, 2016
Citations
Citations are not available for this document.
0

Citations

0

Downloads

4

Views