A comparison study of document clustering using doc2vec versus tfidf combined with lsa for small corpora | [Journal of Theoretical and Applied Information Technology • 2020]

Author(s):

1. AMALIA AMALIA: Department of Computer Science, Faculty of Computer Science and Information Technology,Universitas Sumatera Utara, Medan,Indonesia

2. OPIM SALIM SITOMPUL: Department of Information Technology, Faculty of Computer Science and Information Technology, Universitas Sumatera Utara, Medan, Indonesiaas Sumatera Utara, Medan,Indonesia

3. ERNA BUDHIARTI NABABAN: Department of Information Technology, Faculty of Computer Science and Information Technology, Universitas Sumatera Utara, Medan, Indonesiaas Sumatera Utara, Medan,Indonesia

4. TEDDY MANTORO: Department of Computer Science, Universitas Sampoerna, Jakarta, Indonesia

Abstract:

The selection of a suitable word vector representation is one of the essential parameters in document clustering because it affects the performance of clustering. The excellent word vector representation will generate a good clustering result, even only using the simple clustering algorithm like K-Means. Doc2Vec, as one of word vector representations, has been extensively studied in large text datasets and proven outperforms the performance of traditional word vector representation in document categorization. However, only a few studies analyze word vector representations of small corpora. As appropriate, learning observation in a small corpus is also crucial because, in some cases, a large corpus was not always available, particularly in some low-resources languages like Bahasa Indonesia. Moreover, the clustering of the small datasets also plays essential roles in pattern recognition and can be an initial step to implement the analysis result in a more significant corpus. This study is an experimental study that aims to explore more in-depth exploration to compare document clustering using Doc2Vec versus TFIDF-LSA for small corpora in Bahasa Indonesia. In this study, the quality of word vector representation is measure by the cluster performance using intrinsic and extrinsic measurements. The study also considers measuring word representation based on time and memory consumption. This study also concerns with getting an optimal word vector representation by tuned appropriate hyper-parameter. The word vector representations were tested to various sizes of the small corpora using the K-Means algorithm. The result of this study, a TFIDF-LSA gets a better cluster performance; meanwhile, the Doc2Vec model gets a better time and memory usage efficiency.

Page(s): 3644-3657

DOI: DOI not available

Published: Journal: Journal of Theoretical and Applied Information Technology, Volume: 98, Issue: 17, Year: 2020

Keywords:

Clustering , word embedding , Small Corpora , Clustering Comparison , Word Vector Representation

References:

References are not available for this document.

Citations

Citations are not available for this document.

Citations

Downloads

Views