Exploring bigram character features for Arabic text clustering

AbuZeina, Dia

dc.contributor.advisor	AbuZeina, Dia
dc.contributor.author	AbuZeina, Dia
dc.date.accessioned	2019-10-09T10:29:38Z
dc.date.accessioned	2022-05-22T08:53:03Z
dc.date.available	2019-10-09T10:29:38Z
dc.date.available	2022-05-22T08:53:03Z
dc.date.issued	2019-07-26
dc.identifier.citation	ABUZEINA, DIA EDDIN. "Exploring bigram character features for Arabic text clustering." Turkish Journal of Electrical Engineering & Computer Sciences 27.4 (2019): 3165-3179.	en_US
dc.identifier.issn	10.3906/elk-1808-103
dc.identifier.uri	http://localhost:8080/xmlui/handle/123456789/8132
dc.description.abstract	The vector space model (VSM) is an algebraic model that is widely used for data representation in text mining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space. Therefore, many feature selection techniques, such as employing roots or stems (i.e. words without infixes and prefixes, and/or suffixes) instead of using complete word forms, are proposed to tackle this space challenge problem. Recently, the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, we measure the accuracy of the Arabic text clustering using two feature types: the complete word form and the microword form. Hence, the microword is two consecutive characters which are also known as the Bigram character feature. In the experiment, the principal component analysis (PCA) is used to reduce the feature vector dimensions while the k-means algorithm is used for the clustering purposes. The testing set includes 250 documents of five categories. The entire corpus contains 54,472 words, whereas the vocabulary contains 13,356 unique words. The experimental results show that the complete word form score accuracy is 97.2% while the two-character form score is 96.8%. In conclusion, the accuracies are almost the same; however, the two-character form uses a smaller vocabulary as well as less PCA subspaces. The study experiments might be a significant indication of the necessity to consider the Bigram character feature in the future text processing and natural language processing applications.	en_US
dc.description.sponsorship	Palestine Polytechnic University	en_US
dc.language.iso	en	en_US
dc.publisher	TÜBİTAK	en_US
dc.subject	words: Arabic, text, clustering, features, dimensionality reduction, k-means, principal component analysis, vector space model	en_US
dc.title	Exploring bigram character features for Arabic text clustering	en_US
dc.title.alternative	Exploring bigram character features for Arabic text clustering	en_US
dc.type	Article	en_US