DSpace Repository

Exploring bigram character features for Arabic text clustering

Show simple item record

dc.contributor.advisor AbuZeina, Dia
dc.contributor.author AbuZeina, Dia
dc.date.accessioned 2019-10-09T10:29:38Z
dc.date.accessioned 2022-05-22T08:53:03Z
dc.date.available 2019-10-09T10:29:38Z
dc.date.available 2022-05-22T08:53:03Z
dc.date.issued 2019-07-26
dc.identifier.citation ABUZEINA, DIA EDDIN. "Exploring bigram character features for Arabic text clustering." Turkish Journal of Electrical Engineering & Computer Sciences 27.4 (2019): 3165-3179. en_US
dc.identifier.issn 10.3906/elk-1808-103
dc.identifier.uri http://localhost:8080/xmlui/handle/123456789/8132
dc.description.abstract The vector space model (VSM) is an algebraic model that is widely used for data representation in text mining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space. Therefore, many feature selection techniques, such as employing roots or stems (i.e. words without infixes and prefixes, and/or suffixes) instead of using complete word forms, are proposed to tackle this space challenge problem. Recently, the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, we measure the accuracy of the Arabic text clustering using two feature types: the complete word form and the microword form. Hence, the microword is two consecutive characters which are also known as the Bigram character feature. In the experiment, the principal component analysis (PCA) is used to reduce the feature vector dimensions while the k-means algorithm is used for the clustering purposes. The testing set includes 250 documents of five categories. The entire corpus contains 54,472 words, whereas the vocabulary contains 13,356 unique words. The experimental results show that the complete word form score accuracy is 97.2% while the two-character form score is 96.8%. In conclusion, the accuracies are almost the same; however, the two-character form uses a smaller vocabulary as well as less PCA subspaces. The study experiments might be a significant indication of the necessity to consider the Bigram character feature in the future text processing and natural language processing applications. en_US
dc.description.sponsorship Palestine Polytechnic University en_US
dc.language.iso en en_US
dc.publisher TÜBİTAK en_US
dc.subject words: Arabic, text, clustering, features, dimensionality reduction, k-means, principal component analysis, vector space model en_US
dc.title Exploring bigram character features for Arabic text clustering en_US
dc.title.alternative Exploring bigram character features for Arabic text clustering en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account