DSpace Repository

Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing

Show simple item record

dc.contributor.author Al-Anzi, Fawaz
dc.contributor.author AbuZeina, Dia
dc.date.accessioned 2021-05-09T08:07:57Z
dc.date.accessioned 2022-05-22T08:54:11Z
dc.date.available 2021-05-09T08:07:57Z
dc.date.available 2022-05-22T08:54:11Z
dc.date.issued 2017
dc.identifier.uri http://localhost:8080/xmlui/handle/123456789/8223
dc.description.abstract Cosine similarity is one of the most popular distance measures in text classification problems. In this paper, we used this important measure to investigate the performance of Arabic language text classification. For textual features, vector space model (VSM) is generally used as a model to represent textual information as numerical vectors. However, Latent Semantic Indexing (LSI) is a better textual representation technique as it maintains semantic information between the words. Hence, we used the singular value decomposition (SVD) method to extract textual features based on LSI. In our experiments, we conducted comparison between some of the well-known classification methods such as Naïve Bayes, k- Nearest Neighbors, Neural Network, Random Forest, Support Vector Machine, and classification tree. We used a corpus that contains 4,000 documents of ten topics (400 document for each topic). The corpus contains 2,127,197 words with about 139,168 unique words. The testing set contains 400 documents, 40 documents for each topics. As a weighing scheme, we used Term Frequency.Inverse Document Frequency (TF.IDF). This study reveals that the classification methods that use LSI features significantly outperform the TF.IDF-based methods. It also reveals that k-Nearest Neighbors (based on cosine measure) and support vector machine are the best performing classifiers. 2016 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). en_US
dc.language.iso en_US en_US
dc.publisher Elsevier en_US
dc.subject Arabic text Classification Supervised learning Cosine similarity Latent Semantic Indexing en_US
dc.title Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing en_US
dc.type Article en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


My Account