Abstract:
The vector space model (VSM) is a textual representation method that is widely used in documents
classification. However, it remains to be a space-challenging problem. One attempt to
alleviate the space problem is by using dimensionality reduction techniques, however, such
techniques have deficiencies such as losing some important information. In this paper, we propose
a novel text classification method that neither uses VSM nor dimensionality reduction
techniques. The proposed method is a space efficient method that utilizes the first order Markov
model for hierarchical Arabic text classification. For each category and sub-category, a Markov
chain model is prepared based on the neighboring characters sequences. The prepared models are
then used for scoring documents for classification purposes. For evaluation, we used a hierarchical
Arabic text data collection that contains 11,191 documents that belong to eight topics
distributed into 3-levels. The experimental results show that the Markov chains based method
significantly outperforms the baseline system that employs the latent semantic indexing (LSI)
method. That is, the proposed method enhances the F1-measure by 3.47%. The novelty of this
work lies on the idea of decomposing words into sequences of characters, which found to be a
promising approach in terms of space and accuracy. Based on our best knowledge, this is the first
attempt to conduct research for hierarchical Arabic text classification with such relatively large
data collection.