Abstract:
The aim of this paper is to evaluate improvement in the classification of protein sequence data by introducing clustering as a prepossessing step. Clustering analysis was introduced to discover any possible sub-clusters that might have different patterns within the same protein class. A classification learning algorithm is then applied to each cluster to enhance the classification accuracy. Two standard benchmark datasets: caspase 3 human substrates that include cleaved and non-cleaved peptides, and the membrane proteins inner and
α-helical proteins were used to examine the proposed approach. Different descriptors based on the physicochemical properties of amino acids were extracted from the protein sequence data and two encoding methods were used to represent the protein sequences using the descriptors. The results show that applying clustering process prior to classification gives higher prediction accuracy than using classification alone. In addition, the result of time performance shows that the proposed approach succeeded in reducing the training time of the classification process significantly while maintaining the accuracy of prediction.