Using clustering to enhance protin sequence classification

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

جامعة بوليتكنك فلسطين - informatics

Abstract

We introduce a new approach for enhancing the performance of prediction of biological attributes based on protein sequences using a combination of classification algorithms and clustering analysis. Before applying classification, we use clustering analysis in order to find clusters of similar proteins. A classification algorithm is then applied on each cluster. The proposed approach is suitable for large datasets, when high classification accuracy and fast convergence are required. Different descriptors based on the physicochemical properties of amino acids are used, some of them are native properties and the others are derived properties. Two encoding methods are used to represent the protein sequences using the descriptors. These descriptors and encoding methods are analyzed to enhance the performance of the proposed approach. Three standard benchmark datasets, Caspase, Major Histocompatibility Complex class II (MHC-II) and the membrane proteins are used to examine the proposed approach. Many experiments with different parameters are performed and the results are cross validated. The results show that applying clustering prior to classification gives higher prediction accuracy than using the classification without clustering, especially when using the membrane proteins dataset and the Caspase dataset. In addition, the result of time performance, especially when using the MHC-II vii

Description

no of pages 109, 26547, Informatics 2/2013 , in the store

Citation

Endorsement

Review

Supplemented By

Referenced By