Abstract:
We introduce a new approach for enhancing the performance of prediction of biological attributes based on protein sequences using a combination
of classification algorithms and clustering analysis. Before applying classification, we use clustering analysis in order to find clusters of similar proteins.
A classification algorithm is then applied on each cluster. The proposed approach is suitable for large datasets, when high classification accuracy and
fast convergence are required.
Different descriptors based on the physicochemical properties of amino
acids are used, some of them are native properties and the others are derived properties. Two encoding methods are used to represent the protein
sequences using the descriptors. These descriptors and encoding methods
are analyzed to enhance the performance of the proposed approach.
Three standard benchmark datasets, Caspase, Major Histocompatibility
Complex class II (MHC-II) and the membrane proteins are used to examine
the proposed approach. Many experiments with different parameters are
performed and the results are cross validated.
The results show that applying clustering prior to classification gives
higher prediction accuracy than using the classification without clustering, especially when using the membrane proteins dataset and the Caspase dataset.
In addition, the result of time performance, especially when using the MHC-II
vii
Description:
no of pages 109, 26547, Informatics 2/2013 , in the store