Abstract:
Proteins’ structures and functions play key roles in organisms’ life. However, it is timely and costly to experimentally determine the attributes of protein sequences.
There is a tremendous growth in the amount of available protein data and the interpretation of information contained in protein sequences is a complex process.
Hence it is important to extract features from proteins by descriptors and classify them by computational methods.
Conventional discrete descriptor for proteins is defined as the occurrence frequency of each amino acid type in a protein sequence. This method works well with strong amino acid composition similarities but fails when sequence order across the protein is a strong determinant of the attribute. In addition, conventional sequential
descriptors work well when sequence order across the protein is a strong determinant of the attribute but are not applicable to various lengths of proteins.
These cases necessitate fixed length descriptors that are able to capture some sequence order effect in protein sequences of various lengths. This thesis proposes a new mathematical definition for the conventional discrete descriptor and new fixed length descriptors with partial sequence order effect as well as new discrete descriptors that considers some similarity among amino acids according to their
descriptors.
We performed a comparison with six standard protein descriptors and fourteen novel descriptors; on three classification problems subcellular localization, caspase peptides and DNA binding; by two classification models Random Forest and Support Vector Machines. The experimental results demonstrate the effectiveness of vi the new descriptors. Some of the novel descriptors outperform Pseudo Amino Acid Composition in terms of the Area Under Curve and the execution time measurements.
Performance differences exist between the descriptors thereby underlining that choosing an appropriate protein descriptor is of paramount protein classification modelling
Description:
CD, no of pages 63, 30114, informatics 5/2015