DSpace Repository

Protein Sequence Data: Laying the Mathematical Foundations to Derive Novel Descriptors

Show simple item record

dc.contributor.advisor Abu Sbeih, Murad
dc.contributor.author Kurar, Berat
dc.date.accessioned 2018-10-29T08:27:34Z
dc.date.accessioned 2022-05-11T05:33:27Z
dc.date.available 2018-10-29T08:27:34Z
dc.date.available 2022-05-11T05:33:27Z
dc.date.issued 12/1/2015
dc.identifier.uri http://test.ppu.edu/handle/123456789/939
dc.description CD, no of pages 63, 30114, informatics 5/2015
dc.description.abstract Proteins’ structures and functions play key roles in organisms’ life. However, it is timely and costly to experimentally determine the attributes of protein sequences. There is a tremendous growth in the amount of available protein data and the interpretation of information contained in protein sequences is a complex process. Hence it is important to extract features from proteins by descriptors and classify them by computational methods. Conventional discrete descriptor for proteins is defined as the occurrence frequency of each amino acid type in a protein sequence. This method works well with strong amino acid composition similarities but fails when sequence order across the protein is a strong determinant of the attribute. In addition, conventional sequential descriptors work well when sequence order across the protein is a strong determinant of the attribute but are not applicable to various lengths of proteins. These cases necessitate fixed length descriptors that are able to capture some sequence order effect in protein sequences of various lengths. This thesis proposes a new mathematical definition for the conventional discrete descriptor and new fixed length descriptors with partial sequence order effect as well as new discrete descriptors that considers some similarity among amino acids according to their descriptors. We performed a comparison with six standard protein descriptors and fourteen novel descriptors; on three classification problems subcellular localization, caspase peptides and DNA binding; by two classification models Random Forest and Support Vector Machines. The experimental results demonstrate the effectiveness of vi the new descriptors. Some of the novel descriptors outperform Pseudo Amino Acid Composition in terms of the Area Under Curve and the execution time measurements. Performance differences exist between the descriptors thereby underlining that choosing an appropriate protein descriptor is of paramount protein classification modelling en_US
dc.language.iso en en_US
dc.publisher جامعة بوليتكنك فلسطين - معلوماتية en_US
dc.subject Proteins, Mathematical Foundations en_US
dc.title Protein Sequence Data: Laying the Mathematical Foundations to Derive Novel Descriptors en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account