Protein Sequence Data: Laying the Mathematical Foundations to Derive Novel Descriptors

Kurar, Berat

DSpace Home
→
Graduation Projects, Theses, and Student Papers
→
Master of Informatics
→
View Item

dc.contributor.advisor	Abu Sbeih, Murad
dc.contributor.author	Kurar, Berat
dc.date.accessioned	2018-10-29T08:27:34Z
dc.date.accessioned	2022-05-11T05:33:27Z
dc.date.available	2018-10-29T08:27:34Z
dc.date.available	2022-05-11T05:33:27Z
dc.date.issued	12/1/2015
dc.identifier.uri	http://test.ppu.edu/handle/123456789/939
dc.description	CD, no of pages 63, 30114, informatics 5/2015
dc.description.abstract	Proteins’ structures and functions play key roles in organisms’ life. However, it is timely and costly to experimentally determine the attributes of protein sequences. There is a tremendous growth in the amount of available protein data and the interpretation of information contained in protein sequences is a complex process. Hence it is important to extract features from proteins by descriptors and classify them by computational methods. Conventional discrete descriptor for proteins is defined as the occurrence frequency of each amino acid type in a protein sequence. This method works well with strong amino acid composition similarities but fails when sequence order across the protein is a strong determinant of the attribute. In addition, conventional sequential descriptors work well when sequence order across the protein is a strong determinant of the attribute but are not applicable to various lengths of proteins. These cases necessitate fixed length descriptors that are able to capture some sequence order effect in protein sequences of various lengths. This thesis proposes a new mathematical definition for the conventional discrete descriptor and new fixed length descriptors with partial sequence order effect as well as new discrete descriptors that considers some similarity among amino acids according to their descriptors. We performed a comparison with six standard protein descriptors and fourteen novel descriptors; on three classification problems subcellular localization, caspase peptides and DNA binding; by two classification models Random Forest and Support Vector Machines. The experimental results demonstrate the effectiveness of vi the new descriptors. Some of the novel descriptors outperform Pseudo Amino Acid Composition in terms of the Area Under Curve and the execution time measurements. Performance differences exist between the descriptors thereby underlining that choosing an appropriate protein descriptor is of paramount protein classification modelling	en_US
dc.language.iso	en	en_US
dc.publisher	جامعة بوليتكنك فلسطين - معلوماتية	en_US
dc.subject	Proteins, Mathematical Foundations	en_US
dc.title	Protein Sequence Data: Laying the Mathematical Foundations to Derive Novel Descriptors	en_US
dc.type	Other	en_US