Protein Sequence Data: Laying the Mathematical Foundations to Derive Novel Descriptors

dc.contributor.advisorAbu Sbeih, Murad
dc.contributor.authorKurar, Berat
dc.date.accessioned2018-10-29T08:27:34Z
dc.date.accessioned2022-05-11T05:33:27Z
dc.date.available2018-10-29T08:27:34Z
dc.date.available2022-05-11T05:33:27Z
dc.date.issued12/1/2015
dc.descriptionCD, no of pages 63, 30114, informatics 5/2015
dc.description.abstractProteins’ structures and functions play key roles in organisms’ life. However, it is timely and costly to experimentally determine the attributes of protein sequences. There is a tremendous growth in the amount of available protein data and the interpretation of information contained in protein sequences is a complex process. Hence it is important to extract features from proteins by descriptors and classify them by computational methods. Conventional discrete descriptor for proteins is defined as the occurrence frequency of each amino acid type in a protein sequence. This method works well with strong amino acid composition similarities but fails when sequence order across the protein is a strong determinant of the attribute. In addition, conventional sequential descriptors work well when sequence order across the protein is a strong determinant of the attribute but are not applicable to various lengths of proteins. These cases necessitate fixed length descriptors that are able to capture some sequence order effect in protein sequences of various lengths. This thesis proposes a new mathematical definition for the conventional discrete descriptor and new fixed length descriptors with partial sequence order effect as well as new discrete descriptors that considers some similarity among amino acids according to their descriptors. We performed a comparison with six standard protein descriptors and fourteen novel descriptors; on three classification problems subcellular localization, caspase peptides and DNA binding; by two classification models Random Forest and Support Vector Machines. The experimental results demonstrate the effectiveness of vi the new descriptors. Some of the novel descriptors outperform Pseudo Amino Acid Composition in terms of the Area Under Curve and the execution time measurements. Performance differences exist between the descriptors thereby underlining that choosing an appropriate protein descriptor is of paramount protein classification modellingen_US
dc.identifier.urihttp://test.ppu.edu/handle/123456789/939
dc.language.isoenen_US
dc.publisherجامعة بوليتكنك فلسطين - معلوماتيةen_US
dc.subjectProteins, Mathematical Foundationsen_US
dc.titleProtein Sequence Data: Laying the Mathematical Foundations to Derive Novel Descriptorsen_US
dc.typeOtheren_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Berat-thesis.pdf
Size:
894.81 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: