dc.description.abstract |
The main problem of searching the Genome DNA sequences is the large size of sequences and the
very high and variant sequences lengths. There are different methods used to enhance sequence
searching like using database indexing methods instead of direct access to sequence files.
Our main idea is to provide a suitable access methodology, in time and space, to Genome DNA
sequences for searching and comparing while considering the size of the data and the index. The
Genome database searching system is needed to give facilities, compact data representation and
compression, accurate output, practical to use, and to minimize the number of l/O operations. l/O
operations mainly needed at last step to avoid false positives (the sequences that appear to be related
but are not related to the searched query). The number of candidate sequences, that need to be
checked by database l/O referencing, will be reduced by pruning so no need to search the whole
database.
In this thesis, we propose an approach to build a complete index structure that is suitable for large
database to do searching with efficient storage space and search time. We use a suitable
representation of Genome DNA sequences using n-gram Haar wavelet transformation, and integer
conversion for coefficients. A suitable index structure, which is build upon a modified BTree index,
is used to hold the integer representation after transformation. We also introduce enhancements that
can be followed to increase system efficiency by decreasing index storage size. Our structure is
called the Modified Wavelet Transformation and BTree (M-WTBT). The M-WTBT structure allows
tuning for a set of parameters so that the index structure is suitable to the available resources. An
implementation is done, using a dataset used previously by a set of researches, to approve features
and to show the advantages of the M-WTBT structure. Also, the M-WTBT shown to be effective
when compare with a set of previous researches.
Keyword: Sequence transformation, sequence compression, large database indexing, Haar Wavelet
Transformation, Genome DNA Sequence searching and indexing |
en_US |