Show simple item record

dc.contributor.advisorMyers, Eugene W.en_US
dc.contributor.authorKececioglu, John Dimitri.
dc.creatorKececioglu, John Dimitri.en_US
dc.date.accessioned2011-10-31T17:44:56Z
dc.date.available2011-10-31T17:44:56Z
dc.date.issued1991en_US
dc.identifier.urihttp://hdl.handle.net/10150/185673
dc.description.abstractThe DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 ≤ ε < 1, find a shortest sequence S such that every fragment F ∈ F, or its reverse complement, matches a substring of S with at most ε|F| errors. Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2ε/(1-ε) and at most a factor 1/1-ε longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of α approximates Sequence Reconstruction within a factor of (1- ε)α under the overlap measure. We construct the overlap graph for (1) in O(εN²) time given fragments of total length N at error rate ε. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K ≤ 2ᴱ is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V(E + V log V)log V) time. This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity.
dc.language.isoenen_US
dc.publisherThe University of Arizona.en_US
dc.rightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.en_US
dc.subjectDNA -- Computer programs.en_US
dc.titleExact and approximation algorithms for DNA sequence reconstruction.en_US
dc.typetexten_US
dc.typeDissertation-Reproduction (electronic)en_US
dc.identifier.oclc704395410en_US
thesis.degree.grantorUniversity of Arizonaen_US
thesis.degree.leveldoctoralen_US
dc.contributor.committeememberDowney, Peteren_US
dc.contributor.committeememberManber, Udien_US
dc.identifier.proquest9210281en_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.disciplineGraduate Collegeen_US
thesis.degree.namePh.D.en_US
refterms.dateFOA2018-06-12T10:57:13Z
html.description.abstractThe DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 ≤ ε < 1, find a shortest sequence S such that every fragment F ∈ F, or its reverse complement, matches a substring of S with at most ε|F| errors. Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2ε/(1-ε) and at most a factor 1/1-ε longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of α approximates Sequence Reconstruction within a factor of (1- ε)α under the overlap measure. We construct the overlap graph for (1) in O(εN²) time given fragments of total length N at error rate ε. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K ≤ 2ᴱ is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V(E + V log V)log V) time. This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity.


Files in this item

Thumbnail
Name:
azu_td_9210281_sip1_m.pdf
Size:
6.082Mb
Format:
PDF
Description:
azu_td_9210281_sip1_m.pdf

This item appears in the following Collection(s)

Show simple item record