## Exact and approximation algorithms for DNA sequence reconstruction.

dc.contributor.advisor | Myers, Eugene W. | en_US |

dc.contributor.author | Kececioglu, John Dimitri. | |

dc.creator | Kececioglu, John Dimitri. | en_US |

dc.date.accessioned | 2011-10-31T17:44:56Z | |

dc.date.available | 2011-10-31T17:44:56Z | |

dc.date.issued | 1991 | en_US |

dc.identifier.uri | http://hdl.handle.net/10150/185673 | |

dc.description.abstract | The DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 ≤ ε < 1, find a shortest sequence S such that every fragment F ∈ F, or its reverse complement, matches a substring of S with at most ε|F| errors. Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2ε/(1-ε) and at most a factor 1/1-ε longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of α approximates Sequence Reconstruction within a factor of (1- ε)α under the overlap measure. We construct the overlap graph for (1) in O(εN²) time given fragments of total length N at error rate ε. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K ≤ 2ᴱ is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V(E + V log V)log V) time. This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity. | |

dc.language.iso | en | en_US |

dc.publisher | The University of Arizona. | en_US |

dc.rights | Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. | en_US |

dc.subject | DNA -- Computer programs. | en_US |

dc.title | Exact and approximation algorithms for DNA sequence reconstruction. | en_US |

dc.type | text | en_US |

dc.type | Dissertation-Reproduction (electronic) | en_US |

dc.identifier.oclc | 704395410 | en_US |

thesis.degree.grantor | University of Arizona | en_US |

thesis.degree.level | doctoral | en_US |

dc.contributor.committeemember | Downey, Peter | en_US |

dc.contributor.committeemember | Manber, Udi | en_US |

dc.identifier.proquest | 9210281 | en_US |

thesis.degree.discipline | Computer Science | en_US |

thesis.degree.discipline | Graduate College | en_US |

thesis.degree.name | Ph.D. | en_US |

refterms.dateFOA | 2018-06-12T10:57:13Z | |

html.description.abstract | The DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 ≤ ε < 1, find a shortest sequence S such that every fragment F ∈ F, or its reverse complement, matches a substring of S with at most ε|F| errors. Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2ε/(1-ε) and at most a factor 1/1-ε longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of α approximates Sequence Reconstruction within a factor of (1- ε)α under the overlap measure. We construct the overlap graph for (1) in O(εN²) time given fragments of total length N at error rate ε. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K ≤ 2ᴱ is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V(E + V log V)log V) time. This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity. |

### Files in this item

**Name:**

**Size:**

**Format:**

**Description:**