RNA sequence assembly using base pair bond structure
University of New Brunswick
The process of determining nucleic acid (DNA and RNA) sequences is affected by limitations in technology that limit the length of sequence that can be read at once. Longer sequences are thus read in pieces, and so an important part of the process is to re-assemble these pieces, using overlapping pieces from multiple copies. Sequence assembly is a complex computational problem, particularly if the target sequence is one that is hitherto unknown and so there is no reference that helps to assemble the sequence pieces correctly. The advent of a new generation of high throughput technology that can quickly read shorter sequences makes the assembly an even more critical step because of the increased likelihood of missing regions in the final re-assembled sequence. Current sequence assembly algorithms are designed to seek a best result, from a series of characters perspective, in the form of a shortest possible ‘superstring’. However, DNA, RNA and other biological sequences are more than just strings, and so a shortest possible string does not equate to how the pieces were originally ordered, which is ultimately the goal of sequence assembly. This work proposes an alternative approach of including known structure properties and conditions that govern the building and design of biological sequences to help the re-assembly of RNA sequences, and includes tests whose results show noticeable increase in accuracy of the sequence assembly result in comparison to past approaches and that using structure also enables some shorter sequences to be assembled from a single copy.