De Novo Short Read Genome Sequencing
Ariella Sasson, Rutgers University
One of the most significant advances in biology has been the ability to sequence the DNA of organisms. Even in the shadow of the completion of the human genome, intractable regions of the genome remain incomplete. New whole genome sequencing technologies are needed to reach the goal of sequencing a full human genome for $1,000. Next-generation high-throughput short read sequencing technologies are now available and have the ability to generate millions of short reads. Although greater coverage depths are possible, de novo sequence assembly with these shorter sequences is significantly more complex than resequencing; handling them presents new computational problems and opportunities. Identifying repetitive regions during the assembly, coping with errors in the short read sequences, and manipulating millions of reads simultaneously, are some of the difficulties that need to be overcome via algorithms or computational power. These complexities have made researchers question if an optimal genome assembly can be completed at acceptable computational costs. One major consideration is based on computational resources required by the algorithms to assemble the desired genome. Both memory costs and run time are significant issues when dealing with millions of sequence reads. Another consideration is that the short read length (<100 base pairs) implies that the assembler must be able to deal with numerous ambiguous overlaps, the correction of sequence errors, and the assembly of reads containing errors. There are a few assemblers that have been developed or modified to assemble short read sequences. Each has its limitations, and while some have shown success on smaller bacterial artificial chromosomes (BACs) and small to mid-sized genomes (10-35 MB), none succeeded in “assembling” large genomes due to their computational expense, including RAM limitations and/or run-time length. In order to address the problem of producing an optimal assembly de novo from short read sequences, new computational methods must take into account proper use of clusters, algorithms and memory usage. With such continuous advances, assembling the genomes of a myriad of unsequenced organisms, from bacteria, fungae, and plants in the short term to animals in the longer term, will soon become a reality.
Abstract Author(s): Ariella Sasson, Anirvan Sengupta, & Todd Michael