基因组装算法:调研
Genome Assembly Algorithms: A Survey

作者: 连帅彬 , 戴宪华 :中山大学信息科学与技术学院;

关键词: 基因组装De Bruijn图下一代测序技术(NGS)Genome Assembly De Bruijn Graph Next Generation Sequencing (NGS)

摘要: 基因测序技术在过去的二十几年里取得了突飞猛进的发展,随着以高通量,短读取,低成本为特点的新一代基因测序技术的问世,测序一个物种全基因的时间和成本大大降低。基于下一代测序技术的全基因组装算法和软件相继开发出来,目前比较成熟的基因组装算法大约有二十种左右。由于基因组装问题本身的复杂性,目前还没有针对不同组装算法的具体设计步骤,操作环境,应用范围等方面的调研。基于此本文简要调研了现有的十二种具有代表性的基因组装算法,系统的分析了每种算法的设计步骤,算法原理,操作环境以及应用。这篇调研对于如何设计基因组装算法,对于不同的基因数据如何选择更加合适的基因组装算法和软件提供了一定的指导。

Abstract: During the last twenty years, genome sequencing technology has made a great development. With the appearance of the Next Generation Sequencing Technology characterized by higher throughput, shorter read and even lower cost, the time and cost of sequencing the whole genomes of a species are sharply decreased. The genome assembly algorithms and software are developed in succession, currently there are almost twenty ones. Due to the complexity of genome assembly itself, there is no survey aiming at the special steps, operational environment, applications of the different genome algorithms. In this perspective, we survey these typical twelve genome assembly algorithms briefly and then analyze the special steps, principles, operational environments and applications. This survey can provide a guidance on how to design or develop a genome assembly algorithm or software, and which genome assembly algorithm of software is more suitable according to different genome sequencing data.

文章引用: 连帅彬 , 戴宪华 (2013) 基因组装算法:调研。 计算生物学, 3, 7-14. doi: 10.12677/hjcb.2013.32002

参考文献

[1] M. L. Metzker. Sequencing technologies—The next generation. Nature Review Genetic, 2010, 11(1): 31-46.

[2] E. R. Mardis. The impact of next-generation sequencing tech- nology on genetics. Trends in Genetics, 2008, 24(3): 133-141.

[3] O. Morozova, M. A. Marra. Applications of next-generation se- quencing technologies in functional genomics. Genomics, 2008, 92(5): 255-264.

[4] R. L. Strausberg, S. Levy and Y. H. Rogers. Emerging DNA sequencing technologies for human genomic medicine. Drug Discovery Today, 2008, 13(13-14): 569-577.

[5] E. Pettersson, J. Lundeberg and A. Ahmadian. Generations of sequencing technologies. Genomics, 2009, 93(2): 105-111.

[6] C. A. Hutchison III. DNA sequencing: Bench to bedside and beyond. Nucleic Acids Research, 2007, 35(18): 6227-6237.

[7] M. Pop. Genome assembly reborn: Recent computational chal- lenges. Briefings in Bioinformatics, 2009, 10(4): 354-366.

[8] R. Staden. A strategy of DNA sequencing employing computer programs. Nucleic Acids Research, 1979, 6(7): 2601-2610.

[9] N. J. Loman, R. V. Misra and T. J. Dallman. Performance com- parison of benchtop high-throughput sequencing platforms. Na- ture Biotechnology, 2012, 30(5): 434-439.

[10] C. Alkan, S. Sajjadian and E. E. Eichler. Limitations of next- generation genome sequence assembly. Nature Methods, 2011, 8(1): 61-65.

[11] P. Green. Whole-genome disassembly. Proceedings of the National Academy of Sciences of the USA, 2002, 99(7): 4143- 4144.

[12] T. J. Treangen and S. L. Salzberg. Repetitive DNA and next- generation sequencing: Computational challenges and solutions. Nature Reviews: Genetics, 2012, 13(1): 36-46.

[13] D. A. Wheeler, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature, 2008, 452: 872- 876.

[14] D. R. Bentley, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 2008, 456(7218): 53-59.

[15] Y. Benjamini, T. P. Speed. Summarizing and correcting the GC content bias in high throughput sequencing. Nucleic Acids Re- search, 2012, 40(10): e72.

[16] M. Pop. Genome assembly reborn: Recent computational chal- lenges. Briefings in bioinformatics, 2009, 10(4): 354-366.

[17] X. Huang, S. P. Yang. Generating a genome assembly with PCAP. Current Protocols in Bioinformatics, 2005, 11: 3.

[18] D. Hernandez, P. Francois, L. Farinelli, M. Steras and J. Sch- renzel. De novo bacterial genome sequencing: Millions of very short read assembled on a desktop computer. Genome Research, 2008, 18(5): 802-809.

[19] F. Paul, B. Ewan. Sense from sequence reads: Methods for align- ment and assembly. Nature Methods, 2009, 6(11): S6-S12.

[20] S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov and E. S. Lander. ARACHNE: A whole-genome shotgun assembler. Genome Research, 2002, 12(1): 177-189.

[21] R. L. Warren, G. G. Sutton, S. J. M. Jones and R. A. Holt. As- sembling millions of short DNA sequences using SSAKE. Bio- informatics. 2007, 23(4): 500-501.

[22] J. C. Dohm, C. Lottaz, T. Borodina and H. Himmelbauer. SHARCGS, a fast and highly accurate short-read assembly algo- rithm for denovo genomic sequencing. Genome Research, 2007, 17(11): 1697-1706.

[23] W. R. Jeck, J. A. Reinhardt, D. A. Baltrus, M. T. Hickenbotham, V. Magrini, et al. Extending assembly of short DNA sequences to handle error. Bioinformatics, 2007, 23(21): 2942-2944.

[24] D. W. Bryant Jr., W. K. Wong and T. C. Mockler. QSRA: A quality-value guided de novo short read assembler. BMC Bioin- formatics, 2009, 10: 69.

[25] G. G. Sutton, O. White, M. D. Adams and A. R. Kerlavage. TIGR assembler: A new tool for assembling large shotgun se- quencing projects. Genome Science and Technology, 1995, 1(1): 9-19.

[26] M. J. P. Chaisson, P. A. Pevzner. Short read fragment assembly of bacterial genomes. Genome Research, 2007, 18(2): 324-330.

[27] J. Butler, I. MacCallum, M. Kleber, I. Shlyakhter, M. K. Bel- monte, et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research, 2008, 18(5): 810-820.

[28] D. R. Zerbino, E. Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008, 18(5): 821-829.

[29] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, et al. ABySS: A parallel assembler for short read sequence data [J]. Genome Research, 2009, 19(6): 1117-1123.

[30] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, H. Yang and J. Wang. De novo assembly of human genomes with massively parallel short read sequenc- ing, Genome Research, 2009, 20(2): 265-272.

[31] S. Steven, L. Phillippy, M. Adam and Z. Aleksey. GAGE: A critical evaluation of genome assemblies and assembly. Genome Research, 2012, 22(3): 557-567.

[32] S. Batzoglou. Algorithmic challenges in mammalian genome sequence assembly. In: M. Dunn, L. Jorde, P. Little and S. Subramaniam, Eds., Encyclopedia of genomics, proteomics and bioinformatics. Hoboken: John Wiley and Sons, 2005.

[33] M. Pop. McGraw-Hill 2006 yearbook of science and technology. New York: McGraw-Hill, 2005.

[34] G. Sutton, I. Dew. Shotgun fragment assembly. In: I. Rigoutsos and G. Stephanopoulos, Eds., Systems biology: Genomics. New York: Oxford University Press, 2007: 79-117.

[35] B. Ewing, L. Hillier, M. C. Wendl and P. Green. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research, 1998, 8(3): 175-185.

[36] E. S. Lander, M. S. Waterman. Genomic mapping by finger- printing random clones: A mathematical analysis. Genomics, 1988, 2(3): 231-239.

[37] M. C. Wendl, R. H. Waterston. Generalized gap model for bacte- rial artificial chromosome clone fingerprint mapping and shot- gun sequencing. Genome Research, 2002, 12(12): 1943-1949.

[38] M. Pop. Genome assembly reborn: Recent computational chal- lenges. Briefings in Bioinformatics, 2009, 10(4): 354-366.

[39] M. Pop, S. L. Salzberg. Bioinformatics challenges of new se- quencing technology. Trends in Genetics, 2008, 24(3): 142-149.

[40] M. Sahli, T. Shibuya. Arapan-S: A fast and highly accurate whole genome assembly software for viruses and small genomes. BMC Research Notes, 2012, 5: 243.

[41] P. E. C. Compeau, P. A. Pevzner and G. Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology, 2011, 29: 987-991.

[42] H. Fleischner. Eulerian graphs and related topics. London: El- sevier Science, 1990.

[43] P. A. Pevzner, H. X. Tang and M. S. Waterman. An Eulerian path approach to DNA fragment assembly. Proceedings of the Na- tional Academy of Sciences of the USA, 2001, 98(17): 9748- 9753.

[44] P. A. Pevzner, H. Tang. Fragment assembly with double-barreled data. Bioinformatics, 2001, 17(1): S225-S233.

[45] D. Earl, K. Bradnam, J. St. John, et al. Assemblathon 1: A com- petitive assessment of de novo short read assembly. Genome Research, 2011, 21(12): 2224-2241.

分享
Top