GATB: a software toolbox for genome assembly and analysis

Abstract : The analysis of NGS data remains a time and space-consuming task. Many efforts have been made to provide efficient data structures for indexing the terabytes of data generated by the fast sequencing machines (Suffix Array, Burrows-Wheeler transform, Bloom Filter, etc.). Mapper tools, genome assemblers, SNP callers, etc., make an intensive use of these data structures to keep their memory footprint as lower as possible.The overall efficiency of NGS software is brought by a smart combination of how data are represented inside the computer memory and how they are processed through the available processing units inside a processor. Developing such software is thus a real challenge, as it requires a large spectrum of competences from high-level data structure and algorithm concepts to tiny details of implementation.GATB toolboxThe GATB software toolbox aims to lighten the design of NGS algorithms. It offers a panel of high-level optimized building blocks to speed-up the development of NGS tools related to genome assembly and/or genome analysis. The underlying data structure is the de Bruijn graph, and the general parallelism model is multithreading. The GATB library targets standard computing resources such as current multicore processor (laptop computer, small server) with a few GB of memory. From high-level C++ API, NGS programing designers can rapidly elaborate their own software based on state-of-the-art algorithms and data structures of the domain.The GATB library is written in C++ and is available at the following web site http://gatb.inria.fr under the GNU Affero GPL license.Genomic SoftwareFrom the GATB toolbox, various software targeting specific genomic treatments have been designed. Below is a short list of tools currently available. Many other tools are under development.Minia is a short-read assembler capable of assembling large and complex genomes into contigs on a desktop computer. The assembler produces contigs of similar length and accuracy compared to other assemblers. As an example, a Boa constrictor constrictor (1.6 Gbp) dataset (Illumina 2x120 bp reads, 125x coverage) from Assemblathon 2 can be processed in approximately 45 hours and 3GB of memory on a standard computer (3.4 GHz 8-core processor) using a single core, yielding a contig N50 of 3.6 Kbp (prior to scaffolding and gap-filling).Bloocoo is a k-mer spectrum-based read error corrector, designed to correct large datasets with a very low memory footprint. The correction procedure is similar to the Musket multistage approach. Bloocoo yields similar results while requiring far less memory: as an example, it can correct whole human genome re-sequencing reads at 70 x coverage with less than 4GB of memory.DiscoSNP aims to discover Single Nucleotide Polymorphism (SNP) from non-assembled reads. Applied on a mouse dataset (2.88 Gbp, 100 bp Illumina reads), DiscoSnp takes 34 hours and at most 4.5 GB RAM memory. In the same spirit, the TakeABreak software discovers inversions from non-assembled reads. It directly finds particular patterns in the de Bruijn Graph, and provides execution performances similar to DiscoSNP.
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01088641
Contributor : Dominique Lavenier <>
Submitted on : Tuesday, December 9, 2014 - 10:16:35 PM
Last modification on : Thursday, August 22, 2019 - 12:04:02 PM
Long-term archiving on : Tuesday, March 10, 2015 - 10:10:30 AM

File

BioIT_poster_v1.1.1.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01088641, version 1

Citation

Erwan Drezen, Guillaume Rizk, Rayan Chikhi, Charles Deltel, Claire Lemaitre, et al.. GATB: a software toolbox for genome assembly and analysis. Bio-IT World Conference, Apr 2014, Boston, United States. ⟨hal-01088641⟩

Share

Metrics

Record views

750

Files downloads

168