Exploration of de Bruijn Graph Filtering for de novo Assembly Using GraphLab
Résumé
The emergence of next generation DNA sequencers has raised interest in short read de novo assembly of whole genomes. Though numerous frameworks were developed in the held, the presence of errors in reads as well as the increasing size of datasets call for scalable preprocessing methods for noise hltering. In this paper we present a hltering algorithm that targets determination of valid k-mers in a de Bruijn graph built from short reads. Such preprocessing will help increase accuracy and reduce memory footprint in further assembly procedures by removing erroneous k-mers from the datasets at an early stage. The algorithm leverages GraphLab, a scalable graph processing framework not previously used in traditional assembly toolchains. The accuracy of the algorithm was evaluated with synthetic datasets exhibiting various error rates and proven to be able to determine large parts of de Bruijn graphs on datasets with error level greater than real-life datasets. The implementation is executed on a distributed cluster and a study of its scalability and operating performances is conducted and exhibits interesting scaling properties, hence demonstrating the relevance of GraphLab in such a context.