Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees

Camille Marchet; Antoine Limasset

Communication Dans Un Congrès Année : 2022

Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees

, (1, 2)

1
2

Camille Marchet

Fonction : Auteur
PersonId : 170261
IdHAL : camille-marchet
ORCID : 0000-0002-7235-7346

Antoine Limasset

Fonction : Auteur
PersonId : 180632
IdHAL : antoine-limasset
ORCID : 0000-0002-0669-4141
IdRef : 223503908

Centre National de la Recherche Scientifique

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Résumé

A public database such as the SRA (Sequence Read Archive) has reached 30 peta-bases of raw sequences and doubles its nucleotide content every two years. While BLAST-like methods can routinely search a sequence in a single genome or a small collection of genomes, making accessible such immense resources is out of reach for alignment-based strategies. In the last years, an abundant literature tackled the fundamental task of locating a sequence in an extensive dataset collection by converting the query and datasets to k-mers sets and computing their intersections. Among those methods, approximate membership query data structures conjugate the ability to query small signatures or variants while being scalable to collections of thousands of sequencing experiments. However, at present, more than 10,000 eukaryotic samples are not analyzable in reasonable time and space frames. Here we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC presents several advantages over the state-of-the-art, enabling users to scale to the next order of magnitude. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3 to 6 fold improvement in construction time compared to other compressed methods for comparable index size. Using inverted indexes and a novel data structure dubbed aggregative Bloom filters, a PAC query can need single random access and be performed in constant time in favorable instances. Thanks to efficient partitioning techniques, index construction and queries can be performed in parallel with a low memory footprint. Using limited computation resources and in five days, we built a PAC index that included more than 30,000 human RNA-seq samples (corresponding to more than 40 billion distinct k-mers) to assess PAC's scalability. We also showed that PAC's ability to query 500,000 transcript sequences in less than an hour. The only bottleneck being the index size, PAC should scale on hundred of thousand datasets on a regular cluster. PAC is open-source software available at https://github.com/Malfoy/PAC.

Domaines

Bio-informatique [q-bio.QM]

Fichier principal

2022.02.11.480089.full.pdf (450.15 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Antoine Limasset : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03832918

Soumis le : vendredi 28 octobre 2022-09:39:40

Dernière modification le : mercredi 24 janvier 2024-09:54:23

Archivage à long terme le : dimanche 29 janvier 2023-18:13:19

Dates et versions

hal-03832918 , version 1 (28-10-2022)

Identifiants

HAL Id : hal-03832918 , version 1

Citer

Camille Marchet, Antoine Limasset. Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees. Recomb 2022- 26th Annual International Conference on Research in Computational Molecular Biology, May 2022, La jolla, United States. ⟨hal-03832918⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS CRISTAL CRISTAL-BONSAI UNIV-LILLE

26 Consultations

37 Téléchargements

Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager