Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees

Résumé

A public database such as the SRA (Sequence Read Archive) has reached 30 peta-bases of raw sequences and doubles its nucleotide content every two years. While BLAST-like methods can routinely search a sequence in a single genome or a small collection of genomes, making accessible such immense resources is out of reach for alignment-based strategies. In the last years, an abundant literature tackled the fundamental task of locating a sequence in an extensive dataset collection by converting the query and datasets to k-mers sets and computing their intersections. Among those methods, approximate membership query data structures conjugate the ability to query small signatures or variants while being scalable to collections of thousands of sequencing experiments. However, at present, more than 10,000 eukaryotic samples are not analyzable in reasonable time and space frames. Here we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC presents several advantages over the state-of-the-art, enabling users to scale to the next order of magnitude. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3 to 6 fold improvement in construction time compared to other compressed methods for comparable index size. Using inverted indexes and a novel data structure dubbed aggregative Bloom filters, a PAC query can need single random access and be performed in constant time in favorable instances. Thanks to efficient partitioning techniques, index construction and queries can be performed in parallel with a low memory footprint. Using limited computation resources and in five days, we built a PAC index that included more than 30,000 human RNA-seq samples (corresponding to more than 40 billion distinct k-mers) to assess PAC's scalability. We also showed that PAC's ability to query 500,000 transcript sequences in less than an hour. The only bottleneck being the index size, PAC should scale on hundred of thousand datasets on a regular cluster. PAC is open-source software available at https://github.com/Malfoy/PAC.
Fichier principal
Vignette du fichier
2022.02.11.480089.full.pdf (450.15 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03832918 , version 1 (28-10-2022)

Identifiants

  • HAL Id : hal-03832918 , version 1

Citer

Camille Marchet, Antoine Limasset. Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees. Recomb 2022- 26th Annual International Conference on Research in Computational Molecular Biology, May 2022, La jolla, United States. ⟨hal-03832918⟩
26 Consultations
37 Téléchargements

Partager

Gmail Facebook X LinkedIn More