Minimally-overlapping words for sequence similarity search

Martin Frith; Laurent Noé; Gregory Kucherov

doi:10.1093/bioinformatics/btaa1054

Article Dans Une Revue Bioinformatics Année : 2020

Minimally-overlapping words for sequence similarity search

(1) , (2) , (3)

1
2
3

Martin Frith

Fonction : Auteur

Artificial Intelligence Research Center [Tokyo]

Laurent Noé

Fonction : Auteur
PersonId : 85
IdHAL : noe
ORCID : 0000-0002-1170-8376
IdRef : 093601948

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Gregory Kucherov

Fonction : Auteur
PersonId : 14903
IdHAL : gregory-kucherov
ORCID : 0000-0001-5899-5424
IdRef : 093602189

Laboratoire d'Informatique Gaspard-Monge

Résumé

Motivation: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results: Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Supplementary information: Supplementary data are available at Bioinformatics online.

Domaines

Algorithme et structure de données [cs.DS] Bio-informatique [q-bio.QM]

Fichier principal

resubmission-bioinfo.pdf (314.1 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Gregory Kucherov : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03087470

Soumis le : mercredi 23 décembre 2020-20:59:02

Dernière modification le : mercredi 24 janvier 2024-09:54:23

Dates et versions

hal-03087470 , version 1 (23-12-2020)

Identifiants

HAL Id : hal-03087470 , version 1
DOI : 10.1093/bioinformatics/btaa1054

Citer

Martin Frith, Laurent Noé, Gregory Kucherov. Minimally-overlapping words for sequence similarity search. Bioinformatics, 2020, ⟨10.1093/bioinformatics/btaa1054⟩. ⟨hal-03087470⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENPC CNRS PARISTECH LIGM LIGM_MOA CRISTAL CRISTAL-BONSAI UNIV-LILLE ANR UNIV-EIFFEL LIGM_ADA

117 Consultations

53 Téléchargements

Minimally-overlapping words for sequence similarity search

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager