q-gram analysis and urn models - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Discrete Mathematics and Theoretical Computer Science Année : 2003

q-gram analysis and urn models

Résumé

Words of fixed size q are commonly referred to as $q$-grams. We consider the problem of $q$-gram filtration, a method commonly used to speed upsequence comparison. We are interested in the statistics of the number of $q$-grams common to two random texts (where multiplicities are not counted) in the non uniform Bernoulli model. In the exact and dependent model, when omitting border effects, a $q$-gramin a random sequence depends on the $q-1$ preceding $q$-grams. In an approximate and independent model, we draw randomly a $q$-gram at each position, independently of the others positions. Using ball and urn models, we analyze the independent model. Numerical simulations show that this model is an excellent first order approximationto the dependent model. We provide an algorithm to compute the moments.
Fichier principal
Vignette du fichier
dmAC0124.pdf (210.2 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
Loading...

Dates et versions

hal-01183917 , version 1 (12-08-2015)

Identifiants

Citer

Pierre Nicodème. q-gram analysis and urn models. Discrete Random Walks, DRW'03, 2003, Paris, France. pp.243-258, ⟨10.46298/dmtcs.3322⟩. ⟨hal-01183917⟩
134 Consultations
651 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More