Skip to Main content Skip to Navigation
Journal articles

AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures

Camelia Constantin 1 Cedric Du Mouza 2 Witold Litwin 3 Philippe Rigaux 4 Thomas Schwarz
1 BD - Bases de Données
LIP6 - Laboratoire d'Informatique de Paris 6
2 CEDRIC - ISID - CEDRIC. Ingénierie des Systèmes d'Information et de Décision
CEDRIC - Centre d'études et de recherche en informatique et communications
4 CEDRIC - VERTIGO - CEDRIC. Bases de données avancées
CEDRIC - Centre d'études et de recherche en informatique et communications
Abstract : We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-indexrelies on a classical inverted file structure, its main innovation being a probabilistic search based on the properties of algebraicsignatures used both for n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry outa search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constantnumber of disk accesses, independently from both the pattern size and the database size. We conduct extensive experimentson large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportionalto the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class ofapplications that require very fast lookups in large textual databases.We describe the index structure, our use of algebraic signatures and the search algorithm. We discuss the operationaltrade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimentalperformance analysis. We next compare the AS-Index to the state-of-the-art alternatives and show that (i) the constructiontime matches that of the competitors, due to the similarity of structures, (ii) the search time constantly outperforms thestandard approach, thanks to the economical access to data complemented by signature calculations, which is at the core ofour search method.
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01126550
Contributor : Laboratoire Cedric <>
Submitted on : Friday, March 6, 2015 - 12:00:55 PM
Last modification on : Wednesday, March 4, 2020 - 11:01:44 AM

Links full text

Identifiers

Citation

Camelia Constantin, Cedric Du Mouza, Witold Litwin, Philippe Rigaux, Thomas Schwarz. AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures. Journal of Computer Science and Technology, Springer Verlag, 2016, 31 (1), pp.147-166. ⟨10.1007/s11390-016-1618-6⟩. ⟨hal-01126550⟩

Share

Metrics

Record views

279