Skip to Main content Skip to Navigation


Abstract : The pattern discovery literature has long struggled with two major problems. First, it is not possible to use the relevant patterns directly if the minimum interest threshold is small because there are far too many. Conversely, if the minimum interest threshold is too large, certain instances will be described little or not at all. Second, the full set of patterns that met the minimum interest threshold constraint may contain many redundancies. Output sampling is a non-exhaustive method for the instant discovery of relevant patterns which ensures good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach studied for different types of patterns, including itemsets and subgraphs, has not yet been applied to sequential patterns and distributed databases. In this thesis, we propose numerous methods dedicated to sequential pattern sampling, pattern sampling in distributed databases and finally trie-based pattern sampling. In addition to answering these complex tasks, the originality of our approaches is to introduce a class of interestingness measures relying on the norm of the pattern, named norm-based interestingness measures. In particular, it enables to add constraints on the norm of sampled patterns to control the length of the drawn patterns and to avoid the pitfall of the ``long tail'' where the rarest patterns flood the user. In this context, we first propose two algorithms called NUSSampling for sequential databases and DDSampling for distributed databases. Based on two-step random procedures incorporating this class of interestingness measures, they randomly draw patterns proportionally to the frequency weighted by a utility based on the norm. Second, we propose TPSampling, a sampling algorithm for itemsets based on the trie structure. Less consumer in memory, it also randomly draws patterns based on frequency weighted by a utility based on the norm. We show that all of our methods perform an exact sampling according to the underlying measure. At the application level, we focus on the interest of norm constraints and exponential decay that help to draw general patterns from the head of the long tail. We also illustrate how to benefit from these sampled patterns to build classifiers dedicated to sequences and itemsets. This classification approach rivals with state-of-the-art proposals showing the interest of sequential pattern sampling with norm-based utility. In addition, we also illustrate the usefulness of the sampled patterns on the distributed data of the Semantic Web for detecting outlier entities in DBpedia and Wikidata.
Document type :
Complete list of metadata

Cited literature [98 references]  Display  Hide  Download
Contributor : Lamine DIOP Connect in order to contact the contributor
Submitted on : Monday, November 9, 2020 - 6:35:40 PM
Last modification on : Tuesday, November 17, 2020 - 2:57:36 PM


Files produced by the author(s)


  • HAL Id : tel-02948509, version 2


Lamine Diop. ECHANTILLONNAGE SOUS CONTRAINTES DE MOTIFS STRUCTURES. Recherche d'information [cs.IR]. Université Gaston Berger de Saint-Louis (Sénégal), 2020. Français. ⟨tel-02948509⟩



Record views


Files downloads