Skip to Main content Skip to Navigation
Journal articles

Statistically valid links and anti-links between words and between documents: applying TourneBool randomization test to a Reuters collection.

Alain Lelu 1, 2, * Martine Cadot 3
* Corresponding author
1 KIWI - Knowledge Information and Web Intelligence
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
3 ABC - Machine Learning and Computational Biology
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects $\times$ attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative ``counter-relations'' between words, i.e. words which ``steer clear'' one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories.
Document type :
Journal articles
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-00429434
Contributor : Martine Cadot <>
Submitted on : Monday, November 2, 2009 - 7:01:58 PM
Last modification on : Friday, April 2, 2021 - 3:38:25 AM

Links full text

Identifiers

Collections

Citation

Alain Lelu, Martine Cadot. Statistically valid links and anti-links between words and between documents: applying TourneBool randomization test to a Reuters collection.. Advances in Knowledge Discovery and Management (AKDM), 2010, 292, pp.307-324. ⟨10.1007/978-3-642-00580-0⟩. ⟨hal-00429434⟩

Share

Metrics

Record views

301