# Statistically valid links and anti-links between words and between documents: applying TourneBool randomization test to a Reuters collection.

* Corresponding author
1 KIWI - Knowledge Information and Web Intelligence
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
3 ABC - Machine Learning and Computational Biology
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects $\times$ attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative counter-relations'' between words, i.e. words which steer clear'' one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories.
Keywords :
Document type :
Journal articles
Domain :
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-00429434
Contributor : Martine Cadot <>
Submitted on : Monday, November 2, 2009 - 7:01:58 PM
Last modification on : Friday, April 2, 2021 - 3:38:25 AM

### Citation

Alain Lelu, Martine Cadot. Statistically valid links and anti-links between words and between documents: applying TourneBool randomization test to a Reuters collection.. Advances in Knowledge Discovery and Management (AKDM), 2010, 292, pp.307-324. ⟨10.1007/978-3-642-00580-0⟩. ⟨hal-00429434⟩

Record views