A Distributed Information Divergence Estimation over Data Streams

Emmanuelle Anceaume 1, 2 Yann Busnel 3
1 CIDER
IRISA-D1 - SYSTÈMES LARGE ÉCHELLE
2 CIDRE - Confidentialité, Intégrité, Disponibilité et Répartition
IRISA-D1 - SYSTÈMES LARGE ÉCHELLE, Inria Rennes – Bretagne Atlantique , CentraleSupélec
3 GDD - Gestion de Données Distribuées [Nantes]
LINA - Laboratoire d'Informatique de Nantes Atlantique
Abstract : In this paper, we consider the setting of large scale distributed systems, in which each node needs to quickly process a huge amount of data received in the form of a stream that may have been tampered with by an adversary. In this situation, a fundamental problem is how to detect and quantify the amount of work performed by the adversary. To address this issue, we propose a novel algorithm AnKLe for estimating the Kullback-Leibler divergence of an observed stream compared with the expected one. AnKLe combines sampling techniques and information-theoretic methods. It is very efficient, both in terms of space and time complexities, and requires only a single pass over the data stream. We show that AnKLe is an (ε, δ)-approximation algorithm with a space complexity Õ(1/ε + 1/ε^2) bits in "most" cases, and Õ(1/ε + (n−ε−1)/ε^2) otherwise, where n is the number of distinct data items in a stream. Moreover, we propose a distributed version of AnKLe that requires at most O (rl (log n + 1)) bits of communication between the l participating nodes, where r is number of rounds of the algorithm. Experimental results show that the estimation provided by AnKLe remains accurate even for different adversarial settings for which the quality of other methods dramatically decreases.
Complete list of metadatas

Cited literature [28 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00998708
Contributor : Yann Busnel <>
Submitted on : Monday, June 2, 2014 - 3:24:57 PM
Last modification on : Friday, November 16, 2018 - 1:40:43 AM
Long-term archiving on : Tuesday, September 2, 2014 - 12:30:22 PM

File

ankle-tpds2013.pdf
Files produced by the author(s)

Identifiers

Citation

Emmanuelle Anceaume, Yann Busnel. A Distributed Information Divergence Estimation over Data Streams. IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers, 2014, 25 (2), pp.478-487. ⟨10.1109/TPDS.2013.101⟩. ⟨hal-00998708⟩

Share

Metrics

Record views

2072

Files downloads

496