# Efficiently Summarizing Distributed Data Streams over Sliding Windows

1 GDD - Gestion de Données Distribuées [Nantes]
LINA - Laboratoire d'Informatique de Nantes Atlantique
2 DIONYSOS - Dependability Interoperability and perfOrmance aNalYsiS Of networkS
Inria Rennes – Bretagne Atlantique , IRISA-D2 - RÉSEAUX, TÉLÉCOMMUNICATION ET SERVICES
Abstract : Estimating the frequency of any piece of information in large-scale distributed data streams became of utmost importance in the last decade (\emph{e.g.}, in the context of network monitoring, big data, \emph{etc.}). If some elegant solutions have been proposed recently, their approximation is computed from the inception of the stream. In a runtime distributed context, one would prefer to gather information only about the recent past. This may be led by the need to save resources or by the fact that recent information is more relevant. In this paper, we consider the \emph{sliding window} model and propose two different (on-line) algorithms that approximate the items frequency in the active window. More precisely, we determine a $(\varepsilon,\delta)$-approximation meaning that the error is greater than $\varepsilon$ only with probability $\delta$. These solutions use a very small amount of memory with respect to the size $N$ of the window and the number $n$ of distinct items of the stream, namely, $O(\frac{1}{\varepsilon} \log \frac{1}{\delta} (\log N + \log n))$ and $O(\frac{1}{\tau\varepsilon} \log \frac{1}{\delta} (\log N + \log n))$ bits of space, where $\tau$ is a parameter limiting memory usage. We also provide their distributed variant, \emph{i.e.}, considering the \emph{sliding window functional monitoring} model, with a communication cost of $O(\frac{k}{\varepsilon^2} \log \frac{1}{\delta} \log N)$ bits per window (where $k$ is the number of nodes). We compared the proposed algorithms to each other and also to the state of the art through extensive experiments on synthetic traces and real data sets that validate the robustness and accuracy of our algorithms.
Keywords :
Document type :
Reports
Domain :

Cited literature [22 references]

https://hal.archives-ouvertes.fr/hal-01073877
Contributor : Yann Busnel <>
Submitted on : Tuesday, June 30, 2015 - 3:27:48 PM
Last modification on : Friday, November 16, 2018 - 1:39:07 AM
Long-term archiving on : Tuesday, April 25, 2017 - 8:23:52 PM

### File

nca15-rr.pdf
Files produced by the author(s)

### Identifiers

• HAL Id : hal-01073877, version 3

### Citation

Nicolò Rivetti, Yann Busnel, Achour Mostefaoui. Efficiently Summarizing Distributed Data Streams over Sliding Windows. [Research Report] LINA-University of Nantes; Centre de Recherche en Économie et Statistique; Inria Rennes Bretagne Atlantique. 2015. ⟨hal-01073877v3⟩

Record views