Random forests and big data

Abstract : Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involves massive data but it also often includes data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests, introduced by Breiman in 2001, are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems as well as two-class or multi-class classification problems. This paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks and sketch some alternative directions for random forests in the Big Data context.
Document type :
Conference papers
Liste complète des métadonnées

Cited literature [17 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01160643
Contributor : Nathalie Vialaneix <>
Submitted on : Monday, June 8, 2015 - 10:11:30 AM
Last modification on : Tuesday, September 18, 2018 - 4:24:01 PM
Document(s) archivé(s) le : Tuesday, September 15, 2015 - 11:57:11 AM

Files

genuer_etal_jds2015.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01160643, version 1

Citation

Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, Nathalie Villa-Vialaneix. Random forests and big data. 47ème Journées de Statistique de la SFdS, Société Française de Statistique, Jun 2015, Lille, France. ⟨hal-01160643⟩

Share

Metrics

Record views

577

Files downloads

1461