Collecting and Characterizing Distributed Machine Learning Workloads

Yasmine Djebrouni; Sara Bouchenak; Khalid Benabdeslem

Pré-Publication, Document De Travail Année : 2020

Collecting and Characterizing Distributed Machine Learning Workloads

(1, 2) , (1) , (1)

1
2

Yasmine Djebrouni

Fonction : Auteur

Institut National des Sciences Appliquées de Lyon

École Nationale Supérieure d'Informatique [Alger]

Sara Bouchenak

Fonction : Auteur

Institut National des Sciences Appliquées de Lyon

Khalid Benabdeslem

Fonction : Auteur

Institut National des Sciences Appliquées de Lyon

Résumé

Machine learning is a key for transforming data into actionable knowledge. The rapid increase in the amount of analyzed data forced the switch to distributed ML platforms. However, the complexity of such platforms is overwhelming for uninitiated users, who may not understand the trade-offs and the challenges of parameterizing such systems to achieve good performance. In order to better analyze and understand ML workloads running on ML distributed platforms, we conducted extensive experiments with various ML methods and real-world datasets, and collected the execution traces of these distributed ML workloads, that represent a total of 12 GB of traces and tens of millions of data records. We then provide a statistical analysis of the collected traces, and illustrate through a use case how different ML workloads' are characterized and their needs identified.

Mots clés

Distributed Machine Learning ML Workload Characterization Trace Collection Distributed Machine Learning

Domaines

Intelligence artificielle [cs.AI] Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

Collecting_and_Characterizing_Distributed_Machine_Learning_Workloads (3).pdf (320.7 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Yasmine Djebrouni : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03343275

Soumis le : mercredi 15 septembre 2021-10:33:00

Dernière modification le : jeudi 5 mai 2022-10:18:25

Archivage à long terme le : jeudi 16 décembre 2021-18:03:14

Dates et versions

hal-03343275 , version 1 (15-09-2021)

Identifiants

HAL Id : hal-03343275 , version 1

Citer

Yasmine Djebrouni, Sara Bouchenak, Khalid Benabdeslem. Collecting and Characterizing Distributed Machine Learning Workloads. 2021. ⟨hal-03343275⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSA-LYON GRID5000 INSA-GROUPE UDL SILECS

72 Consultations

127 Téléchargements

Collecting and Characterizing Distributed Machine Learning Workloads

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager