DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Bogdan Nicolae; Justin M Wozniak; Matthieu Dorier; Franck Cappello

Communication Dans Un Congrès Année : 2020

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

(1) , (1) , (1) , (1)

Bogdan Nicolae

Fonction : Auteur
PersonId : 21945
IdHAL : bnicolae
ORCID : 0000-0002-0661-7509

Argonne National Laboratory [Lemont]

Justin M Wozniak

Fonction : Auteur

Argonne National Laboratory [Lemont]

Matthieu Dorier

Fonction : Auteur
PersonId : 972651

Argonne National Laboratory [Lemont]

Franck Cappello

Fonction : Auteur
PersonId : 828491

Argonne National Laboratory [Lemont]

Résumé

Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.

Mots clés

deep learning data-parallel training layer-wise parallelism state cloning and replication large-scale AI

Domaines

Calcul parallèle, distribué et partagé [cs.DC] Intelligence artificielle [cs.AI]

Fichier principal

paper.pdf (697.58 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Bogdan Nicolae : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02914545

Soumis le : mardi 11 août 2020-22:51:20

Dernière modification le : vendredi 14 août 2020-14:21:06

Archivage à long terme le : lundi 30 novembre 2020-18:33:51

Dates et versions

hal-02914545 , version 1 (11-08-2020)

Identifiants

HAL Id : hal-02914545 , version 1

Citer

Bogdan Nicolae, Justin M Wozniak, Matthieu Dorier, Franck Cappello. DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training. CLUSTER'20: The 2020 IEEE International Conference on Cluster Computing, Sep 2020, Kobe, Japan. ⟨hal-02914545⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

114 Consultations

320 Téléchargements

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager