Skip to Main content Skip to Navigation
Conference papers

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Abstract : Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. "fork" the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.
Complete list of metadata

Cited literature [39 references]  Display  Hide  Download
Contributor : Bogdan Nicolae Connect in order to contact the contributor
Submitted on : Tuesday, August 11, 2020 - 10:51:20 PM
Last modification on : Friday, August 14, 2020 - 2:21:06 PM
Long-term archiving on: : Monday, November 30, 2020 - 6:33:51 PM


Files produced by the author(s)


  • HAL Id : hal-02914545, version 1


Bogdan Nicolae, Justin M Wozniak, Matthieu Dorier, Franck Cappello. DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training. CLUSTER'20: The 2020 IEEE International Conference on Cluster Computing, Sep 2020, Kobe, Japan. ⟨hal-02914545⟩



Record views


Files downloads