SMSG: Profiling-Free Parallelism Modeling for Distributed Training of DNN - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue International Journal of Parallel Programming Année : 2022

SMSG: Profiling-Free Parallelism Modeling for Distributed Training of DNN

Résumé

Abstract The increasing size of deep neural networks (DNNs) raises a high demand for distributed training. An expert could find good hybrid parallelism strategies, but designing suitable strategies is time and labor-consuming. Therefore, automating parallelism strategy generation is crucial and desirable for DNN designers. Some automatic searching approaches have recently been studied to free the experts from the heavy parallel strategy conception. However, these approaches all rely on a numerical cost model, which requires heavy profiling results that lack portability. These profiling-based approaches cannot lighten the strategy generation work due to the non-reusable profiling value. Our intuition is that there is no need to estimate the actual execution time of the distributed training but to compare the relative cost of different strategies. We propose SMSG (Symbolic Modeling for Strategy Generation), which analyses the cost based on the communication and computation semantics. With SMSG, the parallel cost analyses are decoupled from hardware characteristics. SMSG defines cost functions for each kind of operator to quantitatively evaluate the amount of data for computation and communication, which eliminates the heavy profiling tasks. Besides, SMSG introduces how to apply functional transformation by using the Third Homomorphism theorem to control the high searching complexity. Our experiments show that SMSG can find good hybrid parallelism strategies to generate an efficient training performance similar to the state of the art. Moreover, SMSG covers a wide variety of DNN models with good scalability. SMSG provides good portability when changing training configurations that a profiling-based approach cannot.

Dates et versions

hal-04551101 , version 1 (18-04-2024)

Identifiants

Citer

Haoran Wang, Thibaut Tachon, Chong Li, Sophie Robert, Sébastien Limet. SMSG: Profiling-Free Parallelism Modeling for Distributed Training of DNN. International Journal of Parallel Programming, 2022, 51 (2-3), pp.109-127. ⟨10.1007/S10766-022-00741-6⟩. ⟨hal-04551101⟩
9 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More