Skip to Main content Skip to Navigation
Conference papers

Online Scheduling with Redirection for Parallel Jobs

Abstract : An important component of High Performance Computing (HPC) clusters is the job scheduling algorithm, which decides the allocation and the scheduling of the jobs in the system. Such scheduling algorithms need to be scalable to confront the growth both in size and in complexity of the modern clusters. We propose in this paper a new algorithm for scheduling parallel jobs with redirection. Specifically, our algorithm redirects the jobs whose execution affects significantly an important number of other jobs. A redirected job is stopped and restarted from the beginning in a dedicated part of the cluster. We show the effectiveness of our method through an intensive experimental campaign of simulations of production cluster log traces.
Complete list of metadata

Cited literature [7 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02944032
Contributor : Adrien Faure <>
Submitted on : Monday, September 21, 2020 - 10:27:48 AM
Last modification on : Wednesday, March 24, 2021 - 3:32:05 AM
Long-term archiving on: : Thursday, December 3, 2020 - 2:37:52 PM

File

HIPS2020.pdf
Files produced by the author(s)

Identifiers

Citation

Adrien Faure, Giorgio Lucarelli, Olivier Richard, Denis Trystram. Online Scheduling with Redirection for Parallel Jobs. 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2020, New Orleans, France. pp.1-4, ⟨10.1109/IPDPSW50202.2020.00066⟩. ⟨hal-02944032⟩

Share

Metrics

Record views

94

Files downloads

104