Optimization of checkpointing and execution model for an implementation of OpenMP on distributed memory architectures

Abstract : OpenMP and MPI have become the standard tools to develop parallel programs on shared-memory and distributed-memory architectures respectively. As compared to MPI, OpenMP is easier to use. This is due to the ability of OpenMP to automatically execute code in parallel and synchronize results using its directives, clauses, and runtime functions while MPI requires programmers do all this manually. Therefore, some efforts have been made to port OpenMP on distributed-memory architectures. However, excluding CAPE, no solution has successfully met both requirements: 1) to be fully compliant with the OpenMP standard and 2) high performance. CAPE stands for Checkpointing-Aided Parallel Execution. It is a framework that automatically translates and provides runtime functions to execute OpenMP program on distributed-memory architectures based on checkpointing techniques. In order to execute an OpenMP program on distributed-memory system, CAPE uses a set of templates to translate OpenMP source code to CAPE source code, and then, the CAPE source code is compiled by a C/C++ compiler. This code can be executed on distributed-memory systems under the support of the CAPE framework. Basically, the idea of CAPE is the following: the program first run on a set of nodes on the system, each node being executed as a process. Whenever the program meets a parallel section, the master distributes the jobs to the slave processes by using a Discontinuous Incremental Checkpoint (DICKPT). After sending the checkpoints, the master waits for the returned results from the slaves. The next step on the master is the reception and merging of the resulting checkpoints before injecting them into the memory. For slave nodes, they receive different checkpoints, and then, they inject it into their memory to compute the divided job. The result is sent back to the master using DICKPTs. At the end of the parallel region, the master sends the result of the checkpoint to every slaves to synchronize the memory space of the program as a whole. In some experiments, CAPE has shown very high-performance on distributed-memory systems and is a viable and fully compatible with OpenMP solution. However, CAPE is in the development stage. Its checkpoint mechanism and execution model need to be optimized in order to improve the performance, ability, and reliability. This thesis aims at presenting the approaches that were proposed to optimize and improve checkpoints, design and implement a new execution model, and improve the ability for CAPE. First, we proposed arithmetics on checkpoints, which aims at modeling checkpoint’s data structure and its operations. This modeling contributes to optimize checkpoint size and reduces the time when merging, as well as improve checkpoints capability. Second, we developed TICKPT which stands for Time-stamp Incremental Checkpointing as an instance of arithmetics on checkpoints. TICKPT is an improvement of DICKPT. It adds a timestamp to checkpoints to identify the checkpoints order. The analysis and experiments to compare it to DICKPT show that TICKPT do not only provide smaller in checkpoint size, but also has less impact on the performance of the program using checkpointing. Third, we designed and implemented a new execution model and new prototypes for CAPE based on TICKPT. The new execution model allows CAPE to use resources efficiently, avoid the risk of bottlenecks, overcome the requirement of matching the Bernstein’s conditions. As a result, these approaches make CAPE improving the performance, ability as well as reliability. Four, Open Data-sharing attributes are implemented on CAPE based on arithmetics on checkpoints and TICKPT. This also demonstrates the right direction that we took, and makes CAPE more complete
Liste complète des métadonnées

https://tel.archives-ouvertes.fr/tel-02014711
Contributor : Abes Star <>
Submitted on : Monday, February 11, 2019 - 4:55:06 PM
Last modification on : Wednesday, February 13, 2019 - 1:17:23 AM

File

longtran_thesis.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02014711, version 1

Citation

Van Long Tran. Optimization of checkpointing and execution model for an implementation of OpenMP on distributed memory architectures. Distributed, Parallel, and Cluster Computing [cs.DC]. Institut National des Télécommunications, 2018. English. 〈NNT : 2018TELE0017〉. 〈tel-02014711〉

Share

Metrics

Record views

144

Files downloads

23