CPU overheating characterization in HPC systems: a case study

Abstract : With the increase in size of supercomputers, also increases the number of abnormal events. Some of these events might lead to an application failure. Others might simply impact the system efficiency. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper studies the problem of CPU overheating in supercomputers. In a first part, we analyze data collected over one year on a supercomputer of the top500 list to understand under which conditions CPU overheating occurs. Our analysis show that overheating events are due to some specific applications. In a second part, we evaluate the impact of such overheating events on the performance of MPI applications. Using 6 representative HPC benchmarks, we show that for a majority of the applications, a frequency drop on one CPU impacts the execution time of distributed runs proportionally to the duration and to the extent of the frequency drop.
Complete list of metadatas

Cited literature [19 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01949708
Contributor : Marc Platini <>
Submitted on : Monday, December 10, 2018 - 12:49:15 PM
Last modification on : Wednesday, February 13, 2019 - 1:56:11 PM
Long-term archiving on : Monday, March 11, 2019 - 2:04:24 PM

File

CPU_Overheating_Characterizati...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01949708, version 1

Collections

Citation

Marc Platini, Thomas Ropars, Benoit Pelletier, Noël de Palma. CPU overheating characterization in HPC systems: a case study. Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop, Nov 2018, Dallas, United States. ⟨hal-01949708⟩

Share

Metrics

Record views

61

Files downloads

120