Monitoring high performance networks in large-scale clusters

Abstract : The number of large-scale clusters is rising. They are included into Grids or become key components o f large structures. As more users and projects rely 071 HPC clusters, high availability and security are requirements for a fast growing adoption and use. I n this paper, ute focus o n high performance networks. All HPC clusters are built o n top of them. We demonstrate that classical instrumentation are ineficient in HPC environment, they do not scale or cause a significant loss of performance. Based 071. this fact, we highlight clusters properties: nodes have assigned roles and are coupled at various levels. Moreover, we study the main characteristics of resource usage for each type of node and propose a n instrumentation that can be effectively deployed. It results in fine-grained mechanisms adapted to system architecture and performance constraints.. Relevant information is collected over time. Two properties are verified online and dynamically: coherency and containment. Each induces a type of verification and both aim at reducin,g recovery time from failure and security risk of a whole cluster. We illustrate our. rnethodology o n QsNet network and provide a way t o increase safety of high performance networks and clusters.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-00461160
Contributor : Christian Toinard <>
Submitted on : Wednesday, March 3, 2010 - 4:18:09 PM
Last modification on : Thursday, January 17, 2019 - 3:06:04 PM

Identifiers

Collections

Citation

Fabrice Gadaud. Monitoring high performance networks in large-scale clusters. CCGRIDW'06, May 2006, Singapore, France. pp.32, ⟨10.1109/CCGRID.2006.155⟩. ⟨hal-00461160⟩

Share

Metrics

Record views

134