Measurement and interpretation of micro-benchmark and application energy use on the cray xc30, Energy Efficient Supercomputing Workshop, pp.51-59, 2014. ,
Unprotected computing: a large-scale study of dram raw error rate on a supercomputer, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.645-655, 2016. ,
, Update. Supercomputing Frontiers and Innovations, vol.1, p.24, 2014.
Reading between the lines of failure logs: Understanding how hpc systems fail, 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp.1-12, 2013. ,
Fault prediction under the microscope: A closer look into HPC systems, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p.77, 2012. ,
Effects of dynamic voltage and frequency scaling on a k20 gpu, 42nd International Conference on Parallel Processing (ICPP), pp.826-833, 2013. ,
Failures in large scale systems: long-term measurement, analysis, and implications, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p.44, 2017. ,
Investigating power efficiency and co-location effects on heterogeneous hpc architectures, 2013. ,
A run-time system for powerconstrained hpc applications, International conference on high performance computing, pp.394-408, 2015. ,
Characterizing temperature, power, and soft-error behaviors in data center systems: Insights, challenges, and opportunities, IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS, pp.22-31, 2017. ,
Failure trends in a large disk drive population, FAST, vol.7, pp.17-23, 2007. ,
Beyond dvfs: A first look at performance under a hardware-enforced power bound, Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp.947-953, 2012. ,
Adagio: making dvs practical for complex hpc applications, Proceedings of the 23rd international conference on Supercomputing, pp.460-469, 2009. ,
Disk failures in the real world: What does an mttf of 1, 000, 000 hours mean to you? In FAST, vol.7, pp.1-16, 2007. ,
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, p.12022, 2007. ,
Identification and categorisation of applications and initial benchmarks suite, 2008. ,
Addressing failures in exascale computing, The International Journal of High Performance Computing Applications, vol.28, pp.129-173, 2014. ,
Memory errors in modern systems: The good, the bad, and the ugly, In ACM SIGPLAN Notices, vol.50, pp.297-310, 2015. ,
What can we learn from four years of data center hardware failures, 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN, pp.25-36, 2017. ,
The k computer operations: experiences and statistics, Procedia Computer Science, vol.29, pp.576-585, 2014. ,