R. Intel, , 2017.

J. Dongarra, K. London, S. Moore, P. Mucci, D. Terpstra et al., Experiences and lessons learned with a portable interface to hardware performance counters, Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS'03, pp.289-291, 2003.

, Oprofile. a system profiler for linux

S. Eranian, Perfmon2: a flexible performance monitoring interface for linux, Proceedings of the 2006 Ottawa Linux Symposium, pp.269-288, 2006.

R. Lachaize, B. Lepers, and V. Quéma, Memprof: A memory profiler for numa multicore systems, Proceedings of the Usenix Annual Technical Conference, USENIX ATC'12, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00945731

A. Jaleel, R. S. Cohn, C. Luk, and B. Jacob, Cmp$im: A pinbased on-the-fly multi-core cache simulator, Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), pp.28-36, 2008.

S. M. Günther and J. Weidendorfer, Assessing cache false sharing effects by dynamic binary instrumentation, Proceedings of the Workshop on Binary Instrumentation and Applications, pp.26-33, 2009.

Q. Zhao, D. Koh, S. Raza, D. Bruening, W. Wong et al., Dynamic cache contention detection in multi-threaded applications, Proceedings of the international conference on Virtual Execution Environments, pp.27-38, 2011.

M. Hobbel, T. Rauber, and C. Scholtes, Trace-based automatic padding for locality improvement with correlative data visualization interface, Proceedings of the International Conference on Parallel Architectures and Compilation, PACT'07, 2007.

T. Liu, C. Tian, Z. Hu, and E. D. Berger, PREDATOR: Predictive false sharing detection, Proceedings of the symposium on Principles and Practices of Parallel Programming, PPoPP'14, pp.3-14, 2014.

T. Liu and X. Liu, Cheetah: detecting false sharing efficiently and effectively, Proceedings of the international symposium on Code Generation and Optimization, CGO'16, pp.1-11, 2016.

N. R. Tallent, J. M. Mellor-crummey, and A. Porterfield, Analyzing lock contention in multithreaded applications, Proceedings of the symposium on Principles and Practices of Parallel Programming, PPoPP'10, pp.269-280, 2010.

X. Yu, S. Han, D. Zhang, and T. Xie, Comprehending performance from real-world execution traces: A device-driver case, Proceedings of the conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'14, pp.193-206, 2014.

F. David, G. Thomas, J. Lawall, and G. Muller, Continuously measuring critical section pressure with the free-lunch profiler, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'14, pp.291-307, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01080277

E. Altman, M. Arnold, S. Fink, and N. Mitchell, Performance analysis of idle programs, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'10, pp.739-753, 2010.

A. Bhatele, K. Mohror, S. H. Langer, and K. E. Isaacs, There goes the neighborhood: performance degradation due to nearby jobs, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp.1-12, 2013.

M. Casas and G. Bronevetsky, Active measurement of the impact of network switch utilization on application performance, Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS'14, pp.165-174, 2014.

L. Song and S. Lu, Statistical debugging for real-world performance problems, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'14, pp.561-578, 2014.

X. Zhao, K. Rodrigues, Y. Luo, D. Yuan, and M. Stumm, Nonintrusive performance profiling for entire software stacks based on the flow reconstruction principle, Proceedings of the conference on Operating Systems Design and Implementation, OSDI'16, pp.603-618, 2016.

J. Huang, B. Mozafari, and T. F. Wenisch, Statistical analysis of latency through semantic profiling, Proceedings of the EuroSys European Conference on Computer Systems, EuroSys'17, pp.64-79, 2017.

N. Joukov, A. Traeger, R. Iyer, C. P. Wright, and E. Zadok, Operating system profiling via latency analysis, Proceedings of the conference on Operating Systems Design and Implementation, OSDI'06, pp.89-102, 2006.

C. Coarfa, J. Mellor-crummey, N. Froyd, and Y. Dotsenko, Scalability analysis of spmd codes using expectations, Proceedings of the International conference on Supercomputing, ICS'07, pp.13-22, 2007.

F. Trahay, Y. Ishikawa, F. Rue, R. Namyst, M. Faverge et al., Eztrace: a generic framework for performance analysis, Proceedings of the International Symposium on Cluster, Cloud and Grid Computing, CCGRID'11, pp.618-619, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00587216

C. Aulagnon, D. Martin-guillerez, F. Rue, and F. Trahay, Runtime function instrumentation with EZTrace, Proceedings of PROPER 2012 -Workshop on Productivity and Performance, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00863037

F. Trahay, E. Brunet, M. M. Bouksiaa, and L. Jianwei, Selecting points of interest in traces using patterns of events, Proceedings of the International Conference on Parallel, Distributed, and NetworkBased Processing, PDP'15, pp.70-77, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01257904

J. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller, Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications, Proceedings of the Usenix Annual Technical Conference, USENIX ATC'12, pp.65-76, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00779908

M. Scott and W. Bolosky, False sharing and its effect on shared memory performance, Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS), p.57, 1993.

T. Liu and E. D. Berger, SHERIFF: Precise detection and automatic mitigation of false sharing, Proceedings of the conference on Object Oriented Programming Systems Languages and Applications, OOPSLA'11, pp.3-18, 2011.

J. Lozi, F. David, G. Thomas, J. Lawall, and G. Muller, Fast and portable locking for multicore architectures, ACM Transactions on Computer Systems (TOCS), vol.33, issue.4, p.62, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01252167

G. Southern and J. Renau, Analysis of PARSEC workload scalability, Proceedings of the International Symposium on Performance Analysis of Systems and Software, ISPASS'16, pp.133-142, 2016.

M. Roth, M. J. Best, C. Mustard, and A. Fedorova, Deconstructing the overhead in parallel applications, Proceedings of the International Symposium on Workload Characterization, IISWC'12, pp.59-68, 2012.

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, Evaluating mapreduce for multi-core and multiprocessor systems, Proceedings of the symposium on High Performance Computer Architecture, HPCA'07, pp.13-24, 2007.

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, The SPLASH-2 programs: Characterization and methodological considerations, Proceedings of the International Symposium on Computer Architecture, ISCA'95, pp.24-36, 1995.

C. Bienia, S. Kumar, J. P. Singh, and K. Li, The PARSEC benchmark suite: Characterization and architectural implications, Proceedings of the International Conference on Parallel Architectures and Compilation, PACT'06, pp.72-81, 2008.

D. H. Bailey, Nas parallel benchmarks, Encyclopedia of Parallel Computing, pp.1254-1259, 2011.

M. A. Frumkin and L. V. Shabanov, Benchmarking memory performance with the data cube operator, NASA, Tech. Rep, 2004.

B. Fitzpatrick, Distributed caching with memcached, Linux journal, vol.2004, issue.124, p.5, 2004.

M. Zhuang and B. Aker, memaslap: Load testing and benchmarking a server

S. Ghemawat and J. Dean, LevelDB, 2011.

C. Curtsinger and E. D. Berger, Coz: Finding code that counts with causal profiling, Proceedings of the Symposium on Operating Systems Principles, SOSP'15, pp.184-197, 2015.

, Facebook rocksdb

T. Ohmann, K. Thai, I. Beschastnikh, and Y. Brun, Mining precise performance-aware behavioral models from existing instrumentation, Proceedings of the International Conference on Software Engineering, ICSE'14, pp.484-487, 2014.

B. Teabe, A. Tchana, and D. Hagimont, Application-specific quantum for multi-core platform scheduler, Proceedings of the EuroSys European Conference on Computer Systems, EuroSys'16, vol.3, pp.1-3, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01782587

J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa, Contention aware execution: Online contention detection and response, Proceedings of the international symposium on Code Generation and Optimization, CGO'10, pp.257-265, 2010.

I. Chung, G. Cong, D. Klepacki, S. Sbaraglia, S. Seelam et al., A framework for automated performance bottleneck detection, Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS'08, pp.1-7, 2008.

C. Xu, X. Chen, R. Dick, and Z. M. Mao, Cache contention and application performance prediction for multi-core systems, Proceedings of the International Symposium on Performance Analysis of Systems and Software, ISPASS'10, pp.76-86, 2010.

M. Liu and T. Li, Optimizing virtual machine consolidation performance on numa server architecture for cloud workloads, Proceedings of the International Symposium on Computer Architecture, ISCA'14, pp.325-336, 2014.

S. Jayasena, S. Amarasinghe, A. Abeyweera, G. Amarasinghe, H. Silva et al., Detection of false sharing using machine learning, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp.1-9, 2013.

J. Tao and W. Karl, Cachein: a toolset for comprehensive cache inspection, Proceedings of the International Conference on Computational Science, ICCS'05, pp.174-181, 2005.

M. Nanavati, M. Spear, N. Taylor, S. Rajagopalan, D. T. Meyer et al., Whose cache line is it anyway?: Operating system support for live detection and repair of false sharing, Proceedings of the EuroSys European Conference on Computer Systems, EuroSys'13, pp.141-154, 2013.

X. Liu and B. Wu, Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs, Proceedings of the Conference for High Performance Computing, Networking, Storage and Analysis, SC'15, p.47, 2015.

A. Pesterev, N. Zeldovich, and R. T. Morris, Locating cache performance bottlenecks using data profiling, Proceedings of the EuroSys European Conference on Computer Systems, EuroSys'10, pp.335-348, 2010.

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize et al., Traffic management: A holistic approach to memory placement on numa systems, Proceedings of the conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'13, pp.381-394, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00945758

F. Trahay, M. Selva, L. Morel, and K. Marquet, NumaMMA: Numa MeMory Analyzer, Proceedings of the International Conference on Parallel Processing, ICPP'18, 2018.
URL : https://hal.archives-ouvertes.fr/cea-01854072

K. Bois, S. Eyerman, J. B. Sartor, and L. Eeckhout, Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior, Proceedings of the International Symposium on Computer Architecture, ISCA'13, pp.511-522, 2013.

S. Eyerman, K. D. Bois, and L. Eeckhout, Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications, Proceedings of the International Symposium on Performance Analysis of Systems and Software, ISPASS'12, pp.145-155, 2012.