P. Siegl, R. Buchty, and M. Berekovic, Data-centric computing frontiers: A survey on processing-in-memory, Int'l Symposium on Memory Systems (MEMSYS), pp.295-308, 2016.

R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy et al., Near-data processing: Insights from a MICRO-46 workshop, IEEE Micro, vol.34, issue.4, pp.36-42, 2014.

M. Horowitz, Computing's energy problem (and what we can do about it)," in Int'l Solid-State Circuits Conference (ISSCC), pp.10-14, 2014.

S. Borkar, Exascale computing -a fact or a fiction? (keynote), Int'l Parallel & Distributed Processing Symposium (IPDPS), 2013.

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, GPUs and the future of parallel computing, IEEE Micro, vol.31, issue.5, pp.7-17, 2011.

J. D. Mccalpin, STREAM: Sustainable memory bandwidth in high performance computers, 2016.

P. R. Kinget, Scaling analog circuits into deep nanoscale CMOS: Obstacles and ways to overcome them, IEEE Custom Integrated Circuits Conference (CICC), 2015.

B. Feinberg, U. K. Venalam, N. Whitehair, S. Wang, and E. Ipek, Enabling scientific computing on memristive accelerators, Int'l Symposium on Computer Architecture (ISCA), pp.367-382, 2018.

E. Azarkhish, D. Rossi, I. Loi, and L. Benini, Design and evaluation of a processing-in-memory architecture for the smart memory cube, Int'l Conference on Architecture of Computing Systems (ARCS), pp.19-31, 2016.

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, A scalable processing-in-memory accelerator for parallel graph processing, Int'l Symposium on Computer Architecture (ISCA), pp.105-117, 2015.

M. Gao and C. Kozyrakis, HRL: Efficient and flexible reconfigurable logic for near-data processing, Int'l Symposium on High Performance Computer Architecture (HPCA), pp.126-137, 2016.

R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto et al., Active memory cube: A processing-in-memory architecture for exascale systems, IBM Journal of Research and Development, vol.59, issue.2/3, 2015.

M. B. Taylor, Is dark silicon useful?" in Design Automation Conference (DAC), pp.1131-1136, 2012.

B. Gervasi and J. Hinkle, Overcoming system memory challenges with persistent memory and NVDIMM-P, JEDEC Server Forum, 2017.

D. Proposed and . Protocol, JEDEC Solid State Technology Association, committee JC-45, vol.6, p.2261, 2017.

, Core Specification, 2018.

, An Introduction to CCIX, 2018.

, ) BLAS (basic linear algebra subprograms), 2017.

E. Vermij, C. Hagleitner, L. Fiorin, R. Jongerius, J. Van-lunteren et al., An architecture for near-data processing systems, ACM Int'l Conf. on Computing Frontiers (CF), pp.357-360, 2016.

A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia et al., LazyPIM: An efficient cache coherence mechanism for processing-in-memory, IEEE Computer Architecture Letters, vol.16, issue.1, pp.46-50, 2017.

A. Farmahini-farahani, J. H. Ahn, K. Morrow, and N. S. Kim, NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules, Int'l Symposium on High Performance Computer Architecture (HPCA), pp.283-295, 2015.

J. Cong, Z. Fang, F. Javadi, and G. Reinman, AIM: Accelerating computational genomics through scalable and noninvasive accelerator-interposed memory, Int'l Symposium on Memory Systems (MEMSYS), pp.3-14, 2017.

M. Alian, S. W. Min, H. Asgharimoghaddam, A. Dhar, D. K. Wang et al., Application-transparent nearmemory processing architecture with memory channel network, IEEE/ACM Int'l Symposium on Microarchitecture (MICRO), pp.803-815, 2018.

H. Asghari-moghaddam, Y. H. Son, J. H. Ahn, and N. S. Kim, Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems, IEEE/ACM Int'l Symposium on Microarchitecture (MICRO), 2016.

M. Drumond, A. Daglis, N. Mirzadeh, and D. Ustiugov, The Mondrian data engine, Int'l Symposium on Computer Architecture (ISCA), pp.639-651, 2017.

T. Dysart, P. Kogge, M. Deneroff, E. Bovell, P. Briggs et al., Highly scalable near memory processing with migrating threads on the Emu system architecture, Workshop on Irregular Applications: Architecture and Algorithms (IA3), pp.2-9, 2016.

S. L. Xi, O. Babarinsa, M. Athanassoulis, and S. Idreos, Beyond the wall: Near-data processing for databases, Int'l Workshop on Data Management on New Hardware (DaMoN), 2015.

M. Gokhale, S. Lloyd, and C. Hajas, Near memory data structure rearrangement, Int'l Symposium on Memory Systems (MEMSYS)

B. Akin, F. Franchetti, and J. C. Hoe, Data reorganization in memory using 3D-stacked DRAM, Int'l Symposium on Computer Architecture (ISCA), pp.131-143, 2015.

M. Scrbak, M. Islam, K. M. Kavi, M. Ignatowski, and N. Jayasena, Exploring the processing-in-memory design space, Elsevier Journal of Systems Architecture, vol.75, pp.59-67, 2017.

S. Lee, B. Jeon, K. Kang, D. Ka, N. Kim et al., A 512GB 1.1v managed DRAM solution with 16GB ODP and media controller, Int'l Solid-State Circuits Conference (ISSCC), pp.384-385, 2019.

, JEDEC Solid State Technology Association, pp.82-112, 2014.

, JEDEC Solid State Technology Association, pp.82-102, 2009.

, ARM Architecture Reference Manual -ARMv8, 2018.

, ARM Architecture Reference Manual Supplement, The Scalable Vector Extension (SVE), for ARMv8-A, ARM DDI 0584a.d (ID122117) ed., ARM Ltd, 2017.

A. Nobile and G. Von-boehn, World Intellectual Property Organization (WIPO), Int'l Publication Number WO, 2018.

M. Gao, G. Ayers, and C. Kozyrakis, Practical near-data processing for in-memory analytics frameworks, Int'l Conference on Parallel Architecture and Compilation (PACT), pp.113-124, 2015.

G. H. Khaksari, R. K. Karne, and A. L. Wijesinha, A bare machine application development methodology, FCS Int'l Journal of Computers and Their Applications (IJCA), vol.19, issue.1, pp.10-25, 2012.

L. Gwennap, Kirin 950 takes performance lead, Mobile Chip Report, 2015.

T. Vogelsang, Understanding the energy consumption of dynamic random access memories, IEEE/ACM Int'l Symposium on Microarchitecture (MICRO)

, Calculating Memory Power for DDR4 SDRAM, 2017.

C. Gonzalez, E. Fluhr, D. Dreps, D. Hogenmiller, R. Rao et al., POWER9: A processor family optimized for cognitive computing with 25Gb/s accelerator links and 16Gb/s PCIe Gen4, Int'l Solid-State Circuits Conference (ISSCC), pp.50-51, 2017.

B. Bowhill, B. Stackhouse, N. Nassif, Z. Yang, A. Raghavan et al., The Xeon processor E5-2600 v3: A 22nm 18-core product family, Int'l Solid-State Circuits Conference (ISSCC), pp.1-3, 2015.

I. Cutress, Analyzing the silicon: Die size estimates and arrangements: The Intel Skylake-X review, 2017.

S. Wu, C. Y. Lin, M. C. Chiang, J. J. Liaw, J. Y. Cheng et al., A 16nm FinFET CMOS technology for mobile SoC and computing applications, Int'l Electron Devices Meeting (IEDM), 2013.

M. Stantic, O. Palomar, T. Hayes, I. Ratkovic, A. Cristal et al., An integrated vector-scalar design on an in-order ARM core, ACM Transactions on Architecture and Code Optimization (TACO), vol.14, issue.2, 2017.

Y. Ge, M. Tomono, M. Ito, and Y. Hirose, High-performance and low-power consumption vector processor for LTE baseband LSI, Fujitsu Scientific and Technical Journal (FSTJ), vol.50, issue.1, pp.132-137, 2014.

Y. Lee, C. Schmidt, S. Karandikar, D. Dabbelt, A. Ou et al., Hwacha preliminary evaluation results, v3.8.1, 2015.

N. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal et al., In-datacenter performance analysis of a tensor processing unit, Int'l Symposium on Computer Architecture (ISCA), pp.1-12, 2017.

A. N. Sarma and V. D. Ambali, Cooling solution for computing and storage server, IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), pp.840-849, 2017.

, DDR4 Registering Clock Driver (DDR4RCD01), JEDEC JESD82-31, 2016.

, DDR4 Data Buffer Definition (DDR4DB01), JEDEC JESD82-32, 2016.

, Intel C112/C114 Scalable Memory Buffer (SMB) data sheet, pp.332444-332445, 2015.

, POWER8 Memory Buffer Datasheet for DDR3 Applications, 2016.

. Ncab-group, Cost drivers in PCB production, NCAB Group Seminars, 2015.

, Samsung Galaxy S8, vol.10, 2017.

M. Alfano, B. Black, J. Rearick, J. Siegel, M. Su et al., Unleashing fury: A new paradigm for 3-D design and test, IEEE Design & Test, vol.34, issue.1, pp.8-15, 2017.

S. Burke, The cost of HBM2 vs. GDDR5 & why AMD had to use it, 2017.

L. Gwennap, Cortex-A76 rev amps core design, Microprocessor Report, 2018.

C. Staelin and L. Mcvoy, Lmbench -system benchmarks, 2007.

K. Inc, The state of data science & machine learning, 2017.

R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research, vol.9, pp.1871-1874, 2008.

C. Chang and C. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol.2, issue.3, pp.1-27, 2011.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed et al., SSD: Single shot multibox detector, European Conference on Computer Vision (ECCV), pp.21-37, 2016.

T. Chen and C. Guestrin, XGBoost: A scalable tree boosting system, 22nd ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pp.785-794, 2016.

K. Czechowski, V. W. Lee, E. Grochowski, R. Ronen, R. Singhal et al., Improving the energy efficiency of big cores, Int'l Symposium on Computer Architecture (ISCA), pp.493-504, 2014.

S. M. Hassan, S. Yalamanchili, and S. Mukhopadhyay, Near data processing: Impact and optimization of 3D memory system architecture on the uncore, Int'l Symposium on Memory Systems (MEMSYS), pp.11-21, 2015.

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi et al., The gem5 simulator, ACM SIGARCH Computer Architecture News, vol.39, issue.2, pp.1-7, 2011.

R. Clapp, M. Dimitrov, K. Kumar, V. Viswanathan, and T. Willhalm, Quantifying the performance impact of memory latency and bandwidth for big data workloads, Int'l Symposium on Workload Characterization (IISWC), pp.213-224, 2015.

. Intel-labs and . Barcelona, Overview -Dreams -AWB/Leap projects, 2013.

K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer et al., A view of the parallel computing landscape, Communications of the ACM, vol.52, issue.10, pp.56-67, 2009.

A. Barbalace, A. Iliopoulos, H. Rauchfuss, and G. Brasche, It's time to think about an operating system for near data processing architectures, 16th Workshop on Hot Topics in Operating Systems (HotOS), pp.56-61, 2017.