A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex

Abstract : This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions of event records, each of which consists of ∼100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. We report also on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.
Type de document :
Communication dans un congrès
22nd International Conference on Computing in High Energy and Nuclear Physics, Oct 2016, San Francisco, United States. J.Phys.Conf.Ser., 898 (6), pp.062020, 2017, 〈10.1088/1742-6596/898/6/062020〉
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01669628
Contributeur : Inspire Hep <>
Soumis le : mercredi 20 décembre 2017 - 23:47:33
Dernière modification le : jeudi 11 janvier 2018 - 06:14:23

Identifiants

Collections

Citation

Zbigniew Baranowski, Luca Canali, Rainer Toebbicke, Julius Hrivnac, Dario Barberis. A study of data representation in Hadoop to optimize data storage and search performance for the ATLAS EventIndex. 22nd International Conference on Computing in High Energy and Nuclear Physics, Oct 2016, San Francisco, United States. J.Phys.Conf.Ser., 898 (6), pp.062020, 2017, 〈10.1088/1742-6596/898/6/062020〉. 〈hal-01669628〉

Partager

Métriques

Consultations de la notice

10