Grid Operations: evolution of operational model over the first year
Résumé
The paper reports on the evolution of operational model set up in the Enabling Grids for E-sciencE (EGEE) project, and on the implications of Grid Operations in LHC Computing Grid (LCG). The primary tasks of Grid Operations cover monitoring of resources and services, notification of failures to the relevant contacts and problem tracking through a ticketing system. Moreover, an escalation procedure is enforced to urge the responsible bodies to address and solve the problems. An extensive amount of knowledge has been collected, documented and published in a way which facilitates a rapid resolution to the common problems. The number of sites in production quickly expanded from 60 to 170 in less than a year. At the same time, the operations model evolved from one single person at CERN to a distributed model involving more and more geographically scattered teams. The evolution of both procedures and workflow requires steady refinement of the associated tools as ticketing system, knowledge database and integration platform. Since EGEE/LCG production infrastructure relies on the availability of robust operations mechanisms, it is essential to gradually improve the operational procedures and to track the progress of the tools' on-going development.