Category:GridPP Operations

From GridPP Wiki
Jump to: navigation, search

The core task areas are:

§ Staged rollout coordination: The Grid is based on well-defined middleware components that are periodically updated with new functionality, security patches and bug fixes. Although verified at the developer-level, the UK needs to contribute to the worldwide rollout process that requires site-testing and feedback. The GridPP operations team (Ops-Team) must coordinate the certification of all releases in the UK context before rolling out to all UK sites.


§ On-duty coordination: To ensure the high quality of UK resources, the Grid is instrumented with extensive monitoring and alarms. An on-duty team responds to problems by raising trouble-tickets that inform sites of issues and track their resolution. The Ops-Team coordinates this activity.


§ Ticket follow-up: Tickets may be raised by users via their VOs, from the on-duty team as described above, or from site-administrators experiencing problems with the middleware or infrastructure. Tickets may need to be forwarded through several interfaced ticketing systems to arrive at the appropriate expert and, once there, additional information if often required. The Ops-Team ensures that tickets are followed-up by facilitating communication; providing advice on the next step; and ensuring that tickets do not stagnate in the system.


§ Regional tools: Tools such as Site Availability Monitoring (SAM) servers coupled with Nagios probes; a regional dashboard; and instances of the APEL accounting and Grid Operations databases, have been devolved to the regions in the move towards the EGI/NGI model. These tools provide monitoring, accounting and coordination at a national level. The Ops-Team ensures that these tools are deployed and individual members are experts who track technical developments and represent the UK in international discussions.


§ Documentation: Good documentation improves the efficiency with which day-to-day tasks are undertaken and is essential for long-term project stability. The Ops-Team ensures that installation and deployment instructions applicable to the UK are maintained.


§ Security (to work with the security officer): To support, and provide cover for, the GridPP operational security officer, the Ops-Team provides (currently two) deputy security coordinators. This helps ensure a well-coordinated and rapid response to security incidents, minimising exposure.


§ Monitoring: The Ops-Team oversees the deployment of the increasing suite of Grid monitoring tools and helps sys-admins find, customise and interpret information relevant to their own sites.


§ Accounting: The Ops-Team helps the UK sites to publish the correct accounting information by leading benchmarking activities and performing checks on the published data.


§ Core Grid services: The Ops-Team is responsible for the deployment and operation of core services such as the WMS and top-level BDII. These are essential for the Grid to operate and for improved resilience they are hosted at several sites in addition to the Tier-1.


§ Wider VO issues: GridPP's infrastructure is used by many non-LHC (smaller) VOs. The ops team is responsible for tracking usage requests and ensuring that enablement is as smooth as possible. Where problems are encountered by the VO or its users the ops team will provide support and guidance.


§ Grid interoperation: Otherwise known as working with other grids. The ops-team need to remain aware of and contribute to developments in the NGI/NGS, in EGI and so on. This task is about how we work with those other grids and knowing for example what is required in the wider context.