Using the Grid

From GridPP Wiki
Revision as of 12:27, 3 February 2020 by Andrew Mcnab 40269d547a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This document is a skeleton and feedback is requested about more points to cover Please either improve the document yourself in the Wiki here or send feedback to andrew.mcnab AT cern.ch

This document sets out some good practices for projects which want to use the GridPP grid/cloud infrastructure.

Projects are referred to as Virtual Organisations. To get started, please contact the GridPP management. You will then be directed to appropriate people in the operations team to discuss technical requirements.

GridPP DIRAC Service

The GridPP DIRAC service is the most common choice for running jobs and managing data. It is capable of copying one or more copies of each data file to different sites, and directing jobs to the right site depending on which data file(s) they need.

Operations

General communication with the operations team is at the weekly GridPP Operations meetings at 11am on Tuesdays and the TB-Support mailing list. It's useful to identify an expert user or users who will attend these meetings and report back to the rest of your project. It's likely that expert users will need to manage "production jobs" that produce data products used by non-expert users for further analysis, and this will be an ongoing operational task.

Expert users

Expert users should become familiar with the dynamics of the system. For example, checking the GOCDB sites database to see if sites are unavailable due to a planned downtime for an upgrade or reorganisation. As the number of non-expert users grows, experts should expect to field questions from other users and pass them on to the GridPP operations team where necessary or by submitting a ticket to the site via the GGUS ticketting system. Sites normally expect to receive tickets rather than emails so they can keep track of outstanding issues properly.

In DIRAC and other systems, users have the ability to "ban" sites temporarily if they have problems. It is important that these problems are passed on down the chain to experts within the project: otherwise, issues don't get fixed as sites are unaware of problems which only affect one virtual organisation. It is also important to remove temporary bans: otherwise a user can find all the viable targets for their jobs are unnecessarily unused by them.