Scheduled Downtimes

From GridPP Wiki
Jump to: navigation, search

Use of Downtimes

DRAFT

Initial scope: only contains proposals for CE downtimes.

Process for CE downtimes

For a “Scheduled Downtime”, it is necessary to enter the data 24 hours or more from the start. The [GOCDB]is the main tool for declaring downtimes but extra measures, discussed below, are needed to stop all unwanted jobs and to deal with tests of the (new) CE. Below are the pros and cons of the three most obvious approaches, presented in order of “least control” to “most control”.

Just put the downtime in the GOCDB

  • Procedure: Surf to the GOCDB portal (https://goc.egi.eu/portal/). Submit a search for your site and select it. Add the bottom of the page, select “Add Downtime”. Then fill in and submit the form, selecting the services you wish to maintain. You can remove the downtime as soon as the maintenance is complete.
  • Pros: Very easy to do.
  • Cons: The monitoring system and some submission frameworks heed the GOCDB downtimes, but the WMSs pay no heed to them. Thus jobs continue to be transmitted to non-operational CEs, with chaotic results. There is no way to deal with your test jobs.

Put downtime in GOCDB and disable CE

  • Procedure: Use the procedure above to enter the downtime in the GOCDB. Then at the appropriate time (when the downtime starts) use the glite-ce-disable-submissions to disable CE and set it to draining. This will block new submission, and advertise in the BDII that the service is “Draining” via GlueCEStateStatus.
glite-ce-disable-submission hepgrid5.ph.liv.ac.uk

You can remove the downtime as soon as the maintenance is complete, and use glite-ce-enable-submissions to put the service back on.

Pros: The monitoring system and all submission methods heed the GOCDB downtimes and/or the glite-ce-disable-submissions command.

Cons: When the CE comes back up after a build, there could be a race condition unless special measures are made to make it come up in a “glite-ce-disable-submissions” state. The race condition could cause the CE to toggle on/off as testing proceeds, with chaotic results. There is still no way to deal with your test jobs, i.e. allow test jobs in while rejecting all others.

Put downtime in GOCDB and take the CE out of the site BDII transmissions

  • Procedure: Use the procedure above to enter the downtime in the GOCDB. Then at the appropriate time (when the downtime starts), on the site BDII system, backup and edit the site-urls.conf conf file, removing the CE in question.
# cp /etc/bdii/gip/site-urls.conf /etc/bdii/gip/site-urls.conf.thedate
# vi /etc/bdii/gip/site-urls.conf

Alternatively, for more control, initially use “glite-ce-disable-submissions” to set the CE to draining at the start of the downtime, and only remove the CE from the BDII once the CE has drained, and maintenance starts, rather than the whole of the downtime; this maximises the time that the site advertises some information about the CE.

You can remove the downtime as soon as the maintenance is complete, and restore site-urls.conf to its former state. Use glite-ce-enable-submissions to put the service back on if necessary.

Pros: Full control. The monitoring system and all submission methods heed the downtimes and/or removal of the BDII data. For the purposes of general job submission, it doesn't matter whether “glite-ce-enable/disable-submissions” commands are issued if the entire CE is removed from the BDII anyway. Tests can be conducted while the CE is out of the BDII, because the WMS can be commanded to send the test jobs to your specific CE, side-stepping the site's BDII information, e.g.

# glite-wms-job-submit -e $somewms -a  -r hepgrid5.ph.liv.ac.uk:8443/cream-pbs-long testJob.jdl

Cons: Needs vi etc.

Process for other downtimes

<TODO>