HEPiX shutdown command

From GridPP Wiki
Jump to: navigation, search

In the HEPiX Virtualization working group, Tony Cass proposed a protocol for sites to provide virtual machines with a command they could use to shutdown the VM from the inside. This is listed in the summary of the HEPiX protocol for machinefeatures and jobfeatures by the WLCG WMTEG.

The full path of the shutdown command is the contents of the file $MACHINEFEATURES/shutdown_command where $MACHINEFEATURES is exported into the environment of the relevant processes in the VM. In the standard CERNVM images, this variable is set in /var/lib/hepix/context/context.sh (If $MACHINEFEATURES is not set, then /etc/machinefeatures can be used as a default alternative, and this is the case on lxplus/lxbatch.)

In March 2013, Andrew McNab proposed an extension to this protocol, by which VMs could optionally provide arguments to the shutdown command to provide information to the hypervisor and site.

The motivation is that sites are likely to find it useful to have some basic indication of why a VM has shut itself down, and the component inside the VM that decides to use that command is in good position to provide that information. This is especially useful if what is running inside the VM are job agent(s) that are fetching the real jobs from the experiment's central task queue as we currently do with pilot jobs.

In more detail, the proposal is that VMs may use the command indicated by shutdown_command in the above document, and have the option of providing command line arguments which are concatenated into a single string. That string starts with a three digit number followed by a space, followed by a human-readable message, like HTTP responses. As with HTTP status codes, the values of that number are grouped into broad categories, with scope for adding finer grained status reporting in the future. Much of this is only relevant to the job agent model, but it is quite an important case for us.

100 Shutdown as requested by the VM's host/hypervisor

200 Intended work completed ok

300 No more work available from task queue

400 Site/host/VM is currently banned/disabled from receiving more work

500 Problem detected with environment/VM provided by the site

600 Grid-wide problem with job agent or application within VM

700 Transient problem with job agent or application within WM

The 200 codes would be used when a job is run directly inside the VM (i.e. no job agent and central task queue), or when job agents only run a set number of times and have finished.

The 300 codes (saying this experiment currently has no more work) may be useful in temporarily rebalancing the site to accept more jobs from other experiments/VOs (depends on how target shares or whatever are done.)

The 400 and 500 ones might be very useful to flag up in local monitoring, to prompt investigation of what problems are leading to these outcomes.

The 600 and 700 codes allow the site to receive higher level errors which are not its responsibility. This is likely to be helpful in deciding where responsibility lies, and as a basis for contacting the VO to ensure the site's resources are being used effectively. VOs can use the 600 codes to indicate to the site that any instances of the VM created in the near future may also fail.

In the Vac implementation, codes 300-699 trigger Vac's backoff procedure to throttle the rate of creating VMs that are likely to fail or find no work.

As with HTTP, this numbering scheme allows sub cases to be added in the future (e.g. sanity checking before starting job agents could lead to "501 VM has less disk space than needed").