Glasgow full shut down procedure

From GridPP Wiki
Revision as of 12:32, 21 December 2007 by Andrew elwell (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

controlled shutdown of some nodes (caused by environmental limitations / scheduled upgrades etc)

log in to svr016 and mark the worker nodes concerned as offline

This can be simple (for one or two hosts)

svr016 #> pbsnodes -o node008 node140

Or slightly more complex

svr016 #> for i in 01 03 05 07 09 11 13 15 17 19 21 23 25 27 29 31 33 35 ; do echo -n " node0$i" ; done | xargs pbsnodes -o

As we run long (up to 7 days) the worker nodes may take a long time to drain. Follow the Ganglia plots to see when the nodes go idle. Once thats done you can shutdown the nodes cleanly with the poweroff command on the nodes themselves.

psdh -w node008,node140 poweroff

Please note that to power a node back on from this state you'll either need to press the power button on the front of the box itself, or cycle the power (the nodes should be set to come on automatically after power loss - which means we can control them via the APC masterswitches)

Urgent clean shutdown (minor panic)

log into svr031. Poweroff all nodes (clean shutdown)

pdsh -w node[001-140] poweroff

once they're down you can shut off the power to them

powernode --host=node001-140 --off

Then take down the servers / DPMdisks / NFSdisk / headnode FIXME - Any preferred order?

pdsh -w svr0[16-30] poweroff
pdsh -w disk0[32-41] -x disk037 poweroff 
pdsh -w disk037 poweroff
powernode --host=...     --off 

Finally kill server 31 itself once you've checked that all the lights are off on the machines.

FIXME - What about the nortel switch?
poweroff

It should all now be nice n quiet

Very Urgent shutdown =

Big red button is located by the aircon unit at the back of the room