Difference between revisions of "Glasgow full shut down procedure"
(No difference)
|
Latest revision as of 12:32, 21 December 2007
controlled shutdown of some nodes (caused by environmental limitations / scheduled upgrades etc)
log in to svr016 and mark the worker nodes concerned as offline
This can be simple (for one or two hosts)
svr016 #> pbsnodes -o node008 node140
Or slightly more complex
svr016 #> for i in 01 03 05 07 09 11 13 15 17 19 21 23 25 27 29 31 33 35 ; do echo -n " node0$i" ; done | xargs pbsnodes -o
As we run long (up to 7 days) the worker nodes may take a long time to drain. Follow the Ganglia plots to see when the nodes go idle. Once thats done you can shutdown the nodes cleanly with the poweroff command on the nodes themselves.
psdh -w node008,node140 poweroff
Please note that to power a node back on from this state you'll either need to press the power button on the front of the box itself, or cycle the power (the nodes should be set to come on automatically after power loss - which means we can control them via the APC masterswitches)
Urgent clean shutdown (minor panic)
log into svr031. Poweroff all nodes (clean shutdown)
pdsh -w node[001-140] poweroff
once they're down you can shut off the power to them
powernode --host=node001-140 --off
Then take down the servers / DPMdisks / NFSdisk / headnode FIXME - Any preferred order?
pdsh -w svr0[16-30] poweroff pdsh -w disk0[32-41] -x disk037 poweroff pdsh -w disk037 poweroff powernode --host=... --off
Finally kill server 31 itself once you've checked that all the lights are off on the machines.
FIXME - What about the nortel switch? poweroff
It should all now be nice n quiet
Very Urgent shutdown =
Big red button is located by the aircon unit at the back of the room