Glasgow New Cluster Installer

From GridPP Wiki
Jump to: navigation, search

Overview

The installer is based on kickstart, which installs the base RPM set. After this install is done, a postbootinstaller forces ssh keys (and any other secrets) onto the host.

At this point cfengine is started (from rc.local) and configures the node properly.

After the node is installed cfengine continues to run (once an hour), so that any updates to the system are patched in quickly.

Summary

Prolog

Make sure the node is known to YAIM People's Front, by adding it's MAC/IP addresses, etc.

These steps only need done once.

Main Sequence

  1. Use the setboot utility to set the correct PXE boot for the host, e.g., sl-4x-x86_64-eth0-ks
    1. Note the coded boot image here: Scientific Linux (sl), version 4x, architecture x86_64, install over eth0, with kickstart (ks).
    2. For a list of boot images see the /usr/local/ypf/tftp/pxelinux.cfg directory.
  2. Check the contents of /usr/local/ypf/www/classes.conf which determine the kickstart file.
    1. Currently for workernodes this should be nodeXXX: sl-4x-x86_64 eth0 yumsl4x compat32 cfengine wn
    2. Note the correspondence with the boot image naming scheme above.
    3. Unless a node is completely new this file should in fact be ok.
  3. Allow the node to recover secrets
    1. Use allowsecret --host=nodeXXX-nodeYYY
  4. Reboot the node(s): Reboot or nodepower --reboot --host=....
  5. Nodes will PXE boot, kickstart and themselves
    1. As part of the kickstart process the nodes do a yum update, so they will reboot fully patched.
  6. Upon reboot, a node will signal its first boot by writing a file into /usr/local/ypf/var/firstboot/NODENAME
  7. The firstbootwatcher script looks for hostnames in this file, and if they are authorised, pushes the ssh keys, cfengine keys and grid certificates to the host.
  8. cfengine starts and configures the node for use.
    1. cfengine will not start if the node doesn't have its correct ssh host keys (this is a proxy for having been granted secrets) .
    2. If everything looks ok after YAIM has been run, then pbs_mom is started to join the batch system.

Note that if WNs are taken out the batch system then they are probably marked as offline in torque, so use pbsnodes -c NODE to clear the offline status.

Example

Two worker nodes, node013 and node088 have been repaired and need to be rebuilt and brought back into service:

svr031# setboot --verbose --image=sl-30x-i386-eth1-ks --host=node013,node088
svr031# allowsecret --verbose --host=node013,node088
svr031# ppoweroff -n node013,node088
svr031# ppoweron -n node013,node088
[Drink coffee ~10 min]

If this works the nodes should now appear in ganglia and in torque. If pbs_mom was started on the nodes then YAIM ran correctly, so all is well. If the nodes were marked offline then on svr016 do pbsnodes -c node013 node088.