Glasgow Cluster YPF Install
YPF generates and maintains the infrastruture to install a machine via PXE boot, which then starts a Scientific Linux kickstart install.
The kickstart install does only very basic machine configuration, installing a basic RPM set with updates. When the machine reboots it signals this fact to the master server, requesting its secrets (ssh keys, etc.). Then cfengine starts and finalises the machine's configuration.
(The advantage of concentrating the install complexity in cfengine is that ssh has now started and the machine can be logged into to sort any problems. A stalled kickstart is much more difficult to recover from.)
If this is a new node to the cluster then it needs to be added to the YPF database and various cluster configuration files regenerated, see Glasgow Cluster YPF Adding A New Host.
You will also need to generate the IP configuration that drops into /etc/sysconfig/network-scripts/ifcfg-ethN using the mkifcfg and mkskel pages.
- Use the setboot utility to set the correct PXE boot for the host, e.g., sl-4x-x86_64-eth0-ks
- Note the coded boot image here: Scientific Linux (sl), version 4x, architecture x86_64, install over eth0, with kickstart (ks).
- For a list of boot images see the /usr/local/ypf/tftp/pxelinux.cfg directory.
- Note that the tftp server does follow soft links, so the kernel and inidrd images in /usr/local/ypf/tftp are links to the /mirror area. (see Glasgow Mirrors)
if there is no suitable PXE config, then it's easiest to copy and edit one ie: sl-5r-i386-eth0-ks
# PXE boot into Scientific Linux 5 DEFAULT sl-5r-i386-eth0-ks LABEL sl-5r-i386-eth0-ks KERNEL vmlinuz-sl-5r-i386 APPEND initrd=initrd-sl-5r-i386 keymap=uk lang=en_GB ks=http://master.beowulf.cluster/alt/autokick.php ksdevice=eth0 headless
- Check the contents of /usr/local/ypf/www/classes.conf which determine the kickstart file. (N.B. this configuration will move to the YPF database.)
- Currently for workernodes this should be nodeXXX: sl-4x-x86_64 eth0 yumsl4x compat32 cfengine wn
- Note the correspondence with the boot image naming scheme above.
- Unless a node is completely new this file should in fact be ok.
- Allow the node to recover secrets
- Use allowsecret --host=nodeXXX,nodeYYY
- Reboot the node(s): Reboot or powernode --host=nodeXXX,nodeYYY --reboot.
- Nodes will PXE boot, kickstart and themselves
- As part of the kickstart process the nodes do a yum update, so they will reboot fully patched.
- Upon reboot, a node will signal its first boot by writing a file into /usr/local/ypf/var/firstboot/NODENAME
- The firstbootwatcher script looks for hostnames in this file, and if they are authorised, pushes the ssh keys, cfengine keys and grid certificates to the host
- cfengine starts and configures the node for use.
- cfengine will not start if the node doesn't have its correct ssh host keys (this is a proxy for having been granted secrets) .
- If everything looks ok after YAIM has been run, then pbs_mom is started to join the batch system on WNs.
Note that if WNs are taken out the batch system then they are probably marked as offline in torque, so use pbsnodes -c NODE to clear the offline status.
Two worker nodes, node013 and node088 have been repaired and need to be rebuilt and brought back into service:
svr031# setboot --verbose --image=sl-4x-x86_64-eth0-ks --host=node013,node088 svr031# allowsecret --verbose --host=node013,node088 svr031# powernode --reboot --host= node013,node088 [Drink coffee ~10 min]
If this works the nodes should now appear in ganglia and in torque. If pbs_mom was started on the nodes then YAIM ran correctly, so all is well. If the nodes were marked offline then on svr016 do pbsnodes -c node013 node088.