https://www.gridpp.ac.uk/w/index.php?title=Glasgow_Cluster_YPF_Install&feed=atom&action=historyGlasgow Cluster YPF Install - Revision history2024-03-29T09:08:08ZRevision history for this page on the wikiMediaWiki 1.22.0https://www.gridpp.ac.uk/w/index.php?title=Glasgow_Cluster_YPF_Install&diff=1713&oldid=prevGraeme stewart at 13:34, 30 November 20072007-11-30T13:34:43Z<p></p>
<p><b>New page</b></p><div>==Overview==<br />
<br />
Installing machines is done as part of the [[:Category:YPF | YAIM People's Front]] system. See [[YPF Overview]] for more general details.<br />
<br />
YPF generates and maintains the infrastruture to install a machine via PXE boot, which then starts a Scientific Linux kickstart install.<br />
<br />
The kickstart install does only very basic machine configuration, installing a basic RPM set with updates. When the machine reboots it signals this fact to the master server, requesting its secrets (ssh keys, etc.). Then [[cfengine]] starts and finalises the machine's configuration.<br />
<br />
(The advantage of concentrating the install complexity in [[cfengine]] is that ssh has now started and the machine can be logged into to sort any problems. A stalled kickstart is much more difficult to recover from.)<br />
<br />
===Summary===<br />
<br />
====Prolog====<br />
<br />
If this is a new node to the cluster then it needs to be added to the YPF database and various cluster configuration files regenerated, see [[Glasgow Cluster YPF Adding A New Host]].<br />
<br />
You will also need to generate the IP configuration that drops into ''/etc/sysconfig/network-scripts/ifcfg-ethN'' using the '''mkifcfg''' and '''mkskel''' pages.<br />
<br />
====Main Sequence====<br />
# Use the <tt>setboot</tt> utility to set the correct PXE boot for the host, e.g., <tt>sl-4x-x86_64-eth0-ks</tt><br />
## Note the coded boot image here: Scientific Linux (sl), version 4x, architecture x86_64, install over eth0, with kickstart (ks).<br />
## For a list of boot images see the <tt>/usr/local/ypf/tftp/pxelinux.cfg</tt> directory.<br />
### Note that the tftp server does follow soft links, so the kernel and inidrd images in <tt>/usr/local/ypf/tftp</tt> are links to the <tt>/mirror</tt> area. (see [[Glasgow Mirrors]])<br />
if there is no suitable PXE config, then it's easiest to copy and edit one ie:<br />
<tt>sl-5r-i386-eth0-ks</tt><br />
# PXE boot into Scientific Linux 5<br />
<br />
DEFAULT sl-5r-i386-eth0-ks<br />
<br />
LABEL sl-5r-i386-eth0-ks<br />
KERNEL vmlinuz-sl-5r-i386<br />
APPEND initrd=initrd-sl-5r-i386 keymap=uk lang=en_GB ks=http://master.beowulf.cluster/alt/autokick.php ksdevice=eth0 headless<br />
<br />
# Check the contents of <tt>/usr/local/ypf/www/classes.conf</tt> which determine the kickstart file. (''N.B. this configuration will move to the YPF database.'')<br />
## Currently for workernodes this should be <tt>nodeXXX: sl-4x-x86_64 eth0 yumsl4x compat32 cfengine wn</tt><br />
## Note the correspondence with the boot image naming scheme above.<br />
## Unless a node is completely new this file should in fact be ok.<br />
# Allow the node to recover secrets<br />
## Use <tt>allowsecret --host=nodeXXX,nodeYYY</tt><br />
# Reboot the node(s): Reboot or <tt>powernode --host=nodeXXX,nodeYYY --reboot</tt>.<br />
# Nodes will PXE boot, kickstart and themselves<br />
## As part of the kickstart process the nodes do a <tt>yum update</tt>, so they will reboot fully patched.<br />
# Upon reboot, a node will signal its first boot by writing a file into <tt>/usr/local/ypf/var/firstboot/NODENAME</tt><br />
# The <tt>firstbootwatcher</tt> script looks for hostnames in this file, and if they are authorised, pushes the ssh keys, cfengine keys and grid certificates to the host<br />
# [[cfengine]] starts and configures the node for use.<br />
## cfengine will not start if the node doesn't have its correct ssh host keys (this is a proxy for having been granted secrets) .<br />
## If everything looks ok after YAIM has been run, then pbs_mom is started to join the batch system on WNs.<br />
<br />
Note that if WNs are taken out the batch system then they are probably marked as <tt>offline</tt> in torque, so use <tt>pbsnodes -c NODE</tt> to clear the offline status.<br />
<br />
===Example===<br />
<br />
Two worker nodes, <tt>node013</tt> and <tt>node088</tt> have been repaired and need to be rebuilt and brought back into service:<br />
<br />
<pre><br />
svr031# setboot --verbose --image=sl-4x-x86_64-eth0-ks --host=node013,node088<br />
svr031# allowsecret --verbose --host=node013,node088<br />
svr031# powernode --reboot --host= node013,node088<br />
[Drink coffee ~10 min]<br />
</pre><br />
<br />
If this works the nodes should now appear in ganglia and in torque. If pbs_mom was started on the nodes then YAIM ran correctly, so all is well. If the nodes were marked offline then on <tt>svr016</tt> do <tt>pbsnodes -c node013 node088</tt>.<br />
<br />
[[Category: ScotGrid]] [[Category: YPF]] [[Category: Glasgow]]</div>Graeme stewart