High Availability Torque

From GridPP Wiki
Revision as of 13:26, 13 August 2008 by Matt hodges (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Motivation

When using torque and maui as a batch system a single point of failure is introduced on the batch schedular. When the pbs_server fails due to a crash or hardware failure then.

  • No more jobs can be submitted with qsub.
  • For any jobs completing all accounting records will not be recorded.

Status

This work is curretly stuck since I have reached a dead end trying to get it to work at the moment.

--Steve traylen 12:57, 26 Oct 2005 (BST)

Plan

Our aim is to use linux high availabilty software to provide a failover of the pbs_server on to another stand by node. Our setup will be to have two physical nodes gpp003.gridpp.rl.ac.uk and gpp004.gridpp.rl.ac.uk that are capable of running the pbs_server. A third hostname and ip address will be available gpp001.gridpp.rl.ac.uk that will provide the pbs_server binding. When the service hostname gpp001.gridpp.rl.ac.uk is migrated between boxes it is also important the pbs_server's state information is available and consistant on the new node. In order to acheive this consitancy a DRBD block level replication of the pbs_server's state meta data is put in place.

File:HighAvailabiltyTorque.png

As well as the pbs_server service being transfered the maui service along with its state information will also be transfered.

In the context of LCG it is often the case that both the GateKeeper and Torque are co-located on the same host. These ideas do not extend to what is a rater more complicated situation with failover.

It is also our aim to make efficient use of the redundant standby node by allowing it to run batch jobs. In the case of failover any jobs running on the standby node will be terminated or suspended if possible.

Howto

Following are the details of how this has been achevied at RAL though hopefully this should be applicable anywhere. It is assumed that the user understands how to configure both Maui and Torque before starting this.

It is our aim that both gpp002 and gpp003 will be set up in as near as identical ways so as to ease deployment. While the service can migrate between hosts there is a prefered host which in our case is gpp002. It this preferences for a host that creates the only difference between gpp002 and gpp003's configuration.

Disk Configuration

It is important that seperate partitions exist for the directories that we wish to replicate between gpp002 and gpp003. For this our kickstart files effectivly contain:

  part /var/spool/pbs  --size  2028
  part /var/spool/maui --size  2028


Conviniently all of torque's and maui's state data is contained is just two directories. Once the system is installed however these entries must be commented out from the /etc/fstab. These partitions are both going to reformatted and any mount request for them will be handled by the HA software.

  # /etc/fstab file on gpp002 and gpp003
  #LABEL=/var/spool/maui   /var/spool/maui         ext3    defaults        1 2
  #LABEL=/var/spool/pbs    /var/spool/pbs          ext3    defaults        1 2

You should unmount both of these partitions before proceeding.

  # umount /var/spool/pbs /var/spool/maui

Install PBS_MOM

If you plan to use either gpp002 or gpp003 as batch workers when they are in standby mode and not hosting the pbs_server then you should install the Torque packages torque, torque-clients and torque-resmom.

It is worth noting at this point that files contained within the packages such as /var/spool/pbs/spool and /var/spool/pbs/server_name are in the / partition and not in the special now unmounted partitions we created during the kickstart.

Our plan is to overlay the root filesystem with our special replicated /var/spool/pbs when either of the physical hosts is hosting the pbs_server.

Configure PBS_MOM

Again if you plan to use either gpp002 or gpp003 as batch workers the pbs_moms should be configured in a similar way the other pbs_moms within your cluster. In particular /var/spool/pbs/mom_priv/config should be set up for pbs_moms and for the pbs-clients:

 # /var/spool/pbs/server_name configuration file for pbs-clients such
 # as qstat, qsub.
 # This should point to the pbs_server hostname that is going to migrate
 # between the two physical servers on failover.
 gpp001.gridpp.rl.ac.uk

Install DRBD

It will be a good idea to read about about DRBD first. In particular this is just a rewrite more or less of what is in the DRBD INSTALL file.

DRBD requires a kernel module to be installed onto the system to match the running kernel so you must repeat this process for every kernel upgrade that you plan to do.

  • Install and run the latest kernel and kernel-source or one of your choice.
 # yum install kernel-source kernel

reboot the machine to have the correct kernel running.

  • The kernel headers must be correct for the kernel that you are actually planning to build drbd against. I am very open to suggestions to improve the last stage where the Makefile is edited.
 # cd /usr/src/linux-2.4
 # make mrproper
 # cp configs/kernel-2.4.21-i686-smp.config .config  (or the relavent config)
 # make oldconfig && make oldconfig
 # vi Makefile
     remove the custom part in EXTRAVERSION and possibly add smp to EXTRAVERSION. 
     e.g. EXTRAVERSION = -37.ELsmp
 # make dep
 # wget http://oss.linbit.com/drbd/0.7/drbd-0.7.14.tar.gz
 # tar zxvf drbd-0.7.14.tar.gz
 # cd drbd-0.7.14
 # make rpm
  • Finally install the resulting drbd packages.
 # rpm -Uvh dist/RPMS/i386/drbd-0.7.14-1.i386.rpm
 # rpm -Uvh dist/RPMS/i386/drbd-km-2.4.21_37.EL-0.7.14-1.i386.rpm
 # depmod -a

It is possible to install multiple drbd-km packages at the same time to allow switching to different kernels easily.

Configure DRBD

Earlier on we created two filesystems e2label'ed as /var/spool/pbs and /var/spool/maui. In our case these were hda5 and hda4 respectivly which can be queried with.

 # e2label /dev/hda4
 /var/spool/maui
 # e2label /dev/hda5 
 /var/spool/pbs

as specified earlier both of these file systems must be unmounted to proceed. You should change the file systems devices to those devices kickstart has decided to put your partitions on.

Still working on both hosts, gpp002 and gpp003 create a /etc/drbd.conf file. Our aim is to create two DRBD resources called PBS and MAUI. This will create two block devices drbd0 and drbd1 which will be our network replicated devices. The configuration file defines what the mapping will be to physical disk device on hda5 and hda4. It is these network block devices we will eventually mount as /var/spool/pbs and /var/spool/maui when one of the physical hosts is hosting the pbs_server. In this example the .2 and .3 ip addresses are gpp002 and gpp003 respectivly.

   
global {
    # Number of hosts in our game.
    minor-count  2;
}

resource PBS {
   protocol  C;
   incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

   startup {
     degr-wfc-timeout 120;    # 2 minutes.
   }
   disk {
     on-io-error   detach;
   }
   net {

   }
   syncer {
     rate 10M;
     group 1;
     al-extents 257;
   }
   on gpp002.gridpp.rl.ac.uk {
      device /dev/drbd0;
      disk   /dev/hda5;
      address    130.246.187.2:7788;
      meta-disk  internal;
   }

   on gpp003.gridpp.rl.ac.uk {
      device /dev/drbd0;
      disk   /dev/hda5;
      address    130.246.187.3:7788;
      meta-disk  internal;
   }

}

resource MAUI {
   protocol C;
   incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";

   startup {
     degr-wfc-timeout 120;    # 2 minutes.
   }
   disk {
     on-io-error   detach;
   }
   net {

   }
   syncer {
     rate 10M;
     group 2;
     al-extents 257;
   }
   on gpp002.gridpp.rl.ac.uk {
      device /dev/drbd1;
      disk   /dev/hda3;
      address     130.246.187.2:7789;
      meta-disk  internal;
   }
   on gpp003.gridpp.rl.ac.uk {
      device /dev/drbd1;
      disk   /dev/hda3;
      address    130.246.187.3:7789;
      meta-disk  internal;
   }

}

Many more options are available within this file, please consult the DRBD doucmentation.

Now start the drbd services on both gpp002 and gpp003 our two failover nodes.

We can query the state of the devices with

 #/sbin/drbdadm state PBS
 Secondary/Secondary

telling us that neither node is currently the primary master. Up till now everything has been identical for both gpp002 and gpp003. We must now declare one to be the master which we will define as gpp002.

gpp002# /sbin/drbdadm -- --do-what-I-say primary MAUI
gpp002# /sbin/drbdadm -- --do-what-I-say primary PBS

Next we format the primary partition.

gpp002# /sbin/mkfs -t ext3 /dev/drbd0
gpp002# /sbin/mkfs -t ext3 /dev/drbd1

Finally we test a failover from gpp002 to gpp003.

gpp002# mkdir /tmp/mnt
gpp002# mount /dev/drbd0 /tmp/mnt
gpp002# touch /tmp/mnt/testFileCreatedOnGpp002
gpp002# /sbin/drbdadm secondary PBS
gpp003# mkdir /tmp/mnt
gpp003# /sbin/drbdadm primary PBS
gpp003# mount /dev/drbd0 /tmp/mnt
gpp003# ls /tmp/mnt
testFileCreatedOnGpp002
gpp003# touch /tmp/mnt/anotherTestFileCreatedonGpp003

And the failover from gpp003 to gpp002 can now be done in reverse.

Install PBS_SERVER and Maui

You now have to install the pbs_server and maui server on both gpp002 and gpp003. These services that will migrate between gpp002 and gpp003 must be installed on both boxes with their directories for state data on the replicated partition. In other words assuming gpp002 is currently your drbd primary for both partitions.

 gpp002# /sbin/drbadm state PBS
  Primary/Secondary
 gpp002# /sbin/drbadm state MAUI
  Primary/Secondary

You can procede with the following.

 gpp002# mount /dev/drbd0 /var/spool/pbs
 gpp002# mount /dev/drbd1 /var/spool/maui
 gpp002# yum install pbs_server maui maui-clients maui-server
 gpp002# umount /var/spool/pbs /var/spool/maui
 gpp002# /sbin/drbdadm secondary MAUI
 gpp002# /sbin/drbdadm secondary PBS
 
 gpp003# /sbin/drbdadm primary MAUI
 gpp003# /sbin/drbdadm primary PBS
 gpp003# mount /dev/drbd0 /var/spool/pbs
 gpp003# mount /dev/drbd1 /var/spool/maui
 gpp003# yum install pbs_server maui maui-clients maui-server

You may get some errors about rpm creating new configuration files since you have allready installed them once on the other host. They can be ignored.

It is important that the pbs_server and maui are started and stopped only by the heartbeat software and not by system init start up.

 # /sbin/service maui stop
 # /sbin/chkconfig maui off
 # /sbin/service pbs_server stop
 # /sbin/chkconfig pbs_server off

You should never start pbs_server or maui with these standard methods.

Configure PBS_SERVER

With the PBS block device mounted on one of the nodes you can now configure the pbs_server. So assuming you have a qmgr.conf containing the queue and server parameters you want. Set up two configuration files, torque.cfg and server_name.

 # /var/spool/pbs/torque.cfg
 # Set SERVERHOST to the name of pbs_server to be on the virtual network name
 # we plan to run the service on.
 SERVERHOST gpp001.gridpp.rl.ac.uk


 # /var/spool/pbs/server_name
 # Set this again to hostname we are going to run the pbs_server on.
 # This is needed so that qmgr, qstat, pbsnodes -a all work correctly.
 gpp001.gridpp.rl.ac.uk
 

Finally we configure the pbs_server

 # /usr/sbin/pbs_server -t create
 # qmgr < qmgr.conf
 # /sbin/service pbs_server stop

Configure MAUI

The MAUI block device must mounted on one host so that you can configure maui. Maui is configured with the file /var/spool/maui/maui.cfg. The hostname of the pbs_server must be specified in this file. Unfortuantly Maui does not like you using anything other than the physcial hostname of the pbs_server so the maui.cfg must at least contain.

 # /var/spool/maui.cfg 
 SERVERHOST              gpp002.gridpp.rl.ac.uk
 ADMINHOST               gpp002.gridpp.rl.ac.uk

The result of this is that we must change the configuration file on failover which is easily done and will be explained later.

Install Heartbeat

Heatbeat is available as packages from thier download site. Currently we are using version 2.0.2.

Unfortuantly the packages there are built for Suse Linux so it is a good idea to recreate them.

 # yum install libtool libtool-libs net-snmp-devel e2fsprogs-devel libxml2-devel
 # wget http://linux-ha.org/download/heartbeat-2.0.2.tar.gz
 # tar zxvf heartbeat-2.0.2.tar.gz
 # cd heartbeat-2.0.2
 # ./ConfigureMe package
 # rpm -Uvh heartbeat-2.0.2-1.i386.rpm \
      heartbeat-pils-2.0.2-1.i386.rpm heartbeat-stonith-2.0.2-1.i386.rpm


However I had problem installing these packages built above which I have reported. There are however prebuilt packages available for RHEL3 that appear to work okay.

Configure Heatbeat

You should definetly read the guides on the HA website before proceeding, there are many options and the following just represents what we are doing.

Other Resources