RALPP CE Info System Timeout Solving

From GridPP Wiki
Jump to: navigation, search

We've been having problems with our Site falling out of the BDII since we've added 200 more jobs slots to the system.

We think it's related to the extra load on the CE caused by the increased number of jobs and have tried a number of things to fix it:

Move the site BDII to a different node

yum install bdii
/opt/glite/yaim/scripts/run_function /root/yaim-conf/site-info.def config_bdii
cd /opt/bdii/etc
scp heplnx201:/opt/bdii/etc/* .
service bdii restart

Then opened up the 2170 port in the firewall and get the site-bdii.pp.rl.ac.uk alias moved to 130.246.47.202

Configure rgma-gin to use the globus mds for it's info rather than running lcg-info-wrapper (which is expensive)

Edit /opt/glite/etc/rgma-gin/gin.conf to change the line:

  run /opt/lcg/libexec/lcg-info-wrapper

to

  run /usr/sbin/rgma-gin-ldap

Where /usr/sbin/rgma-gin-ldap is:

#!/bin/sh
ldapsearch -x -H ldap://heplnx201.pp.rl.ac.uk:2135 -b mds-vo-name=local,o=grid | sed -e 'N;s/\n //'

In theory you could do this on the site bdii against the full site info and not run rgma-gin on any other nodes.

Install the caching pbs commands from nikhef

Download from nikhef, installation instructions are included in the tarball but basically copy pbsnodes, qstat and qsub to pbsnodes.org, qstat.org and qsub.org and replace them with the supplied scripts which cahce the infomation.

Increase the bdii and globus-mds timeouts and decrease thier frequency

The frequency and timeouts for the globus-mds are generated from /opt/globus/libexec/edg.info, I increased the time between runs to 5 minutes and the timeout on the command to 60 seconds.

The BDII querying ferquency and timeout are defined in /opt/bdii/etc/bdii.conf I increased the timeout to 60 seconds and the frequency to 2 minutes.

Next steps are:

  1. Upgrade torque to v2
  2. Move the CE function off the torque box

As a temporary measure we may reduce the number of worker nodes.

Chris brew 17:26, 26 Oct 2006 (BST)