Difference between revisions of "RALPP CE Info System Timeout Solving"
Chris brew (Talk | contribs) |
(No difference)
|
Latest revision as of 16:26, 26 October 2006
We've been having problems with our Site falling out of the BDII since we've added 200 more jobs slots to the system.
We think it's related to the extra load on the CE caused by the increased number of jobs and have tried a number of things to fix it:
Contents
Move the site BDII to a different node
yum install bdii /opt/glite/yaim/scripts/run_function /root/yaim-conf/site-info.def config_bdii cd /opt/bdii/etc scp heplnx201:/opt/bdii/etc/* . service bdii restart
Then opened up the 2170 port in the firewall and get the site-bdii.pp.rl.ac.uk alias moved to 130.246.47.202
Configure rgma-gin to use the globus mds for it's info rather than running lcg-info-wrapper (which is expensive)
Edit /opt/glite/etc/rgma-gin/gin.conf
to change the line:
run /opt/lcg/libexec/lcg-info-wrapper
to
run /usr/sbin/rgma-gin-ldap
Where /usr/sbin/rgma-gin-ldap
is:
#!/bin/sh ldapsearch -x -H ldap://heplnx201.pp.rl.ac.uk:2135 -b mds-vo-name=local,o=grid | sed -e 'N;s/\n //'
In theory you could do this on the site bdii against the full site info and not run rgma-gin on any other nodes.
Install the caching pbs commands from nikhef
Download from nikhef, installation instructions are included in the tarball but basically copy pbsnodes, qstat and qsub to pbsnodes.org, qstat.org and qsub.org and replace them with the supplied scripts which cahce the infomation.
Increase the bdii and globus-mds timeouts and decrease thier frequency
The frequency and timeouts for the globus-mds are generated from /opt/globus/libexec/edg.info
, I increased the time between runs to 5 minutes and the timeout on the command to 60 seconds.
The BDII querying ferquency and timeout are defined in /opt/bdii/etc/bdii.conf
I increased the timeout to 60 seconds and the frequency to 2 minutes.
Next steps are:
- Upgrade torque to v2
- Move the CE function off the torque box
As a temporary measure we may reduce the number of worker nodes.
Chris brew 17:26, 26 Oct 2006 (BST)