Work to use Bristol HPC Cluster with TAR WN

From GridPP Wiki
Jump to: navigation, search

Need-To-Know

There is some configuration that must be known before anything can be started. These values will be embedded in the site-info.def and users.conf files:

the absolute path to LCG WN software area
the absolute path to VO experiment software install area
UID+GID of pool accounts (408 accts initially : 4 x 100 CMS, LHCb, DTeam, Ops & 2 sgm/prd accounts each).

HPC Admin Creates LCG Accounts

This must be done initially by the HPC Admins, to create lcg "service" accounts; edginfo & rgma for maui monitoring, and lcg account to own the WN tarball software+config. The lcg account should also have access to the VO experiement software area & may have sudo access to the VO pool account home dirs, for job debugging.

Create LCG Account = Maintains LCG WN software & config

groupadd lcg  # the GID can be anything
useradd -n -g lcg lcga  # the UID can be anything
set lcga passwd to whatever

Assume for example that /home/shared is where the LCG software will be installed. This directory must be seen on the WN.

LCGDIR=/home/shared/lcg
mkdir $LCGDIR
chown lcga:lcg $LCGDIR
chmod g+ws  $LCGDIR

HPC Admin creates other LCG Service Accounts

The UID+GID of pool accounts are embedded in an LCG config file (users.conf). So the pool account UID+GID must be made/known now before the users.conf can be created.

The following groups needed, for diagnose -g on torque server:

edginfo:x:102:
rgma:x:103:
infosys:x:7001

Accounts needed on torque server only:

edginfo:x:102:103:EDG Info user:/home/edginfo:/bin/bash   <- can be no
rgma:x:103:104:RGMA user:/opt/glite/etc/rgma:/bin/bash    homedir, no shell

They can have no homedir & no shell. edginfo & rgma must also be in group infosys.


HPC Admin creates LCG Pool Accounts needed on torque server & all WN

If the UID+GID cannot be as shown, then the entire CE2 needs reconfiguring with the UID+GID that will be used. CE2 pool account UID+GID must be identical to those definted on HPC WN and torque server.

Groups needed on torque server & wn:

cms:x:3500:
dteam:x:4000:
lhcb:x:5500:
ops:x:7000:

Pool accounts to create:

cms001..cms100 with UIDs 35001 ...35100

The accounts may require a special .bashrc sourcing the LCG environment scripts. (Still being debugged.) The path will have to be named in their .bashrc (so again that needs to be decided up front).

Also experiment software managment & production management accounts:

cmssgm:x:39000:
cmsprd:x:39001:

Similar for lhcb, dteam, ops.

Ensure lcga account is in all the groups:

usermod -G cms,dteam,lhcb,ops lcga

LCG Admin installs LCG software

The following done by the lcga account.

Create working dirs inside $LCGDIR

LCGDIR=/home/shared/lcg # for example
cd $LCGDIR
mkdir java ui wn

It seems more useful to have yaim-conf inside wn, since the UI install has its own yaim-conf (not merged) its own site-info.def.

It is still a question if the UI tarball install can be merged with the WN tarball install. Concerns about an update to one of them causing problems with the other.


It will be useful to keep a history of yaim runs (logfiles) that is kept NOT inside the wn software, since a new WN install starts a brand new yaimlog.

So. directory structure inside $LCGDIR/wn/{src,yaim-conf,logs,3.1X} with current-> 3.1.X

Install java by hand & define it in site-info.def

For now copy j2sdk-1_4_2_15-linux-i586.bin from elsewhere (not sure it can be wget), run it:

cd $LCGDIR/java
sh j2sdk-1_4_2_15-linux-i586.bin
ln -s j2sdk1.4.2_15 latest
ln -s j2sdk1.4.2_15 default

That is so in site-info.def

 JAVA_LOCATION="${CONFIG_ROOT}/java/default"

is not tied to a version.

Setup yaim-conf files in wn/yaim-conf

site-info.def <- same for CE & WN
wn-list.conf <-
groups.conf <- should need no changing
users.conf <= this has UID & GID for each account embedded.

How critical is users.conf for WN? Is it only for CE/SE? Ask.


WN Software Install + Configure

cd $LCGDIR/src
wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-WN/slc4/x86/glite-WN-3.1.0-4.tar.gz 
wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-WN/slc4/x86/glite-WN-3.1.0-4-external.tar.gz 
cd ..; 
# the current install is 3.1.0-4
mkdir 3.1.0-4
ln -s 3.1.0-4 current
cd current; 
tar zxf ../src/glite-WN-3.1.0-4.tar.gz
tar zxf ../src/glite-WN-3.1.0-4-external.tar.gz

Last chance: Check over the site-info.def for correct paths & etc.

To configure:

cd $LCGDIR/wn
script log/070905.initial.yaim 
./current/glite/yaim/bin/yaim -c -s -yaim-conf/site-info.def -n WN_TAR
exit # finish script

yaim writes log output to $LCGDIR/wn/current/glite/yaim/log/yaimlog

Additional Install Tasks : Certs & X509 script

To install certs into $GLITE_EXTERNAL_ROOT/etc/grid-security/certificates : Ensure $X509_CERT_DIR is *not* defined in environment & in site-info-def. To download and install the certificates into $GLITE_EXTERNAL_ROOT/etc/grid-security/certificates do

cd $LCGDIR/wn/current 
./glite/yaim/bin/yaim -r -s ../yaim-conf/site-info.def -n WN_TAR -f config_certs_userland

To set up a crontab to update the certificate revocation lists,

./glite/yaim/bin/yaim -r -s ../yaim-conf/site-info.def -n WN_TAR -f config_crl

However, that won't work on gpfs due to conflict with selinux. A fix is in the works.

The LCG WN-TAR documentation is missing this as well: A script setting pointers to the certificate directories must be created.

Create script $LCGDIR/wn/current/external/etc/profile.d/x509.sh containing

. /home/shared/lcg/wn/current/external/etc/profile.d/grid-env-funcs.sh
gridenv_set "X509_CERT_DIR" "/home/shared/lcg/wn/current/external/etc/grid-security/certificates"
gridenv_set "X509_VOMS_DIR" "/home/shared/lcg/wn/current/external/etc/grid-security/vomsdir"

WN modifications

Currently we are trying to find a workaround for this. To properly source the glite environment, every WN must have a script in /etc/profile.d. Scripts in /etc/profile.d are executed by every job on the WN, so to ensure that glite environment variables do not affect non-PP jobs, the script only executes the glite environment if the job is detected as being submitted by the PP CE2 (currently called testce02).

#!/bin/sh
if [  "$PBS_O_HOST" = "testce02.phy.bris.ac.uk" ]; then
  STAGE=/home/shared/lcg/wn/current
  if [ -f $STAGE/external/etc/profile.d/grid-env.sh ] ; then
    . $STAGE/external/etc/profile.d/x509.sh
    . $STAGE/external/etc/profile.d/lcgenv.sh
    . $STAGE/external/etc/profile.d/grid-env.sh
  else
    echo "Could not find grid environment scripts"`hostname -f`
  fi
  unset STAGE
fi