Difference between revisions of "Work to use Bristol HPC Cluster with TAR WN"
(No difference)
|
Latest revision as of 12:39, 8 September 2007
Contents
Need-To-Know
There is some configuration that must be known before anything can be started. These values will be embedded in the site-info.def and users.conf files:
the absolute path to LCG WN software area the absolute path to VO experiment software install area UID+GID of pool accounts (408 accts initially : 4 x 100 CMS, LHCb, DTeam, Ops & 2 sgm/prd accounts each).
HPC Admin Creates LCG Accounts
This must be done initially by the HPC Admins, to create lcg "service" accounts; edginfo & rgma for maui monitoring, and lcg account to own the WN tarball software+config. The lcg account should also have access to the VO experiement software area & may have sudo access to the VO pool account home dirs, for job debugging.
Create LCG Account = Maintains LCG WN software & config
groupadd lcg # the GID can be anything useradd -n -g lcg lcga # the UID can be anything set lcga passwd to whatever
Assume for example that /home/shared is where the LCG software will be installed. This directory must be seen on the WN.
LCGDIR=/home/shared/lcg mkdir $LCGDIR chown lcga:lcg $LCGDIR chmod g+ws $LCGDIR
HPC Admin creates other LCG Service Accounts
The UID+GID of pool accounts are embedded in an LCG config file (users.conf). So the pool account UID+GID must be made/known now before the users.conf can be created.
The following groups needed, for diagnose -g on torque server:
edginfo:x:102: rgma:x:103: infosys:x:7001
Accounts needed on torque server only:
edginfo:x:102:103:EDG Info user:/home/edginfo:/bin/bash <- can be no rgma:x:103:104:RGMA user:/opt/glite/etc/rgma:/bin/bash homedir, no shell
They can have no homedir & no shell. edginfo & rgma must also be in group infosys.
HPC Admin creates LCG Pool Accounts needed on torque server & all WN
If the UID+GID cannot be as shown, then the entire CE2 needs reconfiguring with the UID+GID that will be used. CE2 pool account UID+GID must be identical to those definted on HPC WN and torque server.
Groups needed on torque server & wn:
cms:x:3500: dteam:x:4000: lhcb:x:5500: ops:x:7000:
Pool accounts to create:
cms001..cms100 with UIDs 35001 ...35100
The accounts may require a special .bashrc sourcing the LCG environment scripts. (Still being debugged.) The path will have to be named in their .bashrc (so again that needs to be decided up front).
Also experiment software managment & production management accounts:
cmssgm:x:39000: cmsprd:x:39001:
Similar for lhcb, dteam, ops.
Ensure lcga account is in all the groups:
usermod -G cms,dteam,lhcb,ops lcga
LCG Admin installs LCG software
The following done by the lcga account.
Create working dirs inside $LCGDIR
LCGDIR=/home/shared/lcg # for example cd $LCGDIR mkdir java ui wn
It seems more useful to have yaim-conf inside wn, since the UI install has its own yaim-conf (not merged) its own site-info.def.
It is still a question if the UI tarball install can be merged with the WN tarball install. Concerns about an update to one of them causing problems with the other.
It will be useful to keep a history of yaim runs (logfiles) that is kept NOT inside the wn software,
since a new WN install starts a brand new yaimlog.
So. directory structure inside $LCGDIR/wn/{src,yaim-conf,logs,3.1X} with current-> 3.1.X
Install java by hand & define it in site-info.def
For now copy j2sdk-1_4_2_15-linux-i586.bin from elsewhere (not sure it can be wget), run it:
cd $LCGDIR/java sh j2sdk-1_4_2_15-linux-i586.bin ln -s j2sdk1.4.2_15 latest ln -s j2sdk1.4.2_15 default
That is so in site-info.def
JAVA_LOCATION="${CONFIG_ROOT}/java/default"
is not tied to a version.
Setup yaim-conf files in wn/yaim-conf
site-info.def <- same for CE & WN wn-list.conf <- groups.conf <- should need no changing users.conf <= this has UID & GID for each account embedded.
How critical is users.conf for WN? Is it only for CE/SE? Ask.
WN Software Install + Configure
cd $LCGDIR/src wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-WN/slc4/x86/glite-WN-3.1.0-4.tar.gz wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/glite-WN/slc4/x86/glite-WN-3.1.0-4-external.tar.gz
cd ..; # the current install is 3.1.0-4 mkdir 3.1.0-4 ln -s 3.1.0-4 current cd current; tar zxf ../src/glite-WN-3.1.0-4.tar.gz tar zxf ../src/glite-WN-3.1.0-4-external.tar.gz
Last chance: Check over the site-info.def for correct paths & etc.
To configure:
cd $LCGDIR/wn script log/070905.initial.yaim ./current/glite/yaim/bin/yaim -c -s -yaim-conf/site-info.def -n WN_TAR exit # finish script
yaim writes log output to $LCGDIR/wn/current/glite/yaim/log/yaimlog
Additional Install Tasks : Certs & X509 script
To install certs into $GLITE_EXTERNAL_ROOT/etc/grid-security/certificates : Ensure $X509_CERT_DIR is *not* defined in environment & in site-info-def. To download and install the certificates into $GLITE_EXTERNAL_ROOT/etc/grid-security/certificates do
cd $LCGDIR/wn/current ./glite/yaim/bin/yaim -r -s ../yaim-conf/site-info.def -n WN_TAR -f config_certs_userland
To set up a crontab to update the certificate revocation lists,
./glite/yaim/bin/yaim -r -s ../yaim-conf/site-info.def -n WN_TAR -f config_crl
However, that won't work on gpfs due to conflict with selinux. A fix is in the works.
The LCG WN-TAR documentation is missing this as well: A script setting pointers to the certificate directories must be created.
Create script $LCGDIR/wn/current/external/etc/profile.d/x509.sh containing
. /home/shared/lcg/wn/current/external/etc/profile.d/grid-env-funcs.sh gridenv_set "X509_CERT_DIR" "/home/shared/lcg/wn/current/external/etc/grid-security/certificates" gridenv_set "X509_VOMS_DIR" "/home/shared/lcg/wn/current/external/etc/grid-security/vomsdir"
WN modifications
Currently we are trying to find a workaround for this. To properly source the glite environment, every WN must have a script in /etc/profile.d. Scripts in /etc/profile.d are executed by every job on the WN, so to ensure that glite environment variables do not affect non-PP jobs, the script only executes the glite environment if the job is detected as being submitted by the PP CE2 (currently called testce02).
#!/bin/sh if [ "$PBS_O_HOST" = "testce02.phy.bris.ac.uk" ]; then STAGE=/home/shared/lcg/wn/current if [ -f $STAGE/external/etc/profile.d/grid-env.sh ] ; then . $STAGE/external/etc/profile.d/x509.sh . $STAGE/external/etc/profile.d/lcgenv.sh . $STAGE/external/etc/profile.d/grid-env.sh else echo "Could not find grid environment scripts"`hostname -f` fi unset STAGE fi