UCL-HEP
Local resources
The UCL HEP batch farm currently consists of 17 DELL PowerEdge 2650 servers, each with two 2.4GHz CPUs with hyperthreading enabled and 1 to 4 GB of RAM. Job scheduling is handled by a Torque/PBS server running on a separate machine and all the resources are shared with the local HEP users. These machines are additionally integral part of the UCL HEP computing cluster, running a standard version of SLC305. Network connectivity is Gbit ethernet to the core UCL routers, which in turn connect via Gbit to the London MAN.
Monitoring and Accounting
SAM monitoring for UKI-LT2-UCL-HEP
Site LOG
LCG 2.7.0 upgrade
07-08/03/2006
MON (pc91), CE (pc90), SE_dpm_mysql (pc55), SE_dpm_disk (pc30) upgraded: apt + yaim configure
LFC (pc91) installed: apt+ yaim install + yaim configure
APEL setup properly to publish site accounting via RGMA
TAR_WN TAR_UI (pc97) upgraded: tarball + yaim configure
Minor bugs to iron out: (in progress)
SC4 Transfer Test
21-22/03/2006
Gianfranco needed some modifications to the filetransfer script to deal with a relocatable UI install using different paths, so Graeme started the transfer RAL->UCL.
Warmup tests had indicated a poor write rate, constrained by the NFS mounted DPM filesystems. In addition, the test could not be run to completion because of the SC4 Aggregate Throughput test starting on Wednesday morning. However, running overnight achieved:
Transfer Bandwidth Report: 421/1000 transferred in 46968.2349169 seconds 421000000000.0 bytes transferred. Bandwidth: 71.708038549Mb/s
Obviously this backs up the observation that NFS mounted filesystems perform very poorly. Additionally, both DPM head and pool node are networked throuh a 100Mb switch. This should change to Gb in the near-ish future.
24-25/03/2006
The new version (0.3.4) of Graeme filetransfer script, adapted to the relocatable UI installation we have at UCL-HEP, worked. It installed itself in /opt/lcg/bin with the lib in /opt/lcg/lib/python. Copied it to the shared area, so that it is available from any PC in our cluster. A few transfer tests were then carried out:
gs> filetransfer.py --number=1 --ignore-status-error \ srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned1G \ srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/atlas/canned1G
gs> filetransfer.py --number=1 --delete --ignore-status-error \ srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/atlas/canned1G/tfr000-file00000 \ srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/atlas
While the RAL->UCL transfers would succeed, the UCL->RAL invariably failed:
gs> /usr/local/lcg/glite/bin/glite-transfer-status -l d139282b-bb51-11da-bd42-dee58a58037f Active Source: srm://pc55.hep.ucl.ac.uk:8443/srm/managerv1?SFN=/dpm/hep.ucl.ac.uk/home/atlas/canned1G_1/tfr000-file00000 Destination: srm://dcache.gridpp.rl.ac.uk:8443/srm/managerv1?SFN=/pnfs/gridpp.rl.ac.uk/data/atlas/tfr000-file00000 State: Waiting Retries: 1 Reason: Transfer failed. ERROR the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool manager error: Best pool <atlas_60_1> too high : 2.0E8 Duration: 0
Grame advised this meant that the RAL pools for Atlas were full. Since Gianfranco could only authenticate for Atlas, Graeme once again volunteered to initiate the transfer, which he did just before midnight.
Things were going quite nicely until 2006-03-25T01-21-09, then 11 transfers went into the waiting state "TRANSFER - Transfer timed out." There were a few more transient failures, then things started to go very pear shaped at 2006-03-25T12-01-32: "Transfer failed due to possible network problem - timed out."
After that nothing succeeded. It turns out that the Network Interface on the Storage Pool node pc30 went into error state and the machine locked up at Mar 25 13:31:34. The console showed the following error:
eth0: too much work in interrupt, status e401
in /var/log/messages:
Mar 25 12:00:57 pc30 gridftpd[16253]: RETR /pc30.hep.ucl.ac.uk:/storage/lcgdteam/2006-03-24/canned1G.4452.0 Mar 25 12:04:33 pc30 gridftpd[16177]: Data connection. data_write() failed: Handle not in the proper state Mar 25 12:04:33 pc30 gridftpd[16177]: lost connection to fts0344.gridpp.rl.ac.uk [130.246.179.20]
Here's the summary of the succesfull UCL->RAL transfers:
Transfer Bandwidth Report: 421/1000 transferred in 53551.5980821 seconds 421000000000.0 bytes transferred. Bandwidth: 62.8926142379Mb/s
Success/Failure rates and bandwidth summary for the transfers:
File:Fts-RAL-UCL.gif File:Fts-UCL-RAL.gif
File:Fts-UCL-bandwidth.gif
SC4 Transfer Test (Autumn 06)
14-15/09/2006
Preparation
Downloaded and installed Graeme's latest version of the file transfer script:
wget http://www.physics.gla.ac.uk/~graeme/scripts/packages/filetransfer rpm -Uvh filetransfer-0.5.2-1.noarch.rpm rpm -ql filetransfer-0.5.2-1 /opt/lcg/bin/filetransfer.py /opt/lcg/lib/python/fileTransferLib.py
Attempts to make the install available on the shared directory of our UI (using sof links) failed.
Make sure all is set corectly and try a test transfer:
unset GLOBUS_TCP_PORT_RANGE grid-proxy-init -valid 30:00 Your identity: /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca Enter GRID pass phrase for this identity: Creating proxy .................................... Done Your proxy is valid until: Sat Sep 9 04:56:09 2006 myproxy-init -d Your identity: /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca Enter GRID pass phrase for this identity: Creating proxy ............................................... Done Proxy Verify OK Your proxy is valid until: Fri Sep 15 16:55:16 2006 Enter MyProxy pass phrase: Verifying password - Enter MyProxy pass phrase: A proxy valid for 168 hours (7.0 days) for user /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca now exists on lcgrbp01.gridpp.rl.ac.uk. glite-transfer-channel-list | grep UCLHEP RALLCG2-UKILT2UCLHEP STAR-UKILT2UCLHEP UKILT2UCLHEP-RALLCG2
Tried to optimize transfer settings, but didn't have management rights on them:
glite-transfer-channel-set -f 8 RALLCG2-UKILT2UCLHEP set: setNumberOfFiles: You are not authorised for channel management upon this service glite-transfer-channel-set -T 1 RALLCG2-UKILT2UCLHEP set: setNumberOfStreams: You are not authorised for channel management upon this service glite-transfer-channel-set -f 8 UKILT2UCLHEP-RALLCG2 set: setNumberOfFiles: You are not authorised for channel management upon this service glite-transfer-channel-set -T 1 UKILT2UCLHEP-RALLCG2 set: setNumberOfStreams: You are not authorised for channel management upon this service glite-transfer-channel-list RALLCG2-UKILT2UCLHEP | grep files Number of files: 1, streams: 1
Jaimie points out that the relevant channel is STAR-UKILT2UCLHEP and set the number of files to 8:
glite-transfer-channel-set -f 8 -T 1 STAR-UKILT2UCLHEP glite-transfer-channel-list STAR-UKILT2UCLHEP Channel: STAR-UKILT2UCLHEP Between: * and UKI-LT2-UCL-HEP State: Active Contact: lcg-support@gridpp.rl.ac.uk Bandwidth: 0 Nominal throughput: 0 Number of files: 8, streams: 1 Number of VO shares: 5 VO 'dteam' share is: 20 VO 'alice' share is: 20 VO 'atlas' share is: 20 VO 'cms' share is: 20 VO 'lhcb' share is: 20
Made sure DPNS_HOST and DPM_HOST are set to the local SE (pc55.hep.ucl.ac.uk):
echo $DPNS_HOST pc55.hep.ucl.ac.uk echo $DPM_HOST pc55.hep.ucl.ac.uk
Tried:
filetransfer.py --number=1 --delete srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000 srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/dteam/FileTransferTest/`date +%Y%m%d_%H%M%S` queue: 30 outputfile: transfer-2006-09-11T14-19-53.log verbose: 1 sleepTime: 60 maxTmpError: -1 cancel-time: 1800 number: 10 sourceSize: 0 delete: 1 logfile: filetransfer.log srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000 0 MyProxy Password: /usr/local/glite/d-cache/srm/bin/srm-get-metadata -retry_num=0 -debug=false srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000 Srm metadata query of srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000 exited with status 256 user credentials are: /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca Failed to get source SURL size. Is it a valid file?
However, a few days later this worked. NOTE: when suceeding, it prints your passphrase to stdout. BEWARE!!!
RAL->UKI-LT2-UCL-HEP 22-hour transfer
Waited for UCL-CENTRAL to complete their tests, then fired off the transfer setting duration to 22 hours (Started @ 17:54, 14 Sept 2006), as UCL was to go down next day for a scheduled power outage. Unfortunately, I used the "-duration" flag, as opposed to "--duration". Although the script didn't complain, it ignored that option and stopped the transfer at 23:25, 14 Sept 2006, after transferring 100 files.
Re-started the transfer again at 00:45, 15 Sept 2006, setting the duration to 15 hours. It completed at 15:45, 15 Sept 2006.
This is the correct submission command that had to be used:
filetransfer.py --duration=22:00 --delete --uniform-source-size srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile[000:099] srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/dteam/FileTransferTest/`date +%Y%m%d_%H%M%S`
Here are the summaries of the two transfers:
Transfer Bandwidth Report Summary ================================= transfer 0 (8f31ccee-4411-11db-bc53-e98919b6d2fb) 100/100 (100000000000.0) transferred. Started at 17:54:38, Done at 23:25:58, Duration = 5:31:19, Bandwidth = 40.2425322224Mb/s Date of Submission was 14/9/2006 Total number of FTS submissions = 1 100/100 transferred in 19945.9034829 seconds 100000000000.0bytes transferred. Average Bandwidth:40.1084864712Mb/s Transfer Bandwidth Report Summary ================================= transfer 0 (a71b94eb-444b-11db-bc53-e98919b6d2fb) 16/100 (16000000000.0) transferred. Started at 0:50:29, Canceled at 1:50:26, Duration = 0:59:57, Bandwidth = 35.582794095Mb/s transfer 1 (640396ad-4454-11db-bc53-e98919b6d2fb) 26/100 (26000000000.0) transferred. Started at 1:53:3, Canceled at 3:23:56, Duration = 1:30:53, Bandwidth = 38.141177635Mb/s transfer 2 (75261cb2-4461-11db-bc53-e98919b6d2fb) 79/100 (79000000000.0) transferred. Started at 3:26:34, Canceled at 9:0:13, Duration = 5:33:39, Bandwidth = 31.5693706657Mb/s transfer 3 (cc57a425-448b-11db-bc53-e98919b6d2fb) 52/100 (52000000000.0) transferred. Started at 8:30:44, Canceled at 12:3:59, Duration = 3:33:15, Bandwidth = 32.512081779Mb/s transfer 4 (1aff0071-44aa-11db-bc53-e98919b6d2fb) 49/100 (49000000000.0) transferred. Started at 12:6:36, Canceled at 15:9:14, Duration = 3:2:37, Bandwidth = 35.7745810358Mb/s transfer 5 (6a3155ec-44bc-11db-bc53-e98919b6d2fb) 9/100 (9000000000.0) transferred. Started at 15:10:51, Active at 15:49:50, Duration = 0:38:58, Bandwidth = 30.7883202266Mb/s Date of Submission was 15/9/2006 Total number of FTS submissions = 6 231/600 transferred in 53960.877938 seconds 231000000000.0bytes transferred. Average Bandwidth:34.2470336032Mb/s
Note the strange "Canceled" messages. Not sure what causes them, as I didn't interact with the transfer at all, after firing it off.
Summary Graphs:
File:Fts-graph.pl10.gif File:Fts-graph.pl11.gifFile:Fts-graph.pl4.gif
UKI-LT2-UCL-HEP->RAL 24-hour transfer
Upgrade Log
Glite 3.0
05-07/07/2006
Middleware Upgrade
Upgraded front end nodes (except CE) to the gLite flavour of the services they run.
On all nodes {MON (pc91), CE (pc90), SE_dpm_mysql (pc55), SE_dpm_disk (pc30)}, changed /etc/apt/sources.list.d/lcg.list to:
rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates #rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security
Then:
apt-get update
Updated to the glite rpm's wheere necessary:
-pc91: Dump a snapshot of the DB first! -> NOT DONE - regularly backed-up by a cron job
rpm -ve lcg-MON-2.7.0-sl3 OK rpm -ve lcg-LFC_mysql-2.7.0-sl3 OK apt-get install glite-MON OK apt-get install glite-LFC_mysql OK
-pc90: no changes, will keep lcg-CE
apt-get upgrade lcg-CE WARNING
Does NOT get glite-yaim -> must get by hand (see below)
-pc55: Dump a snapshot of the DB first! -> NOT DONE - regularly backed-up by a cron job
rpm -ve lcg-SE_dpm_mysql-2.7.0-sl3 OK apt-get install glite-SE_dpm_mysql OK
-pc30:
rpm -ve ???
(there's no lcg-SE-dpm_disk...) -> NOTHING TO DO
Then: <pre> apt-get dist-upgrade
(make sure yaim is the latest version: glite-yaim-3.0.0-16) if no glite-yaim, get it here and install it:
wget http://www.cern.ch/grid-deployment/gis/yaim/glite-yaim-x.x.x-x.noarch.rpm rpm -ivh glite-yaim-x.x.x-x.noarch.rpm
Not necessary on pc91, pc30, pc55. DONE on pc90 (as it didn't get it). Had to remove lcg-yaim before being able to dist-upgrade:
rpm -ev lcg-yaim-2.7.0-5
Tarball and userdeps downloaded for UI/WN:
wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/GLITE-3_0_0-sl3.tar.gz wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/GLITE-3_0_0-userdeps-sl3.tar.gz tar xzvf GLITE-3_0_0-sl3.tar.gz cd edg mv ../GLITE-3_0_0-userdeps-sl3.tar.gz . tar xzvf GLITE-3_0_0-userdeps-sl3.tar.gz
Re-configure all nodes
site-info.def has now a few new variables. A few others have been dithced. New version prepared based on the prototype that came with the rpms. Note that the PATH to the file has changed to: /opt/glite/yaim/ucl-hep/site-info.def
Also PATH to users.conf and groups.conf has changed accordingly.
/opt/glite/yaim/scripts/configure_node /opt/glite/yaim/ucl-hep/site-info.def [NODE_TYPE]
These went well for all nodes, except for TAR_WN TAR_UI, which failed with error:
/usr/local/glite/glite/yaim/scripts/configure_node /usr/local/glite/glite/yaim/ucl-hep/site-info-pc97.def TAR_WN TAR_UI ... Configuring config_glite Traceback (most recent call last): File "/usr/local/glite/glite/yaim/scripts/../functions//../libexec/YAIM2gLiteConvertor.py", line 418, in ? updateContainerParameter( param, environ[param] ) File "/usr/local/glite/glite/yaim/scripts/../functions//../libexec/YAIM2gLiteConvertor.py", line 190, in updateContainerParameter stripQuotes( value.split( ' ' )[3] ), 'voms.cert.subj' ) IndexError: list index out of range [ERROR] The user-defined parameter is not defined [ERROR] An error has occurred while parsing the configuration files An error occurred while configuring the service gLite configuration script has returned nonzero return code Error configuring config_glite
This fails when doing TAR_WN TAR_UI, or TAR_WN only. It goes to finish when TAR_UI, although it throws the error while running.
WORKAROUND (NOTE: not yet known if this breaks something else; so far so good):
Commenting out the offending lines (189 and 190) in /usr/local/glite/glite/yaim/scripts/../functions//../libexec/YAIM2gLiteConvertor.py
After that, configuring TAR_WN TAR_UI runs with no error.
Post configure_ node fixes
Some post configure_node actions still required for WN/UI. These fixes need be applied everytime configure_node is executed.
Despite setting in site-info.def
INSTALL_ROOT=/usr/local/glite
to relocate the installation, that value is not yet correctly passed to some paths.
The following fixes needed be applied:
cp -p /etc/grid-security/vomsdir/* /usr/local/glite/etc/grid-security/vomsdir
edit /usr/local/glite/etc/profile.d/grid_env.*sh and change to:
#X509_VOMS_DIR=/etc/grid-security/vomsdir X509_VOMS_DIR=/usr/local/glite/etc/grid-security/vomsdir
and
#setenv X509_VOMS_DIR /etc/grid-security/vomsdir setenv X509_VOMS_DIR /usr/local/glite/etc/grid-security/vomsdir
This is a fix for broken lcg-infosites command:
ln -s /usr/local/glite/edg/lib/perl/vendor_perl/5.8.0/Net/ /usr/local/glite/edg/lib/perl/Net
Additional fixes
RGMAsc SFT failed. Checking back the reasons for similar failure after 2.7.0 upgrade, I found out that the file
/usr/local/glite/edg/share/java/log4j.jar
was missing and copied it over from a backup of the lcg 2.7.0 installation
Also RGMA failed, so had to restart a few times tomcat5 on the MON box