UCL-HEP

From GridPP Wiki
Jump to: navigation, search
Back To LT2

Local resources

The UCL HEP batch farm currently consists of 17 DELL PowerEdge 2650 servers, each with two 2.4GHz CPUs with hyperthreading enabled and 1 to 4 GB of RAM. Job scheduling is handled by a Torque/PBS server running on a separate machine and all the resources are shared with the local HEP users. These machines are additionally integral part of the UCL HEP computing cluster, running a standard version of SLC305. Network connectivity is Gbit ethernet to the core UCL routers, which in turn connect via Gbit to the London MAN.

UCL HEP batch farm

Monitoring and Accounting

GSTAT for UKI-LT2-UCL-HEP

SFTs for UKI-LT2-UCL-HEP

SAM monitoring for UKI-LT2-UCL-HEP

APEL for UKI-LT2-UCL-HEP

Site LOG


LCG 2.7.0 upgrade

07-08/03/2006

MON (pc91), CE (pc90), SE_dpm_mysql (pc55), SE_dpm_disk (pc30) upgraded: apt + yaim configure

LFC (pc91) installed: apt+ yaim install + yaim configure

APEL setup properly to publish site accounting via RGMA

TAR_WN TAR_UI (pc97) upgraded: tarball + yaim configure

Minor bugs to iron out: (in progress)

SC4 Transfer Test

21-22/03/2006

Gianfranco needed some modifications to the filetransfer script to deal with a relocatable UI install using different paths, so Graeme started the transfer RAL->UCL.

Warmup tests had indicated a poor write rate, constrained by the NFS mounted DPM filesystems. In addition, the test could not be run to completion because of the SC4 Aggregate Throughput test starting on Wednesday morning. However, running overnight achieved:

 Transfer Bandwidth Report:
 421/1000 transferred in 46968.2349169 seconds
 421000000000.0 bytes transferred.
 Bandwidth: 71.708038549Mb/s

Obviously this backs up the observation that NFS mounted filesystems perform very poorly. Additionally, both DPM head and pool node are networked throuh a 100Mb switch. This should change to Gb in the near-ish future.

24-25/03/2006

The new version (0.3.4) of Graeme filetransfer script, adapted to the relocatable UI installation we have at UCL-HEP, worked. It installed itself in /opt/lcg/bin with the lib in /opt/lcg/lib/python. Copied it to the shared area, so that it is available from any PC in our cluster. A few transfer tests were then carried out:

gs> filetransfer.py --number=1 --ignore-status-error \
srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned1G \
srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/atlas/canned1G
gs> filetransfer.py --number=1 --delete --ignore-status-error \
srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/atlas/canned1G/tfr000-file00000 \
srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/atlas

While the RAL->UCL transfers would succeed, the UCL->RAL invariably failed:

gs> /usr/local/lcg/glite/bin/glite-transfer-status -l d139282b-bb51-11da-bd42-dee58a58037f
Active
  Source:
srm://pc55.hep.ucl.ac.uk:8443/srm/managerv1?SFN=/dpm/hep.ucl.ac.uk/home/atlas/canned1G_1/tfr000-file00000
  Destination:
srm://dcache.gridpp.rl.ac.uk:8443/srm/managerv1?SFN=/pnfs/gridpp.rl.ac.uk/data/atlas/tfr000-file00000
  State:       Waiting
  Retries:     1
  Reason:      Transfer failed. ERROR the server sent an error response:
425 425 Cannot open port: java.lang.Exception: Pool manager error: Best pool <atlas_60_1> too high : 2.0E8
  Duration:    0

Grame advised this meant that the RAL pools for Atlas were full. Since Gianfranco could only authenticate for Atlas, Graeme once again volunteered to initiate the transfer, which he did just before midnight.

Things were going quite nicely until 2006-03-25T01-21-09, then 11 transfers went into the waiting state "TRANSFER - Transfer timed out." There were a few more transient failures, then things started to go very pear shaped at 2006-03-25T12-01-32: "Transfer failed due to possible network problem - timed out."

After that nothing succeeded. It turns out that the Network Interface on the Storage Pool node pc30 went into error state and the machine locked up at Mar 25 13:31:34. The console showed the following error:

eth0: too much work in interrupt, status e401

in /var/log/messages:

Mar 25 12:00:57 pc30 gridftpd[16253]: RETR /pc30.hep.ucl.ac.uk:/storage/lcgdteam/2006-03-24/canned1G.4452.0
Mar 25 12:04:33 pc30 gridftpd[16177]: Data connection. data_write() failed: Handle not in the proper state
Mar 25 12:04:33 pc30 gridftpd[16177]: lost connection to fts0344.gridpp.rl.ac.uk [130.246.179.20]

Here's the summary of the succesfull UCL->RAL transfers:

Transfer Bandwidth Report:
  421/1000 transferred in 53551.5980821 seconds
  421000000000.0 bytes transferred.
Bandwidth: 62.8926142379Mb/s                                                                                                                                                                                                                                           

Success/Failure rates and bandwidth summary for the transfers:

File:Fts-RAL-UCL.gif File:Fts-UCL-RAL.gif

                              File:Fts-UCL-bandwidth.gif


SC4 Transfer Test (Autumn 06)

14-15/09/2006

Preparation

Downloaded and installed Graeme's latest version of the file transfer script:

wget http://www.physics.gla.ac.uk/~graeme/scripts/packages/filetransfer
rpm -Uvh filetransfer-0.5.2-1.noarch.rpm
rpm -ql filetransfer-0.5.2-1
/opt/lcg/bin/filetransfer.py
/opt/lcg/lib/python/fileTransferLib.py

Attempts to make the install available on the shared directory of our UI (using sof links) failed.

Make sure all is set corectly and try a test transfer:

unset GLOBUS_TCP_PORT_RANGE

grid-proxy-init -valid 30:00
Your identity: /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca
Enter GRID pass phrase for this identity:
Creating proxy .................................... Done
Your proxy is valid until: Sat Sep 9 04:56:09 2006

myproxy-init -d
Your identity: /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca
Enter GRID pass phrase for this identity:
Creating proxy ............................................... Done
Proxy Verify OK
Your proxy is valid until: Fri Sep 15 16:55:16 2006
Enter MyProxy pass phrase:
Verifying password - Enter MyProxy pass phrase:
A proxy valid for 168 hours (7.0 days) for user /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca now exists on
lcgrbp01.gridpp.rl.ac.uk.

glite-transfer-channel-list | grep UCLHEP
RALLCG2-UKILT2UCLHEP
STAR-UKILT2UCLHEP
UKILT2UCLHEP-RALLCG2

Tried to optimize transfer settings, but didn't have management rights on them:

glite-transfer-channel-set -f 8 RALLCG2-UKILT2UCLHEP
set: setNumberOfFiles: You are not authorised for channel management upon this service
glite-transfer-channel-set -T 1 RALLCG2-UKILT2UCLHEP
set: setNumberOfStreams: You are not authorised for channel management upon this service
glite-transfer-channel-set -f 8 UKILT2UCLHEP-RALLCG2
set: setNumberOfFiles: You are not authorised for channel management upon this service
glite-transfer-channel-set -T 1 UKILT2UCLHEP-RALLCG2
set: setNumberOfStreams: You are not authorised for channel management upon this service

glite-transfer-channel-list RALLCG2-UKILT2UCLHEP | grep files
Number of files: 1, streams: 1

Jaimie points out that the relevant channel is STAR-UKILT2UCLHEP and set the number of files to 8:

glite-transfer-channel-set -f 8 -T 1 STAR-UKILT2UCLHEP
glite-transfer-channel-list STAR-UKILT2UCLHEP
Channel: STAR-UKILT2UCLHEP
Between: * and UKI-LT2-UCL-HEP
State: Active
Contact: lcg-support@gridpp.rl.ac.uk
Bandwidth: 0
Nominal throughput: 0
Number of files: 8, streams: 1
Number of VO shares: 5
VO 'dteam' share is: 20
VO 'alice' share is: 20
VO 'atlas' share is: 20
VO 'cms' share is: 20
VO 'lhcb' share is: 20

Made sure DPNS_HOST and DPM_HOST are set to the local SE (pc55.hep.ucl.ac.uk):

echo $DPNS_HOST
pc55.hep.ucl.ac.uk
echo $DPM_HOST
pc55.hep.ucl.ac.uk

Tried:

filetransfer.py --number=1 --delete
srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000
srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/dteam/FileTransferTest/`date +%Y%m%d_%H%M%S`
queue: 30
outputfile: transfer-2006-09-11T14-19-53.log
verbose: 1
sleepTime: 60
maxTmpError: -1
cancel-time: 1800
number: 10
sourceSize: 0
delete: 1
logfile: filetransfer.log
srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000
0
MyProxy Password:
/usr/local/glite/d-cache/srm/bin/srm-get-metadata -retry_num=0 -debug=false
srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000
Srm metadata query of srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile000 exited with
status 256
user credentials are: /C=UK/O=eScience/OU=UCL/L=EISD/CN=francesco g sciacca

Failed to get source SURL size. Is it a valid file?

However, a few days later this worked. NOTE: when suceeding, it prints your passphrase to stdout. BEWARE!!!


RAL->UKI-LT2-UCL-HEP 22-hour transfer

Waited for UCL-CENTRAL to complete their tests, then fired off the transfer setting duration to 22 hours (Started @ 17:54, 14 Sept 2006), as UCL was to go down next day for a scheduled power outage. Unfortunately, I used the "-duration" flag, as opposed to "--duration". Although the script didn't complain, it ignored that option and stopped the transfer at 23:25, 14 Sept 2006, after transferring 100 files.

Re-started the transfer again at 00:45, 15 Sept 2006, setting the duration to 15 hours. It completed at 15:45, 15 Sept 2006.

This is the correct submission command that had to be used:

filetransfer.py --duration=22:00 --delete --uniform-source-size
srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/canned/1GBcannedfile[000:099]
srm://pc55.hep.ucl.ac.uk:8443/dpm/hep.ucl.ac.uk/home/dteam/FileTransferTest/`date +%Y%m%d_%H%M%S`

Here are the summaries of the two transfers:

Transfer Bandwidth Report Summary
=================================
transfer 0 (8f31ccee-4411-11db-bc53-e98919b6d2fb)
100/100 (100000000000.0) transferred. Started at 17:54:38, Done at 23:25:58, Duration = 5:31:19, Bandwidth = 40.2425322224Mb/s

Date of Submission was 14/9/2006
Total number of FTS submissions = 1
100/100 transferred in 19945.9034829 seconds
100000000000.0bytes transferred.
Average Bandwidth:40.1084864712Mb/s


Transfer Bandwidth Report Summary
=================================
transfer 0 (a71b94eb-444b-11db-bc53-e98919b6d2fb)
16/100 (16000000000.0) transferred. Started at 0:50:29, Canceled at 1:50:26, Duration = 0:59:57, Bandwidth = 35.582794095Mb/s
transfer 1 (640396ad-4454-11db-bc53-e98919b6d2fb)
26/100 (26000000000.0) transferred. Started at 1:53:3, Canceled at 3:23:56, Duration = 1:30:53, Bandwidth = 38.141177635Mb/s
transfer 2 (75261cb2-4461-11db-bc53-e98919b6d2fb)
79/100 (79000000000.0) transferred. Started at 3:26:34, Canceled at 9:0:13, Duration = 5:33:39, Bandwidth = 31.5693706657Mb/s
transfer 3 (cc57a425-448b-11db-bc53-e98919b6d2fb)
52/100 (52000000000.0) transferred. Started at 8:30:44, Canceled at 12:3:59, Duration = 3:33:15, Bandwidth = 32.512081779Mb/s
transfer 4 (1aff0071-44aa-11db-bc53-e98919b6d2fb)
49/100 (49000000000.0) transferred. Started at 12:6:36, Canceled at 15:9:14, Duration = 3:2:37, Bandwidth = 35.7745810358Mb/s
transfer 5 (6a3155ec-44bc-11db-bc53-e98919b6d2fb)
9/100 (9000000000.0) transferred. Started at 15:10:51, Active at 15:49:50, Duration = 0:38:58, Bandwidth = 30.7883202266Mb/s

Date of Submission was 15/9/2006
Total number of FTS submissions = 6
231/600 transferred in 53960.877938 seconds
231000000000.0bytes transferred.
Average Bandwidth:34.2470336032Mb/s

Note the strange "Canceled" messages. Not sure what causes them, as I didn't interact with the transfer at all, after firing it off.

Summary Graphs:

File:Fts-graph.pl10.gif File:Fts-graph.pl11.gifFile:Fts-graph.pl4.gif

UKI-LT2-UCL-HEP->RAL 24-hour transfer

Upgrade Log

Glite 3.0

05-07/07/2006

Middleware Upgrade

Upgraded front end nodes (except CE) to the gLite flavour of the services they run.

On all nodes {MON (pc91), CE (pc90), SE_dpm_mysql (pc55), SE_dpm_disk (pc30)}, changed /etc/apt/sources.list.d/lcg.list to:

rpm http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/ rhel30 externals Release3.0 updates

#rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_7_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security

Then:

apt-get update 

Updated to the glite rpm's wheere necessary:

-pc91: Dump a snapshot of the DB first! -> NOT DONE - regularly backed-up by a cron job

rpm -ve lcg-MON-2.7.0-sl3		OK
rpm -ve lcg-LFC_mysql-2.7.0-sl3		OK
apt-get install glite-MON		OK
apt-get install glite-LFC_mysql		OK

-pc90: no changes, will keep lcg-CE

apt-get upgrade lcg-CE			WARNING

Does NOT get glite-yaim -> must get by hand (see below)

-pc55: Dump a snapshot of the DB first! -> NOT DONE - regularly backed-up by a cron job

rpm -ve lcg-SE_dpm_mysql-2.7.0-sl3	OK
apt-get install glite-SE_dpm_mysql	OK

-pc30:

rpm -ve ???

(there's no lcg-SE-dpm_disk...) -> NOTHING TO DO


Then:
<pre>
apt-get dist-upgrade

(make sure yaim is the latest version: glite-yaim-3.0.0-16) if no glite-yaim, get it here and install it:

wget http://www.cern.ch/grid-deployment/gis/yaim/glite-yaim-x.x.x-x.noarch.rpm
rpm -ivh glite-yaim-x.x.x-x.noarch.rpm

Not necessary on pc91, pc30, pc55. DONE on pc90 (as it didn't get it). Had to remove lcg-yaim before being able to dist-upgrade:

rpm -ev lcg-yaim-2.7.0-5

Tarball and userdeps downloaded for UI/WN:

wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/GLITE-3_0_0-sl3.tar.gz
wget http://grid-deployment.web.cern.ch/grid-deployment/download/relocatable/GLITE-3_0_0-userdeps-sl3.tar.gz
tar xzvf GLITE-3_0_0-sl3.tar.gz
cd edg
mv ../GLITE-3_0_0-userdeps-sl3.tar.gz .
tar xzvf GLITE-3_0_0-userdeps-sl3.tar.gz
Re-configure all nodes

site-info.def has now a few new variables. A few others have been dithced. New version prepared based on the prototype that came with the rpms. Note that the PATH to the file has changed to: /opt/glite/yaim/ucl-hep/site-info.def

Also PATH to users.conf and groups.conf has changed accordingly.

/opt/glite/yaim/scripts/configure_node  /opt/glite/yaim/ucl-hep/site-info.def [NODE_TYPE]

These went well for all nodes, except for TAR_WN TAR_UI, which failed with error:

/usr/local/glite/glite/yaim/scripts/configure_node /usr/local/glite/glite/yaim/ucl-hep/site-info-pc97.def TAR_WN TAR_UI
...
Configuring config_glite
Traceback (most recent call last):
 File "/usr/local/glite/glite/yaim/scripts/../functions//../libexec/YAIM2gLiteConvertor.py", line 418, in ?
   updateContainerParameter( param, environ[param] )
 File "/usr/local/glite/glite/yaim/scripts/../functions//../libexec/YAIM2gLiteConvertor.py", line 190, in updateContainerParameter
   stripQuotes( value.split( ' ' )[3] ), 'voms.cert.subj' )
IndexError: list index out of range
[ERROR] The user-defined parameter  is not defined
[ERROR] An error has occurred while parsing the configuration files
An error occurred while configuring the service
gLite configuration script has returned nonzero return code
Error configuring config_glite 

This fails when doing TAR_WN TAR_UI, or TAR_WN only. It goes to finish when TAR_UI, although it throws the error while running.


WORKAROUND (NOTE: not yet known if this breaks something else; so far so good): Commenting out the offending lines (189 and 190) in /usr/local/glite/glite/yaim/scripts/../functions//../libexec/YAIM2gLiteConvertor.py

After that, configuring TAR_WN TAR_UI runs with no error.

Post configure_ node fixes

Some post configure_node actions still required for WN/UI. These fixes need be applied everytime configure_node is executed.

Despite setting in site-info.def

INSTALL_ROOT=/usr/local/glite

to relocate the installation, that value is not yet correctly passed to some paths.

The following fixes needed be applied:

cp -p /etc/grid-security/vomsdir/* /usr/local/glite/etc/grid-security/vomsdir

edit /usr/local/glite/etc/profile.d/grid_env.*sh and change to:

#X509_VOMS_DIR=/etc/grid-security/vomsdir
X509_VOMS_DIR=/usr/local/glite/etc/grid-security/vomsdir

and

#setenv X509_VOMS_DIR /etc/grid-security/vomsdir
setenv X509_VOMS_DIR /usr/local/glite/etc/grid-security/vomsdir


This is a fix for broken lcg-infosites command:

ln -s /usr/local/glite/edg/lib/perl/vendor_perl/5.8.0/Net/ /usr/local/glite/edg/lib/perl/Net
Additional fixes

RGMAsc SFT failed. Checking back the reasons for similar failure after 2.7.0 upgrade, I found out that the file

/usr/local/glite/edg/share/java/log4j.jar

was missing and copied it over from a backup of the lcg 2.7.0 installation

Also RGMA failed, so had to restart a few times tomcat5 on the MON box