Scotgrid LCG 2.7 Pre-Release Testing

From GridPP Wiki
Jump to: navigation, search

Experiences in testing the LCG 2.7 Pre-release on Scotgrid Testzone.


General Problems

APT Repository Problems

  • Command: /opt/lcg/yaim/scripts/install_node /etc/lcg/site-info.def CE_torque
  • Problem: 404 Errors from wget:
Err http://linuxsoft.cern.ch LCG/apt/LCG-2_7_0/sl3/en/i386/lcg_sl3 pkglist
404 Page Not Found
  • Analysis: Repositories do not exist (yet?)
  • Fix: Per Oliver's instructions, there's a different repository to use for testing 2.7.0:
 LCG_REPOSITORY="rpm http://grid-deployment.web.cern.ch/grid-deployment apt-cert/HEAD/sl3/en/i386 lcg_sl3 lcg_sl3.updates lcg_sl3.security"

CE Installation

  • Command: /opt/lcg/yaim/scripts/configure_node /etc/lcg/site-info.def CE_torque

GridFTP

  • Configuration Step: Configuring config_globus
  • Problem: Error messages in lcg-mon-gridftp configuration:
error reading information on service lcg-mon-gridftp: No such file or directory
/opt/lcg/yaim/scripts/configure_node: line 184: /etc/rc.d/init.d/lcg-mon-gridftp: No such file or directory

Torque

  • Configuration Step: Configuring config_torque_submitter_ssh
  • Problem: Can't connect to localhost for some configuration:
Configuring config_torque_submitter_ssh
Connection refused
/usr/bin/pbsnodes: cannot connect to server grid04.ph.gla.ac.uk, error=111
Connection refused
/usr/bin/pbsnodes: cannot connect to server grid04.ph.gla.ac.uk, error=111
  • Analysis:
    • config_torque_submitter_ssh causes a pbsnodes command to be issued
    • config_torque_submitter_ssh happens before config_torque_server, which starts PBS.
    • Hence, PBS isn't started and pbsnodes fails.
    • Re-running configure_node (i.e. with PBS actually running) doesn't produce this error.

GridICE

  • Configuration Step: Configuring config_fmon_client
  • Problem: Error message about "'Torque' not available" - not clear if it's harmful
Starting GridICE daemons: LRMS 'torque' not available! (possible pbs or lsf)

CE Tests

RGMA

After the MON box installation, I wanted to check that RGMA was happy on both machines. On the CE, I tried:

$ /opt/glite/bin/rgma-client-check


*** Running R-GMA client tests on grid04 ***

Checking C API: Failed to create producer: Can't connect to servlet (authentication failed).
Failure - failed to insert test tuple
Checking C++ API: R-GMA application error in PrimaryProducer: Neither TRUSTFILE nor X509_USER_PROXY environment variable set.
Failure - failed to insert test tuple
Checking CommandLine API: ERROR: Neither TRUSTFILE nor X509_USER_PROXY is set
Failure - failed to insert test tuple
Checking Java API: R-GMA error: Neither TRUSTFILE nor X509_USER_PROXY is set
Failure - failed to insert test tuple
Checking Python API: RGMA Error: Neither TRUSTFILE nor X509_USER_PROXY is set
Failure - failed to insert test tuple

*** R-GMA client test failed ***
  • Analysis: /opt/lcg/yaim/functions/config_lcgenv only configures $X509_USER_PROXY when configuring a UI:
if ( echo $NODE_TYPE_LIST | egrep -q UI ); then
    cat << EOF >> ${LCG_ENV_LOC}/lcgenv.sh
if [ "x\$X509_USER_PROXY" = "x" ]; then
    export X509_USER_PROXY=/tmp/x509up_u\$(id -u)
fi
EOF
fi
  • Fix: Performing a manual export X509_USER_PROXY=/tmp/x509up_u33334 and re-running rgma-client-check worked.

MON Installation

  • No obvious errors in installation or configuration.

Initial Tests

  • /opt/glite/bin/rgma-server-check
    • Got the error: "You must export RGMA_HOME or GLITE_LOCATION variable before running this script."
    • I then sourced /etc/profile.d/gliteenv.sh in my shell and ran rgma-server-check successfully.

UI Installation

No obvious problems here.

UI Functionality

  • lcg-utils functional - no problems
  • dpm client utilities - no problems
  • lfc client utilities - no problems
  • srm clients - srm-get-metadata broken (known problem, now fixed in release)
  • fts clients - no problems

SE (dCache) Installation

Head node

  • Install command: /opt/lcg/yaim/scripts/install_node /etc/lcg/site-info.def lcg-SE_dcache
  • postgreSQL installed and postgres user exists:
# id postgres
uid=26(postgres) gid=26(postgres) groups=26(postgres)
  • Relevant rpms:
grid09:~# rpm -qa|grep pnfs
pnfs-3.1.10-15
grid09:~# rpm -qa|grep cache
d-cache-lcg-6.1.0-1
lcg-SE_dcache-20060117_1154-sl3
dcache-client-1.6.6-4
lcg-info-dynamic-dcache-1.0.9-1_sl3
dcache-server-1.6.6-4
  • Note: srm-get-metadata does not work in the 1.6.6-4 client. This is fixed in the 1.6.6-5 release [now included in LCG 2.7.0].
# rpm -qa|grep postg
postgresql-pl-8.0.4-2PGDG
postgresql-devel-8.0.4-2PGDG
postgresql-tcl-8.0.4-2PGDG
postgresql-libs-8.0.4-2PGDG
postgresql-server-8.0.4-2PGDG
postgresql-test-8.0.4-2PGDG 
postgresql-contrib-8.0.4-2PGDG
postgresql-docs-8.0.4-2PGDG
postgresql-python-8.0.4-2PGDG
postgresql-8.0.4-2PGDG
postgresql-jdbc-8.0.4-2PGDG
  • Note: dCache recommend using postgreSQL v8.1 and above for users who decide to use PNFS with a postgreSQL database backend. Unfortunately, YAIM does not install this version of pnfs, but continues to use gdbm. It would be better if both options were available, gdbm for sites that are upgrading from an old dCache and postgreSQL for those that are installing from scratch.
  • Configure command: /opt/lcg/yaim/scripts/configure_node /etc/lcg/site-info.def SE_dcache
  • Configuration failed on first attempt due to an issue with the domainname.
grid09:/etc/lcg# cat /etc/resolv.conf
; generated by /sbin/dhclient-script
search ph.gla.ac.uk
nameserver 194.36.1.70
nameserver 130.209.16.6
nameserver 130.209.4.18
nameserver 130.209.4.16
grid09:/etc/lcg# hostname -f
grid09:/etc/lcg# hostname
grid09

so hostname -d does not return the domain. This will cause a problem with the installation since configure_node contains:

if [ "x${DCACHE_ADMIN#$thishost}" = "x" ];

where

thishost=`hostname -f`

but hostname -f only returns grid09 for the machine. Solution was to change the site-info.def to be:

DCACHE_ADMIN="grid09"
DCACHE_POOLS="grid12:/srm-storage"

It must be noted that this subsequently breaks the information system since lcg-info-generic.conf does not then contain the FQDN of the machine in its entries. This could be fixed by editing site-info.def again and rerunning the information system configuration.

  • During configuration, is there any need to try and start globus-gridftp?
Stopping globus-gridftp:                                   [FAILED]
Starting globus-gridftp:execvp: No such file or directory
                                                           [FAILED]
Shutting down lcg-mon-gridftp:                             [  OK  ]
Starting lcg-mon-gridftp                                   [  OK  ]
  • VOMs error:
voms
search(https://lcg-voms.cern.ch:8443/voms/dteam/services/VOMSCompatibility?
method=getGridmapUsers&container=%2Fdteam%2FRole%3Dproduction):  
Internal Server Error
  • PNFS not started after configuration.
service pnfs start
grid09:~# ls /pnfs
data  fs  ftpBase  ph.gla.ac.uk
grid09:~# ls /pnfs/ph.gla.ac.uk/
data
grid09:~# ls /pnfs/ph.gla.ac.uk/data/
  • The above listing shows that no VO directories have been created, so need to do this by hand. Each VO should have its own database to prevent bottleneclks.
  • postmaster process not started after configuration.
grid09:/etc/init.d# su postgres
grid09:/etc/init.d$ postmaster -i -D /var/lib/pgsql/data/ > /tmp/logfile 2>&1 &
                                                                               
grid09:/etc/init.d# ps aux|grep postg
postgres 15117  0.1  0.3 18956 3152 pts/0    S    14:38   0:00 postmaster -i -D /var/lib/pgsql/data/
postgres 15119  0.0  0.2  8872 2148 pts/0    S    14:38   0:00 postgres: logger process
postgres 15121  0.0  0.3 19088 3240 pts/0    S    14:38   0:00 postgres: writer process
postgres 15122  0.0  0.2  9872 2184 pts/0    S    14:38   0:00 postgres: stats buffer process
postgres 15123  0.0  0.2  9060 2304 pts/0    S    14:38   0:00 postgres: stats collector process
postgres 15150  0.0  0.5 19936 5996 pts/0    S    14:39   0:00 postgres: srmdcache dcache 127.0.0.1(34262) idle in transaction
root     15173  0.0  0.0  3684  656 pts/0    S    14:39   0:00 grep postg
  • PNFS and dcache-core restarted after starting postmaster.
  • Web monitoring interface operational.
links http://localhost:2288

Not all relevant cells appear in the interface. Had to add SRM-grid09 \ to /opt/d-cache/config/httpd.batch by hand and restart the web interface before the SRM cell appeared.

  • Relevant dCache processes running:
# netstat -lntp | grep java
tcp        0      0 0.0.0.0:33122               0.0.0.0:*                   LISTEN      4436/java
tcp        0      0 0.0.0.0:22125               0.0.0.0:*                   LISTEN      4585/java
tcp        0      0 0.0.0.0:22223               0.0.0.0:*                   LISTEN      4661/java
tcp        0      0 0.0.0.0:2288                0.0.0.0:*                   LISTEN      4740/java
tcp        0      0 0.0.0.0:8443                0.0.0.0:*                   LISTEN      5106/java
tcp        0      0 0.0.0.0:22111               0.0.0.0:*                   LISTEN      5009/java
  • No gsidcap or gridftp doors running on admin node by default.
  • By default, the dCache log files are created in /var/log, not /opt/d-cache/log . This is different from previous behaviour, presumably due to the logArea setting in /opt/d-cache/config/dCacheSetup. Manually created logrotate file, setting logrotation to 15 so as to keep logs for a sufficiently long time.
$ cat /etc/logrotate.d/dcache
/opt/d-cache/log/*.log {
        rotate 15
        daily
        missingok
        compress
        copytruncate
}
  • Manually created grid-mapfile2dcache-kpwd in /etc/cron.hourly. Set permissions to 755.
#! /bin/bash
/opt/d-cache/bin/grid-mapfile2dcache-kpwd
  • PNFS setup': PNFS using gdbm. My initial install and configuration did not have any VO databases enabled, but I think this was due to the hostname issue with the nodes. Had trouble upon subsequent runnings of yaim config to get the VO databases setup. Noticed that only a single database per VO was being initialised, maybe better to have 2 (dteam and dteam/generated ?). Necessary to add in additional VOs by hand (i.e. not possible to re-run YAIM. Possibly new site-info.def variable to allow for this).

To map VOs to pools:

grid09:/pnfs/ph.gla.ac.uk/lhcb# echo "StoreName    lhcb">".(tag)(OSMTemplate)"
grid09:/pnfs/ph.gla.ac.uk/lhcb# echo lhcb > ".(tag)(sGroup)" 
grid09:/pnfs/ph.gla.ac.uk/lhcb# grep "" $(cat ".(tags)()")
.(tag)(OSMTemplate):StoreName    lhcb
.(tag)(sGroup):lhcb
                                                                               

Information system

  • After creating /opt/lcg/var/gip/tmp with 777 permissions and starting globus-mds, the dynamic storage information was created.
  • dCache >1.6.6-2 comes with a dynamic information plugin that should integrate with LCG GIP and provide storage used per-VO.
  • VO specific pool groups have been setup in the dCache PoolManager and pools have been added to them (i.e. psu create pgroup lhcb).
  • Links to plugin made:
ln -s /opt/d-cache/jobs/infoDynamicSE-plugin-dcache /opt/lcg/var/gip/plugin/ 
ln -s /opt/d-cache/jobs/infoDynamicSE-provider-dcache /opt/lcg/var/gip/provider/
  • To make sure that everything will be OK, try running the plugin by hand. If should output the used and available storage per-VO. If it doesn't, something is wrong. Make sure that you have the pool groups correctly labelled.
  • Modify /opt/lcg/share/doc/lcg-info-templates/lcg-info-static-se.conf to apply to your site configuration and add in extra blocks for each VO that you support. Generate the .ldif file that contains the static information by running:
/opt/lcg/sbin/lcg-info-static-create -c /opt/lcg/share/doc/lcg-info-templates/lcg-info-static-se.conf \
-t /opt/lcg/etc/GlueSE.template > /opt/lcg/var/gip/ldif/lcg-info-static-se.ldif
  • Need to change /opt/d-cache/config/dCacheSetup to make sure that it points to the correct .ldif file:
infoProviderStaticFile=/opt/lcg/var/gip/ldif/lcg-info-static-se.ldif
  • Remove /opt/lcg/var/gip/plugin/lcg-info-dynamic-se and the dynamic file /opt/lcg/var/gip/tmp/lcg-info-dynamic-se.ldif.4065 to stop old LCG dynamic information provider. Good idea to restart globus-mds.

Pool node

  • Installation command: /opt/lcg/yaim/scripts/install_node /etc/lcg/site-info.def lcg-SE_dcache
  • Installed without problems.
  • Configuration command: /opt/lcg/yaim/scripts/configure_node /etc/lcg/site-info.def SE_dcache
  • Made same modification to site-info.def regarding the hostnames of the admin and pool nodes.
  • Same VOMs and globus-gridftp errors as above.
  • Configuration picked up from hostname and DCACHE_POOLS that this machine was to be a pool node. Single dCache pool created:
grid12:~# ls /srm-storage/
pool
grid12:~# ls /srm-storage/pool/
control  data  setup  setup.orig  setup.temp
grid12:~# cat /opt/d-cache/config/grid12.poollist
grid12_1  /srm-storage/pool  sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=grid12
  • Default behaviour is for the pool to have an open gridftp door but no gsidcap or srm door. This is reasonable.
  • dcache-pool started, but dcache-core not since no symbolic link made between file in /etc/init.d and /opt/d-cache/bin/dcache-core:
grid12:~# ls /etc/init.d/dcache*
/etc/init.d/dcache-pool

Make link by hand then try and start dcache-core:

grid12:/etc/init.d# service dcache-core start
[ERROR] pnfs not mounted on /pnfs/ph.gla.ac.uk/ and ADMIN_NODE
        in etc/node_config or etc/door_config not set properly. Exiting.
  • PNFS not mounted on pool node due to the file /pnfs/fs/admin/etc/exports/<IP address of pool node> not existing. Had to create it by echo'ing the following lines into it:
grid09:~# cat /pnfs/fs/admin/etc/exports/194.36.1.96
/admin     /0/root/fs/admin     0   nooptions
/pnfsdoors /0/root/fs/usr           0   nooptions

Also had to copy /pnfs/fs/admin/etc/exports/trusted/127.0.0.1 to the IP address of the pool node.

  • After configuring all of this, we have a basic operational system. SRM commands working.
  • Able to add new pool using standard methods (add to grid12.poollist file, create pool, data, control directories, modify httpd.batch)

FTS testing

  • Need to test interoperability of dCache 1.6.6 with FTS. Find that transferring large files into dCache with FTS does not succeed due to dCache not responding with performance markers quickly enough, causing FTS to think that the transfer has timed out. The files then go into WAITING state according to FTS, but the entire set of files is still transferred. Changed option in dCacheSetup file on the nodes with gridftp doors so that:
performanceMarkerPeriod=30 

The default is 180 (for 3 minutes inbetween markers). This appears to have solved the problem. This is a known issue.

WN install

  • lcg-WN was installed and configured on the dCache pool node. No dependency issues were observed, but a few errors were observed when configuring WN_torque:
Configuring config_torque_client
/opt/lcg/yaim/scripts/configure_node: line 57:
/opt/edg/sbin/edg-pbs-knownhosts: No such file or directory
/opt/lcg/yaim/scripts/configure_node: line 59:
/var/spool/pbs/mom_priv/config: No such file or directory

SE (DPM) Installation

Head Node

Installed YAIM v14 onto grid08.

Got into a pickle with hostnames being ill defined, then wrong IP addresses in /etc/hosts (doh!).

Found a bug in config_gip - the directory /opt/lcg/var/gip/tmp needs to be created with mode 0777, otherwise globus-mds doesn't start properly. That aside static info and the dynamic plugin seem healthy.

Later noticed that my plugin info had gone stale. var/gip/tmp directory hadn't been created with the right permissions. Mailed Oliver - fixed.

Otherwise you get a basic DPM. One pool, with one filesystem on $DPM_HOST:$DPMSTORAGE. srmcp works, and I've added other pools - seems basically functional.

Will add a disk pool and try making a VO specific pool.

Head Node Upgrade

Installed 1.3.8 (LCG 2.6.0) on grid11. Populated with files.

Upgraded to 2.7.0. Found that config_DPM_upgrade worked fine - db changed to FQDNs and all files remained accessible.

Notes:

  • Have to ensure that mysql pw, pool names, etc. remain the same from the previous YAIM install
  • db schema upgrade triggered by detecting schema 1.1.0. sites who upgraded to 1.4.1 and upgraded the db schema will not get the FQDN fix applied. will have to work on this for these gridpp sites and test.

Disk Node

Installed by YAIM (v15) onto grid11.

Functionality is significantly enhanced. Disk nodes are now trusted users of DPM, so can add their filesystems to pools.

Found a bug that servers with more than 1 filesystem to add fail (in config_DPM_disk). Rewrote config_DPM_disk to fix this (wrapping fs paths in a "for" loop). Mailed to Oliver - will be included.

lcg-rep

Tested fix of dCache->DPM transfers. Found I was still getting slow transfers. J-P B helped track down the problem to hostname being "short". He will fix rfio code, but for the moment define hostname to be FQDN works. (So the RFIO_TCP_NODELAY option doesn't need set.)

LFC

Tested parallel install with mon on grid06 (ran out of machines). Set up central catalog for dteam, local one for atlas. YAIM worked fine, advertising catalogs in the correct way.

Tested catalogs using lcg-util commands on grid13 (UI). No problems found.