Brunel

From GridPP Wiki
Jump to: navigation, search
Back To LT2

Computing

Brunel University has three separate clusters / CEs accessible via the Grid:

  • Triton
    • A farm of twin quad-core 64-bit Xeons (2.5 GHz, 16 GB).
    • Lead admin: Raul
  • Netlab
    • A SRIF-2 funded cluster of twin dual-core 64-bit Opterons (1.8 GHz, 4 GB).
    • Lead admin: Raul
  • GreenStripe
    • GreenStripe had just two nodes, on the public network. It was just meant as a debugging aid, esp. for networking issues, and for understanding new middleware updates and VOs, and also served the NGS.
      GreenStripe was originally built in 2002 using a 1.5 GHz Pentium 4, but upgraded about 2004 to use dual-Xeon servers (2.0 GHz, 2 GB, no hyperthreading), with the WNs replaced in 2008 by a pair reused from Argo. GreenStripe was retired in August 2012 after 010 years' nominal service.
  • Argo
    • Argo was a SRIF-1 funded cluster of 64 dual-Xeons (2.4 GHz, 2 GB, no hyperthreading). Half the nodes had IDE drives and half SCSI. LCG nominally received 50% of this resource.
      Argo was built in 2003 and retired in 2008. The final score was 8 IDE nodes failed (disk systems) vs. 1 SCSI node failed (system board).

Storage

We have three DPM-based SEs:

  • dgc-grid-38
    • dgc-grid-38 is a dedicated SE for use by the MICE experiment, in particular to test HTTPS interfaces. It was originally a 400GB IDE RAID5 system deployed as a gLite 3.1 PPS test system for DPM in 2007, and replaced by a gLite 3.2 DPM service on a single 6 TB SATA RAID system in Spring 2012.
    • Lead admin: Henry
  • dgc-grid-50
    • dgc-grid-50 was the main gLite 3.1 production SE, with three 20 TB SATA RAID pool servers. It is being retired in Autumn 2012.
  • dgc-grid-34
    • This was the old EDG/gLite 3.0 SE, with two 6 TB SATA RAID pool servers. It was retired in 2010.
  • In common with LT2 policy, the SE storage is not backed up and should be treated as "volatile" (this should be indicated by GlueSAPolicyFileLifeTime).

Log

Rotating SFT reservations in Maui

I have managed to get sft's to use the reserved node only as a last resort by adding a - sign after the dteam and ops groups:

SRCFG[sft] GROUPLIST=dteam-,ops-

which works quite well if there are spare nodes. There is also the option

SRCFG[sft] FLAGS=SPACEFLEX

which is supposed to rotate the standing reservation onto different nodes when the new reservation is made

"reservation is allowed to move from host to host over time in an attempt to optimize resource utilization*"

http://www.clusterresources.com/products/maui/docs/7.1.5managingreservations.shtml

This is done at midnight when using the default PERIOD of DAY with default STARTTIME of 00:00:00:00 and ENDTIME of 24:00:00). I had to increase the DEPTH to 4 days so that a reservation could be definately be made, in other words sufficiently far ahead to be sure all nodes would be free but it always seems to add the new reservations onto the existing reserved node.

Upgrade Log

EMI 3 / SL6

Storage Accounting

Enable storage accounting as per https://wiki.egi.eu/wiki/APEL/Storage:

Get the accounting script from https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Dev/Recipes/EGI_APEL_Accounting Discover it's not accessible from there: GGUS ticket 131525. It's in /usr/share/lcgdm/scripts/star-accounting.py and is already there as it was shipped with DPM.

Get the SSM software: https://wiki.egi.eu/wiki/APEL/SSM leads to https://github.com/apel/ssm/releases - get v. 2.1.7-1 (apel-ssm-2.1.7-1.el6.noarch.rpm) and do yum localinstall.

 Package                          Arch                    Version                        Repository                                     Size
Installing:
 apel-ssm                         noarch                  2.1.7-1.el6                    /apel-ssm-2.1.7-1.el6.noarch                   54 k
Installing for dependencies:
 python-daemon                    noarch                  1.5.2-1.el6                    epel                                           27 k
 python-dirq                      noarch                  1.7.1-1.el6                    epel                                           50 k
 python-ldap                      x86_64                  2.3.10-1.el6                   sl                                            124 k
 python-lockfile                  noarch                  0.8-3.el6                      epel                                           17 k
 stomppy                          noarch                  3.1.6-1.el6                    epel                                           48 k

Next steps are in README.md on [1]

~ > useradd -r apel --comment "APEL user for storage accounting"
~ > chown apel:apel /var/spool/apel/
~ > chown apel:apel /var/log/apel/
~ > chown apel:apel /var/run/apel/

and create APEL-accessible copies of the host certificate and key.

Get authorised / Configure GOCDB

Register service of type "eu.egi.storage.accounting" for that SE with GOCDB, giving full host DN. Wait overnight for it to propagate.

Configure SSM...

In /etc/apel/sender.cfg

  • network: PROD
  • use_ssl: true
  • certificate: /path/to/your/host/certificate
  • key: /path/to/your/key
  • destination: /queue/global.accounting.storage.central
  • path: /var/spool/apel/outgoing

Test Run:

~ > mkdir -p /var/spool/apel/outgoing/`date +%Y%m%d` 
~ > /usr/share/lcgdm/scripts/star-accounting.py --reportgroups --nsconfig=/usr/etc/NSCONFIG --site="UKI-LT2-Brunel" >  /var/spool/apel/outgoing/`date +%Y%m%d`/`date +%Y%m%d%H%M%S`
~ > ls -l  /var/spool/apel/outgoing/20171103/20171103215443 
-rw-r--r--. 1 root root 4972 Nov  3 21:54 /var/spool/apel/outgoing/20171103/20171103215443

This produces the output files (but, as root).

chown -R apel:apel /var/spool/apel/outgoing
su --command /usr/bin/ssmsend apel

then publishes OK.

cron Job:

/etc/cron.d/StorageAccounting.cron is

PATH=/sbin:/bin:/usr/sbin:/usr/bin
48 11 * * * root /root/StorageAccounting.sh >> /var/log/apel/StorageAccounting.log 2>&1
48 14 * * * apel /usr/bin/ssmsend >> /var/log/apel/StorageAccounting.log 2>&1

where StorageAccounting.sh simply does the mkdir / star-accounting.py / chown steps above. I'm sure the chown shouldn't really be needed abut haven't got time to dig.

DPM upgrade

This is expected to boil down down to installing new repos and doing a yum update. Raul Lopes/Matt Doidge:

Check what version you're publishing with something like:

    dpm-listspaces --domain dgc-grid-38.brunel.ac.uk --gip --protocols --basedir home --site UKI-LT2-BRUNEL --glue2 | grep MiddlewareVersion

Re-YAIM, to pick up various VOMS server changes

  /opt/glite/yaim/bin/yaim -c -s /root/emi_2/site-info.def -n emi_dpm_mysql -n emi_dpm_disk

and reboot.

Noted since that - if I open port 80 - it is possible to browse the directory tree anonymously (as user "nobody"), but to access a file it still flips to https. At next re-YAIM try setting DPM_DAV_SECURE_REDIRECT to "off"... That works, as in anyone can now access the data. Need robots.txt, but is inaccessible - see GGUS ticket #109503.

In /etc/httpd/conf.d/zlcgdm-dav.conf must add

   # Filesystem location
   <LocationMatch "^/(?!(dpm/brunel\.ac\.uk/|static|icons|robots.txt)).*">

Note that if /etc/httpd/conf.d/ssl.conf exists then Apache will crash on startup. YAIM removes this file, but it may re-appear if the daemon is updated (without re-running YAIM).

UI

See protected MICEmine page

EMI 2 / SL6

DPM upgrade

Upgrade to DPM on SL6: Basic plan is similar to 3.1 to 3.2 upgrade - install on new HW, test, then transfer DB and data across.

Install generic SL6 OS + EPEL repo, NTP etc. on new hardware from scratch

Install mysql-server, then DPM, then lcg-expiregridmapdir.noarch

Follow instructions in https://twiki.cern.ch/twiki/bin/view/EMI/GenericInstallationConfigurationEMI2 https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Admin/Configuration/Manual and associated YAIM guide.

Basically the same siteinfo file as before. Left DMLITE off for now.

Got caught out first time round (early Nov) - need to open up ports in local firewall.

Got caught out on revisit (early Dec.) - globus-gridftp not starting automatically (chkconfig on) and I'd forgotten to include the GLOBUS_TCP_PORT_RANGE in the local firewall config. I also added a

   $GLOBUS_TCP_PORT_RANGE 20000,20050

line to /etc/gridftp.conf though I'm not sure it was actually necessary.

I also removed public write access from log files and directories.

Upshot was a working DPM 1.8.4 instance into which I then copied the old data and DB and re-ran YAIM to update the schema. This then ran happily overnight, at which point I tried to upgrade to 1.8.5 - yum update and re-run YAIM. This didn't show any problems but the server didn't work after a reboot - no DPM daemon. Turned out that badly-done init script can think DPM service is already running if certain ports are in use - see GGUS ticket https://ggus.eu/tech/ticket_show.php?ticket=89889. Still didn't fully work after yet another reboot - had to manually restart dpm-gsiftpd.

Had to open firewall ports for BDII, GLOBUS_TCP_PORT_RANGE, https.

Link to MySQL tweakage: https://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI2#3_2_MySQL_database_configuration (overkill for this server).

  • Not sure which packages are needed for WebDAV (will lcgdm-dav pull in all dependencies?)

For WebDAV follow https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/WebDAV/Setup

  • Also updated to 1.8.6
  • yum install lcgdm-dav-server dmlite-plugins-adapter
  • new YAIM variables:
  DMLITE="yes"               # Enable DMLite (Required!)
  DMLITE_TOKEN_PASSWORD="password" # This password is used by dmlite to generate the tokens
  DPM_DAV="yes"              # Enable DAV access
  DPM_DAV_NS_FLAGS="Write"   # Allow write access on the NS node
  DPM_DAV_DISK_FLAGS="Write" # Allow write access on the disk nodes
  DPM_DAV_SECURE_REDIRECT="On" # Enable redirection from head to disk using plain HTTP.

I used my own password...

Re-YAIM

  /opt/glite/yaim/bin/yaim -c -s /root/emi_2/site-info.def -n emi_dpm_mysql -n emi_dpm_disk

and reboot. Again a mysterious failure - this time dpm-gsiftpd wasn't properly starting; same symptoms as #89889. WebDav interface working and I can download 6GB files.

New kernel on 8 Feb and re-boot; again dpm-gsiftpd didn't start properly.

  • still not clear what the difference is between dpm-gsiftpd and globus-gridftp - originally both were enabled.
  • did chkconfig globus-gridftp-server off - dpm-gsiftpd still on

Will see what happens

N.B. DPM upgrade tips

Glite 3.2

DPM upgrade

Upgrade to 64-bit DPM: Install generic SL5.7 OS + DAG repo, NTP etc.

- Enable 64-bit inodes under XFS:

    LABEL=mice-brn          /storage/for            xfs     inode64         0 0

- Try to enforce InnoDB tables in mysql (in my.cnf):

    default-storage-engine=InnoDB

and create root password in mysql for localhost and FQDN.

- Install egi-trust repo and certs; host certificate; glite repos as per generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide320).

- Add priority rating to repo files.

- Install glite RPMs and run YAIM with existing siteinfo file:

     /opt/glite/yaim/bin/yaim -c -s /root/glite3_2_0/site-info.def -n SE_dpm_mysql

There was a bit of trouble relating to the dpminfo user, probably down to a corrupt edgusers.conf, and an info system error I didn't catch. dpmmgr and dpminfo were given the same numerical uids/gids as the old server.

- According to http://www.gridpp.ac.uk/wiki/DPM_Log_File_Tracing, "(N.B. The original srmv2 daemon should be switched off.)"

- Fix log rotation

This gives a working DPM, but with no data in it.

The raw data was already backed up to a second RAID system. For the database I followed Alessandra Forti's notes at http://northgrid-tech.blogspot.com/2011/11/dpm-upgrade-174-182-glite-32.html (though I've stuck with the OS' mysql 5.0x). On the old SE:

- upgrade from 1.7x to 1.8.0 as normal (to update schema)

- Stop all DPM and BDII services

- Dump the old database with

    mysqldump -C -Q -u root -p -B dpm_db cns_db > dpm.sql-20111125.gz

(this is equivalent to my existing backup command) and then drop the requests tables to shrink the DB:

    wget http://www.sysadmin.hep.ac.uk/svn/fabric-management/dpm/dpm-drop-requests-tables.sql 
    mysql -p < dpm-drop-requests-tables.sql

This then thrashes the hard disk (8 GB, 1997 vintage) while I bite my nails for 10's of minutes.

- Dump the shrunken database (what we really want):

    mysqldump -C -Q -u root -p -B dpm_db cns_db > dpm.sql-20111125-v2.gz

The -C flag didn't seem to do anything - my output files were still SQL text.

Copy them to new SE, where

- Stop all DPM and BDII services

- Upload the database

    mysql -u root -p -C < /root/dpm.sql-20111125-v2.gz

- rsync the data back in, then re-run yaim (to update schema); this time no errors. Reboot; SRM tests all pass, data there with correct permissions and access by SRM, but httpd-redirector segfaults when a file is selected.

I thought this might be a weirdness with the glite user permissions, but no luck... apparently terminally broken in gLite 3.2, told to use EMI instead.

UI

64-bit UI: Install generic SL5 OS + DAG repo, NTP etc. You need to make sure that hostname -f returns something like a FQDN else yaim will fail (even though the UI doesn't really need one) Follow the install guide https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide320 in setting up the gLite and lcg-CA repos, then

  yum groupinstall glite-UI
  yum install lcg-CA
  /opt/glite/yaim/bin/yaim -c -s site-info.def -n glite-UI

- In /opt/edg/etc/edg_wl_ui_cmd_var.conf, /opt/edg/etc/edg_wl_ui_gui_var.conf and /opt/glite/etc/glite_wmsui_cmd_var.conf set a sensible default VO if needed.

- Set up a weekly cron job to touch /tmp/jobOutput, to stop it getting cleaned away. Also worth doing in rc.local for those longer downtimes...

- You will need to put the certificates of your VOMS servers in /etc/grid-security/vomsdir/, else glite-wms-job-status will fail with:

  **** Error: API_NATIVE_ERROR ****
  Error while calling the "UcWrapper::getExpiration" native api
  Cannot verify AC signature!
  (Please check if the host certificate of the VOMS server that has issued your proxy is installed on this machine)

even though submission worked fine!

- and either way glite-wms-job-output fails with

  Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
  
  Error - Operation Failed
  Unable to retrieve the output

but clears the sandbox from the WMS. Brilliant.

- You need the 32-bit version of the openldap libraries installed, else lcg-cp fails.

- it looks like there might be an undeclared dependency of the UI on zlib-devel to get adler32 or CRC32 checksum validation of transfers (i.e. lcg-cp --checksum ...), though MD5 is OK.

Glite 3.1

Upgrade to SL(C)4

CE

LCG CE/WN: dgc-grid-35: 32-bit, essentially standard, but also had to add a DNS-style VO (SuperNEMO). No pool SGM accounts

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.). Restored local override functions from old YAIM.

- Installed packages with yum:

     yum install glite-BDII lcg-CE glite-TORQUE_client glite-TORQUE_utils glite-TORQUE_server glite-WN

This doesn't pull in the CA certs, so also need

     yum install lcg-CA

and local installs of needed VOMS server certificates (and host certs!).

- Comment out the __GROUP_ENABLE variable from the 'requires' list in config_gip_ce_check() in /opt/glite/yaim/functions/config_gip_ce, see https://savannah.cern.ch/bugs/index.php?35890(I think this is no longer needed)

- Create dummy config_sw_dir in /opt/glite/yaim/functions/local on all WNs to save pummelling the disk server (see http://www.gridpp.ac.uk/wiki/GLite_Update_27#Software_area ). Also don't need it on the CE unless adding new VO or changing to SGM pool account

- In /opt/glite/yaim/functions/local on the CE, need to modify config_gip_ce in order to publish correctly the GlueCESEBindMountInfo for all CEs. Need to add

    "$SE_HOST1") accesspoint="/dpm/${MY_DOMAIN}/home";;
    "$SE_HOST2") accesspoint="/dpm/${MY_DOMAIN}/mice1";;
    "$SE_HOST3") accesspoint="/dpm/${MY_DOMAIN}/home2";;

in the case statement around line 350 (GGUS ticket #36792). (I think this is no longer needed)

- Updated site-info.def with various new stuff; esp. add comma to port ranges and on site BDII modify Mds-vo-name/port number to pick up new BDII-powered GRISes.

- For SuperNEMO, created vo.d subdirectory with the SuperNEMO entries in supernemo.vo.eu-egee.org.

The YAIM guide (https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#vo_d_directory) and GridPP Wiki entry (http://www.gridpp.ac.uk/wiki/Enabling_the_DNS_VO_style_test.gridpp.ac.uk) contradict each other - in particular the YAIM guide mentions "a shorter name for the VO" which I think is bogus - the authors have got confused with Unix groups and PBS queues that happen to use the same name. Basically, two things need to happen: members (FQANs) of VO supernemo.vo.eu-egee.org need to get mapped to snemo001:snemo, etc.; and YAIM must ensure that users in group snemo are allowed to submit to the desired queue (here, "short"). The obvious place to spell out those connections is groups.conf, which has a field at the end for the VO, but as I've never seen that used I wimped out and kept it standard, e.g.

   "/VO=supernemo.vo.eu-egee.org/GROUP=/supernemo.vo.eu-egee.org":snemo:3700::

Instead, I put the full VO name in the users.conf file, e.g.

   37001:snemo001:3700:snemo:supernemo.vo.eu-egee.org::

In site-info.def the entry in VOS is the full VO name (supernemo.vo.eu-egee.org), the entry in QUEUES the queue name ("short") and these are connected by the disaster area that is SHORT_GROUP_ENABLE:

   SHORT_GROUP_ENABLE=" \
   ops     /VO=ops/GROUP=/ops/ROLE=lcgadmin         /VO=ops/GROUP=/ops/ROLE=production \
   supernemo.vo.eu-egee.org /VO=supernemo.vo.eu-egee.org/GROUP=/supernemo.vo.eu-egee.org/ROLE=production \
   "

(and so on with entries for every supported VO. Lovely.)

- From the directory containing site-info.def, finally ran yaim to configure the thing:

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n BDII_site -n glite-WN -n TORQUE_client -n lcg-CE -n TORQUE_server -n TORQUE_utils

(though possibly the order -n TORQUE_server -n lcg-CE -n TORQUE_utils would be better, glite bug #17585, but the above agrees with release notes for update 14)

- In /opt/edg/etc/edg-pbs-shostsequiv.conf and /opt/edg/etc/edg-pbs-knownhosts.conf, make sure the CE (torque server) is listed with both short and full hostnames, and re-run /opt/edg/sbin/edg-pbs-knownhosts and /opt/edg/sbin/edg-pbs-shostsequiv.

- Apel is now complaining that

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger
       at org.glite.apel.pbs.ApelLogParser.<clinit>(ApelLogParser.java:34)

In /opt/glite/bin/apel-pbs-log-parser change

    LOG4J_CP="$APEL_HOME/share/glite-apel-core/java/log4j.jar"

to

    LOG4J_CP="/usr/share/java/log4j.jar"

I've also had to add a ulimit -n 3000 over the years...

Our customised APEL config means we can't use YAIM to configure it, so needs updating manually. The BLAH parser config can be stolen as-is out of the template YAIM uses, and added to the existing APEL config, and the gatekeeper and message parsing disabled.

Also we separate multiple CEs. APEL got the CE specints from the GRIS (on 2135) which no longer exists; configuring it to use 2170 makes it query the GIIS instead and it will only ever pull out the first cluster's values. Instead changed it to use hard-wired numbers.

- Spotted a weird problem with dynamic info publishing: queue status (production vs. draining, number of job slots available) was correct, but number of running jobs, ERT, etc. was stuck at zero. Fixed this by adding "edguser" to the list of allowed operators in PBS.

- Failures from loony number of dead globus-gma processes - see https://gus.fzk.de/ws/ticket_info.php?ticket=42981. Restarting the globus-gma service clears them out (temporary fix). Adding tout 120 to /opt/globus/etc/globus-gma.conf as suggested in the ticket seems to have fixed this for us.

- Add TMPDIR variable as per WNs

- Keep seeing

    dgc-grid-35 glite-lb-interlogd[24670]: queue_thread: event_queue_connect: edg_wll_gss_connect

once a minute; not cured by restarting gLite service. I ended up turning locallogger off by stopping the gLite service, without any obvious sideffects.

WN

Simple WN: 32-bit, essentially standard, but also had to add a DNS-style VO (SuperNEMO). No pool SGM accounts

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.). Restored local override functions from old YAIM. For MPI, added repository, as per http://www.grid.ie/mpi/wiki/YaimConfig

- Installed packages with yum:

     yum install lcg-CA
   ( yum install glite-MPI_utils )
     yum install glite-TORQUE_client glite-WN

and local installs of needed VOMS server certificates.

- Create dummy config_sw_dir in /opt/glite/yaim/functions/local on all WNs to save pummelling the disk server (see http://www.gridpp.ac.uk/wiki/GLite_Update_27#Software_area ). Also don't need it on the CE unless adding new VO or changing to SGM pool account

- Copied site-info.def, etc., from CE and from the directory containing site-info.def, finally ran yaim to configure the thing:

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n glite-WN -n TORQUE_client

or

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n MPI_WN -n glite-WN -n TORQUE_client

- In /opt/edg/etc/edg-pbs-knownhosts.conf, make sure the CE (torque server) is listed with both short and full hostnames, and re-run /opt/edg/sbin/edg-pbs-knownhosts.

- In /opt/glite/lib/ fix up missing symbolic links to libraries.

- In /var/spool/pbs/mom_priv/config, check ideal_load and max_load have appropriate values.

- Apparently still not sorted out, after 5+ years: set TMPDIR to point to scratch space. In /etc/profile.d/glite_local.csh

# TMPDIR:
setenv TMPDIR /scratch

and in /etc/profile.d/glite_local.sh

# TMPDIR:
export TMPDIR=/scratch

This should probably be added via grid-env.sh

A better idea is to set

$tmpdir /scratch

in /var/spool/pbs/mom_priv/config, so that Torque creates a dedicated per-job subdirectory under /scratch and points TMPDIR at it, but that then also requires an extra customising step to cd into it which may not work with recent WMS change: in /etc/profile.d/glite_local.sh

 export GLITE_LOCAL_CUSTOMIZATION_DIR=/etc/glite

and in /etc/profile.d/glite_local.csh

 setenv GLITE_LOCAL_CUSTOMIZATION_DIR /etc/glite

then in $GLITE_LOCAL_CUSTOMIZATION_DIR/cp_1.sh

# Move to per-job directory
cd $TMPDIR

but note Savannah bug 55237 and DON'T DO BOTH the glite_local and the torque fix!


- The Grid Monitoring is controlled by /opt/glite/etc/grid-cm-client-wn.conf - set active=0 if paranoid

DPM SE

Plain SE

32-bit, essentially standard. This is really an upgrade from the PPS version.

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.). Restored local override functions from old YAIM.

- Repo in docs still points at PPS repository; add current ones.

    [glite-SE_dpm_mysql]
    name=gLite 3.1 DPM head node
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_mysql/sl4/i386/
    enabled=1
    
    [glite-SE_dpm_disk]
    name=gLite 3.1 DPM pool node
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_disk/sl4/i386/
    enabled=1
    
    #[dpm]
    #name=DPM PPS
    #baseurl=http://grid-deployment.web.cern.ch/grid-deployment/glite/pps/3.1/glite-SE_dpm_mysql/sl4/i386/
    #enabled=1

- Installed packages with yum:

     yum install lcg-CA
     yum install glite-SE_dpm_mysql glite-BDII

and local installs of needed VOMS server certificates.

- SE_GRIDFTP_LOGFILE now needs to be set explicitly in the siteinfo file.

- Configure with YAIM:

     /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n SE_dpm_mysql

- According to http://www.gridpp.ac.uk/wiki/DPM_Log_File_Tracing, "(N.B. The original srmv2 daemon should be switched off.)" When reconfiguring, this should now be done by setting

     RUN_SRMV2DAEMON="no"

in /etc/sysconfig/srmv2 to prevent YAIM re-enabling it... Probably srmv1 too these days, though this results in an "srmv1: failed" message in syslog every minute although everything seems to work.

- Fix log rotation

- For httpd support, as well as enabling in YAIM it will be necessary to run SE Linux in permissive mode. SE Linux will prevent the daemon from binding to port 884, and it also prevents direct access to the certificate and key in /etc/grid-security/. Add a daily cron job to restart the dpm-httpd daemon, as log rotation is broken.

- If you get lots of

    send2nsd: NS002 - send error : Bad credentials

errors, check that the CRL update script/cron has run recently: if there are outdated CRLs present it seems this doesn't get run by YAIM.

- If renewing a certificate, you still need to update /home/edguser/.globus/usercert.pem and /home/edginfo/.globus/usercert.pem etc. by hand.

- Information publishing fails as there is /opt/glite/etc/gip/provider/glite-info-provider-release is a broken symbolic link to /opt/glite/libexec/glite-info-provider-release; the only thing anywhere near its stated target is /opt/glite/libexec/glite-info-provider-ldap. The link is created by /config_gip_service_release but that's the the only mention of either in YAIM.

Publishing wasn't working before, but then with

     Error for /opt/glite/etc/gip/plugin/glite-info-dynamic-se: line 2: /opt/lcg/libexec/lcg-info-dynamic-dpm: No such file or directory
     ==> slapadd: bad configuration file!


As this node's only used as a gridftp server the information publishing can wait.

According to GGUS ticket 31772, /opt/glite/libexec/glite-info-provider-release isn't supposed to exist, but YAIM thinks it does. One option is to create a dummy:

    touch /opt/glite/libexec/glite-info-provider-release
    chmod 755 /opt/glite/libexec/glite-info-provider-release
    chown edguser:infosys /opt/glite/libexec/glite-info-provider-release

but this then gives instead the error

    Error for dn: GlueSEControlProtocolLocalID=srm_v1,GlueSEUniqueID=dgc-grid-38.brunel.ac.uk,mds-vo-name=resource,o=grid
    ==> slapadd: bad configuration file!

The other option is to delete /opt/glite/etc/gip/provider/glite-info-provider-release - this gives the same error message as above. The file creation seems to have been fixed in YAIM, but wrongly-created files from previous installs must be removed by hand. At this point the info provider was working (updated LDIFs were being created OK) but the BDII refused to update.

I then had an extended battle with the thing. Note that I tried the fix of adding stuff to /opt/glue/schema/ldap/Glue-CORE.schema file (see http://www.gridpp.ac.uk/wiki/DPM_SRMv2.2_Testing and Savannah bug 15532) which gave a different error message, and also applying the fix from https://savannah.cern.ch/bugs/?33202, both as an extra file in the specified directory and added to stub-resource, without success (predictable as this is openldap 2.2. on SLC4...)

As all else had failed, I tried reading the documentation. /opt/bdii/doc/README includes the URL of the main documentation, which in turn suggests looking in /opt/bdii/var/tmp/stderr.log for the raw bdii-update error messages, which turned out to be complaining (rightly) that /opt/lcg/schema/openldap-2.1/SiteInfo.schema doesn't exist. Commenting that line out of /opt/bdii/etc/schemas and restarting the bdii, and information gets published.

What puzzles me is that the glite-se_dpm_mysql YAIM target just runs config_bdii_only() which appears to only edit /opt/bdii/etc/schemas (rather than creating it). Now, I presumably have that file as I ran config_bdii() whilst blundering about (and it may have been there before since the PPS install), but why does YAIM try to modify it in config_bdii_only() when it presumably doesn't exist?

64-bit Head Node and Pool

The head node should be done first.

dpmmgr will need to have the same numerical user and group IDs on the head and pool nodes.

Head node

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.).

- Add current repo

    [glite-SE_dpm_mysql]
    name=glite 3.1 SE_dpm_mysql service
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_mysql/sl4/$basearch/
    enabled=1
    gpgcheck=0

- Installed packages with yum:

     yum install lcg-CA
     yum install glite-SE_dpm_mysql

and local installs of needed VOMS server and host certificates

- Restored local override functions from old YAIM and make local siteinfo

- Configure with YAIM:

     /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n SE_dpm_mysql

- If this complains about the dpmmgr group, make sure to delete the edg-users and groups and re-config. The complaint from the BDII about the missing schema file is apparently normal. The complaint about /opt/bdii/var/ missing will disappear if you re-config.

- According to http://www.gridpp.ac.uk/wiki/DPM_Log_File_Tracing, "(N.B. The original srmv2 daemon should be switched off.)"

- In /etc/rc.d/init.d/dpm-gsiftp make sure

     ulimit -v 51200 

is removed from start() - was a fix for Savannah bug https://savannah.cern.ch/bugs/?func=detailitem&item_id=28922

- Fix log rotation

Pool Node

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.).

- Add current repo

    [glite-SE_dpm_disk]
    name=glite 3.1 SE_dpm_disk service
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_disk/sl4/$basearch/
    enabled=1
    gpgcheck=0

- Installed packages with yum:

     yum install lcg-CA
     yum install glite-SE_dpm_disk

and local installs of needed VOMS server and host certificates

- Restored local override functions from old YAIM and make local siteinfo

- Configure with YAIM:

     /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n SE_dpm_disk

- In /etc/rc.d/init.d/dpm-gsiftp make sure

     ulimit -v 51200 

is removed from start() - was a fix for Savannah bug https://savannah.cern.ch/bugs/?func=detailitem&item_id=28922

- Fix log rotation

MON / site BDII

Install 32bit SLC4, jpackage/DAG repos, etc., lcg-CA and lcg-vomscerts, and host certificates. Then:

    yum install glite-MON glite-BDII /storage/jdk-1_5_0_14-linux-i586.rpm

(where that's a local copy of the Sun SDK RPM, as SLC doesn't get JDK 1.5.0-16 yet) and

    yum install mysql-server

and change the password. This leaves me with

    glite-yaim-core-4.0.5-7
    glite-yaim-mon-4.0.2-6
    glite-yaim-bdii-4.0.4-6
    mysql-4.1.22-2.el4.sl
    

Configure

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n MON -n BDII_site

It all succeeds apart from

    Starting rgma-servicetool:                                 [FAILED]
    For more details check /var/log/glite/rgma-servicetool.log

where /var/log/glite/rgma-servicetool.log contains the ever-useful

    Wed Nov  5 23:03:29 UTC 2008: Starting rgma-servicetool
    Wed Nov  5 23:03:32 UTC 2008: rgma-servicetool Failed to Start

This probably was because I got the Java path wrong in site-info.def, or else due to the changed MySQL name resolution.

Also, YAIM reports

    chown: cannot access `/opt/bdii/var': No such file or directory
    sed: can't read /opt/bdii/etc/schemas: No such file or directory

which is is claimed in the docs to be a bug, but not clear what to do.

R-GMA/tomcat doesn't seem to be running properly - fix /etc/hosts to Red Hat style, i.e. with both FQDN and localhost set to 127.0.0.1, reboot and re-run yaim anyway.

Make sure that all certificates have been updated - possible bug in YAIM:

    ls -l /etc/tomcat5/host*
    ls -l /etc/grid-security/host*
    ls -l /opt/glite/var/rgma/.certs/host*

Now APEL has been configured locally, so copy the local YAIM override functions into /opt/glite/yaim/functions/local/, and copy over our custom APEL config files and cron job from the old server.

After rebooting, rgma-server-check and rgma-client-check (from a Grid job) both succeed, and the info looks about right from jXplorer.

Glite 3.0

I followed the instructions on http://grid-deployment.web.cern.ch/grid-deployment/documentation/LCG2-Manual-Upgrade/, the following were of note:

All nodes: There is an error in /opt/glite/yaim/functions/config_lcgenv, line 17, described here:

http://savannah.cern.ch/bugs/?func=detailitem&item_id=16895

Worker nodes: clash with boost-g3:

  apt-get remove lcg-WN_torque
  apt-get install glite-WN glite-torque-client-config
  Reading Package Lists... Done
  Building Dependency Tree... Done
  The following extra packages will be installed:
  DPM-client (1.5.7-1sec)
  LFC-client (1.5.7-1sec)
  LFC-interfaces (1.5.7-1)
  boost (1.32.0-6)

then...

  boost-g3 = 1.29.1 is needed by edg-wl-common-api_gcc3_2_2-
  lcg2.1.74-3_sl3
  boost-g3 >= 1.29.1 is needed by glite-WN-3.0.2-0
  E: Transaction set check failed

this was solved by removing package boost-g3 and re-running

  apt-get install glite-WN glite-torque-client-config

SE DPM head: nothing critical:
For GridView, in /opt/lcg/etc/lcg-mon-gridftp.conf change LOG_FILE to /var/log/dpm-gsiftp/dpm-gsiftp.log and restart lcg-mon-gridftp service.

Disk pool server: nothing of note

MON box/LFC server: nothing critical:
In /etc/cron.d/check-tomcat, add a log file to save an hourly e-mail:
10 * * * * root /etc/rc.d/init.d/tomcat5 start >> /var/log/cron-check-tomcat 2>&1

UI

20:41 p.m. Henry is in the machine room. His task is to upgrade the UI.

  • Remove old metapackage, and change repository. As per upgrade instructions, do "apt-get install glite-UI" followed by "apt-get dist-upgrade"
  • "apt-get dist-upgrade" fails because of boost-g3 version problem. Manually remove boost-g3 makes "apt-get dist-upgrade" apparently succeed.
  • Edit site-info file as needed and run configure_node. It fails with
    ...
    Configuring config_lcgenv
    /opt/glite/yaim/scripts/configure_node: line 17: dgc-grid-34.brunel.ac.uk: command not found
    Configuring config_replica_manager
    ...

    Fix as per Savannah bug above.
  • Configure_node still fails, now with
    ...
    Configuring config_glite
    Traceback (most recent call last):
       File "/opt/glite/yaim/functions/../libexec/YAIM2gLiteConvertor.py", line 437, in ?
          buildUIsets( gLiteDom['glite-ui.cfg.xml'] )
       KeyError: glite-ui.cfg.xml
    Configuration Complete
    ...

    It turns out that /opt/glite/etc/config/templates/glite-ui.cfg.xml is missing - this comes from the glite-ui-config RPM which hasn't been installed. Running the full install_node fetches this RPM, as well as a number of others. See GGUS ticket 9230.
  • Now configure_node complains that WMS_HOST hasn't been set. Add "WMS_HOST=" and it apparently goes through OK. Successfully submit a couple of "hello world" test jobs.
  • In /opt/edg/etc/profile.d/edg-wl-ui-gui-env.[c]sh, /opt/edg/var/etc/profile.d/edg-wl-ui-gui-env.[c]sh and /opt/glite/etc/profile.d/glite-wmsui-gui-env.[c]sh, modify default value of JAVA_INSTALL_PATH to point at your Java.
  • In /opt/edg/etc/edg_wl_ui_cmd_var.conf, /opt/edg/etc/edg_wl_ui_gui_var.conf, and /opt/glite/etc/glite_wmsui_cmd_var.conf, set useful default value of DefaultVo.
  • If appropriate, enable load-sharing for RAL RBs ( http://www.gridpp.ac.uk/wiki/RAL_Tier1_Work_Load_Management#Local_Deployment_Information ).

22:58 p.m. Big Brother calls Henry into the Wiki room...

LCG CE: dgc-grid-40: nothing major of note
In /opt/lcg/var/gip/ldif/static-file-Cluster.ldif edit GlueSubClusterPhysicalCPUs and GlueSubClusterLogicalCPUs to show correct values, and restart globus-mds.

LCG CE/WN: dgc-grid-35: nothing major of note

  • Shut down mysql service that gets installed
  • Avoid boost version issue by doing Upgrade step 4 (dist-upgrade) BEFORE step 3 (install glite-WN)
  • In /opt/lcg/var/gip/ldif/static-file-Cluster.ldif edit GlueSubClusterPhysicalCPUs and GlueSubClusterLogicalCPUs to show correct values, and restart globus-mds.
  • In /etc/logrotate.d/gridftp set rotation time for /var/log/globus-gridftp.log to lot more than 31 days.
  • In /opt/edg/etc/edg-pbs-knownhosts.conf the FQDN of the gatekeeper appears twice - make one the short name and re-run edg-pbs-knownhosts.
  • After upgrade to Torque v. 2, remove localhost entry from /var/spool/pbs/mom_priv/config on all WNs.
  • Create dummy config_sw_dir in /opt/glite/yaim/functions/local on all WNs to save pummelling the disk server (see http://www.gridpp.ac.uk/wiki/GLite_Update_27#Software_area ). Also don't need it on the CE unless adding new VO or changing to SGM pool account.
  • Gridice CPU usage sometimes goes through roof; in /etc/cron.d/gridice_restart:
    8 5,17 * * * root /etc/rc.d/init.d/gridice_daemons restart >> /var/log/gridice_restart 2>&1