Brunel

From GridPPwiki

Table of contents

Computing

Brunel University has three separate clusters / CEs accessible via the Grid:

  • GreenStripe:
    • GreenStripe has just two nodes, on the public network. It is really just meant as a debugging aid, esp. for networking issues, and for understanding new middleware updates and VOs. It also serves the NGS.
      Greenstripe is currently on SLC4/glite310 and believed fully functional (i.e. passes SFTs).
    • Lead admin: Henry
  • Triton
    • A farm of twin quad-core 64-bit Xeons (2.5 GHz, 16 GB).
    • Lead admin: Raul
  • Netlab
    • A SRIF-2 funded cluster of twin dual-core 64-bit Opterons (1.8 GHz, 4 GB).
    • Lead admin: Raul
  • Argo
    • Argo was a SRIF-1 funded cluster of 64 dual-Xeons (2.4 GHz, 2 GB, no hyperthreading). Half the nodes have IDE drives and half SCSI. LCG nominally received 50% of this resource.
      Argo has now been retired. The final score was 8 IDE nodes failed (disk systems) vs. 1 SCSI node failed (system board).

Storage

We have three DPM-based SEs:

  • dgc-grid-50
    • dgc-grid-50 is now the main production SE. Currently this has three 20 TB SATA RAID pool servers.
    • Lead admin: Raul
  • dgc-grid-34
    • This is the old SE, with two 6 TB SATA RAID pool servers. It will be withdrawn from service shortly.
    • Lead admin: Raul
  • dgc-grid-38
    • dgc-grid-38 is a 400GB IDE RAID5 system used by the MICE (http://mice.iit.edu/) experiment.
    • Lead admin: Henry
  • In common with LT2 policy, the SE storage is not backed up and should be treated as "volatile" (this should be indicated by GlueSAPolicyFileLifeTime).

Log

Rotating SFT reservations in Maui

I have managed to get sft's to use the reserved node only as a last resort by adding a - sign after the dteam and ops groups:

SRCFG[sft] GROUPLIST=dteam-,ops-

which works quite well if there are spare nodes. There is also the option

SRCFG[sft] FLAGS=SPACEFLEX

which is supposed to rotate the standing reservation onto different nodes when the new reservation is made

"reservation is allowed to move from host to host over time in an attempt to optimize resource utilization*"

http://www.clusterresources.com/products/maui/docs/7.1.5managingreservations.shtml

This is done at midnight when using the default PERIOD of DAY with default STARTTIME of 00:00:00:00 and ENDTIME of 24:00:00). I had to increase the DEPTH to 4 days so that a reservation could be definately be made, in other words sufficiently far ahead to be sure all nodes would be free but it always seems to add the new reservations onto the existing reserved node.

Upgrade Log

Glite 3.2

UI

64-bit UI: Install generic SL5 OS + DAG repo, NTP etc. You need to make sure that hostname -f returns something like a FQDN else yaim will fail (even though the UI doesn't really need one) Follow the install guide https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide320 in setting up the gLite and lcg-CA repos, then

  yum groupinstall glite-UI
  yum install lcg-CA
  /opt/glite/yaim/bin/yaim -c -s site-info.def -n glite-UI

- In /opt/edg/etc/edg_wl_ui_cmd_var.conf, /opt/edg/etc/edg_wl_ui_gui_var.conf and /opt/glite/etc/glite_wmsui_cmd_var.conf set a sensible default VO if needed.

- Set up a weekly cron job to touch /tmp/jobOutput, to stop it getting cleaned away. Also worth doing in rc.local for those longer downtimes...

- You will need to put the certificates of your VOMS servers in /etc/grid-security/vomsdir/, else glite-wms-job-status will fail with:

  **** Error: API_NATIVE_ERROR ****
  Error while calling the "UcWrapper::getExpiration" native api
  Cannot verify AC signature!
  (Please check if the host certificate of the VOMS server that has issued your proxy is installed on this machine)

even though submission worked fine!

- and either way glite-wms-job-output fails with

  Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
  
  Error - Operation Failed
  Unable to retrieve the output

but clears the sandbox from the WMS. Brilliant.

- You need the 32-bit version of the openldap libraries installed, else lcg-cp fails.

- it looks like there might be an undeclared dependency of the UI on zlib-devel to get adler32 or CRC32 checksum validation of transfers (i.e. lcg-cp --checksum ...), though MD5 is OK.

Glite 3.1

Upgrade to SL(C)4

CE

LCG CE/WN: dgc-grid-35: 32-bit, essentially standard, but also had to add a DNS-style VO (SuperNEMO). No pool SGM accounts

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.). Restored local override functions from old YAIM.

- Installed packages with yum:

     yum install glite-BDII lcg-CE glite-TORQUE_client glite-TORQUE_utils glite-TORQUE_server glite-WN

This doesn't pull in the CA certs, so also need

     yum install lcg-CA

and local installs of needed VOMS server certificates (and host certs!).

- Comment out the __GROUP_ENABLE variable from the 'requires' list in config_gip_ce_check() in /opt/glite/yaim/functions/config_gip_ce, see https://savannah.cern.ch/bugs/index.php?35890(I think this is no longer needed)

- Create dummy config_sw_dir in /opt/glite/yaim/functions/local on all WNs to save pummelling the disk server (see http://www.gridpp.ac.uk/wiki/GLite_Update_27#Software_area ). Also don't need it on the CE unless adding new VO or changing to SGM pool account

- In /opt/glite/yaim/functions/local on the CE, need to modify config_gip_ce in order to publish correctly the GlueCESEBindMountInfo for all CEs. Need to add

    "$SE_HOST1") accesspoint="/dpm/${MY_DOMAIN}/home";;
    "$SE_HOST2") accesspoint="/dpm/${MY_DOMAIN}/mice1";;
    "$SE_HOST3") accesspoint="/dpm/${MY_DOMAIN}/home2";;

in the case statement around line 350 (GGUS ticket #36792). (I think this is no longer needed)

- Updated site-info.def with various new stuff; esp. add comma to port ranges and on site BDII modify Mds-vo-name/port number to pick up new BDII-powered GRISes.

- For SuperNEMO, created vo.d subdirectory with the SuperNEMO entries in supernemo.vo.eu-egee.org.

The YAIM guide (https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400#vo_d_directory) and GridPP Wiki entry (http://www.gridpp.ac.uk/wiki/Enabling_the_DNS_VO_style_test.gridpp.ac.uk) contradict each other - in particular the YAIM guide mentions "a shorter name for the VO" which I think is bogus - the authors have got confused with Unix groups and PBS queues that happen to use the same name. Basically, two things need to happen: members (FQANs) of VO supernemo.vo.eu-egee.org need to get mapped to snemo001:snemo, etc.; and YAIM must ensure that users in group snemo are allowed to submit to the desired queue (here, "short"). The obvious place to spell out those connections is groups.conf, which has a field at the end for the VO, but as I've never seen that used I wimped out and kept it standard, e.g.

   "/VO=supernemo.vo.eu-egee.org/GROUP=/supernemo.vo.eu-egee.org":snemo:3700::

Instead, I put the full VO name in the users.conf file, e.g.

   37001:snemo001:3700:snemo:supernemo.vo.eu-egee.org::

In site-info.def the entry in VOS is the full VO name (supernemo.vo.eu-egee.org), the entry in QUEUES the queue name ("short") and these are connected by the disaster area that is SHORT_GROUP_ENABLE:

   SHORT_GROUP_ENABLE=" \
   ops     /VO=ops/GROUP=/ops/ROLE=lcgadmin         /VO=ops/GROUP=/ops/ROLE=production \
   supernemo.vo.eu-egee.org /VO=supernemo.vo.eu-egee.org/GROUP=/supernemo.vo.eu-egee.org/ROLE=production \
   "

(and so on with entries for every supported VO. Lovely.)

- From the directory containing site-info.def, finally ran yaim to configure the thing:

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n BDII_site -n glite-WN -n TORQUE_client -n lcg-CE -n TORQUE_server -n TORQUE_utils

(though possibly the order -n TORQUE_server -n lcg-CE -n TORQUE_utils would be better, glite bug #17585, but the above agrees with release notes for update 14)

- In /opt/edg/etc/edg-pbs-shostsequiv.conf and /opt/edg/etc/edg-pbs-knownhosts.conf, make sure the CE (torque server) is listed with both short and full hostnames, and re-run /opt/edg/sbin/edg-pbs-knownhosts and /opt/edg/sbin/edg-pbs-shostsequiv.

- Apel is now complaining that

    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger
       at org.glite.apel.pbs.ApelLogParser.<clinit>(ApelLogParser.java:34)

In /opt/glite/bin/apel-pbs-log-parser change

    LOG4J_CP="$APEL_HOME/share/glite-apel-core/java/log4j.jar"

to

    LOG4J_CP="/usr/share/java/log4j.jar"

Our customised APEL config means we can't use YAIM to configure it, so needs updating manually. The BLAH parser config can be stolen as-is out of the template YAIM uses, and added to the existing APEL config, and the gatekeeper and message parsing disabled.

Also we separate multiple CEs. APEL got the CE specints from the GRIS (on 2135) which no longer exists; configuring it to use 2170 makes it query the GIIS instead and it will only ever pull out the first cluster's values. Instead changed it to use hard-wired numbers.

- Spotted a weird problem with dynamic info publishing: queue status (production vs. draining, number of job slots available) was correct, but number of running jobs, ERT, etc. was stuck at zero. Fixed this by adding "edguser" to the list of allowed operators in PBS.

- Failures from loony number of dead globus-gma processes - see https://gus.fzk.de/ws/ticket_info.php?ticket=42981. Restarting the globus-gma service clears them out (temporary fix). Adding tout 120 to /opt/globus/etc/globus-gma.conf as suggested in the ticket seems to have fixed this for us.

- Add TMPDIR variable as per WNs

WN

Simple WN: 32-bit, essentially standard, but also had to add a DNS-style VO (SuperNEMO). No pool SGM accounts

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.). Restored local override functions from old YAIM. For MPI, added repository, as per http://www.grid.ie/mpi/wiki/YaimConfig

- Installed packages with yum:

     yum install lcg-CA
   ( yum install glite-MPI_utils )
     yum install glite-TORQUE_client glite-WN

and local installs of needed VOMS server certificates.

- Create dummy config_sw_dir in /opt/glite/yaim/functions/local on all WNs to save pummelling the disk server (see http://www.gridpp.ac.uk/wiki/GLite_Update_27#Software_area ). Also don't need it on the CE unless adding new VO or changing to SGM pool account

- Copied site-info.def, etc., from CE and from the directory containing site-info.def, finally ran yaim to configure the thing:

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n glite-WN -n TORQUE_client

or

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n MPI_WN -n glite-WN -n TORQUE_client

- In /opt/edg/etc/edg-pbs-knownhosts.conf, make sure the CE (torque server) is listed with both short and full hostnames, and re-run /opt/edg/sbin/edg-pbs-knownhosts.

- In /opt/glite/lib/ fix up missing symbolic links to libraries.

- In /var/spool/pbs/mom_priv/config, check ideal_load and max_load have appropriate values.

- Apparently still not sorted out, after 5+ years: set TMPDIR to point to scratch space. In /etc/profile.d/glite_local.csh

# TMPDIR:
setenv TMPDIR /scratch

and in /etc/profile.d/glite_local.sh

# TMPDIR:
export TMPDIR=/scratch

This should probably be added via grid-env.sh

A better idea is to set

$tmpdir /scratch

in /var/spool/pbs/mom_priv/config, so that Torque creates a dedicated per-job subdirectory under /scratch and points TMPDIR at it, but that then also requires an extra customising step to cd into it which may not work with recent WMS change: in /etc/profile.d/glite_local.sh

 export GLITE_LOCAL_CUSTOMIZATION_DIR=/etc/glite

and in /etc/profile.d/glite_local.csh

 setenv GLITE_LOCAL_CUSTOMIZATION_DIR /etc/glite

then in $GLITE_LOCAL_CUSTOMIZATION_DIR/cp_1.sh

# Move to per-job directory
cd $TMPDIR

but note Savannah bug 55237 (https://savannah.cern.ch/bugs/index.php?55237) and DON'T DO BOTH the glite_local and the torque fix!


- The Grid Monitoring is controlled by /opt/glite/etc/grid-cm-client-wn.conf - set active=0 if paranoid

DPM SE

Plain SE

32-bit, essentially standard. This is really an upgrade from the PPS version.

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.). Restored local override functions from old YAIM.

- Repo in docs still points at PPS repository; add current ones.

    [glite-SE_dpm_mysql]
    name=gLite 3.1 DPM head node
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_mysql/sl4/i386/
    enabled=1
    
    [glite-SE_dpm_disk]
    name=gLite 3.1 DPM pool node
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_disk/sl4/i386/
    enabled=1
    
    #[dpm]
    #name=DPM PPS
    #baseurl=http://grid-deployment.web.cern.ch/grid-deployment/glite/pps/3.1/glite-SE_dpm_mysql/sl4/i386/
    #enabled=1

- Installed packages with yum:

     yum install lcg-CA
     yum install glite-SE_dpm_mysql glite-BDII

and local installs of needed VOMS server certificates.

- SE_GRIDFTP_LOGFILE now needs to be set explicitly in the siteinfo file.

- Configure with YAIM:

     /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n SE_dpm_mysql

- According to http://www.gridpp.ac.uk/wiki/DPM_Log_File_Tracing, "(N.B. The original srmv2 daemon should be switched off.)" When reconfiguring, this should now be done by setting

     RUN_SRMV2DAEMON="no"

in /etc/sysconfig/srmv2 to prevent YAIM re-enabling it... Probably srmv1 too these days, though this results in an "srmv1: failed" message in syslog every minute although everything seems to work.

- Fix log rotation

- For httpd support, as well as enabling in YAIM it will be necessary to run SE Linux in permissive mode. SE Linux will prevent the daemon from binding to port 884, and it also prevents direct access to the certificate and key in /etc/grid-security/. Add a daily cron job to restart the dpm-httpd daemon, as log rotation is broken.

- If you get lots of

    send2nsd: NS002 - send error : Bad credentials

errors, check that the CRL update script/cron has run recently: if there are outdated CRLs present it seems this doesn't get run by YAIM.

- Information publishing fails as there is /opt/glite/etc/gip/provider/glite-info-provider-release is a broken symbolic link to /opt/glite/libexec/glite-info-provider-release; the only thing anywhere near its stated target is /opt/glite/libexec/glite-info-provider-ldap. The link is created by /config_gip_service_release but that's the the only mention of either in YAIM.

Publishing wasn't working before, but then with

     Error for /opt/glite/etc/gip/plugin/glite-info-dynamic-se: line 2: /opt/lcg/libexec/lcg-info-dynamic-dpm: No such file or directory
     ==> slapadd: bad configuration file!


As this node's only used as a gridftp server the information publishing can wait.

According to GGUS ticket 31772, /opt/glite/libexec/glite-info-provider-release isn't supposed to exist, but YAIM thinks it does. One option is to create a dummy:

    touch /opt/glite/libexec/glite-info-provider-release
    chmod 755 /opt/glite/libexec/glite-info-provider-release
    chown edguser:infosys /opt/glite/libexec/glite-info-provider-release

but this then gives instead the error

    Error for dn: GlueSEControlProtocolLocalID=srm_v1,GlueSEUniqueID=dgc-grid-38.brunel.ac.uk,mds-vo-name=resource,o=grid
    ==> slapadd: bad configuration file!

The other option is to delete /opt/glite/etc/gip/provider/glite-info-provider-release - this gives the same error message as above. The file creation seems to have been fixed in YAIM, but wrongly-created files from previous installs must be removed by hand. At this point the info provider was working (updated LDIFs were being created OK) but the BDII refused to update.

I then had an extended battle with the thing. Note that I tried the fix of adding stuff to /opt/glue/schema/ldap/Glue-CORE.schema file (see http://www.gridpp.ac.uk/wiki/DPM_SRMv2.2_Testing and Savannah bug 15532) which gave a different error message, and also applying the fix from https://savannah.cern.ch/bugs/?33202, both as an extra file in the specified directory and added to stub-resource, without success (predictable as this is openldap 2.2. on SLC4...)

As all else had failed, I tried reading the documentation. /opt/bdii/doc/README includes the URL of the main documentation, which in turn suggests looking in /opt/bdii/var/tmp/stderr.log for the raw bdii-update error messages, which turned out to be complaining (rightly) that /opt/lcg/schema/openldap-2.1/SiteInfo.schema doesn't exist. Commenting that line out of /opt/bdii/etc/schemas and restarting the bdii, and information gets published.

What puzzles me is that the glite-se_dpm_mysql YAIM target just runs config_bdii_only() which appears to only edit /opt/bdii/etc/schemas (rather than creating it). Now, I presumably have that file as I ran config_bdii() whilst blundering about (and it may have been there before since the PPS install), but why does YAIM try to modify it in config_bdii_only() when it presumably doesn't exist?

64-bit Head Node and Pool

The head node should be done first.

dpmmgr will need to have the same numerical user and group IDs on the head and pool nodes.

Head node

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.).

- Add current repo

    [glite-SE_dpm_mysql]
    name=glite 3.1 SE_dpm_mysql service
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_mysql/sl4/$basearch/
    enabled=1
    gpgcheck=0

- Installed packages with yum:

     yum install lcg-CA
     yum install glite-SE_dpm_mysql

and local installs of needed VOMS server and host certificates

- Restored local override functions from old YAIM and make local siteinfo

- Configure with YAIM:

     /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n SE_dpm_mysql

- If this complains about the dpmmgr group, make sure to delete the edg-users and groups and re-config. The complaint from the BDII about the missing schema file is apparently normal. The complaint about /opt/bdii/var/ missing will disappear if you re-config.

- According to http://www.gridpp.ac.uk/wiki/DPM_Log_File_Tracing, "(N.B. The original srmv2 daemon should be switched off.)"

- In /etc/rc.d/init.d/dpm-gsiftp make sure

     ulimit -v 51200 

is removed from start() - was a fix for Savannah bug https://savannah.cern.ch/bugs/?func=detailitem&item_id=28922

- Fix log rotation

Pool Node

- Installed OS, Java and added repositories according to generic install guide (https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide310). For jpackage repository used the UK mirror (i.e. baseurl=http://www.mirrorservice.org/sites/jpackage.org/1.7/generic/free/, etc.).

- Add current repo

    [glite-SE_dpm_disk]
    name=glite 3.1 SE_dpm_disk service
    baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-SE_dpm_disk/sl4/$basearch/
    enabled=1
    gpgcheck=0

- Installed packages with yum:

     yum install lcg-CA
     yum install glite-SE_dpm_disk

and local installs of needed VOMS server and host certificates

- Restored local override functions from old YAIM and make local siteinfo

- Configure with YAIM:

     /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n SE_dpm_disk

- In /etc/rc.d/init.d/dpm-gsiftp make sure

     ulimit -v 51200 

is removed from start() - was a fix for Savannah bug https://savannah.cern.ch/bugs/?func=detailitem&item_id=28922

- Fix log rotation

MON / site BDII

Install 32bit SLC4, jpackage/DAG repos, etc., lcg-CA and lcg-vomscerts, and host certificates. Then:

    yum install glite-MON glite-BDII /storage/jdk-1_5_0_14-linux-i586.rpm

(where that's a local copy of the Sun SDK RPM, as SLC doesn't get JDK 1.5.0-16 yet) and

    yum install mysql-server

and change the password. This leaves me with

    glite-yaim-core-4.0.5-7
    glite-yaim-mon-4.0.2-6
    glite-yaim-bdii-4.0.4-6
    mysql-4.1.22-2.el4.sl
    

Configure

    /opt/glite/yaim/bin/yaim -c -s /root/glite3_1_0/site-info.def -n MON -n BDII_site

It all succeeds apart from

    Starting rgma-servicetool:                                 [FAILED]
    For more details check /var/log/glite/rgma-servicetool.log

where /var/log/glite/rgma-servicetool.log contains the ever-useful

    Wed Nov  5 23:03:29 UTC 2008: Starting rgma-servicetool
    Wed Nov  5 23:03:32 UTC 2008: rgma-servicetool Failed to Start

This probably was because I got the Java path wrong in site-info.def, or else due to the changed MySQL name resolution.

Also, YAIM reports

    chown: cannot access `/opt/bdii/var': No such file or directory
    sed: can't read /opt/bdii/etc/schemas: No such file or directory

which is is claimed in the docs to be a bug, but not clear what to do.

R-GMA/tomcat doesn't seem to be running properly - fix /etc/hosts to Red Hat style, i.e. with both FQDN and localhost set to 127.0.0.1, reboot and re-run yaim anyway.

Make sure that all certificates have been updated - possible bug in YAIM:

    ls -l /etc/tomcat5/host*
    ls -l /etc/grid-security/host*
    ls -l /opt/glite/var/rgma/.certs/host*

Now APEL has been configured locally, so copy the local YAIM override functions into /opt/glite/yaim/functions/local/, and copy over our custom APEL config files and cron job from the old server.

After rebooting, rgma-server-check and rgma-client-check (from a Grid job) both succeed, and the info looks about right from jXplorer.

Glite 3.0

I followed the instructions on http://grid-deployment.web.cern.ch/grid-deployment/documentation/LCG2-Manual-Upgrade/, the following were of note:

All nodes: There is an error in /opt/glite/yaim/functions/config_lcgenv, line 17, described here:

http://savannah.cern.ch/bugs/?func=detailitem&item_id=16895

Worker nodes: clash with boost-g3:

  apt-get remove lcg-WN_torque
  apt-get install glite-WN glite-torque-client-config
  Reading Package Lists... Done
  Building Dependency Tree... Done
  The following extra packages will be installed:
  DPM-client (1.5.7-1sec)
  LFC-client (1.5.7-1sec)
  LFC-interfaces (1.5.7-1)
  boost (1.32.0-6)

then...

  boost-g3 = 1.29.1 is needed by edg-wl-common-api_gcc3_2_2-
  lcg2.1.74-3_sl3
  boost-g3 >= 1.29.1 is needed by glite-WN-3.0.2-0
  E: Transaction set check failed

this was solved by removing package boost-g3 and re-running

  apt-get install glite-WN glite-torque-client-config

SE DPM head: nothing critical:
For GridView, in /opt/lcg/etc/lcg-mon-gridftp.conf change LOG_FILE to /var/log/dpm-gsiftp/dpm-gsiftp.log and restart lcg-mon-gridftp service.

Disk pool server: nothing of note

MON box/LFC server: nothing critical:
In /etc/cron.d/check-tomcat, add a log file to save an hourly e-mail:
10 * * * * root /etc/rc.d/init.d/tomcat5 start >> /var/log/cron-check-tomcat 2>&1

UI

20:41 p.m. Henry is in the machine room. His task is to upgrade the UI.

  • Remove old metapackage, and change repository. As per upgrade instructions, do "apt-get install glite-UI" followed by "apt-get dist-upgrade"
  • "apt-get dist-upgrade" fails because of boost-g3 version problem. Manually remove boost-g3 makes "apt-get dist-upgrade" apparently succeed.
  • Edit site-info file as needed and run configure_node. It fails with
    ...
    Configuring config_lcgenv
    /opt/glite/yaim/scripts/configure_node: line 17: dgc-grid-34.brunel.ac.uk: command not found
    Configuring config_replica_manager
    ...

    Fix as per Savannah bug above.
  • Configure_node still fails, now with
    ...
    Configuring config_glite
    Traceback (most recent call last):
       File "/opt/glite/yaim/functions/../libexec/YAIM2gLiteConvertor.py", line 437, in ?
          buildUIsets( gLiteDom['glite-ui.cfg.xml'] )
       KeyError: glite-ui.cfg.xml
    Configuration Complete
    ...

    It turns out that /opt/glite/etc/config/templates/glite-ui.cfg.xml is missing - this comes from the glite-ui-config RPM which hasn't been installed. Running the full install_node fetches this RPM, as well as a number of others. See GGUS ticket 9230.
  • Now configure_node complains that WMS_HOST hasn't been set. Add "WMS_HOST=" and it apparently goes through OK. Successfully submit a couple of "hello world" test jobs.
  • In /opt/edg/etc/profile.d/edg-wl-ui-gui-env.[c]sh, /opt/edg/var/etc/profile.d/edg-wl-ui-gui-env.[c]sh and /opt/glite/etc/profile.d/glite-wmsui-gui-env.[c]sh, modify default value of JAVA_INSTALL_PATH to point at your Java.
  • In /opt/edg/etc/edg_wl_ui_cmd_var.conf, /opt/edg/etc/edg_wl_ui_gui_var.conf, and /opt/glite/etc/glite_wmsui_cmd_var.conf, set useful default value of DefaultVo.
  • If appropriate, enable load-sharing for RAL RBs ( http://www.gridpp.ac.uk/wiki/RAL_Tier1_Work_Load_Management#Local_Deployment_Information ).

22:58 p.m. Big Brother calls Henry into the Wiki room...

LCG CE: dgc-grid-40: nothing major of note
In /opt/lcg/var/gip/ldif/static-file-Cluster.ldif edit GlueSubClusterPhysicalCPUs and GlueSubClusterLogicalCPUs to show correct values, and restart globus-mds.

LCG CE/WN: dgc-grid-35: nothing major of note

  • Shut down mysql service that gets installed
  • Avoid boost version issue by doing Upgrade step 4 (dist-upgrade) BEFORE step 3 (install glite-WN)
  • In /opt/lcg/var/gip/ldif/static-file-Cluster.ldif edit GlueSubClusterPhysicalCPUs and GlueSubClusterLogicalCPUs to show correct values, and restart globus-mds.
  • In /etc/logrotate.d/gridftp set rotation time for /var/log/globus-gridftp.log to lot more than 31 days.
  • In /opt/edg/etc/edg-pbs-knownhosts.conf the FQDN of the gatekeeper appears twice - make one the short name and re-run edg-pbs-knownhosts.
  • After upgrade to Torque v. 2, remove localhost entry from /var/spool/pbs/mom_priv/config on all WNs.
  • Create dummy config_sw_dir in /opt/glite/yaim/functions/local on all WNs to save pummelling the disk server (see http://www.gridpp.ac.uk/wiki/GLite_Update_27#Software_area ). Also don't need it on the CE unless adding new VO or changing to SGM pool account.
  • Gridice CPU usage sometimes goes through roof; in /etc/cron.d/gridice_restart:
    8 5,17 * * * root /etc/rc.d/init.d/gridice_daemons restart >> /var/log/gridice_restart 2>&1