Bham Upgrade to LCG-2.7.0

From GridPP Wiki
Jump to: navigation, search

Initial objectives and motivation

  • Early upgrade to 2.7.0 constituted a logical continuation to the UK pre-release testing.
  • I opted for an upgrade with no downtime (well hopefully!) rather than a fresh install for this release.

Yaim config files

Yaim tool for VO management

I tried out the online yaim tool for VO management [1], which produced a configuration segment that I pasted into site-info.def. It's a great tool! The list of VOS it produced were all with capital letters:

VOS="ALICE ATLAS BABAR BIOMED CMS DTEAM HONE ILC LHCB SIXT ZEUS"

I didn't want this and I expect yaim would complain about this? A good measure is also to check the other values in the generated file with the information in the 2.6.0 site-info file.

Changes to watch out for

Here are the changes in the yaim config file to watch out for. I initially unset the following variable as we don't have a classic SE:

CLASSIC_HOST="classic SE host"

Leaving it blank will cause yaim to crash, leaving it as it is will wrongly configure the bdii-update.conf file inserting "classic SE host" in the ldap contact string instead of the SE hostname. So, I had to set this to my SE/DPM server hostname. I was unsure about what value to attribute to DPMDATA. I set it to:

DPMDATA="/dpm/ph.bham.ac.uk/home"

on my CE as this is the value I want my site BDII to publish for GlueSARoot, but I had to set this to one of my DPM filesystem on our SE site-info.def file. If you don't, this is the type of thing you're going to end up with:

root@epgse1 install-scripts]# dpm-qryconf
POOL dpmPart DEFSIZE 200.00M GC_START_THRESH 0 GC_STOP_THRESH 0 DEFPINTIME 0 PUT_RETENP 86400 FSS_POLICY maxfreespace GC_POLICY lru RS_POLICY
 fifo GID 0 S_TYPE -
                              CAPACITY 1.79T FREE 1.60T ( 89.4%)
  epgse1 /disk/f3a CAPACITY 601.08G FREE 561.87G ( 93.5%)
  epgse1 /disk/f3b CAPACITY 601.08G FREE 500.11G ( 83.2%)
  epgse1 /disk/f3c CAPACITY 601.08G FREE 555.95G ( 92.5%)
  epgse1.ph.bham.ac.uk /dpm/ph.bham.ac.uk/home CAPACITY 32.60G FREE 24.19G ( 74.2%

As can be seen from the output, new dpm-fs (epgse1.ph.bham.ac.uk) has been created with the FQDN of my SE, but that's not what I wanted - I deleted it with dpm-rmfs

The users.conf file is slightly different, I generated a new file with an increased number of pool accounts and the new prd accounts, based on my existing pool account configuration.

There is a new group groups.conf yaim file which will be needed for VOMSES. I think it's safe to leave it as it is, though I generated the info for my extra VOs based on the template. In hindsight, I would not recommend doing this? See also ROLLOUT.

Upgrade plan

As explained in the yaim release note, it is not possible to perform an upgrade with no downtime unless two MON boxes are running simultaneously with old and new versions of LCG. I was lucky to have a spare node with a certificate on which I installed a 2.7 MON. I pointed my newly 2.7 WNs to this new MON while still keeping the 2.6 MON node in operation which was the last node I updated to 2.7. I then reran the yaim configuration scripts on my all nodes to point to my old 2.6 MON.

Signed rpms?

Not yet, but on the way? In the pre-release, it was mentioned that packages were signed with

http://glite.web.cern.ch/glite/packages/keys/EGEE.gLite.GPG.public.asc

So, I thought I would give it a go, but ended with yum complaints:

edg-mkgridmap-conf-2.6.0- 100% |=========================|  19 kB    00:00
Error: Unsigned Package /var/cache/yum/sl-lcg/packages/edg-mkgridmap-conf-2.6.0-1_sl3.noarch.rpm
Error: You may want to run yum clean or remove the file:
 /var/cache/yum/sl-lcg/packages/edg-mkgridmap-conf-2.6.0-1_sl3.noarch.rpm
Error: You may need to disable gpg checking to install this package

(Yes, edg-mkgridmap-conf-2.6 is in the 2.7 repository)

Temporary MON installation

yaim suddenly stopped complaining about the mktemp usage. I also saw this error message when upgrading LFC and DPM. A typical error message is:

Configuring config_lfc_upgrade
Usage: mktemp [-d] [-q] [-u] template

mktemp needed an argument on my SL-3.0.4. The following functions were affected: config_DPM_upgrade, config_lfc_upgrade and config_rgma_server. This seems to be only an issue with mktemp in SL-3.0.4, I did see this error in the pre-release installation on SL-3.0.5. See also [2]

In the fresh install of my MON, the MYSQL password for mysql was not set for the localhost and I set it manually, see [3]

WN and CE upgrade

Was smooth!

DPM backups and upgrade!

I pushed and the wrong installation scripts on my SE, and corrupted my DPM database as a result! Fortunately, I had done a back-up just before the upgrade, unfotunately I initially had some problems restoring the database. The moral of this story is to not only back up your database but also to make sure you can recreate it. I did some tests using a spare machine (my laptop) with a MySQL installed:

[mysql@localhost mysql]# mysql -u root -p < /home/yrc/mysql-dump-2006-02-03T06-00.sql Enter password:
ERROR 1005 (HY000) at line 268: Can't create table './dpm_db/dpm_fs.frm' (errno: 150)

I traced this unhelpful error message to a problem with foreign key checks (the thinggy that makes sure there is a bijection between indices in different tables). I had to hack the sql backup file and add:

SET FOREIGN_KEY_CHECKS=0;
all the SQL stuff
SET FOREIGN_KEY_CHECKS=1;

at the beginning and the end of the file.

Anyway, once this problem was solved, Graeme's function did marvels! Here is an extract of the output for info:

Examining DPM for required upgrades to domain names and db schema
MODIFIED FUNCTION!
Examining DPM server hostname epgse1... looks unqualified.
Found simple hostnames. Converting to FQDNs in DPM/DPNS databases.
Sat Feb  4 18:11:17 2006 : Starting to add the domain name.
Please wait...
Sat Feb  4 18:11:17 2006 : 1000 entries migrated
Sat Feb  4 18:11:18 2006 : 2000 entries migrated
Sat Feb  4 18:11:19 2006 : 3000 entries migrated
Sat Feb  4 18:11:20 2006 : 4000 entries migrated
Sat Feb  4 18:11:21 2006 : 5000 entries migrated
Sat Feb  4 18:11:21 2006 : 6000 entries migrated
Sat Feb  4 18:11:22 2006 : 7000 entries migrated
Sat Feb  4 18:11:23 2006 : 8000 entries migrated
Sat Feb  4 18:11:24 2006 : 9000 entries migrated
Sat Feb  4 18:11:25 2006 : 10000 entries migrated
Sat Feb  4 18:11:26 2006 : 11000 entries migrated
Sat Feb  4 18:11:26 2006 : 12000 entries migrated
Sat Feb  4 18:11:27 2006 : The update of the DPNS database is over
3 disk server names have been modified in the configuration.
12311 entries have been migrated.
domain name = ph.bham.ac.uk
db vendor = MySQL
db = epgse1.ph.bham.ac.uk
DPNS database user = dpmmgr
DPNS database password = IwasNotGoingToLeaveitHere!
DPNS database name = cns_db
DPM database name = dpm_db
Mysql database version used: 2.1.0
Found schema version 2.1.0. No need to upgrade the DPM database schema

A dpm-qryconf revealed FQDN for all my dpm filesystems!

P.S. Following my database problem, I maintained my site in production as no jobs, but dteam, landed on our CE due to the VO software publishing bug (now fixed) [4]

Tweaks and bugs

Testing was made very difficult because the SFT result webpage got stuck on the 2006-02-05 01:05:01. I could still run the SFT but couldn't get any info on why I failed a test. This is in these occasions, that we really realise the importance of the SFT test suite is!

SE information system

I had problems to start globus-mds on my DPM SE (A service globus-mds start) will report that this daemon has started, but a netstat will show that no process is listening on port 2135. After some debugging I found out that yaim had overwritten the globus-script-initializer file in /opt/globus/libexec with an empty file. I copied across this file from my CE, an I could at last properly start globus-mds. For more details on this, see [5]

Maui

There is new cool dynamic plug-in for Maui. It published the information correctly for the default Maui config in yaim, but it did not like my configuration based on Steve's Cookbook method, even after I added the edginfo and rgma users to the Maui admins. This is being looked at, see thread on [6] Actually, not so cool at all!

I had disabled the vomaxjobs-maui plugin after the Bham CE got flooded by +1000 jobs! I'm now relying on the 2.6.0 lcg-info-dynamic-pbs plugin to report on the status of jobs until an official fix for vomaxjobs-maui is released.

RGMA

I failed many times the RGMA client test. I observed different type of errors when the rgma client test ran on the same WN:

Checking C API: Failure - failed to query test tuple
Checking CommandLine API: Failure - failed to query test tuple
Checking Java API: Failure - failed to query test tuple
Checking C++ API: Failure - failed to query test tuple
Checking C API: Failed to create producer: Mangled HTTP response from servlet.
Failure - failed to insert test tuple
Checking C++ API: R-GMA application error in PrimaryProducer: No xml returned
Failure - failed to insert test tuple
Checking CommandLine API: ERROR: Could not contact R-GMA server at epgmo1.ph.bham.ac.uk:8443 - HTTP error 400 (No Host matches server name  epgmo1.ph.bham.ac.uk)
Failure - failed to insert test tuple
Checking Java API: Failed to contact R-GMA server: Server returned HTTP response code: 400 for URL: https://epgmo1.ph.bham.ac.uk:8443/R-GMA/PrimaryProducerServlet/createPrimaryProducer?terminationIntervalSec=600&type=memory&isLatest=false&isHistory=false
Failure - failed to insert test tuple
Checking Python API: RGMA Error: Could not contact R-GMA server at epgmo1.ph.bham.ac.uk:8443 - HTTP error 400 (No Host matches server name  epgmo1.ph.bham.ac.uk)
Failure - failed to insert test tuple

It seems that restarting RGMA fixes this? I'm looking into this at the moment.

Information publishing

We shall all make sure we publish only correct information, there are various very interesting messages on ROLLOUT and GRIDPP-STORAGE.

GridIce

GridIce runs on my LFC/MON node. I had to add a blank line followed by

[mds/gris/provider/gridice]

after [mds/gris/provider/edg] in /etc/globus.conf. Unfortunately, I still do not publish anything and see the same problem which is described in [7]