Ed Upgrade 152 To 166

From GridPP Wiki
Jump to: navigation, search

This page contains the notes that were kept during a upgrade of a dCache 1.5.{2,3} to 1.6.6. The upgrade has been carried out on a test system which only contained a single node that ran all services, including two pools and also on the production admin node. Both upgrades were successful. It is possible to operate dCache with a 1.6.6 admin node and 1.5.{2,3} pool nodes.

postgreSQL database upgrade

Previous versions of dCache used gdbm to store the PNFS database. From now on, postgreSQL will be used (removes the 2GB limit) and it is recommended that v8.1 (or above) be used (v7.4 is standard on LCG machines). Some work needs to be done to migrate the old databases to the new format.

1. Make sure that the database is running. If not, start it by becoming the postgres user (su - postgres) and running:

 $ postmaster -i -D /var/lib/pgsql/data/ > /tmp/logfile 2>&1 &                                                                                                                                                            

2. Still as the postgres user, dump the database and then stop it:

 $ pg_dumpall > /bigdisk/db.out                                                                                                                                                             
 $ pg_ctl stop -D /var/lib/pgsql/data                                                                                                                                                            

3. Rename the old database:

 $ mv /var/lib/pgsql/data /var/lib/pgsql/data_7.4                                                                                                                                                            

4. Download the postgreSQL 8 rpms and install them. A version that is suitable for current versions of Scientific Linux 3 can be found at http://www.postgresql.org/ftp/binary/v8.1.0/linux/rpms/redhat/rhel-es-3.0/. Afterwards you should have:

 $ rpm -qa|grep post
 postgresql-libs-8.1.0-2PGDG
 postgresql-devel-8.1.0-2PGDG
 postgresql-python-8.1.0-2PGDG
 postgresql-8.1.0-2PGDG
 postgresql-docs-8.1.0-2PGDG
 postgresql-test-8.1.0-2PGDG
 postgresql-server-8.1.0-2PGDG
 postgresql-jdbc-8.1.0-2PGDG
 postgresql-contrib-8.1.0-2PGDG
 postgresql-pl-8.1.0-2PGDG                                                                                                                                                            

5. Create a new database instance instance (must be user postgres):

 $ initdb -D /var/lib/pgsql/data                                                                                                                                                            

6. Start postgres:

 $ postmaster -i -D /var/lib/pgsql/data/ > /tmp/logfile 2>&1 &

7. Restore database from dump:

 $ psql -e template1 < /bigdisk/db.out                                                                                                                                                            

8. Turned this option on in /var/lib/pgsql/data/postgresql.conf

 add_missing_from = on                                                                                                                                                            

and changed

 #listen_addresses = 'localhost'    # what IP address(es) to listen
 #port = 5432                                                                                                                                                            

to

 listen_addresses = '*'         # what IP address(es) to listen
 port = 5432                                                                                                                                                           

9. Altered /var/lib/pgsql/data/pg_hba.conf so that it contained the entries:

 local   all         all                        trust
 host    all         all         127.0.0.1/32   trust
 host    all         all         ::1/128        trust
 host    all         all         <host-ip>/32   trust                                                                                                                                                            

10. Reload the database (as root):

 $ /etc/init.d/postgres reload

Steps 8 and 9 above were required to get the SRM cell online in dCache once the upgrade to 1.6.6 had been completed.

dCache 1.6.6 upgrade

In the case where only pool nodes need to be upgraded, it should simply be a case of following the procedure presented here, ensuring that NODE_TYPE leads to a pool install and that the pool_path's are up to date. There should be no PNFS/postgreSQL issues to deal with. Problems may arise due to unmet dependencies on the pool nodes, particularly in the case where they are not standard LCG nodes, but are running some alternative operating system. For example, RedHat Advanced Server 2.1 uses glibc v2.2.5, but dCache 1.6.6 requires glibc v2.3 .

For the upgrade, first backup config and etc directories just incase and configuration files are changed. Check that pnfs_overwrite=no and that there is only one value assigned to node_type.

 NODE_TYPE = admin dummy #admin or pool

This is not sufficient, as anything other than just admin will result in a pool node install.

 $ cat node_config
 NODE_TYPE = admin
 DCACHE_BASE_DIR = /opt/d-cache
 PNFS_ROOT = /pnfs
 PNFS_INSTALL_DIR = /opt/pnfs.3.1.10/pnfs
 PNFS_START = yes
 PNFS_OVERWRITE = no
 POOL_PATH = /opt/d-cache/etc
 NUMBER_OF_MOVERS = 100
 $ cat door_config
 ADMIN_NODE      wn4.epcc.ed.ac.uk

 door          active
 --------------------
 GSIDCAP         yes
 GRIDFTP         yes
 SRM             yes

dcache-opt no longer exists and door_config file is read upon startup, so no longer need to run install_doors.sh .

Make sure pool_path has 'no' in third field.

 $ cat pool_path
 /dcache-storage1 14 no
 /dcache-storage2 14 no

Stop all dCache services:

 $ service dcache-pool stop
 $ service dcache-opt stop
 $ service dcache-core stop

Leave pnfs running. Remove the LCG dCache metapackage.

 $ rpm -e lcg-SE_dcache-2.6.0-sl3

Download the dCache tarball and install the rpms:

 $ wget http://www.dcache.org/downloads/releases/dcache-bundle-1.6.6-1.tgz
 $ rpm -Uvh dcache-server-1.6.6-1.i386.rpm dcache-client-1.6.6-1.i386.rpm

Run install.sh script.

 # /opt/d-cache/install/install.sh
[INFO]  No 'SERVER_ID' set in 'node_config'. Using SERVER_ID=epcc.ed.ac.uk.
 
[INFO]  Will be mounted to localhost:/fs by dcache-core start-up script.
[INFO]  Link /pnfs/epcc.ed.ac.uk --> /pnfs/fs/usr already there.
[INFO]  Creating link /pnfs/ftpBase --> /pnfs/fs which is used by the GridFTP door.
 
 
[INFO]  Checking on a possibly existing dCache/PNFS configuration ...
[INFO]  Found an existing dCache/PNFS configuration!
[INFO]  Not allowed to overwrite existing configuration.
 
 
[INFO]  Configuring pnfs export '/pnfsdoors' (needed from version 1.6.6 on)
        mountable by world.
[INFO]  You may restrict access to this export to the GridFTP doors which
        are not on the admin node. See the documentation.
 
[INFO]  Generating ssh keys:
Generating public/private rsa1 key pair.
Your identification has been saved in ./server_key.
Your public key has been saved in ./server_key.pub.
The key fingerprint is:
43:59:86:33:b4:d0:f7:ad:9e:d5:41:eb:c1:b9:03:ff root@wn4.epcc.ed.ac.uk 
 
 
[INFO]  Not overwriting pool at /dcache-storage1.
[INFO]  Not overwriting pool at /dcache-storage2.

Now start the dCache services.

 $ service dcache-core start
 $ service dcache-pool start

Note that the names of the doors have changed to include the hostname. I will probably have to alter the httpd.batch file to reflect this. Note that the start-up script for the optional components is not needed anymore. Therefore, it is probably best to remove them:

 $ rm /opt/d-cache/bin/dcache-opt /etc/init.d/dcache-opt
 rm: cannot lstat `/opt/d-cache/bin/dcache-opt': No such file or directory
 $ ls /opt/d-cache/bin/
 dcache-core  dcache-pool  grid-mapfile2dcache-kpwd

So dcache-opt not present.

Check the dCache functionality to make sure everything is working:

 # ls /pnfs/epcc.ed.ac.uk/data/dteam/
 20051109_094634.txt        srm1-180240.txt           srm3-124244.txt
 20051111_122715.txt        srm1-20051109_104417.txt  srm3-180240.txt
 20051111_123444.txt        srm1-20051109_105040.txt  srm3-20051109_104417.txt
 20051111_123445.txt        srm1-20051111_115412.txt  srm3-20051109_105040.txt
 greig_20051111_122420.txt  srm1-20051111_115413.txt  srm3-20051111_115412.txt
 greig_20051111_122542.txt  srm1-20051111_115414.txt
 srm1-121814.txt            srm3-121814.txt

Testing the new install of the dCache shows that everything is working. The have been changes to the layout of the web monitoring page, although the information pages remain the same. Log file names have changed. Nno longer have (e.g.) gridftpdoor.log, now it is gridftp-`hostname -s`Domain.log . SRM not appearing on the web monitoring page, possibly due to httpd.batch config issues. Need to look into this. However, it all appears to be OK.

Your old config/PoolManager.conf will not be overwritten by the upgrade. Its format did not change. Therefore, it is fine to keep your old one. In case you did not customize the pool manager configuration, make sure that the set costcuts line reads:

 set costcuts -idle=0.0 -p2p=2.0 -alert=0.0 -halt=0.0 -fallback=0.0

Prior versions installed a config/PoolManager.conf with -idle=1.0 which will lead to undesired behaviour of the pool manager.

PNFS Companion

The 'PNFS companion' contains only the 'cacheinfo' i.e. the location of the file on the pools (not on tape). This information used to be stored in PNFS (level 2, e.g. cat '.(use)(2)(myfile)'), but it is not there anymore. The 'PNFS companion' was devised to speed up 'cacheinfo' queries which appear very often. All the other information that used to be stored in PNFS (e.g. the filesystem structure) is now stored in the pnfsserver DBs (owned by the user pnfsserver). Following the instructions in the dCache book:

 [root]$ createdb -U srmdcache companion

Initially this command returned with a permission denied error. This was due to the postgreSQL user srmdcache not having permission to create new databases. Within postgreSQL, use this command to view what postgreSQL users (roles) have been created and what permissions they have.

 postgres=# SELECT * FROM pg_user;
   usename   | usesysid | usecreatedb | usesuper | usecatupd |  passwd  | valuntil | useconfig
 ------------+----------+-------------+----------+-----------+----------+----------+-----------
  postgres   |       10 | t           | t        | t         | ******** |   |
  srmdcache  |    16384 | f           | f        | f         | ******** |   |
  pnfsserver |    16385 | t           | f        | f         | ******** |   |
 (3 rows)

This clearly shows that srmdcache could not create new databases. Change permissions by:

 postgres=# ALTER USER srmdcache CREATEDB;
 ALTER ROLE
 postgres=# SELECT * FROM pg_user;
   usename   | usesysid | usecreatedb | usesuper | usecatupd |  passwd  | valuntil | useconfig
 ------------+----------+-------------+----------+-----------+----------+----------+-----------
  postgres   |       10 | t           | t        | t         | ******** |   |
  pnfsserver |    16385 | t           | f        | f         | ******** |   |
  srmdcache  |    16384 | t           | f        | f         | ******** |   |
 (3 rows)

Now try again.

 [root]# createdb -U srmdcache companion
 CREATE DATABASE

You can view all databases in the system:

 postgres=# \l
          List of databases
    Name    |   Owner    | Encoding
 -----------+------------+-----------
  admin     | pnfsserver | SQL_ASCII
  alice     | pnfsserver | SQL_ASCII
  atlas     | pnfsserver | SQL_ASCII
  cms       | pnfsserver | SQL_ASCII
  companion | srmdcache  | UTF8
  data1     | pnfsserver | SQL_ASCII
  dcache    | postgres   | SQL_ASCII
  dteam     | pnfsserver | SQL_ASCII
  lhcb      | pnfsserver | SQL_ASCII
  postgres  | postgres   | UTF8
  replicas  | srmdcache  | SQL_ASCII
  sixt      | pnfsserver | SQL_ASCII
  template0 | postgres   | UTF8
  template1 | postgres   | UTF8
 (14 rows)

Now finish off this step:

 [root]# psql -U srmdcache companion -f /opt/d-cache/etc/psql_install_companion.sql
 psql:/opt/d-cache/etc/psql_install_companion.sql:6: NOTICE:  CREATE TABLE / UNIQUE will create implicit index "cacheinfo_pnfsid_key" for table "cacheinfo"
 CREATE TABLE
 CREATE INDEX
 CREATE INDEX
 # service pnfs start

Add

 cacheInfo=companion

to /opt/d-cache/config/dCacheSetup and then start all dCache services.

Now the dCache system will not be aware of any files stored on the pools. To make it aware again, you have to go through the following steps: Since this will take a while and will put a considerable load on the PnfsManager, take care that this is done with one pool at a time. In the admin interface go to a pool:

 (local) admin > cd <hostname>_1

and issue the command

 (<poolname>) admin > pnfs register

Then go to the pnfs manager:

 (<poolname>) admin > ..
 (local) admin > cd PnfsManager


 (PnfsManager) admin > info
 ...
 Threads (4) Queue
     [0] 10
     [1] 12
     [2] 9
     [3] 13
 ...

and wait till the value for all four queues is zero. Then go to the next pool and repeat the process. After this is done, the upgrade is complete.

Migration of PNFS databases from gdbm to postgreSQL

This step can be carried out independently from the dCache 1.6.6 upgrade, but should not be done concurrently with the upgrade. First prepare the PostgreSQL server by creating a database user for the pnfs server. It has to have permissions to create databases. It is suggested to call it pnfsserver:

 $ su - postgres
 $ createuser --no-adduser --createdb pnfsserver
 Shall the new role be allowed to create more new roles? (y/n) y
 CREATE ROLE

You can check this via:

 postgres=# SELECT * FROM pg_user;
   usename   | usesysid | usecreatedb | usesuper | usecatupd |  passwd  | valuntil | useconfig
 ------------+----------+-------------+----------+-----------+----------+----------+-----------
  postgres   |       10 | t           | t        | t         | ******** |   |
  srmdcache  |    16384 | t           | f        | f         | ******** |   |
  pnfsserver |    17700 | t           | f        | f         | ******** |   |
 (3 rows)

Find the location of the databases by:

 $ . /usr/etc/pnfsSetup
 $ PATH=${pnfs}/tools:$PATH
 $ cat ${database}/D-* | cut -f 5 -d ':'
 /opt/pnfsdb/pnfs/databases/admin
 /opt/pnfsdb/pnfs/databases/data1
 /opt/pnfsdb/pnfs/databases/alice
 /opt/pnfsdb/pnfs/databases/atlas
 /opt/pnfsdb/pnfs/databases/dteam
 /opt/pnfsdb/pnfs/databases/cms
 /opt/pnfsdb/pnfs/databases/lhcb
 /opt/pnfsdb/pnfs/databases/sixt

Now want to make a backup of the databases and check their integrity. First of all, stop dCache and PNFS:

 $ service dcache-core stop
 $ service pnfs stop

(if you do not stop PNFS you will not be able to get a lock on the PNFS database to perform the check). From your home directory:

 $ mkdir tmp-pnfs-scan
 $ md3tool scan /opt/pnfsdb/pnfs/databases/admin > tmp-pnfs-scan/admin.scan 2>&1
 $ md3tool scandir /opt/pnfsdb/pnfs/databases/admin > tmp-pnfs-scan/admin.scandir 2>&1
 $ md3tool scandirs /opt/pnfsdb/pnfs/databases/admin > tmp-pnfs-scan/admin.scandirs 2>&1

and repeat for the other databases. Check the contents of these files by running commands like:

 $ grep -v "^Scan" *.scandir
 admin.scandir: Scanning DB id : 0
 admin.scandir:   External Reference at 0 : 000100000000000000001060 for data
 alice.scandir: Scanning DB id : 2
 atlas.scandir: Scanning DB id : 3
 cms.scandir: Scanning DB id : 5
 data1.scandir: Scanning DB id : 1
 data1.scandir:   External Reference at 0 : 000400000000000000001060 for dteam
 data1.scandir:   External Reference at 0 : 000700000000000000001060 for sixt
 data1.scandir:   External Reference at 0 : 000200000000000000001060 for alice
 data1.scandir:   External Reference at 0 : 000500000000000000001060 for cms
 data1.scandir:   External Reference at 0 : 000300000000000000001060 for atlas
 data1.scandir:   External Reference at 0 : 000600000000000000001060 for lhcb
 dteam.scandir: Scanning DB id : 4
 lhcb.scandir: Scanning DB id : 6
 sixt.scandir: Scanning DB id : 7

the contents of data1.scandir make sense since the VO databases were created as subdirectories in the /data directory. The PNFS IDs are the roots of these new databases. I should actually have created completely new directories. If your output is similar to that above, you can continue with the conversion to the postgresql version of pnfs.


Updating the pnfs Software

Backup databases:

 $ mv /opt/pnfsdb/pnfs/databases pnfsdb-backup/databases
 $ ls pnfsdb-backup/databases/
 admin  alice  atlas  cms  data1  dteam  lhcb  sixt
 $ cp /usr/etc/pnfsSetup pnfsdb-backup/

Remove old version of pnfs software.

 $ apt-get remove pnfs
 $ mkdir /opt/pnfsdb/pnfs/databases
 $ cd /opt/pnfsdb/pnfs/databases
 $ touch admin data1 dteam alice atlas cms lhcb sixt

Adjust the central configuration file /usr/etc/pnfsSetup: Change the location of the pnfs software in the line

 pnfs=/opt/pnfs.3.1.10/pnfs

to

 pnfs=/opt/pnfs

and add a line reading:

 export dbConnectString="user=pnfsserver"

Install the pnfs-postgresql package.

 $ rpm -ivh pnfs-postgresql-3.1.10-1.i386.rpm

Conversion of the Databases

Source the pnfs environment:

 $ . /usr/etc/pnfsSetup
 $ PATH=${pnfs}/tools:$PATH

Run the migration script on all of the databases (admin, data1...):

 $ gdbm2psql -r -o -i pnfsdb-backup/databases/admin
 Connection string: dbname=template1 user=pnfsserver
 Connection string: dbname=admin user=pnfsserver
 WARNING:  there is no transaction in progress
 Put record #1 into the database...time=0
 key 000000000000xxxxxxxxxxxx is found at 0
 key 000000000000xxxxxxxxxxxx is found at 1
 key 000000000000xxxxxxxxxxxx is found at 2
 key 000000000000xxxxxxxxxxxx is found at 3
 key 000000000000xxxxxxxxxxxx is found at 4
 ...
 key 000000000000xxxxxxxxxxxx is found at 68
 There are 69 records in the database.

For each database created in the conversion, create the database key by running:

 $ psql -U pnfsserver -c 'ALTER TABLE pnfs ADD primary key (pnfsid)' admin
 NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "pnfs_pkey" for table "pnfs"
 ALTER TABLE

Testing the Converted Databases

This must be done before pnfs is started. I have a small installation, so should be able to use the conversion-scan.sh script. All that we are doing here are comparing the outputs of the tests (i.e. mdb3tool) on the old and new databases. conversion-scan.sh is in the pnfs environment.

 $ conversion-scan.sh pnfsdb-backup/databases /opt/pnfsdb/pnfs/databases

Produces lots of output, the last line being:

 Conversion check finished sucessfully. Both sets of databases are of identical content.

So everything seems to have gone OK with the migration. Now start pnfs and dCache

 $ /opt/pnfs/bin/pnfs start
 Starting pnfs services (PostgreSQL version):
 Shmcom : Installed 8 Clients and 8 Servers
 Starting database server for admin (/opt/pnfsdb/pnfs/databases/admin) ... O.K.
 Starting database server for data1 (/opt/pnfsdb/pnfs/databases/data1) ... O.K.
 Starting database server for alice (/opt/pnfsdb/pnfs/databases/alice) ... O.K.
 Starting database server for atlas (/opt/pnfsdb/pnfs/databases/atlas) ... O.K.
 Starting database server for dteam (/opt/pnfsdb/pnfs/databases/dteam) ... O.K.
 Starting database server for cms (/opt/pnfsdb/pnfs/databases/cms) ... O.K.
 Starting database server for lhcb (/opt/pnfsdb/pnfs/databases/lhcb) ... O.K.
 Starting database server for sixt (/opt/pnfsdb/pnfs/databases/sixt) ... O.K.
 Waiting for dbservers to register ... Ready
 Starting Mountd : pmountd
 Starting nfsd : pnfsd

Note new location of pnfs. Remove the old symbolic link /etc/init.d/pnfs and create a new one:

 $ ln -s /opt/pnfs/bin/pnfs /etc/init.d/pnfs


Upgrade to 1.6.6-3 (bug fix)

Change-log can be found here. 1.6.6-3 contains a new information provider for LCG. For this to operate properly, you first of all need to get and install a couple of updated rpms for the GIP.

# rpm -Uvh lcg-info-generic-1.0.22-1.noarch.rpm lcg-info-templates-1.0.14-1.noarch.rpm
Preparing...                ########################################### [100%]
  1:lcg-info-generic       ########################################### [ 50%]
  2:lcg-info-templates     ########################################### [100%]

Made backups of the /opt/d-cache/etc and /opt/d-cache/config directories, just incase anything was overwritten during upgrade.

# rpm -Uvh dcache-server-1.6.6-3.i386.rpm dcache-client-1.6.6-3.i386.rpm pnfs-postgresql-3.1.10-3.i386.rpm

If I compare the md5sums of the tarred-up old and new etc and config directories, it shows that there have been changes.

# md5sum dcache-*
95039373e6d8e2b2aef215ca99de7298  dcache-config-NEW.tar
2e8b438acadb6b35adae1076f6f1c302  dcache-config.tar
fc2182d66285d9a46cfe57a0f6ac073d  dcache-etc-NEW.tar
9780eaa417bb984cc335bccf04b86c77  dcache-etc.tar

Changes made in psql_install_replicas.sql . This must be related to the improved performance of replica manager by adding an index to the database. This has to be done manually when upgrading:

# psql -U srmdcache -d replicas -f /opt/d-cache/etc/psql_install_replicas.sql
You are now connected as new user "srmdcache".
psql:/opt/d-cache/etc/psql_install_replicas.sql:12: ERROR:  schema "proc" already exists
SET
psql:/opt/d-cache/etc/psql_install_replicas.sql:26: ERROR:  relation "replicas" already exists
REVOKE
psql:/opt/d-cache/etc/psql_install_replicas.sql:46: ERROR:  relation "pools" already exists
REVOKE
SET
psql:/opt/d-cache/etc/psql_install_replicas.sql:65: ERROR:  relation "replicas" already exists
SET
psql:/opt/d-cache/etc/psql_install_replicas.sql:78: ERROR:  relation "action" already exists
psql:/opt/d-cache/etc/psql_install_replicas.sql:90: ERROR:  relation "heartbeat" already exists
psql:/opt/d-cache/etc/psql_install_replicas.sql:99: ERROR:  multiple primary keys for table "pools" are not allowed
psql:/opt/d-cache/etc/psql_install_replicas.sql:108: ERROR:  multiple primary keys for table "replicas" are not allowed
psql:/opt/d-cache/etc/psql_install_replicas.sql:117: ERROR:  multiple primary keys for table "heartbeat" are not allowed
CREATE INDEX
CREATE INDEX

You can safely ignore the errors produced by this command. Although dCacheSetup.template and node_config.template have slight changes to deal with the information provider, the default values should be OK and it should not be necessary to change dCacheSetup and node_config from the previous install. The new dCacheSetup.template also contains more comments regarding the meaning and structure of the file contents and a new way of configuring the various postgres databases that dCache uses.

Now start up pnfs and dCache services. Did not experience any problems here, srmGet, srmPut and srmCopy requests all successful after upgrade.

It should be noted that the srm-get-metadata command in the dCache client does not work, returning a java.lang.NullPointerException.

Information Provider

Follow the instructions in the dCache book. New versions of the GIP are required.

There is currently a problem with the dCache GIP plugin which is being investigated. Recommended to downgrade to 1.0.20 of GIP until the problem is resolved. After doing this, follow the steps in the dCache FAQ regarding the publishing of storage element information.

Upgrade to 1.6.6-4 (bug fix)

This is the version will be the one included in LCG 2.7.0. Change log can be found here. Only necessary to upgrade the server and client rpms. The pnfs rpms are identical to those in the 1.6.6-3 release. srm-get-metadata still not working.