Ed Upgrade 152 To 166
This page contains the notes that were kept during a upgrade of a dCache 1.5.{2,3} to 1.6.6. The upgrade has been carried out on a test system which only contained a single node that ran all services, including two pools and also on the production admin node. Both upgrades were successful. It is possible to operate dCache with a 1.6.6 admin node and 1.5.{2,3} pool nodes.
Contents
postgreSQL database upgrade
Previous versions of dCache used gdbm to store the PNFS database. From now on, postgreSQL will be used (removes the 2GB limit) and it is recommended that v8.1 (or above) be used (v7.4 is standard on LCG machines). Some work needs to be done to migrate the old databases to the new format.
1. Make sure that the database is running. If not, start it by becoming the postgres user (su - postgres) and running:
$ postmaster -i -D /var/lib/pgsql/data/ > /tmp/logfile 2>&1 &
2. Still as the postgres user, dump the database and then stop it:
$ pg_dumpall > /bigdisk/db.out $ pg_ctl stop -D /var/lib/pgsql/data
3. Rename the old database:
$ mv /var/lib/pgsql/data /var/lib/pgsql/data_7.4
4. Download the postgreSQL 8 rpms and install them. A version that is suitable for current versions of Scientific Linux 3 can be found at http://www.postgresql.org/ftp/binary/v8.1.0/linux/rpms/redhat/rhel-es-3.0/. Afterwards you should have:
$ rpm -qa|grep post postgresql-libs-8.1.0-2PGDG postgresql-devel-8.1.0-2PGDG postgresql-python-8.1.0-2PGDG postgresql-8.1.0-2PGDG postgresql-docs-8.1.0-2PGDG postgresql-test-8.1.0-2PGDG postgresql-server-8.1.0-2PGDG postgresql-jdbc-8.1.0-2PGDG postgresql-contrib-8.1.0-2PGDG postgresql-pl-8.1.0-2PGDG
5. Create a new database instance instance (must be user postgres):
$ initdb -D /var/lib/pgsql/data
6. Start postgres:
$ postmaster -i -D /var/lib/pgsql/data/ > /tmp/logfile 2>&1 &
7. Restore database from dump:
$ psql -e template1 < /bigdisk/db.out
8. Turned this option on in /var/lib/pgsql/data/postgresql.conf
add_missing_from = on
and changed
#listen_addresses = 'localhost' # what IP address(es) to listen #port = 5432
to
listen_addresses = '*' # what IP address(es) to listen port = 5432
9. Altered /var/lib/pgsql/data/pg_hba.conf so that it contained the entries:
local all all trust host all all 127.0.0.1/32 trust host all all ::1/128 trust host all all <host-ip>/32 trust
10. Reload the database (as root):
$ /etc/init.d/postgres reload
Steps 8 and 9 above were required to get the SRM cell online in dCache once the upgrade to 1.6.6 had been completed.
dCache 1.6.6 upgrade
In the case where only pool nodes need to be upgraded, it should simply be a case of following the procedure presented here, ensuring that NODE_TYPE leads to a pool install and that the pool_path's are up to date. There should be no PNFS/postgreSQL issues to deal with. Problems may arise due to unmet dependencies on the pool nodes, particularly in the case where they are not standard LCG nodes, but are running some alternative operating system. For example, RedHat Advanced Server 2.1 uses glibc v2.2.5, but dCache 1.6.6 requires glibc v2.3 .
For the upgrade, first backup config and etc directories just incase and configuration files are changed. Check that pnfs_overwrite=no and that there is only one value assigned to node_type.
NODE_TYPE = admin dummy #admin or pool
This is not sufficient, as anything other than just admin will result in a pool node install.
$ cat node_config NODE_TYPE = admin DCACHE_BASE_DIR = /opt/d-cache PNFS_ROOT = /pnfs PNFS_INSTALL_DIR = /opt/pnfs.3.1.10/pnfs PNFS_START = yes PNFS_OVERWRITE = no POOL_PATH = /opt/d-cache/etc NUMBER_OF_MOVERS = 100
$ cat door_config ADMIN_NODE wn4.epcc.ed.ac.uk door active -------------------- GSIDCAP yes GRIDFTP yes SRM yes
dcache-opt no longer exists and door_config file is read upon startup, so no longer need to run install_doors.sh .
Make sure pool_path has 'no' in third field.
$ cat pool_path /dcache-storage1 14 no /dcache-storage2 14 no
Stop all dCache services:
$ service dcache-pool stop $ service dcache-opt stop $ service dcache-core stop
Leave pnfs running. Remove the LCG dCache metapackage.
$ rpm -e lcg-SE_dcache-2.6.0-sl3
Download the dCache tarball and install the rpms:
$ wget http://www.dcache.org/downloads/releases/dcache-bundle-1.6.6-1.tgz $ rpm -Uvh dcache-server-1.6.6-1.i386.rpm dcache-client-1.6.6-1.i386.rpm
Run install.sh script.
# /opt/d-cache/install/install.sh
[INFO] No 'SERVER_ID' set in 'node_config'. Using SERVER_ID=epcc.ed.ac.uk. [INFO] Will be mounted to localhost:/fs by dcache-core start-up script. [INFO] Link /pnfs/epcc.ed.ac.uk --> /pnfs/fs/usr already there. [INFO] Creating link /pnfs/ftpBase --> /pnfs/fs which is used by the GridFTP door. [INFO] Checking on a possibly existing dCache/PNFS configuration ... [INFO] Found an existing dCache/PNFS configuration! [INFO] Not allowed to overwrite existing configuration. [INFO] Configuring pnfs export '/pnfsdoors' (needed from version 1.6.6 on) mountable by world. [INFO] You may restrict access to this export to the GridFTP doors which are not on the admin node. See the documentation. [INFO] Generating ssh keys: Generating public/private rsa1 key pair. Your identification has been saved in ./server_key. Your public key has been saved in ./server_key.pub. The key fingerprint is: 43:59:86:33:b4:d0:f7:ad:9e:d5:41:eb:c1:b9:03:ff root@wn4.epcc.ed.ac.uk [INFO] Not overwriting pool at /dcache-storage1. [INFO] Not overwriting pool at /dcache-storage2.
Now start the dCache services.
$ service dcache-core start $ service dcache-pool start
Note that the names of the doors have changed to include the hostname. I will probably have to alter the httpd.batch file to reflect this. Note that the start-up script for the optional components is not needed anymore. Therefore, it is probably best to remove them:
$ rm /opt/d-cache/bin/dcache-opt /etc/init.d/dcache-opt rm: cannot lstat `/opt/d-cache/bin/dcache-opt': No such file or directory $ ls /opt/d-cache/bin/ dcache-core dcache-pool grid-mapfile2dcache-kpwd
So dcache-opt not present.
Check the dCache functionality to make sure everything is working:
# ls /pnfs/epcc.ed.ac.uk/data/dteam/ 20051109_094634.txt srm1-180240.txt srm3-124244.txt 20051111_122715.txt srm1-20051109_104417.txt srm3-180240.txt 20051111_123444.txt srm1-20051109_105040.txt srm3-20051109_104417.txt 20051111_123445.txt srm1-20051111_115412.txt srm3-20051109_105040.txt greig_20051111_122420.txt srm1-20051111_115413.txt srm3-20051111_115412.txt greig_20051111_122542.txt srm1-20051111_115414.txt srm1-121814.txt srm3-121814.txt
Testing the new install of the dCache shows that everything is working. The have been changes to the layout of the web monitoring page, although the information pages remain the same. Log file names have changed. Nno longer have (e.g.) gridftpdoor.log, now it is gridftp-`hostname -s`Domain.log . SRM not appearing on the web monitoring page, possibly due to httpd.batch config issues. Need to look into this. However, it all appears to be OK.
Your old config/PoolManager.conf will not be overwritten by the upgrade. Its format did not change. Therefore, it is fine to keep your old one. In case you did not customize the pool manager configuration, make sure that the set costcuts line reads:
set costcuts -idle=0.0 -p2p=2.0 -alert=0.0 -halt=0.0 -fallback=0.0
Prior versions installed a config/PoolManager.conf with -idle=1.0 which will lead to undesired behaviour of the pool manager.
PNFS Companion
The 'PNFS companion' contains only the 'cacheinfo' i.e. the location of the file on the pools (not on tape). This information used to be stored in PNFS (level 2, e.g. cat '.(use)(2)(myfile)'), but it is not there anymore. The 'PNFS companion' was devised to speed up 'cacheinfo' queries which appear very often. All the other information that used to be stored in PNFS (e.g. the filesystem structure) is now stored in the pnfsserver DBs (owned by the user pnfsserver). Following the instructions in the dCache book:
[root]$ createdb -U srmdcache companion
Initially this command returned with a permission denied error. This was due to the postgreSQL user srmdcache not having permission to create new databases. Within postgreSQL, use this command to view what postgreSQL users (roles) have been created and what permissions they have.
postgres=# SELECT * FROM pg_user; usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig ------------+----------+-------------+----------+-----------+----------+----------+----------- postgres | 10 | t | t | t | ******** | | srmdcache | 16384 | f | f | f | ******** | | pnfsserver | 16385 | t | f | f | ******** | | (3 rows)
This clearly shows that srmdcache could not create new databases. Change permissions by:
postgres=# ALTER USER srmdcache CREATEDB; ALTER ROLE postgres=# SELECT * FROM pg_user; usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig ------------+----------+-------------+----------+-----------+----------+----------+----------- postgres | 10 | t | t | t | ******** | | pnfsserver | 16385 | t | f | f | ******** | | srmdcache | 16384 | t | f | f | ******** | | (3 rows)
Now try again.
[root]# createdb -U srmdcache companion CREATE DATABASE
You can view all databases in the system:
postgres=# \l List of databases Name | Owner | Encoding -----------+------------+----------- admin | pnfsserver | SQL_ASCII alice | pnfsserver | SQL_ASCII atlas | pnfsserver | SQL_ASCII cms | pnfsserver | SQL_ASCII companion | srmdcache | UTF8 data1 | pnfsserver | SQL_ASCII dcache | postgres | SQL_ASCII dteam | pnfsserver | SQL_ASCII lhcb | pnfsserver | SQL_ASCII postgres | postgres | UTF8 replicas | srmdcache | SQL_ASCII sixt | pnfsserver | SQL_ASCII template0 | postgres | UTF8 template1 | postgres | UTF8 (14 rows)
Now finish off this step:
[root]# psql -U srmdcache companion -f /opt/d-cache/etc/psql_install_companion.sql psql:/opt/d-cache/etc/psql_install_companion.sql:6: NOTICE: CREATE TABLE / UNIQUE will create implicit index "cacheinfo_pnfsid_key" for table "cacheinfo" CREATE TABLE CREATE INDEX CREATE INDEX
# service pnfs start
Add
cacheInfo=companion
to /opt/d-cache/config/dCacheSetup and then start all dCache services.
Now the dCache system will not be aware of any files stored on the pools. To make it aware again, you have to go through the following steps: Since this will take a while and will put a considerable load on the PnfsManager, take care that this is done with one pool at a time. In the admin interface go to a pool:
(local) admin > cd <hostname>_1
and issue the command
(<poolname>) admin > pnfs register
Then go to the pnfs manager:
(<poolname>) admin > .. (local) admin > cd PnfsManager
(PnfsManager) admin > info ... Threads (4) Queue [0] 10 [1] 12 [2] 9 [3] 13 ...
and wait till the value for all four queues is zero. Then go to the next pool and repeat the process. After this is done, the upgrade is complete.
Migration of PNFS databases from gdbm to postgreSQL
This step can be carried out independently from the dCache 1.6.6 upgrade, but should not be done concurrently with the upgrade. First prepare the PostgreSQL server by creating a database user for the pnfs server. It has to have permissions to create databases. It is suggested to call it pnfsserver:
$ su - postgres $ createuser --no-adduser --createdb pnfsserver Shall the new role be allowed to create more new roles? (y/n) y CREATE ROLE
You can check this via:
postgres=# SELECT * FROM pg_user; usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig ------------+----------+-------------+----------+-----------+----------+----------+----------- postgres | 10 | t | t | t | ******** | | srmdcache | 16384 | t | f | f | ******** | | pnfsserver | 17700 | t | f | f | ******** | | (3 rows)
Find the location of the databases by:
$ . /usr/etc/pnfsSetup $ PATH=${pnfs}/tools:$PATH $ cat ${database}/D-* | cut -f 5 -d ':' /opt/pnfsdb/pnfs/databases/admin /opt/pnfsdb/pnfs/databases/data1 /opt/pnfsdb/pnfs/databases/alice /opt/pnfsdb/pnfs/databases/atlas /opt/pnfsdb/pnfs/databases/dteam /opt/pnfsdb/pnfs/databases/cms /opt/pnfsdb/pnfs/databases/lhcb /opt/pnfsdb/pnfs/databases/sixt
Now want to make a backup of the databases and check their integrity. First of all, stop dCache and PNFS:
$ service dcache-core stop $ service pnfs stop
(if you do not stop PNFS you will not be able to get a lock on the PNFS database to perform the check). From your home directory:
$ mkdir tmp-pnfs-scan $ md3tool scan /opt/pnfsdb/pnfs/databases/admin > tmp-pnfs-scan/admin.scan 2>&1 $ md3tool scandir /opt/pnfsdb/pnfs/databases/admin > tmp-pnfs-scan/admin.scandir 2>&1 $ md3tool scandirs /opt/pnfsdb/pnfs/databases/admin > tmp-pnfs-scan/admin.scandirs 2>&1
and repeat for the other databases. Check the contents of these files by running commands like:
$ grep -v "^Scan" *.scandir admin.scandir: Scanning DB id : 0 admin.scandir: External Reference at 0 : 000100000000000000001060 for data alice.scandir: Scanning DB id : 2 atlas.scandir: Scanning DB id : 3 cms.scandir: Scanning DB id : 5 data1.scandir: Scanning DB id : 1 data1.scandir: External Reference at 0 : 000400000000000000001060 for dteam data1.scandir: External Reference at 0 : 000700000000000000001060 for sixt data1.scandir: External Reference at 0 : 000200000000000000001060 for alice data1.scandir: External Reference at 0 : 000500000000000000001060 for cms data1.scandir: External Reference at 0 : 000300000000000000001060 for atlas data1.scandir: External Reference at 0 : 000600000000000000001060 for lhcb dteam.scandir: Scanning DB id : 4 lhcb.scandir: Scanning DB id : 6 sixt.scandir: Scanning DB id : 7
the contents of data1.scandir make sense since the VO databases were created as subdirectories in the /data directory. The PNFS IDs are the roots of these new databases. I should actually have created completely new directories. If your output is similar to that above, you can continue with the conversion to the postgresql version of pnfs.
Updating the pnfs Software
Backup databases:
$ mv /opt/pnfsdb/pnfs/databases pnfsdb-backup/databases $ ls pnfsdb-backup/databases/ admin alice atlas cms data1 dteam lhcb sixt $ cp /usr/etc/pnfsSetup pnfsdb-backup/
Remove old version of pnfs software.
$ apt-get remove pnfs
$ mkdir /opt/pnfsdb/pnfs/databases $ cd /opt/pnfsdb/pnfs/databases $ touch admin data1 dteam alice atlas cms lhcb sixt
Adjust the central configuration file /usr/etc/pnfsSetup: Change the location of the pnfs software in the line
pnfs=/opt/pnfs.3.1.10/pnfs
to
pnfs=/opt/pnfs
and add a line reading:
export dbConnectString="user=pnfsserver"
Install the pnfs-postgresql package.
$ rpm -ivh pnfs-postgresql-3.1.10-1.i386.rpm
Conversion of the Databases
Source the pnfs environment:
$ . /usr/etc/pnfsSetup $ PATH=${pnfs}/tools:$PATH
Run the migration script on all of the databases (admin, data1...):
$ gdbm2psql -r -o -i pnfsdb-backup/databases/admin Connection string: dbname=template1 user=pnfsserver Connection string: dbname=admin user=pnfsserver WARNING: there is no transaction in progress Put record #1 into the database...time=0 key 000000000000xxxxxxxxxxxx is found at 0 key 000000000000xxxxxxxxxxxx is found at 1 key 000000000000xxxxxxxxxxxx is found at 2 key 000000000000xxxxxxxxxxxx is found at 3 key 000000000000xxxxxxxxxxxx is found at 4 ... key 000000000000xxxxxxxxxxxx is found at 68 There are 69 records in the database.
For each database created in the conversion, create the database key by running:
$ psql -U pnfsserver -c 'ALTER TABLE pnfs ADD primary key (pnfsid)' admin NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "pnfs_pkey" for table "pnfs" ALTER TABLE
Testing the Converted Databases
This must be done before pnfs is started. I have a small installation, so should be able to use the conversion-scan.sh script. All that we are doing here are comparing the outputs of the tests (i.e. mdb3tool) on the old and new databases. conversion-scan.sh is in the pnfs environment.
$ conversion-scan.sh pnfsdb-backup/databases /opt/pnfsdb/pnfs/databases
Produces lots of output, the last line being:
Conversion check finished sucessfully. Both sets of databases are of identical content.
So everything seems to have gone OK with the migration. Now start pnfs and dCache
$ /opt/pnfs/bin/pnfs start Starting pnfs services (PostgreSQL version): Shmcom : Installed 8 Clients and 8 Servers Starting database server for admin (/opt/pnfsdb/pnfs/databases/admin) ... O.K. Starting database server for data1 (/opt/pnfsdb/pnfs/databases/data1) ... O.K. Starting database server for alice (/opt/pnfsdb/pnfs/databases/alice) ... O.K. Starting database server for atlas (/opt/pnfsdb/pnfs/databases/atlas) ... O.K. Starting database server for dteam (/opt/pnfsdb/pnfs/databases/dteam) ... O.K. Starting database server for cms (/opt/pnfsdb/pnfs/databases/cms) ... O.K. Starting database server for lhcb (/opt/pnfsdb/pnfs/databases/lhcb) ... O.K. Starting database server for sixt (/opt/pnfsdb/pnfs/databases/sixt) ... O.K. Waiting for dbservers to register ... Ready Starting Mountd : pmountd Starting nfsd : pnfsd
Note new location of pnfs. Remove the old symbolic link /etc/init.d/pnfs and create a new one:
$ ln -s /opt/pnfs/bin/pnfs /etc/init.d/pnfs
Upgrade to 1.6.6-3 (bug fix)
Change-log can be found here. 1.6.6-3 contains a new information provider for LCG. For this to operate properly, you first of all need to get and install a couple of updated rpms for the GIP.
# rpm -Uvh lcg-info-generic-1.0.22-1.noarch.rpm lcg-info-templates-1.0.14-1.noarch.rpm Preparing... ########################################### [100%] 1:lcg-info-generic ########################################### [ 50%] 2:lcg-info-templates ########################################### [100%]
Made backups of the /opt/d-cache/etc and /opt/d-cache/config directories, just incase anything was overwritten during upgrade.
# rpm -Uvh dcache-server-1.6.6-3.i386.rpm dcache-client-1.6.6-3.i386.rpm pnfs-postgresql-3.1.10-3.i386.rpm
If I compare the md5sums of the tarred-up old and new etc and config directories, it shows that there have been changes.
# md5sum dcache-* 95039373e6d8e2b2aef215ca99de7298 dcache-config-NEW.tar 2e8b438acadb6b35adae1076f6f1c302 dcache-config.tar fc2182d66285d9a46cfe57a0f6ac073d dcache-etc-NEW.tar 9780eaa417bb984cc335bccf04b86c77 dcache-etc.tar
Changes made in psql_install_replicas.sql . This must be related to the improved performance of replica manager by adding an index to the database. This has to be done manually when upgrading:
# psql -U srmdcache -d replicas -f /opt/d-cache/etc/psql_install_replicas.sql You are now connected as new user "srmdcache". psql:/opt/d-cache/etc/psql_install_replicas.sql:12: ERROR: schema "proc" already exists SET psql:/opt/d-cache/etc/psql_install_replicas.sql:26: ERROR: relation "replicas" already exists REVOKE psql:/opt/d-cache/etc/psql_install_replicas.sql:46: ERROR: relation "pools" already exists REVOKE SET psql:/opt/d-cache/etc/psql_install_replicas.sql:65: ERROR: relation "replicas" already exists SET psql:/opt/d-cache/etc/psql_install_replicas.sql:78: ERROR: relation "action" already exists psql:/opt/d-cache/etc/psql_install_replicas.sql:90: ERROR: relation "heartbeat" already exists psql:/opt/d-cache/etc/psql_install_replicas.sql:99: ERROR: multiple primary keys for table "pools" are not allowed psql:/opt/d-cache/etc/psql_install_replicas.sql:108: ERROR: multiple primary keys for table "replicas" are not allowed psql:/opt/d-cache/etc/psql_install_replicas.sql:117: ERROR: multiple primary keys for table "heartbeat" are not allowed CREATE INDEX CREATE INDEX
You can safely ignore the errors produced by this command. Although dCacheSetup.template and node_config.template have slight changes to deal with the information provider, the default values should be OK and it should not be necessary to change dCacheSetup and node_config from the previous install. The new dCacheSetup.template also contains more comments regarding the meaning and structure of the file contents and a new way of configuring the various postgres databases that dCache uses.
Now start up pnfs and dCache services. Did not experience any problems here, srmGet, srmPut and srmCopy requests all successful after upgrade.
It should be noted that the srm-get-metadata command in the dCache client does not work, returning a java.lang.NullPointerException
.
Information Provider
Follow the instructions in the dCache book. New versions of the GIP are required.
There is currently a problem with the dCache GIP plugin which is being investigated. Recommended to downgrade to 1.0.20 of GIP until the problem is resolved. After doing this, follow the steps in the dCache FAQ regarding the publishing of storage element information.
Upgrade to 1.6.6-4 (bug fix)
This is the version will be the one included in LCG 2.7.0. Change log can be found here. Only necessary to upgrade the server and client rpms. The pnfs rpms are identical to those in the 1.6.6-3 release. srm-get-metadata still not working.