Glite3.0 RC2 dCache Experiences

From GridPP Wiki
Jump to: navigation, search

Following the instructions from Glite 3.0 RC2 Testing (https://uimon.cern.ch/twiki/bin/view/LCG/GLite301) a dCache SE is being installed on wn4.epcc.ed.ac.uk. This has been done as part of the UK GLite 3.0 testing. The testing of this release will occur in two parts.

1. Upgrade the LCG 2.7.0 version of dCache (1.6.6-5) with the upgraded postgreSQL PNFS backend to the gLite 3.0 release of dCache.

2. Starting from a fresh SL3.0.5 build of wn4, install gLite 3.0 version of dCache and then upgrade to the postgreSQL backend.

dCache upgrade

Version of dCache before upgrade to gLite:

[root@wn4 gcowan]# rpm -qa|grep cache
lcg-info-dynamic-dcache-1.0.8-1_sl3
dcache-client-1.6.6-5
dcache-server-1.6.6-5
d-cache-lcg-5.1.0-1
[root@wn4 examples]# rpm -qa|grep postgre
postgresql-libs-8.1.0-2PGDG
postgresql-devel-8.1.0-2PGDG
postgresql-python-8.1.0-2PGDG
pnfs-postgresql-3.1.10-3
postgresql-tcl-8.0.4-2PGDG
postgresql-8.1.0-2PGDG
postgresql-docs-8.1.0-2PGDG
postgresql-test-8.1.0-2PGDG
postgresql-slony1-engine-1.1.5_RC3-1_PG8.1.0
postgresql-server-8.1.0-2PGDG
postgresql-jdbc-8.1.0-2PGDG
postgresql-contrib-8.1.0-2PGDG
postgresql-pl-8.1.0-2PGDG
[root@wn4 examples]# rpm -qa|grep pnfs
pnfs-postgresql-3.1.10-3

Package Installation

Configure /etc/apt/sources.list.d/glite.list to have the line:

rpm http://lxb2042.cern.ch/gLite/APT/R3.0-pps rhel30 externals Release3.0 updates
[root@wn4 gcowan]# apt-get install glite-SE_dcache
Reading Package Lists... Done
Building Dependency Tree... Done
The following extra packages will be installed:
  d-cache-lcg (6.1.0-1)
  edg-mkgridmap (2.6.1-1_sl3)
  edg-mkgridmap-conf (2.6.1-1_sl3)
  edg-utils-system (1.8.2-1_sl3)
  fetch-crl (2.0-1)
  glite-rgma-api-c (5.0.6-1)
  glite-rgma-api-cpp (5.0.11-1)
  glite-rgma-api-java (5.0.3-1)
  glite-rgma-api-python (5.0.4-1)
  glite-rgma-base (5.0.4-1)
  glite-rgma-command-line (5.0.3-1)
  glite-rgma-gin (5.0.3-1)
  glite-rgma-log4cpp (5.0.3-1)
  glite-rgma-log4j (5.0.2-1)
  glite-rgma-stubs-servlet-java (5.0.5-1)
  glite-yaim (3.0.0-5)
  lcg-expiregridmapdir (2.0.0-1)
  lcg-info-dynamic-classicSE (1.0.8-1_sl3)
  lcg-info-dynamic-dcache (1.0.9-1_sl3)
  lcg-info-generic (1.0.22-1_sl3)
  lcg-info-templates (1.0.15-1_sl3)
  lcg-mon-gridftp (2.0.3-1_sl3)
  lcg-mon-logfile-common (1.0.6-1_sl3)
  lcg-profile (1.0.6-1)
  lcg-service-proxy (1.0.3-1_sl3)
  lcg-version (3.0.0-1)
  perl-Net-LDAP (0.2701-1.dag.rhel3)
  perl-Net-SSLeay (1.23-0.dag.rhel3)
  postgresql-tcl (8.0.4-2PGDG)
  uberftp-client (VDT1.2.2rh9_LCG-2)
The following packages will be upgraded
  d-cache-lcg (5.1.0-1 => 6.1.0-1)
  edg-mkgridmap (2.5.0-1_sl3 => 2.6.1-1_sl3)
  edg-mkgridmap-conf (2.5.0-1_sl3 => 2.6.1-1_sl3)
  edg-utils-system (1.8.0-1_sl3 => 1.8.2-1_sl3)
  glite-rgma-api-c (4.1.11-1 => 5.0.6-1)
  glite-rgma-api-cpp (4.1.14-1 => 5.0.11-1)
  glite-rgma-api-java (4.1.5-1 => 5.0.3-1)
  glite-rgma-api-python (4.1.13-1 => 5.0.4-1)
  glite-rgma-base (4.1.19-1 => 5.0.4-1)
  glite-rgma-command-line (4.1.13-1 => 5.0.3-1)
  glite-rgma-gin (4.1.17-1 => 5.0.3-1)
  glite-rgma-log4cpp (4.1.4-0 => 5.0.3-1)
  glite-rgma-log4j (4.1.5-1 => 5.0.2-1)
  glite-rgma-stubs-servlet-java (4.1.12-1 => 5.0.5-1)
  lcg-expiregridmapdir (1.0.0-2 => 2.0.0-1)
  lcg-info-dynamic-classicSE (1.0.7-1_sl3 => 1.0.8-1_sl3)
  lcg-info-dynamic-dcache (1.0.8-1_sl3 => 1.0.9-1_sl3)
  lcg-info-generic (1.0.22-1 => 1.0.22-1_sl3)
  lcg-info-templates (1.0.14-1 => 1.0.15-1_sl3)
  lcg-mon-gridftp (1.1.0-1_sl3 => 2.0.3-1_sl3)
  lcg-profile (1.0.4-1 => 1.0.6-1)
  lcg-version (2.6.0-1 => 3.0.0-1)
  uberftp-client (VDT1.2.2rh9_LCG-1 => VDT1.2.2rh9_LCG-2)
The following packages will be REPLACED:
  glite-rgma-system-tests (4.1.7-1)
  (by (2.6.0-9)
  glite-rgma-base) (0.31-sl3)
   ()
  lcg-yaim (4.1.7-1)
  (by (2.6.0-9)
  glite-yaim) (0.31-sl3)
   ()
  perl-ldap (4.1.7-1)
  (by (2.6.0-9)
  perl-Net-LDAP) (0.31-sl3)
The following NEW packages will be installed:
  fetch-crl (2.0-1)
  glite-SE_dcache (3.0.0-2)
  glite-yaim (3.0.0-5)
  lcg-mon-logfile-common (1.0.6-1_sl3)
  lcg-service-proxy (1.0.3-1_sl3)
  perl-Net-LDAP (0.2701-1.dag.rhel3)
  perl-Net-SSLeay (1.23-0.dag.rhel3)
  postgresql-tcl (8.0.4-2PGDG)

Now have the following:

[root@wn4 examples]# rpm -qa|grep cache
lcg-info-dynamic-dcache-1.0.9-1_sl3
glite-SE_dcache-3.0.0-2
dcache-client-1.6.6-5
d-cache-lcg-6.1.0-1
dcache-server-1.6.6-5

So packacge installation all appeared to go without any problems. Now onto the configuration step.

dCache Configuration

Setup YAIM to have all the usual parameters. Had to move users.conf and groups.conf from /opt/glite/yaim/examples to /opt/glite/yaim/etc, but this was referred to in the installation instructions, so no big deal.

The complete output from the configuration step is available glite-3.0 dCache upgrade configuration. In summary, the configuration step broke the previously working dCache installation. PostgreSQL restarted after the configuration, but something is wrong with PNFS:

[root@wn4 gcowan]# service pnfs start
Starting pnfs services (PostgreSQL version):
 Shmcom : Installed 8 Clients and 8 Servers
 Starting database server for admin (/opt/pnfsdb/pnfs/databases/admin) ... Failed
 Starting database server for data1 (/opt/pnfsdb/pnfs/databases/data1) ... Failed
 Starting database server for alice (/opt/pnfsdb/pnfs/databases/alice) ... Failed
 Starting database server for atlas (/opt/pnfsdb/pnfs/databases/atlas) ... Failed
 Starting database server for dteam (/opt/pnfsdb/pnfs/databases/dteam) ... Failed
 Starting database server for cms (/opt/pnfsdb/pnfs/databases/cms) ... Failed
 Starting database server for lhcb (/opt/pnfsdb/pnfs/databases/lhcb) ... Failed
 Starting database server for sixt (/opt/pnfsdb/pnfs/databases/sixt) ... Failed
 Waiting for dbservers to register ... Ready
 Starting Mountd : pmountd
 Starting nfsd : pnfsd
mount: RPC: Unable to receive; errno = Connection refused

This is no surprise since if I list the available postgreSQL databases, there are no VO databases present, even though they were previously present:

[root@wn4 server]# su - postgres
-bash-2.05b$ psql
Welcome to psql 8.1.0, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

postgres=# \l
        List of databases
   Name    |   Owner   | Encoding
-----------+-----------+----------
 billing   | srmdcache | UTF8
 companion | srmdcache | UTF8
 dcache    | srmdcache | UTF8
 postgres  | postgres  | UTF8
 replicas  | srmdcache | UTF8
 template0 | postgres  | UTF8
 template1 | postgres  | UTF8
(7 rows)

Has the YAIM configuration caused the postgreSQL setup to be overwritten? This shouldn't have happened since I set

RESET_DCACHE_RDBMS=no

in site-info.def. However, looking at the workings of config_pgsql<\code> and the configuration output I can seewhat has happened. config_pgsql contains a function <code>yaim_state_reset_postgresql():

yaim_state_reset_postgresql() {
echo start yaim_state_reset_postgresql
 # Returns 0 when postgres is fine
 # Returns 1 when postgres should be removed
 local result
 local rcrmpnfs
 result=0
                                                                               
 psql -U postgres -l 2>&1 > /dev/null
 rcrmpnfs=$?
 if [ "X$RESET_DCACHE_RDBMS" != Xyes ] ; then
   if [ "X$rcrmpnfs" != X0 ] ; then
       RESET_DCACHE_RDBMS=yes
   fi
 fi
 if [ "X$RESET_DCACHE_RDBMS" == Xyes ] ; then
   result=1
 fi
 echo stop yaim_state_reset_postgresql
 return $result
}

In addition to checking the value of the variable in site-info.def, tt runs the command psql -U postgres -l 2>&1 which lists the available databases of the user postgres and uses the output of this command to determine if postrges is running or not. If it is running then the script believes that it shouldn't be re-configured. However, in my case, I had stopped postgres, pnfs and dcache before running the installation and configuration steps. Therefore, when psql -U postgres -l 2>&1, no postgres was running and this generated the output psql: could not connect to server: .... which resulted in the postgres setup being reset.

I'm not sure if this is the best way of doing things. If a sys-admin states in site-info.def that they do not want the postgres config reset, they would expect YAIM to decide for them that it should be reset. Unless a good case for this can be provided I would recommend taking this out of YAIM.

Fresh dCache install

Will install dCache on 2 nodes, wn3.epcc.ed.ac.uk and wn4.epcc.ed.ac.uk. All dCache admin services, a pool, gridftp door and gsidcap door will be on wn4. wn3 will run a pool and gridftp door. This information was encoded in the relevant parameters in site-info.def:

DCACHE_ADMIN="wn4.$MY_DOMAIN"
DCACHE_POOLS="wn4.$MY_DOMAIN:/storage1 wn3.$MY_DOMAIN:/storage2"
DCACHE_PORT_RANGE="50000,52000"
DCACHE_DOOR_SRM="wn4.$MY_DOMAIN:8443"
DCACHE_DOOR_GSIFTP="wn4.$MY_DOMAIN:2811 wn3.$MY_DOMAIN:2811"
DCACHE_DOOR_GSIDCAP="wn4.$MY_DOMAIN"
# DCACHE_DOOR_DCAP="door_node1[:port] door_node2[:port]"
RESET_DCACHE_CONFIGURATION=yes
RESET_DCACHE_PNFS=yes
RESET_DCACHE_RDBMS=yes

In summary, dCache not working after installation and configuration steps. What follows shows detailed information about what is an is not running in the system immediately following YAIM configuration.

wn4 (admin node)

$ /opt/glite/yaim/scripts/install_node /opt/glite/yaim/etc/site-info.def glite-SE_dcache_gdbm | tee /tmp/dcache_admin_install.txt
$ /opt/glite/yaim/scripts/configure_node /opt/glite/yaim/etc/site-info.def SE_dcache | tee /tmp/dcache_admin_config.txt

1. dcache-core and dcache-pool services are up and running and links exists is /etc/init.d

[root@wn4 gcowan]# /sbin/chkconfig --list|grep dcache
dcache-core     0:off   1:off   2:on    3:on    4:on    5:on    6:off
dcache-pool     0:off   1:off   2:on    3:on    4:on    5:on    6:off
[root@wn4 gcowan]# netstat -lntp | grep java
tcp        0      0 0.0.0.0:22125               0.0.0.0:*                   LISTEN      7328/java
tcp        0      0 0.0.0.0:22223               0.0.0.0:*                   LISTEN      7408/java
tcp        0      0 0.0.0.0:22128               0.0.0.0:*                   LISTEN      7930/java
tcp        0      0 0.0.0.0:2288                0.0.0.0:*                   LISTEN      7487/java
tcp        0      0 0.0.0.0:33273               0.0.0.0:*                   LISTEN      7158/java
tcp        0      0 0.0.0.0:8443                0.0.0.0:*                   LISTEN      8042/java
tcp        0      0 0.0.0.0:2811                0.0.0.0:*                   LISTEN      7839/java
tcp        0      0 0.0.0.0:22111               0.0.0.0:*                   LISTEN      7748/java

This shows us that dCache is listening on the the correct ports for SRM (8443), GridFTP (2811), GSIDCap (22128). It also shows that it is listening on 22125 which is for a DCap door, even though I said that I didn't want one in site-info.def. The web monitoring interface (2288) is operational. This shows all required services are running. However, see point 2.

2. The web interface identifies that there are two operational pools, wn4_1 and wn4_2. I only specified that I wanted one pool on wn4 in site-info.def. If I list the contents of wn4 root directory I see /storage1 and /storage2. /storage1 is a separate partition that I created on the machine. /storage2 has been created by YAIM and contains the usual contents of a dCache pool directory, i.e. pool and pool/{control,data,setup,setup.orig,setup.temp}. Not looked into this yet, but the config_dcache function must be parsing site-info.def incorrectly and not checking for the hostname in the pool list.

3. There is an entry for pnfs in /etc/init.d but PNFS was not running after the configuration.

[root@wn4 gcowan]# /sbin/chkconfig --list|grep pnfs
pnfs            0:off   1:off   2:on    3:on    4:on    5:on    6:off

/pnfs exists, but I cannot list it's contents. If I try, the machine hangs. However, I am able to start PNFS by hand and after that I can list the contents of /pnfs etc and all looks OK. But, running

[root@wn4 gcowan]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda3             18297956   1597252  15771204  10% /
none                    481428         0    481428   0% /dev/shm
/dev/hda1             19153528     32880  18147680   1% /storage1
localhost:/fs           400000     80000    284000  22% /pnfs/fs
localhost:/fs           400000     80000    284000  22% /pnfs/fs

This shows two entries for pnfs which seems strange. Not sure what is going on here. If I stop pnfs and then run df again, the machine takes a long time to finish outputting the first 4 lines above.

4. All relevant PNFS databases have been created:

[root@wn4 gcowan]# . /usr/etc/pnfsSetup
[root@wn4 gcowan]# export PATH=$PATH:/opt/pnfs.3.1.10/pnfs/tools/
[root@wn4 gcowan]# mdb show
  ID   Name         Type    Status       Path
----------------------------------------------
  0    admin         r     enabled (x)   /opt/pnfsdb/pnfs/databases/admin
  1    data1         r     enabled (x)   /opt/pnfsdb/pnfs/databases/data1
  2    atlas         r     enabled (x)   /opt/pnfsdb/pnfs/databases//atlas
  3    atlas.generated    r     enabled (x)   /opt/pnfsdb/pnfs/databases//atlas.generated
  4    alice         r     enabled (x)   /opt/pnfsdb/pnfs/databases//alice
  5    alice.generated    r     enabled (x)   /opt/pnfsdb/pnfs/databases//alice.generated
  6    lhcb          r     enabled (x)   /opt/pnfsdb/pnfs/databases//lhcb
  7    lhcb.generated    r     enabled (x)   /opt/pnfsdb/pnfs/databases//lhcb.generated
  8    cms           r     enabled (x)   /opt/pnfsdb/pnfs/databases//cms
  9    cms.generated    r     enabled (x)   /opt/pnfsdb/pnfs/databases//cms.generated
 10    dteam         r     enabled (x)   /opt/pnfsdb/pnfs/databases//dteam
 11    dteam.generated    r     enabled (x)   /opt/pnfsdb/pnfs/databases//dteam.generated
 12    biomed        r     enabled (x)   /opt/pnfsdb/pnfs/databases//biomed
 13    biomed.generated    r     enabled (x)   /opt/pnfsdb/pnfs/databases//biomed.generated

This should all be OK since the number of servers as defined in /usr/etc/pnfsSetup has bee increased:

shmservers=24

But why do all the databases say enabled (x)? I thought they should all have a enabled (r) entry.

5. I'm a bit confused about what the configuration steps are trying to do. The above step shows that the databases have been configured for gdbm. You can tell this by running:

[root@wn4 gcowan]# file /opt/pnfsdb/pnfs/databases/atlas
/opt/pnfsdb/pnfs/databases/atlas: GNU dbm 1.x or ndbm database, little endian

If they were for postgres, then the database files /opt/pnfsdb/pnfs/databases/* would be empty, just acting as placeholders. But, YAIM has installed postgres (not the latest version I might add)

[root@wn4 gcowan]# rpm -qa|grep postgre
postgresql-libs-8.0.4-2PGDG
postgresql-docs-8.0.4-2PGDG
postgresql-python-8.0.4-2PGDG
postgresql-8.0.4-2PGDG
postgresql-jdbc-8.0.4-2PGDG
postgresql-contrib-8.0.4-2PGDG
postgresql-server-8.0.4-2PGDG
postgresql-tcl-8.0.4-2PGDG
postgresql-devel-8.0.4-2PGDG
postgresql-pl-8.0.4-2PGDG
postgresql-test-8.0.4-2PGDG

and has partially configured it for use with dCache. If I list the available postgres databases:

[root@wn4 gcowan]# psql -U postgres -l
        List of databases
   Name    |   Owner   | Encoding
-----------+-----------+----------
 billing   | srmdcache | UNICODE
 companion | srmdcache | UNICODE
 dcache    | srmdcache | UNICODE
 replicas  | srmdcache | UNICODE
 template0 | postgres  | UNICODE
 template1 | postgres  | UNICODE
(6 rows)

There is no postgres database which is not correct, nor are there any VO specific databases, but the configuration is partly complete. Are all these databases required for use with the GDBM version of PNFS and dCache? In addition, it is going to be turned on by default at boot time, but this is OK since it is needed for this version of dCache to operate properly.

[root@wn4 gcowan]# /sbin/chkconfig --list|grep postgr
postgresql      0:off   1:off   2:on    3:on    4:on    5:on    6:off

Try restarting all services to see if that makes a difference to dCache and PNFS.

[root@wn4 gcowan]# service dcache-core stop
...
Stopping srm-wn4Domain (pid=8042) 0 1 2 3 4 5 6 7 Done
Pid File (/opt/d-cache/config/lastPid.replica) doesn't contain valid PID
...
Stopping pnfsDomain (pid=7661) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Giving up : pnfsDomain might still be running
... 

So something strange going on with pnfs, but it starts up OK and I can list the contents of the directories. dcache-core and dcache-pool also started up OK.

6. Noticed a problem in that the exports dir does not contain the IP address of wn3 which would be required if wn3 is to be used as a gridftp door. I would have expected YAIM to have set this up since I specified it in site-info.def.

[root@wn4 gcowan]# ls /pnfs/fs/admin/etc/exports/
0.0.0.0..0.0.0.0  127.0.0.1  trusted

Also, now the SRM cell is not online after the dCache restart. netstat shows nothing listening on 8443. I'll check the logs:

[root@wn4 log]# ls /opt/d-cache/log/

Nothing. All the logs are in /var/log due to the value of the variable in dCacheSetup. This is a bit annoying. /var/log will get very messy due to the large number of dCache log files, especially if log rotation is used. OK, I realise why SRM wasn't coming online, it was because I had stopped postgres. Once it is started and dCache restarted, then SRM comes online.

7. It is good that YAIM automatically creates the VO pool groups. Meaning that it should start to publish storage information via the information provided plugin once pools are added to the pool groups. All directory pnfs tags are setup as well. However, the different VO specific links have not been setup that will point a VO transfer to the VO specific pool. The admin interface shows:

> psu ls link -l
default-link
 readPref  : 10
 cachePref : 10
 writePref : 10
 UGroups :
   any-store  (links=1;units=1)
   world-net  (links=1;units=1)

It would be good if there were corresponding links for each VO so that they were mapped to their pools, i.e.

psu create unit -store atlas:atlas@osm
psu create ugroup atlas
psu addto ugroup atlas atlas:atlas@osm
psu create link atlas-link world-net atlas
psu add link atlas-link atlas
psu set link atlas-link -readpref=10 -writepref=10 -cachepref=10


wn3 (pool node)

Installation and configuration of pool node is not correct.

1. PNFS is installed and YAIM attempts to configure it even though site-info.def specified that wn4 was the admin node. No entry in /etc/fstab

wn4.epcc.ed.ac.uk:/fs /pnfs/fs          nfs     hard,intr,rw,noac,auto 0 0

There was no epcc.ed.ac.uk symbolic link to fs/usr, and the symbolic link ftpBase was pointing to /pnfs/epcc.ed.ac.uk and not /pnfs/fs . Also, pnfs configured to start up at boot time.

Currently having bother mounting pnfs on the pool node when the firewall is up on the admin node. Need to have a look at this, but probably interfering with firewall settings is not something that YAIM should be doing. OK, problem solved.

2. dcache-pool and dcache-core seem to be OK:

[root@wn3 etc]# service dcache-core start
Starting dcache services:
Starting gridftp-wn3Domain  6 5 4 3 2 1 0 Done (pid=1245)

[root@wn3 etc]# netstat -lntp |grep java
tcp        0      0 0.0.0.0:2811                0.0.0.0:*                   LISTEN      1245/java

This is good. /opt/d-cache/etc/node_config contains the correct entires:

ADMIN_NODE=wn4.epcc.ed.ac.uk
GSIDCAP=no
GRIDFTP=yes
SRM=no

3. Again, YAIM has setup 2 pools on node when I only asked for 1. wn3_1 and wn3_2 correspond to /storage1 and /storage2 even though I only created the partition /storage2 to be used by dCache. I think the setup needs to be changed so that the hostname of the machine being configured is matched against the list of pool nodes in site-info.def.

4. Both the pool and the gridftp door appear in the web interface on the admin node, but initially there is a problem with the wn3 pools not coming online. It appears that this is again due to firewall issues between the two nodes. I can now copy files into the wn3 pools (checked by removing wn4 pools from the default group).

5. Firewall issues (again!) stopping transfers to/from wn3 pool (srmGet transfers directly from the pool node, not via the door). srmcp'ing (srmGet'ing) a file out of the dCache via this door hangs at this point:

debug: reading into data buffer 0xb75bd008, maximum length 131072

With firewall off on the pool node, the transfer succeeds. Debugging shows that one of the default iptable rules is causing packets to be rejected. After turning this off, all transfers succeed.