Edinburgh dCache troubleshooting

From GridPP Wiki
Jump to: navigation, search

This page contains information about how to solve problems with the Edinburgh dCache installation.

dCache setup

This section gives some details of the dCache services that run on each node.

srm.epcc.ed.ac.uk

This can be termed the head node of the installation. It runs the following processes:

  • SRM (running in tomcat)
  • PNFS (the dCache namespace, /pnfs)
  • postgresql (used by the SRM and PNFS)
  • Web server running on port 2288
  • plus various other things that we don't need to go into right now.

It should be noted that PNFS is a separate product from dCache, but it is required to be running for dCache to operate. Postgresql needs to be running before PNFS can start. Assuming all configuration is correct, the following commands should successfully start up dCache on this node.

$ service postgresql start
$ service pnfs start
$ service dcache-core start

Obviously there are similar commands for stopping the processes.

pool1.epcc.ed.ac.uk

This is a pool node which is connected via HBAs to the IBM disk at the ACF. It runs the following:

  • dCache pool process for managing the disk pools
  • GridFTP door
  • dcap door (for local access from the WNs)

Since this machine runs a GridFTP door, /pnfs must be mounted (over NFS v2). This entry should appear in /etc/fstab:

srm.epcc.ed.ac.uk:/pnfsdoors /pnfs/epcc.ed.ac.uk nfs hard,intr,ro,bg,noac,auto 0 0

and the /pnfs directory should look like this:

$ ls -l /pnfs/
total 13
drwxr-xr-x  1 root bin   512 Jul 12  2006 epcc.ed.ac.uk
drwxr-xr-x  2 root root 4096 Jul 14  2006 fs
lrwxrwxrwx  1 root root   19 Oct 25 12:21 ftpBase -> /pnfs/epcc.ed.ac.uk

If pnfs is not mounted then that link will be broken. The relevant processes to start and stop are:

$ service dcache-core start
$ service dcache-pool start

The core services deal with the door processes, while the pool services deal with the pool (obviously). The list of configured pools that run on this node is found in /opt/d-cache/config/pool1.poollist. It has entries like:

pool1_01  /export/raid1-01//pool  sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=pool1
pool1_02  /export/raid1-02//pool  sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=pool1
pool1_03  /export/raid1-03//pool  sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=pool1

Which clearly shows the mapping between pool name and the filesystem on which data resides. You should not change any of the other options in this file.

pool2.epcc.ed.ac.uk

This is a second pool node, connected via a dual port HBA to the University SAN. Everything that applies to pool1 also applies here. This node also runs a

  • GSIDcap door (gsi authenticated dcap)

If a problem occurs...

In the majority of cases the admin should simply restart the dCache services on the affected nodes. This generally fixes problems (unless something really serious happens). For example, if this page

http://srm.epcc.ed.ac.uk:2288/usageInfo

contains entries like this:

pool1_04 	pool1Domain 	[99] 	Repository got lost

rather than the usual pool usage information, then the dcache-pool should be restarted on pool1.epcc.ed.ac.uk since this refers to pool1_04 which runs on pool1.epcc.

Upgrade procedure

Whenever dCache is upgraded, it is essential that the script /opt/d-cache/install/install.sh is run. This can be done by hand (as root) or using YAIM. This is essential since dCache upgrades will drop new jar files ontop of existing ones and replace the *.batch files in /opt/d-cache/config/ with new versions. For minor upgrades, I would recommend just running the script. Since the upgrade replaces the *.batch file, you should not change any of the options in them. Any configuration changes should be done through /opt/d-cache/config/dCacheSetup file which is the main place where you can alter the configuration.

Caveat

There's always an exception, isn't there? There are some instances where the *.batch files do have to be changed. For the Edinburgh dCache this has to happen in a couple of places. dCache has a habit of attaching the GridFTP, GSIDCap and DCap door processes to all available network interfaces. This is a problem for us since pool1 has it's public interface and an internal one for communicating with the disk controllers. This can lead to the wrong IP being used when the dCache returns a TURL (transfer URL) to a client when the client is trying to read or write a file, e.g., instead of this

gsiftp://pool1.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/file1

this is returned

gsiftp://192.168.0.1:2811/pnfs/epcc.ed.ac.uk/data/dteam/file1

This problem can be fixed by adding the listen line to the gridftpdoor.batch file, e.g.,

create dmg.cells.services.login.LoginManager GFTP-${thisHostname} \
            "${gsiFtpPortNumber} \
             -export \
             -listen=129.215.175.24\
             diskCacheV111.doors.GsiFtpDoorV1 \
...

Similarly for the gsidcapdoor.batch file (although there appears to be a problem with this at the moment and it doesn't work. This is the reason there is no gsidcap door running on pool1).