Edinburgh dCache troubleshooting
This page contains information about how to solve problems with the Edinburgh dCache installation.
Contents
dCache setup
This section gives some details of the dCache services that run on each node.
srm.epcc.ed.ac.uk
This can be termed the head node of the installation. It runs the following processes:
- SRM (running in tomcat)
- PNFS (the dCache namespace, /pnfs)
- postgresql (used by the SRM and PNFS)
- Web server running on port 2288
- plus various other things that we don't need to go into right now.
It should be noted that PNFS is a separate product from dCache, but it is required to be running for dCache to operate. Postgresql needs to be running before PNFS can start. Assuming all configuration is correct, the following commands should successfully start up dCache on this node.
$ service postgresql start $ service pnfs start $ service dcache-core start
Obviously there are similar commands for stopping the processes.
pool1.epcc.ed.ac.uk
This is a pool node which is connected via HBAs to the IBM disk at the ACF. It runs the following:
- dCache pool process for managing the disk pools
- GridFTP door
- dcap door (for local access from the WNs)
Since this machine runs a GridFTP door, /pnfs must be mounted (over NFS v2). This entry should appear in /etc/fstab:
srm.epcc.ed.ac.uk:/pnfsdoors /pnfs/epcc.ed.ac.uk nfs hard,intr,ro,bg,noac,auto 0 0
and the /pnfs directory should look like this:
$ ls -l /pnfs/ total 13 drwxr-xr-x 1 root bin 512 Jul 12 2006 epcc.ed.ac.uk drwxr-xr-x 2 root root 4096 Jul 14 2006 fs lrwxrwxrwx 1 root root 19 Oct 25 12:21 ftpBase -> /pnfs/epcc.ed.ac.uk
If pnfs is not mounted then that link will be broken. The relevant processes to start and stop are:
$ service dcache-core start $ service dcache-pool start
The core services deal with the door processes, while the pool services deal with the pool (obviously). The list of configured pools that run on this node is found in /opt/d-cache/config/pool1.poollist. It has entries like:
pool1_01 /export/raid1-01//pool sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=pool1 pool1_02 /export/raid1-02//pool sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=pool1 pool1_03 /export/raid1-03//pool sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=pool1
Which clearly shows the mapping between pool name and the filesystem on which data resides. You should not change any of the other options in this file.
pool2.epcc.ed.ac.uk
This is a second pool node, connected via a dual port HBA to the University SAN. Everything that applies to pool1 also applies here. This node also runs a
- GSIDcap door (gsi authenticated dcap)
If a problem occurs...
In the majority of cases the admin should simply restart the dCache services on the affected nodes. This generally fixes problems (unless something really serious happens). For example, if this page
http://srm.epcc.ed.ac.uk:2288/usageInfo
contains entries like this:
pool1_04 pool1Domain [99] Repository got lost
rather than the usual pool usage information, then the dcache-pool should be restarted on pool1.epcc.ed.ac.uk since this refers to pool1_04 which runs on pool1.epcc.
Upgrade procedure
Whenever dCache is upgraded, it is essential that the script /opt/d-cache/install/install.sh is run. This can be done by hand (as root) or using YAIM. This is essential since dCache upgrades will drop new jar files ontop of existing ones and replace the *.batch files in /opt/d-cache/config/ with new versions. For minor upgrades, I would recommend just running the script. Since the upgrade replaces the *.batch file, you should not change any of the options in them. Any configuration changes should be done through /opt/d-cache/config/dCacheSetup file which is the main place where you can alter the configuration.
Caveat
There's always an exception, isn't there? There are some instances where the *.batch files do have to be changed. For the Edinburgh dCache this has to happen in a couple of places. dCache has a habit of attaching the GridFTP, GSIDCap and DCap door processes to all available network interfaces. This is a problem for us since pool1 has it's public interface and an internal one for communicating with the disk controllers. This can lead to the wrong IP being used when the dCache returns a TURL (transfer URL) to a client when the client is trying to read or write a file, e.g., instead of this
gsiftp://pool1.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/file1
this is returned
gsiftp://192.168.0.1:2811/pnfs/epcc.ed.ac.uk/data/dteam/file1
This problem can be fixed by adding the listen line to the gridftpdoor.batch file, e.g.,
create dmg.cells.services.login.LoginManager GFTP-${thisHostname} \ "${gsiFtpPortNumber} \ -export \ -listen=129.215.175.24\ diskCacheV111.doors.GsiFtpDoorV1 \ ...
Similarly for the gsidcapdoor.batch file (although there appears to be a problem with this at the moment and it doesn't work. This is the reason there is no gsidcap door running on pool1).