DCache Problems and Workarounds
Contents
How to find DN of a user who owns a file
It is useful to be able to find the DN of users who create and use files within your dCache. This can be done via an SQL query of the postgreSQL billing database. Create a script like that below and run it as root, where the single argument is either a file path (/pnfs/epcc.ed.ac.uk/data/dteam/filename) or a directory path (/pnfs/epcc.ed.ac.uk/data/dteam/dirname/). In the directory case, the DNs of all files in the directory are returned. Since the SQL query is on the doorinfo table, the script may output multiple entries if the file has been accessed by different users.
#! /bin/bash su - postgres psql billing -c "select datestamp, owner, mappeduid, mappedgid from doorinfo where path like '$1%' and errorcode='0'"
PNFSid to SURL mapping
If a pool node fails then it is useful to be able to determine which files were on it. The dCache postgreSQL database contains a list of pools with the PNFSids of the files that they contain. You can use the pathfinder tool within the PNFS namespace to map these IDs onto actual logical file names. User:Greig_cowan has written a small shell script that will automate this process for you and create a list of SURLs for each of the files. This list could then be presented to the VO managers, telling them precisely which files they have lost. The script is pretty basic and there are a couple of things to note:
- It must be run as the postgres user on the system to allow (non-privileged) access to the database.
- It assumes that the PNFS postgres database is running on the same machine as the SRM service (see fileRoot line).
- The SQL command checks that the PNFSid does not exist on any other pool in the dCache.
- It uses the native dCache pathfinder command. There is an alternative pathfinder command that has been written that creates a local cache for pnfsids. This works faster than the native pathfinder and probably puts less of a load on the PNFS database. The new pathfinder can be found here and can be used in place of the native version. If using the new version, then you will need to replace \(/root/fs/usr\) with \(/pnfs/fs/usr\) in the script.
- Make sure that you give the full path to the output files to ensure that you don't try to write into the PNFS namespace. I should change the script to check for that.
Obviously, some of the above can be customised to suit your needs. The script should be sufficient for GridPP Tier-2's running dCache.
#! /bin/bash # Run this as the postgres user. # Usage: ./script-name poolName pnfsids="/var/lib/pgsql/pnfsids.txt" psql companion -t -c "select pnfsid from cacheinfo where pnfsid in \ (select pnfsid from cacheinfo where pool='$1') \ group by pnfsid having count(pool)=1" > $pnfsids . /usr/etc/pnfsSetup export PATH=$PATH:$pnfs/tools fileRoot="srm://`hostname -f`:8443/pnfs/`hostname -d`" cd /pnfs/`hostname -d`/data for i in `cat $pnfsids` do filePath=`pathfinder $i | sed -n 's_\(/root/fs/usr\)\(\)_\2_gp'` #### <- \(/pnfs/fs/usr\) if [ ! "$filePath" = "" ] then echo $fileRoot$filePath fi done
Too many open files
If on the dCache pool usage web page, instead of the usual information about a pool's disk usage there is a message like
000200000000000DEADBEEF java.io.FileNotFoundException /path/to/pool/control/.SI-00020000000000DEADBEEF (Too many open files)
Then you should add
ulimit -n 16384
immediately after the first comments in the /opt/d-cache/bin/dcache-pool script , and restart the pool
Finding entries by time in srm database
The srm request databases stores their creation times in a bigint column, and the values equal the number of milliseconds after epoch that the request was created. However, this isn't a particularly useful format if you wish to find the requests for a particular day/week/month etc, for this the PostgreSQL timestamptz type would be perfect. Unfortunately there isn't a function in postgres to do the necessary conversion, so we have to create our own:
CREATE FUNCTION epoch_to_timestamptz ( INTEGER ) RETURNS timestamptz AS ' SELECT ''epoch''::timestamp WITH TIME ZONE + $1 * ''1 second''::INTERVAL; ' language 'sql' IMMUTABLE RETURNS NULL ON NULL INPUT ;
(Taken from PostgreSQL General Bits Issue 40) Note that this function will need to be recreated if you ever drop and recreate the dcache database
However this function only converts integers to timestamptz so we have to divide the creationtime by 1000 to scale it to seconds and cast it to the integer type, which gives a query like this
select epoch_to_timestamptz (CAST ((creationtime/1000) as INTEGER)) from copyfilerequests;
or a slightly more useful example to find requests on a given day with an error message :
select (epoch_to_timestamptz (CAST ((creationtime/1000) as INTEGER))) as foo,fromurl,tourl,errormessage from copyfilerequests_b where date_trunc('day', (epoch_to_timestamptz (CAST ((creationtime/1000) as INTEGER)))) = '2006-01-25 00:00:00+00' and errormessage notNULL;
Checking pool migration
Running the query below on the companion database outputs, for all files in a particular pnfs database and on a particular pool, those files which are only present on that pool and no other. This is a useful check after having performed a copy of files from one pool to another to check that there are no files left behind. Remove the substr clause to check all files on a pool.
select pnfsid, count(pool) from cacheinfo where pnfsid in ( select pnfsid from cacheinfo where pool = 'csfnfs51_3' and substr(pnfsid,1,4)='0003') group by pnfsid having count(pool) =1;