DCache SRM v2.2 Testing

From GridPP Wiki
Jump to: navigation, search

This page keeps track of the issues that arose during the testing of dCache 1.8.0, SRMv2.2 endpoints as part of a programme of testing, coordinated by the WLCG deployment group at CERN. All flavours (dCache, CASTOR, DPM, StoRM, BeSTMan) of SRM endpoint are having their basic functionality tested via the s2 client and lcg_utils. Stress testing will also be done once the basic functionality and inter-operability tests have been completed.

A problem is marked SOLVED if the solution is now know.

Endpoints

GridPP has one dCache endpoint involved in the tests.

  • httpg://wn3.epcc.ed.ac.uk:8443/srm/managerv2?SFN=/pnfs/epcc.ed.ac.uk/data/dteam - this is a single test machine, running all services and 2 pools of ~100GB.

The other endpoints can be found in the GSSD twiki page.

Ping (SOLVED)

Initially not working due to problems with Flavia's DN not being of the correct format.

Exception thrown by diskCacheV111.services.authorization.KPWDAuthorizationPlugin: Permission Denied: Cannot determine Username for 
DN /C=IT/O=INFN/OU=Personal  Certificate/L=Pisa/CN=Flavia Donno/EMAILADDRESS=<email>" returnStatus.statusCode=SRM_AUTHORIZATION_FAILURE

The default dcache.kpwd contained 4 (yes, 4) different versions of Flavia's DN, for the different combinations of /Email, /EMAIL, /E etc. These are automatically generated by the grid-mapfile2dcache-kpwd utility for DNs that contain an email address. Unfortunately, none of these contained /EMAILADDRESS. Only when this option was added were the tests successful. It seems that different versions of OpenSSL are still causing these problems. Since the dcache.kpwd file is automatically generated from the grid-mapfile using grid-mapfile2dcache-kpwd script, I have disabled the script so that this endpoint can continue passing the S2 tests. According to Flavia:

It looks like there is a special script to populate the gridmapfile.
It is present in the YAIM configurator for dCache 1.7.
Apparently there is somebody working on the YAIM configurator for dCache 1.8 but it is still not ready. 

The dCache developers will now add in the EMAILADDRESS variant.

srmcp client

dCache SRM used to incorrectly select transfer protocol on basis of its internal priorities, considering the array of protocols sent by the srm client as an unordered set. The latest code is changed to consider the client's protocol list to be ordered and prioritized. Unfortunately this exposed the defect in the current srmcp client, it sends the dcap protocol ahead of the gsiftp protocol in its list. The next release of the srmcp client will address this, in the mean time, please specify gsiftp protocol as the first protocol in the list using for example the following option: "-protocols=gsiftp,dcap". The corresponding Server and Client releases are now available (1.7.0-35) are available for download.

Space Reservation

Reserving space for replica quality (temporary) to Edinburgh dCache came back with "NO_FREE_SPACE" status. Alex Sim asked if he needs to make custodial quality space (permanent) only?

srmReserveSpace at Wed May 02 09:48:18 PDT 2007
ServicePath=srm://wn3.epcc.ed.ac.uk:8443/srm/managerv2
Status=SRM_NO_FREE_SPACE
RetentionPolicy=Replica

REPLICA-ONLINE

I need some advice as to how to setup the REPLICA-ONLINE (Tape0Disk1) storage class. In the default YAIM installation of v1.8 the SpaceManager is not turned on by default.

grep SpaceMan /opt/d-cache/config/dCacheSetup
# srmSpaceManagerEnabled=no
# SpaceManagerDefaultRetentionPolicy=CUSTODIAL
# SpaceManagerDefaultAccessLatency=NEARLINE
# SpaceManagerReserveSpaceForNonSRMTransfers=false

There were set to to

srmSpaceManagerEnabled=yes
SpaceManagerDefaultRetentionPolicy=REPLICA
SpaceManagerDefaultAccessLatency=ONLINE
SpaceManagerReserveSpaceForNonSRMTransfers=false

and the SRM restarted. There should be some additional instructions in the dCache Book.

Update

Starting the SpaceManager appears to have broken some aspects of the SRM2.2 service in the dCache. Ever since making this change we have been failing certain tests with no space available messages.

08_AbortRequest: Executed PrepareToPut, requestToken: -2147463375
08_AbortRequest: srmPrepareToPut returnStatus: SRM_REQUEST_QUEUED
08_AbortRequest: Executing srmAbortRequest, requestToken=-2147463375
08_AbortRequest: srmAbortRequest returnStatus: SRM_SUCCESS
08_AbortRequest: srmStatusOfPutRequest returnStatus=SRM_FAILURE
08_AbortRequest: srmStatusOfPutRequest  fileStatuses=SURL0=srm://wn3.epcc.ed.ac.uk:8443/srm/managerv2?SFN=/pnfs/epcc.ed.ac.uk/data/dteam/20070517-080637-11212-0.txt returnStatus.explanation0= at  hu May 17 07:06:38 BST 2007 state Failed :  no space available
returnStatus.statusCode0=SRM_FAILURE fileSize0=2691 estimatedWaitTime0=1 remainingPinLifetime0=0

It is not clear what is going on here. Some information:

psu create linkGroup default-link-group
psu set linkGroup attribute default-link-group HSM=None
psu addto linkGroup default-link-group default-link
psu create linkGroup write-link-group
psu set linkGroup attribute write-link-group dteamRole=/dteam/NULL/production
psu set linkGroup attribute write-link-group dteamRole=/dteam/NULL/lcgadmin
psu set linkGroup attribute write-link-group HSM=None
psu set linkGroup attribute write-link-group VO=dteam
psu addto linkGroup write-link-group dteam-link

As you can see, I only have Replica-Online space (no HSM).

From the admin shell:

> psu ls -l linkGroup
default-link-group : [  default-link ]
    Attributes:
           HSM = None
 
write-link-group : [  dteam-link ]
    Attributes:
           dteamRole = /dteam/NULL/production /dteam/NULL/lcgadmin
           HSM = None
           VO = dteam
 
> psu ls -l pgroup
dteam
 linkList :
   dteam-link  (pref=20/20/-1/20;;ugroups=2;pools=1)
 poolList :
   wn3_1  (enabled=true;active=7;rdOnly=false;links=0;pgroups=2)

This is interesting, from the database:

dcache=# select * from srmlinkgroup;
 id |        name        | hsmtype | freespaceinbytes | lastupdatetime
----+--------------------+---------+------------------+----------------
  0 | default-link-group | None    |                0 |  1177429333736
  1 | write-link-group   | None    |     104152955222 |  1177435636041
 
dcache=# select * from spacemanagerpoolreservation;
 spacetoken | reservedspacesize | lockedspacesize | creationtime | lifetime | poolname | pnfspath | createdpnfsentry | utilized
------------+-------------------+-----------------+--------------+----------+----------+----------+------------------+----------
(0 rows)
 
dcache=# select * from srmlinkgroupvos;
 vogroup |         vorole         | linkgroupid
---------+------------------------+-------------
 *       | *                      |           0
 dteam   | /dteam/NULL/production |           1
 dteam   | /dteam/NULL/lcgadmin   |           1
 
dcache=# select * from srmretentionpolicy;
 id |   name
----+-----------
  2 | REPLICA
  1 | OUTPUT
  0 | CUSTODIAL

dcache=# select * from srmaccesslatency;
 id |   name
----+----------
  1 | ONLINE
  0 | NEARLINE 

After increasing the printout level to 5 in srm.batch, I was able to grab some more information from the SRM log file (catalina.out) This bit of SQL is interesting:

    WHERE UNUSEDLINKS.id =srmlinkgroupvos.linkGroupId
         AND UNUSEDLINKS.hsmType = 'None'
         AND ( srmlinkgroupvos.VOGroup = 'dteam001'
                OR srmlinkgroupvos.VOGroup = '*' )
         AND ( srmlinkgroupvos.VORole = 
                OR srmlinkgroupvos.VORole = '*' )

From the above tables in the SRM database, the srmlinkgroupvos.vogroups are either * or dteam, but never dteam001 as occurs in the SQL above. Therefore,

          AND ( srmlinkgroupvos.VOGroup = 'dteam001'
                  OR srmlinkgroupvos.VOGroup = '*' )

causes the whole statement to return false, meaning that the SRM does not think there is any space available. It appears that there is some confusion over the names of the VOs used in OSG and EGEE (dteam001 vs dteam).

CUSTODIAL-NEARLINE

To fully test the SRM2.2 functionality, it is necessary to configure a CUSTODIAL-NEARLINE storage class that will simulate a tape backend. This can be done in the dCache by using the Eurogate HSM simulator. In this case, additional link groups will have to be configured.

Link creation

SARA are reporting that it seems to be impossible to create a link that is not part of any link group and attach a dCache pool to that and write a file to that pool. In this case, you get a message like "No write pool configured". Is this a bug or should every link be in a link group? Transfers do not even work when using SRMv1 interface.

SRM_NO_FREE_SPACE

Flavia was seeing this error consistently at a number of sites. The S2 client asks for Retention_Policy=REPLICA and gets CUSTODIAL in return with SRM_NO_FREE_SPACE.

I am is able to use the srm-reserve-space client. As an example, if I ask for REPLICA and ONLINE:

# /opt/d-cache/srm/bin/srm-reserve-space -space_desc=gc-test-1 -retention_policy=REPLICA -access_latency=ONLINE -desired_size 1024 -guaranteed_size 1024 
srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir
Space token =37651

Although I am slightly confused as to how I can reserve space in the dteam area when I am being mapped to an LHCb user in the dCache (I'm only using a grid-proxy, there are no VOMs extensions).

# /opt/d-cache/srm/bin/srm-get-space-metadata srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir 37651
Space Reservation with token=37651
                    owner:VoGroup=lhcb001 VoRole=
                totalSize:1024
           guaranteedSize:1024
               unusedSize:1024
         lifetimeAssigned:86400000
             lifetimeLeft:500514
            accessLatency:ONLINE
          retentionPolicy:REPLICA

Here is the link group configuration:

(PoolManager) admin > psu ls -l linkGroup
atlas-link-group : [  atlas-link ]
    Attributes:
           atlasRole = /atlas/production
           HSM = None
           VO = atlas001
default-link-group : [  default-link ]
    Attributes:
           HSM = None 
write-link-group : [  dteam-link ops-link ]
    Attributes:
           HSM = None
           VO = dteam001 ops001 

I was also able to ask for NEARLINE space and it returned OK, even though we have no tape.

# /opt/d-cache/srm/bin/srm-reserve-space -space_desc=gc-test-1 -retention_policy=REPLICA -access_latency=NEARLINE -desired_size 1024 -guaranteed_size 1024 
srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir
Space token =37652
# /opt/d-cache/srm/bin/srm-get-space-metadata srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir 37652
Space Reservation with token=37652
                   owner:VoGroup=lhcb001 VoRole=
               totalSize:1024
          guaranteedSize:1024
              unusedSize:1024
        lifetimeAssigned:86400000
            lifetimeLeft:500634
           accessLatency:NEARLINE
         retentionPolicy:REPLICA

Hopefully Timur can shed some more light on this.

Status of reserved test changing over time

Thanks to the NGDF guys for this.

Link groups are defined in the pool selection unit of the pool manager. The space manager periodically queries the pool manager for the configured link groups and stores this information in a table in PostgreSQL (called 'srmlinkgroup'). That table contains a column called 'lastUpdateTime', which one may guess should be used to filter out old link groups that were removed from the pool selection unit configuration. However, lookups in this table do not filter on this column, and it appears to me that the field is not even updated in subsequent updates of the information.

Thus what happened in our case was, that the table contained two link groups matching the criteria. The space manager then randomises the choice of which link group to use. One was no longer defined and thus the pool manager would fail to make a match. The other would work fine. This explains why approx. half of all requests succeeded.

The state-full behaviour observed my Mattias, in which once a file with a particular name failed with the above error message, it keeps failing forever, is explained as follows: When a file is uploaded, a space reservation is recorded in another table called 'srmspacefile'. If the transfer fails, the entry is apparently not removed and on the next attempt (with this particular file name), the link-group information from this table is used. Thus the upload will now keep failing.

The entry in the 'srmspacefile' table has a lifetime of 1 hour, as recorded in the lifetime column of that table. However, the lifetime column is not taken into account when checking whether an entry for a particular file name exists in the table. I have not been able to find a place where old entries are removed the table, either, so it is certainly a possibility that the transfer will keep failing even after the entry has expired.

reserve space tool

The dCache team have provided this tool in order to allow admins to statically reserve space against a link group that they have already created. This is accessed through the SrmSpaceManager cell. At the moment, it is returning that there are not the correct number of parameters being passed to it. Also, it is not very clear what the -vog (VO group) and -vor (VO role) should be set to in each case. You can specify -1 for infinite space lifetime.

Authorization with gPlazma (SOLVED)

Information about using configuring gPlazma can be found in the page Configuring gPlazma in dCache. For OSG sites who do not use VOMs, there is a saml plugin for gPlazma that will allow the dCache to use VOMs. EGEE sites should use the grid-vorolemap plugin to gPlazma in order to perform the mapping from DN/VOMs attributes to local user id. This configuration file has the form:

"*" "/dteam/Role=NULL/Capability=NULL" dteam001
"*" "/dteam" dteam001

Where the * refers to "all DNs". This means that all DNs with that VOMs role get mapped to the dteam001. gPlazma uses the /etc/grid-security/storage-authzdb file to control which user accounts can access which parts of the namespace, e.g.,

authorize dteam001 read-write 18118 2688 / /pnfs/epcc.ed.ac.uk/data/dteam /pnfs/epcc.ed.ac.uk/data/dteam

Upgrade to v1.8.0-3

Authorization stopped working again. dcachesrm-gplazma.policy changed after the upgrade. This meant that the grid-vorolemap plugin was not turned on, resulting in the old dcache.kpwd authorisation method being used. However, even after turning it back on again, I am still unable to copy into the dCache as a dteam user. I can use my lhcb proxy without any problems.

This is the output during an unsuccessful dteam transfer into the dCache (printout level is 3):

06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 Desired Username not requested. Will attempt a mapping.
06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 Requesting mapping for User with DN and role: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=greig cowan/dteam/Role=NULL/Capability=NULL
06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 A null record was received from the storage authorization service.
06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 Grid VO Role Authorization Service plugin: Authorization denied for user
06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 with subject DN: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=greig cowan and role /dteam/Role=NULL/Capability=NULL
06/15 17:48:04,886 KPWDAuthorizationPlugin: authRequestID 2050738840 Requesting mapping for User with DN: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=greig cowan
06/15 17:48:04,886 KPWDAuthorizationPlugin: authRequestID 2050738840 dcache.kpwd service returned Username: lhcb001

You can see that the vorole plugin is failing, so gPlazma falls back onto the kpwd plugin, resulting in me being mapped to lhcb001, meaning I can't write into dteam. This is strange since I have explicitly defined the mapping from my DN to a user in the grid-vorolemap file.

catalina.out contains only this during the unsuccessful transfer:

06/15 14:55:41 Cell(SRM-wn3@srm-wn3Domain) : PutFileRequest #: PutCallbacks error: user has no permission to write into path  /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir
06/15 14:55:42 Cell(SRM-wn3@srm-wn3Domain) : PutRequestHandler error: copy request state changed to Failed

The source of this problem could be traced to an incorrect configuration of the /etc/grid-secuurity/storage-authzdb file. There was a space missing after the first "/" in all lines. The configuration should be something like this:

authorize dteam001 read-write 18118 2688 / /pnfs/epcc.ed.ac.uk/data/dteam /

The last field is not used, so it's OK to just put a "/" there (instead of /pnfs/epcc.ed.ac.uk/data/dteam). After checking an v1.7.0 install of dCache and gPlazma, authorisation is working *without* the space after the first "/". It appears that something changed between these versions.

File exists, but not found on pool by pnfs

05_StatusOfGetRequest: Executing srmPrepareToPut, putRequestToken=-2147472416
05_StatusOfGetRequest: fileRequests.expectedFileSize[{2691 }]
05_StatusOfGetRequest: desiredFileStorageType=PERMANENT
05_StatusOfGetRequest: srmPrepareToPut, returnStatus=SRM_REQUEST_QUEUED
05_StatusOfGetRequest: srmStatusOfPutRequest, returnStatus=SRM_SUCCESS
05_StatusOfGetRequest: srmStatusOfPutRequest, remainingTotalRequestTime=
05_StatusOfGetRequest: srmPutDone, fileStatuses=surl0=srm://dct00.usatlas.bnl.gov:8443/srm/managerv2?SFN=//pnfs/usatlas.bnl.gov/data/dteam/20070524-210113-28241-0.txt returnStatus.explanation0=Done returnStatus.statusCode0=SRM_SUCCESS
05_StatusOfGetRequest: Put cycle succeeded
05_StatusOfGetRequest: srmPrepareToGet, getRequestToken=-2147472414

By looking at the pnfs id asinged to this file 000100000000000000095F48 (/pnfs/usatlas.bnl.gov/data/dteam/20070524-210113-28241-0.txt) on the srm log, it should be possible to locate the file on a particular pool:

(PnfsManager) admin > cacheinfoof 000100000000000000095F48
cacheinfoof 000100000000000000095F48
No pool was returned

However, the file does exist on the pool:

[root@dct00 data]# pwd
/data/data5/dcache_pool_5/pool/data
[root@dct00 data]# ls -l 000100000000000000095F48
-rw-r--r--  1 root root 2691 May 24 15:01 000100000000000000095F48

Then looking in the admin on the pool

(dct00_5) admin > rep ls -l 000100000000000000095F48
rep ls -l 000100000000000000095F48
000100000000000000095F48 <C-------X--(0)[0]> 2691 si={myStore:STRING}

It is not clear why the admin shell cannot locate the file if it exists and the pool is online. The C in the above line indicates that the file is a cached copy. It is not clear what the X means.

globus-url-copy hanging (SOLVED)

globus-url-copy and srmcp (v1.23 and v1.25) are hanging at this point.

$ globus-url-copy -dbg file:/etc/group gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/`date +%s`
debug: starting to put gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110
debug: connecting to gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110
debug: response from gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110:
220 GSI FTP Door ready
...
...
debug: response from gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110:
227 OK (129,215,175,66,201,33)

debug: sending command:
STOR /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110 

debug: response from gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110:
150 Opening BINARY data connection for /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110

debug: data callback, no error, buffer 0xb74d2008, length 613, offset=0, eof=true

Empty files are appearing in the pnfs namespace:

# ls -l /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/
total 1
-rw-r--r--  1 dteam001 dteam 613 Jun  4 10:37 1180949835
-rw-r--r--  1 dteam001 dteam   0 Jun  7 10:52 1181209955
-rw-r--r--  1 dteam001 dteam   0 Jun  7 10:57 1181210243
-rw-r--r--  1 dteam001 dteam   0 Jun  7 10:59 1181210329
-rw-r--r--  1 dteam001 dteam   0 Jun  7 12:36 1181216110

It is strange, but nothing much is appearing in the dCache log files. Why has there been nothing new in dCacheDomain.log for over a week?

# ls -lt /var/log/*Domain.log
-rw-r--r--  1 root root 2525514 Jun  7 12:38 /var/log/gPlazma-wn3Domain.log
-rw-r--r--  1 root root  310517 Jun  7 12:36 /var/log/gridftp-wn3Domain.log
-rw-r--r--  1 root root 5488759 Jun  7 08:40 /var/log/pnfsDomain.log
-rw-r--r--  1 root root  194297 Jun  7 06:00 /var/log/httpdDomain.log
-rw-r--r--  1 root root  449473 Jun  4 21:29 /var/log/wn3Domain.log
-rw-r--r--  1 root root  331112 Jun  4 17:26 /var/log/utilityDomain.log
-rw-r--r--  1 root root    3381 Jun  4 10:36 /var/log/dcap-wn3Domain.log
-rw-r--r--  1 root root    5567 Jun  4 10:27 /var/log/adminDoorDomain.log
-rw-r--r--  1 root root       0 May 29 16:54 /var/log/gridftpdoorDomain.log
-rw-r--r--  1 root root  161496 May 29 16:44 /var/log/dCacheDomain.log
-rw-r--r--  1 root root   24969 May 29 16:42 /var/log/gsidcap-wn3Domain.log
-rw-r--r--  1 root root   10216 May 29 16:41 /var/log/infoProviderDomain.log
-rw-r--r--  1 root root    1015 May 29 16:41 /var/log/dirDomain.log
-rw-r--r--  1 root root    1002 May 29 16:40 /var/log/lmDomain.log
-rw-r--r--  1 root root    8642 May 29 10:33 /var/log/gPlazmaDomain.log

The gridftp log files are suggesting that when using SRMv2.2 to initiate the transfer, the pool is expected to connect on the address /0.0.0.0:50279. Clearly something not quite right here. All services are running on the same node, so there should not be a problem in the different dCache components talking to each other (particularly since everything was working before). A few minutes after a restart of dcache-core and dcache-pool on the node, transfers started working again. It took a few minutes for the SpaceManager to calculate the freespaceinbytes:

dcache=# select * from srmlinkgroup;
 id |        name        | hsmtype | freespaceinbytes | lastupdatetime
----+--------------------+---------+------------------+----------------
  0 | default-link-group | None    |                0 |  1177429333736
  1 | write-link-group   | None    |                0 |  1177435636041
(2 rows)

This eventually changed to:

dcache=# select * from srmlinkgroup; 
 id |        name        | hsmtype | freespaceinbytes | lastupdatetime
----+--------------------+---------+------------------+----------------
  0 | default-link-group | None    |                0 |  1177429333736
  1 | write-link-group   | None    |     208305910444 |  1177435636041
(2 rows)

When copying files in using SRMv2.2, it is necessary to specify the access latency and retention policy. Why are the defaults not being picked up?

[root@wn3 ~]# grep REPLICA /opt/d-cache/config/dCacheSetup
SpaceManagerDefaultRetentionPolicy=REPLICA
[root@wn3 ~]# grep ONLINE /opt/d-cache/config/dCacheSetup
SpaceManagerDefaultAccessLatency=ONLINE

It should also be noted that I have modified the link groups to use the different VO names:

dcache=# select * from srmlinkgroupvos;
 vogroup  | vorole | linkgroupid
----------+--------+-------------
 *        | *      |           0
 ops001   | *      |           1
 dteam001 | *      |           1
(3 rows)

This should mean that the SAM and S2 tests should now pass.

Pool not starting up after upgrade to v1.8.0-3 (SOLVED)

After upgrading to v1.8.0-3 (using YAIM to perform the configuration) I still had to tweak things by hand, including turning the SpaceManager on in the dCacheSetup file, and reinstating the old PoolManager.conf file as the upgrade had wiped the definitions of the link groups etc. It should be noted that this information is still contained in the srmlinkgroup and srmlinkgroupvos tables in the dcache database. From the spec files which label the rpm files:

%config(noreplace) /opt/d-cache/config/PoolManager.conf
%config(noreplace) /opt/d-cache/etc/dcachesrm-gplazma.policy

So the loss of the conf files is due to me doing:

rpm -e dcache-server
rpm -i dcache-server

One good thing is that the server components now start immediately, rather than having the annoying count down (although this persists for stopping services).

Unfortunately, one of my pools refuses to come up:

06/14 19:06:30 Cell(wn3_1@wn3Domain) : New Pool Mode : disabled(fetch,store,stage,p2p-client,p2p-server,)
- Invalid state [receiving.cient] for entry /part1/pool/control/000A0000000000000005EB20
- Missing or bad SI-file for 000A0000000000000005EB20 : SI-file not found
06/14 19:06:32 Cell(wn3_1@wn3Domain) : PnfsHandler : CacheException (10001) : Pnfs error : 000A0000000000000005EB20
- Invalid state [receiving.cient] for entry /part1/pool/control/000A0000000000000005EB20
- mark as bad non removeble missing entry : 000A0000000000000005EB20
06/14 19:06:32 Cell(wn3_1@wn3Domain) : Repository reported Throwable : java.lang.IllegalStateException: No state transition for files in error state
java.lang.IllegalStateException: No state transition for files in error state
         at org.dcache.pool.repository.entry.CacheRepositoryEntryState.setSticky(CacheRepositoryEntryState.java:109)
         at org.dcache.pool.repository.RepositoryEntryHealer.entryOf(RepositoryEntryHealer.java:137)
         at org.dcache.pool.repository.CacheRepositoryV3.runInventory(CacheRepositoryV3.java:430)
         at diskCacheV111.pools.MultiProtocolPoolV3$InventoryScanner.run(MultiProtocolPoolV3.java:659)
         at java.lang.Thread.run(Thread.java:595)

It doesn't like the state receiving.cient (should this be client?) in the control files. I've checked the pool, and there are hundreds of files in this state.

dCacheDomain.log is reporting various RoutingManager errors:

06/14 19:05:06 Cell(RoutingMgr@dCacheDomain) : update can't send update  to RoutingMgr{uoid=<1181844306156:139>;path=[>RoutingMgr@local];msg=Missing routing entry for   RoutingMgr@local}
06/14 19:05:11 Cell(RoutingMgr@dCacheDomain) : Couldn't add wellknown route : java.lang.IllegalArgumentException: Duplicated Entry
06/14 19:05:30 Cell(RoutingMgr@dCacheDomain) : update can't send update  to RoutingMgr{uoid=<1181844330825:141>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local}
06/14 19:10:26 Cell(RoutingMgr@dCacheDomain) : update can't send update  to RoutingMgr{uoid=<1181844626958:179>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local}

Routing errors might be caused by the two components in the two different domains with the same name trying to start-up and register.

Update

This problem was caused by the the pool containing many files in a state where the dCache thought that they were unusable. An upgrade to 1.8.0-6 fixed this as the dCache now ignores these problem error messages reported by the pool.

Edinburgh dCache has been failing basic and use case tests (SOLVED)

Not clear where the problem is from the S2 output. Need input from developers. Upgrade to 1.8.0-8 solved problems.

SRM_INTERNAL_ERROR after srmMv

srmMv https://ccsrmtestv2.in2p3.fr:8443/srm/managerv2 fromSURL=srm://ccsrmtestv2.in2p3.fr:8443/srm/managerv2?SFN=/pnfs/in2p3.fr/data/dteam/s-2/20070806-160003-28233-0.txt toSURL=srm://ccsrmtestv2.in2p3.fr:8443/srm/managerv2?SFN=/pnfs/in2p3.fr/data/dteam/s-2/20070806-160003-28233

It returns:

returnStatus.statusCode=SRM_SUCCESS returnStatus.statusCode=SRM_INTERNAL_ERROR

The clients create a file and then it tries to move it into a just created directory. The file status code is SUCCESS but the request return code is SRM_INTERNAL_ERROR.

srmRmdir

The srmRmdir functions returns SRM_FAILURE instead of SRM_NOT_EMPTY_DIRECTORY at IN2P3 and BNL (2nd instance) when a directory is not empty:

srmRmdir https://ccsrmtestv2.in2p3.fr:8443/srm/managerv2 SURL=srm://ccsrmtestv2.in2p3.fr:8443/srm/managerv2?SFN=/pnfs/in2p3.fr/data/dteam/s-2/20070806-170626-30258

It returns:

returnStatus.explanation="/pnfs/in2p3.fr/data/dteam/s-2/20070806-170626-30258 Delete Failed : ///pnfs/in2p3.fr/data/dteam/s-2/20070806-170626-30258 PnfsDeleteEntryMessage return code=5 reason : java.lang.IllegalArgumentException: Directory 00010000000000000008B2F0 not empty" returnStatus.statusCode=SRM_FAILURE

This problem was fixed at FNAL already.

srmMkdir

Seems to be a problem when use lcg_cp to create new directories in dCache. Reported during LHCb testing.

SQL error in release 1.8.0-11 (SOLVED)

This resulted in "No write pools available" errors during srmcp requests. Fixed in patch version -13.