DCache SRM v2.2 Testing
This page keeps track of the issues that arose during the testing of dCache 1.8.0, SRMv2.2 endpoints as part of a programme of testing, coordinated by the WLCG deployment group at CERN. All flavours (dCache, CASTOR, DPM, StoRM, BeSTMan) of SRM endpoint are having their basic functionality tested via the s2 client and lcg_utils. Stress testing will also be done once the basic functionality and inter-operability tests have been completed.
A problem is marked SOLVED if the solution is now know.
Contents
- 1 Endpoints
- 2 Ping (SOLVED)
- 3 srmcp client
- 4 Space Reservation
- 5 Authorization with gPlazma (SOLVED)
- 6 File exists, but not found on pool by pnfs
- 7 globus-url-copy hanging (SOLVED)
- 8 Pool not starting up after upgrade to v1.8.0-3 (SOLVED)
- 9 Edinburgh dCache has been failing basic and use case tests (SOLVED)
- 10 SRM_INTERNAL_ERROR after srmMv
- 11 srmRmdir
- 12 srmMkdir
- 13 SQL error in release 1.8.0-11 (SOLVED)
Endpoints
GridPP has one dCache endpoint involved in the tests.
- httpg://wn3.epcc.ed.ac.uk:8443/srm/managerv2?SFN=/pnfs/epcc.ed.ac.uk/data/dteam - this is a single test machine, running all services and 2 pools of ~100GB.
The other endpoints can be found in the GSSD twiki page.
Ping (SOLVED)
Initially not working due to problems with Flavia's DN not being of the correct format.
Exception thrown by diskCacheV111.services.authorization.KPWDAuthorizationPlugin: Permission Denied: Cannot determine Username for DN /C=IT/O=INFN/OU=Personal Certificate/L=Pisa/CN=Flavia Donno/EMAILADDRESS=<email>" returnStatus.statusCode=SRM_AUTHORIZATION_FAILURE
The default dcache.kpwd contained 4 (yes, 4) different versions of Flavia's DN, for the different combinations of /Email, /EMAIL, /E etc. These are automatically generated by the grid-mapfile2dcache-kpwd utility for DNs that contain an email address. Unfortunately, none of these contained /EMAILADDRESS. Only when this option was added were the tests successful. It seems that different versions of OpenSSL are still causing these problems. Since the dcache.kpwd file is automatically generated from the grid-mapfile using grid-mapfile2dcache-kpwd script, I have disabled the script so that this endpoint can continue passing the S2 tests. According to Flavia:
It looks like there is a special script to populate the gridmapfile. It is present in the YAIM configurator for dCache 1.7. Apparently there is somebody working on the YAIM configurator for dCache 1.8 but it is still not ready.
The dCache developers will now add in the EMAILADDRESS variant.
srmcp client
dCache SRM used to incorrectly select transfer protocol on basis of its internal priorities, considering the array of protocols sent by the srm client as an unordered set. The latest code is changed to consider the client's protocol list to be ordered and prioritized. Unfortunately this exposed the defect in the current srmcp client, it sends the dcap protocol ahead of the gsiftp protocol in its list. The next release of the srmcp client will address this, in the mean time, please specify gsiftp protocol as the first protocol in the list using for example the following option: "-protocols=gsiftp,dcap". The corresponding Server and Client releases are now available (1.7.0-35) are available for download.
Space Reservation
Reserving space for replica quality (temporary) to Edinburgh dCache came back with "NO_FREE_SPACE" status. Alex Sim asked if he needs to make custodial quality space (permanent) only?
srmReserveSpace at Wed May 02 09:48:18 PDT 2007 ServicePath=srm://wn3.epcc.ed.ac.uk:8443/srm/managerv2 Status=SRM_NO_FREE_SPACE RetentionPolicy=Replica
REPLICA-ONLINE
I need some advice as to how to setup the REPLICA-ONLINE (Tape0Disk1) storage class. In the default YAIM installation of v1.8 the SpaceManager is not turned on by default.
grep SpaceMan /opt/d-cache/config/dCacheSetup # srmSpaceManagerEnabled=no # SpaceManagerDefaultRetentionPolicy=CUSTODIAL # SpaceManagerDefaultAccessLatency=NEARLINE # SpaceManagerReserveSpaceForNonSRMTransfers=false
There were set to to
srmSpaceManagerEnabled=yes SpaceManagerDefaultRetentionPolicy=REPLICA SpaceManagerDefaultAccessLatency=ONLINE SpaceManagerReserveSpaceForNonSRMTransfers=false
and the SRM restarted. There should be some additional instructions in the dCache Book.
Update
Starting the SpaceManager appears to have broken some aspects of the SRM2.2 service in the dCache. Ever since making this change we have been failing certain tests with no space available messages.
08_AbortRequest: Executed PrepareToPut, requestToken: -2147463375 08_AbortRequest: srmPrepareToPut returnStatus: SRM_REQUEST_QUEUED 08_AbortRequest: Executing srmAbortRequest, requestToken=-2147463375 08_AbortRequest: srmAbortRequest returnStatus: SRM_SUCCESS 08_AbortRequest: srmStatusOfPutRequest returnStatus=SRM_FAILURE 08_AbortRequest: srmStatusOfPutRequest fileStatuses=SURL0=srm://wn3.epcc.ed.ac.uk:8443/srm/managerv2?SFN=/pnfs/epcc.ed.ac.uk/data/dteam/20070517-080637-11212-0.txt returnStatus.explanation0= at hu May 17 07:06:38 BST 2007 state Failed : no space available returnStatus.statusCode0=SRM_FAILURE fileSize0=2691 estimatedWaitTime0=1 remainingPinLifetime0=0
It is not clear what is going on here. Some information:
psu create linkGroup default-link-group psu set linkGroup attribute default-link-group HSM=None psu addto linkGroup default-link-group default-link psu create linkGroup write-link-group psu set linkGroup attribute write-link-group dteamRole=/dteam/NULL/production psu set linkGroup attribute write-link-group dteamRole=/dteam/NULL/lcgadmin psu set linkGroup attribute write-link-group HSM=None psu set linkGroup attribute write-link-group VO=dteam psu addto linkGroup write-link-group dteam-link
As you can see, I only have Replica-Online space (no HSM).
From the admin shell:
> psu ls -l linkGroup default-link-group : [ default-link ] Attributes: HSM = None write-link-group : [ dteam-link ] Attributes: dteamRole = /dteam/NULL/production /dteam/NULL/lcgadmin HSM = None VO = dteam > psu ls -l pgroup dteam linkList : dteam-link (pref=20/20/-1/20;;ugroups=2;pools=1) poolList : wn3_1 (enabled=true;active=7;rdOnly=false;links=0;pgroups=2)
This is interesting, from the database:
dcache=# select * from srmlinkgroup; id | name | hsmtype | freespaceinbytes | lastupdatetime ----+--------------------+---------+------------------+---------------- 0 | default-link-group | None | 0 | 1177429333736 1 | write-link-group | None | 104152955222 | 1177435636041 dcache=# select * from spacemanagerpoolreservation; spacetoken | reservedspacesize | lockedspacesize | creationtime | lifetime | poolname | pnfspath | createdpnfsentry | utilized ------------+-------------------+-----------------+--------------+----------+----------+----------+------------------+---------- (0 rows) dcache=# select * from srmlinkgroupvos; vogroup | vorole | linkgroupid ---------+------------------------+------------- * | * | 0 dteam | /dteam/NULL/production | 1 dteam | /dteam/NULL/lcgadmin | 1 dcache=# select * from srmretentionpolicy; id | name ----+----------- 2 | REPLICA 1 | OUTPUT 0 | CUSTODIAL dcache=# select * from srmaccesslatency; id | name ----+---------- 1 | ONLINE 0 | NEARLINE
After increasing the printout level to 5 in srm.batch, I was able to grab some more information from the SRM log file (catalina.out) This bit of SQL is interesting:
WHERE UNUSEDLINKS.id =srmlinkgroupvos.linkGroupId AND UNUSEDLINKS.hsmType = 'None' AND ( srmlinkgroupvos.VOGroup = 'dteam001' OR srmlinkgroupvos.VOGroup = '*' ) AND ( srmlinkgroupvos.VORole = OR srmlinkgroupvos.VORole = '*' )
From the above tables in the SRM database, the srmlinkgroupvos.vogroups are either * or dteam, but never dteam001 as occurs in the SQL above. Therefore,
AND ( srmlinkgroupvos.VOGroup = 'dteam001' OR srmlinkgroupvos.VOGroup = '*' )
causes the whole statement to return false, meaning that the SRM does not think there is any space available. It appears that there is some confusion over the names of the VOs used in OSG and EGEE (dteam001 vs dteam).
CUSTODIAL-NEARLINE
To fully test the SRM2.2 functionality, it is necessary to configure a CUSTODIAL-NEARLINE storage class that will simulate a tape backend. This can be done in the dCache by using the Eurogate HSM simulator. In this case, additional link groups will have to be configured.
Link creation
SARA are reporting that it seems to be impossible to create a link that is not part of any link group and attach a dCache pool to that and write a file to that pool. In this case, you get a message like "No write pool configured". Is this a bug or should every link be in a link group? Transfers do not even work when using SRMv1 interface.
SRM_NO_FREE_SPACE
Flavia was seeing this error consistently at a number of sites. The S2 client asks for Retention_Policy=REPLICA and gets CUSTODIAL in return with SRM_NO_FREE_SPACE.
I am is able to use the srm-reserve-space client. As an example, if I ask for REPLICA and ONLINE:
# /opt/d-cache/srm/bin/srm-reserve-space -space_desc=gc-test-1 -retention_policy=REPLICA -access_latency=ONLINE -desired_size 1024 -guaranteed_size 1024 srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir Space token =37651
Although I am slightly confused as to how I can reserve space in the dteam area when I am being mapped to an LHCb user in the dCache (I'm only using a grid-proxy, there are no VOMs extensions).
# /opt/d-cache/srm/bin/srm-get-space-metadata srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir 37651 Space Reservation with token=37651 owner:VoGroup=lhcb001 VoRole= totalSize:1024 guaranteedSize:1024 unusedSize:1024 lifetimeAssigned:86400000 lifetimeLeft:500514 accessLatency:ONLINE retentionPolicy:REPLICA
Here is the link group configuration:
(PoolManager) admin > psu ls -l linkGroup atlas-link-group : [ atlas-link ] Attributes: atlasRole = /atlas/production HSM = None VO = atlas001
default-link-group : [ default-link ] Attributes: HSM = None
write-link-group : [ dteam-link ops-link ] Attributes: HSM = None VO = dteam001 ops001
I was also able to ask for NEARLINE space and it returned OK, even though we have no tape.
# /opt/d-cache/srm/bin/srm-reserve-space -space_desc=gc-test-1 -retention_policy=REPLICA -access_latency=NEARLINE -desired_size 1024 -guaranteed_size 1024 srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir Space token =37652
# /opt/d-cache/srm/bin/srm-get-space-metadata srm://wn3.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir 37652 Space Reservation with token=37652 owner:VoGroup=lhcb001 VoRole= totalSize:1024 guaranteedSize:1024 unusedSize:1024 lifetimeAssigned:86400000 lifetimeLeft:500634 accessLatency:NEARLINE retentionPolicy:REPLICA
Hopefully Timur can shed some more light on this.
Status of reserved test changing over time
Thanks to the NGDF guys for this.
Link groups are defined in the pool selection unit of the pool manager. The space manager periodically queries the pool manager for the configured link groups and stores this information in a table in PostgreSQL (called 'srmlinkgroup'). That table contains a column called 'lastUpdateTime', which one may guess should be used to filter out old link groups that were removed from the pool selection unit configuration. However, lookups in this table do not filter on this column, and it appears to me that the field is not even updated in subsequent updates of the information.
Thus what happened in our case was, that the table contained two link groups matching the criteria. The space manager then randomises the choice of which link group to use. One was no longer defined and thus the pool manager would fail to make a match. The other would work fine. This explains why approx. half of all requests succeeded.
The state-full behaviour observed my Mattias, in which once a file with a particular name failed with the above error message, it keeps failing forever, is explained as follows: When a file is uploaded, a space reservation is recorded in another table called 'srmspacefile'. If the transfer fails, the entry is apparently not removed and on the next attempt (with this particular file name), the link-group information from this table is used. Thus the upload will now keep failing.
The entry in the 'srmspacefile' table has a lifetime of 1 hour, as recorded in the lifetime column of that table. However, the lifetime column is not taken into account when checking whether an entry for a particular file name exists in the table. I have not been able to find a place where old entries are removed the table, either, so it is certainly a possibility that the transfer will keep failing even after the entry has expired.
reserve space tool
The dCache team have provided this tool in order to allow admins to statically reserve space against a link group that they have already created. This is accessed through the SrmSpaceManager cell. At the moment, it is returning that there are not the correct number of parameters being passed to it. Also, it is not very clear what the -vog (VO group) and -vor (VO role) should be set to in each case. You can specify -1 for infinite space lifetime.
Authorization with gPlazma (SOLVED)
Information about using configuring gPlazma can be found in the page Configuring gPlazma in dCache. For OSG sites who do not use VOMs, there is a saml plugin for gPlazma that will allow the dCache to use VOMs. EGEE sites should use the grid-vorolemap plugin to gPlazma in order to perform the mapping from DN/VOMs attributes to local user id. This configuration file has the form:
"*" "/dteam/Role=NULL/Capability=NULL" dteam001 "*" "/dteam" dteam001
Where the * refers to "all DNs". This means that all DNs with that VOMs role get mapped to the dteam001. gPlazma uses the /etc/grid-security/storage-authzdb file to control which user accounts can access which parts of the namespace, e.g.,
authorize dteam001 read-write 18118 2688 / /pnfs/epcc.ed.ac.uk/data/dteam /pnfs/epcc.ed.ac.uk/data/dteam
Upgrade to v1.8.0-3
Authorization stopped working again. dcachesrm-gplazma.policy changed after the upgrade. This meant that the grid-vorolemap plugin was not turned on, resulting in the old dcache.kpwd authorisation method being used. However, even after turning it back on again, I am still unable to copy into the dCache as a dteam user. I can use my lhcb proxy without any problems.
This is the output during an unsuccessful dteam transfer into the dCache (printout level is 3):
06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 Desired Username not requested. Will attempt a mapping. 06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 Requesting mapping for User with DN and role: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=greig cowan/dteam/Role=NULL/Capability=NULL 06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 A null record was received from the storage authorization service. 06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 Grid VO Role Authorization Service plugin: Authorization denied for user 06/15 17:48:04,886 GPLAZMALiteVORoleAuthzPlugin: authRequestID 2050738840 with subject DN: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=greig cowan and role /dteam/Role=NULL/Capability=NULL 06/15 17:48:04,886 KPWDAuthorizationPlugin: authRequestID 2050738840 Requesting mapping for User with DN: /C=UK/O=eScience/OU=Edinburgh/L=NeSC/CN=greig cowan 06/15 17:48:04,886 KPWDAuthorizationPlugin: authRequestID 2050738840 dcache.kpwd service returned Username: lhcb001
You can see that the vorole plugin is failing, so gPlazma falls back onto the kpwd plugin, resulting in me being mapped to lhcb001, meaning I can't write into dteam. This is strange since I have explicitly defined the mapping from my DN to a user in the grid-vorolemap file.
catalina.out contains only this during the unsuccessful transfer:
06/15 14:55:41 Cell(SRM-wn3@srm-wn3Domain) : PutFileRequest #: PutCallbacks error: user has no permission to write into path /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir 06/15 14:55:42 Cell(SRM-wn3@srm-wn3Domain) : PutRequestHandler error: copy request state changed to Failed
The source of this problem could be traced to an incorrect configuration of the /etc/grid-secuurity/storage-authzdb file. There was a space missing after the first "/" in all lines. The configuration should be something like this:
authorize dteam001 read-write 18118 2688 / /pnfs/epcc.ed.ac.uk/data/dteam /
The last field is not used, so it's OK to just put a "/" there (instead of /pnfs/epcc.ed.ac.uk/data/dteam). After checking an v1.7.0 install of dCache and gPlazma, authorisation is working *without* the space after the first "/". It appears that something changed between these versions.
File exists, but not found on pool by pnfs
05_StatusOfGetRequest: Executing srmPrepareToPut, putRequestToken=-2147472416 05_StatusOfGetRequest: fileRequests.expectedFileSize[{2691 }] 05_StatusOfGetRequest: desiredFileStorageType=PERMANENT 05_StatusOfGetRequest: srmPrepareToPut, returnStatus=SRM_REQUEST_QUEUED 05_StatusOfGetRequest: srmStatusOfPutRequest, returnStatus=SRM_SUCCESS 05_StatusOfGetRequest: srmStatusOfPutRequest, remainingTotalRequestTime= 05_StatusOfGetRequest: srmPutDone, fileStatuses=surl0=srm://dct00.usatlas.bnl.gov:8443/srm/managerv2?SFN=//pnfs/usatlas.bnl.gov/data/dteam/20070524-210113-28241-0.txt returnStatus.explanation0=Done returnStatus.statusCode0=SRM_SUCCESS 05_StatusOfGetRequest: Put cycle succeeded 05_StatusOfGetRequest: srmPrepareToGet, getRequestToken=-2147472414
By looking at the pnfs id asinged to this file 000100000000000000095F48 (/pnfs/usatlas.bnl.gov/data/dteam/20070524-210113-28241-0.txt) on the srm log, it should be possible to locate the file on a particular pool:
(PnfsManager) admin > cacheinfoof 000100000000000000095F48 cacheinfoof 000100000000000000095F48 No pool was returned
However, the file does exist on the pool:
[root@dct00 data]# pwd /data/data5/dcache_pool_5/pool/data [root@dct00 data]# ls -l 000100000000000000095F48 -rw-r--r-- 1 root root 2691 May 24 15:01 000100000000000000095F48
Then looking in the admin on the pool
(dct00_5) admin > rep ls -l 000100000000000000095F48 rep ls -l 000100000000000000095F48 000100000000000000095F48 <C-------X--(0)[0]> 2691 si={myStore:STRING}
It is not clear why the admin shell cannot locate the file if it exists and the pool is online. The C in the above line indicates that the file is a cached copy. It is not clear what the X means.
globus-url-copy hanging (SOLVED)
globus-url-copy and srmcp (v1.23 and v1.25) are hanging at this point.
$ globus-url-copy -dbg file:/etc/group gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/`date +%s` debug: starting to put gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110 debug: connecting to gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110 debug: response from gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110: 220 GSI FTP Door ready ... ... debug: response from gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110: 227 OK (129,215,175,66,201,33) debug: sending command: STOR /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110 debug: response from gsiftp://wn3.epcc.ed.ac.uk:2811/pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110: 150 Opening BINARY data connection for /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/1181216110 debug: data callback, no error, buffer 0xb74d2008, length 613, offset=0, eof=true
Empty files are appearing in the pnfs namespace:
# ls -l /pnfs/epcc.ed.ac.uk/data/dteam/greig_test_dir/ total 1 -rw-r--r-- 1 dteam001 dteam 613 Jun 4 10:37 1180949835 -rw-r--r-- 1 dteam001 dteam 0 Jun 7 10:52 1181209955 -rw-r--r-- 1 dteam001 dteam 0 Jun 7 10:57 1181210243 -rw-r--r-- 1 dteam001 dteam 0 Jun 7 10:59 1181210329 -rw-r--r-- 1 dteam001 dteam 0 Jun 7 12:36 1181216110
It is strange, but nothing much is appearing in the dCache log files. Why has there been nothing new in dCacheDomain.log for over a week?
# ls -lt /var/log/*Domain.log -rw-r--r-- 1 root root 2525514 Jun 7 12:38 /var/log/gPlazma-wn3Domain.log -rw-r--r-- 1 root root 310517 Jun 7 12:36 /var/log/gridftp-wn3Domain.log -rw-r--r-- 1 root root 5488759 Jun 7 08:40 /var/log/pnfsDomain.log -rw-r--r-- 1 root root 194297 Jun 7 06:00 /var/log/httpdDomain.log -rw-r--r-- 1 root root 449473 Jun 4 21:29 /var/log/wn3Domain.log -rw-r--r-- 1 root root 331112 Jun 4 17:26 /var/log/utilityDomain.log -rw-r--r-- 1 root root 3381 Jun 4 10:36 /var/log/dcap-wn3Domain.log -rw-r--r-- 1 root root 5567 Jun 4 10:27 /var/log/adminDoorDomain.log -rw-r--r-- 1 root root 0 May 29 16:54 /var/log/gridftpdoorDomain.log -rw-r--r-- 1 root root 161496 May 29 16:44 /var/log/dCacheDomain.log -rw-r--r-- 1 root root 24969 May 29 16:42 /var/log/gsidcap-wn3Domain.log -rw-r--r-- 1 root root 10216 May 29 16:41 /var/log/infoProviderDomain.log -rw-r--r-- 1 root root 1015 May 29 16:41 /var/log/dirDomain.log -rw-r--r-- 1 root root 1002 May 29 16:40 /var/log/lmDomain.log -rw-r--r-- 1 root root 8642 May 29 10:33 /var/log/gPlazmaDomain.log
The gridftp log files are suggesting that when using SRMv2.2 to initiate the transfer, the pool is expected to connect on the address /0.0.0.0:50279. Clearly something not quite right here. All services are running on the same node, so there should not be a problem in the different dCache components talking to each other (particularly since everything was working before). A few minutes after a restart of dcache-core and dcache-pool on the node, transfers started working again. It took a few minutes for the SpaceManager to calculate the freespaceinbytes:
dcache=# select * from srmlinkgroup; id | name | hsmtype | freespaceinbytes | lastupdatetime ----+--------------------+---------+------------------+---------------- 0 | default-link-group | None | 0 | 1177429333736 1 | write-link-group | None | 0 | 1177435636041 (2 rows)
This eventually changed to:
dcache=# select * from srmlinkgroup; id | name | hsmtype | freespaceinbytes | lastupdatetime ----+--------------------+---------+------------------+---------------- 0 | default-link-group | None | 0 | 1177429333736 1 | write-link-group | None | 208305910444 | 1177435636041 (2 rows)
When copying files in using SRMv2.2, it is necessary to specify the access latency and retention policy. Why are the defaults not being picked up?
[root@wn3 ~]# grep REPLICA /opt/d-cache/config/dCacheSetup SpaceManagerDefaultRetentionPolicy=REPLICA [root@wn3 ~]# grep ONLINE /opt/d-cache/config/dCacheSetup SpaceManagerDefaultAccessLatency=ONLINE
It should also be noted that I have modified the link groups to use the different VO names:
dcache=# select * from srmlinkgroupvos; vogroup | vorole | linkgroupid ----------+--------+------------- * | * | 0 ops001 | * | 1 dteam001 | * | 1 (3 rows)
This should mean that the SAM and S2 tests should now pass.
Pool not starting up after upgrade to v1.8.0-3 (SOLVED)
After upgrading to v1.8.0-3 (using YAIM to perform the configuration) I still had to tweak things by hand, including turning the SpaceManager on in the dCacheSetup file, and reinstating the old PoolManager.conf file as the upgrade had wiped the definitions of the link groups etc. It should be noted that this information is still contained in the srmlinkgroup and srmlinkgroupvos tables in the dcache database. From the spec files which label the rpm files:
%config(noreplace) /opt/d-cache/config/PoolManager.conf %config(noreplace) /opt/d-cache/etc/dcachesrm-gplazma.policy
So the loss of the conf files is due to me doing:
rpm -e dcache-server rpm -i dcache-server
One good thing is that the server components now start immediately, rather than having the annoying count down (although this persists for stopping services).
Unfortunately, one of my pools refuses to come up:
06/14 19:06:30 Cell(wn3_1@wn3Domain) : New Pool Mode : disabled(fetch,store,stage,p2p-client,p2p-server,) - Invalid state [receiving.cient] for entry /part1/pool/control/000A0000000000000005EB20 - Missing or bad SI-file for 000A0000000000000005EB20 : SI-file not found 06/14 19:06:32 Cell(wn3_1@wn3Domain) : PnfsHandler : CacheException (10001) : Pnfs error : 000A0000000000000005EB20 - Invalid state [receiving.cient] for entry /part1/pool/control/000A0000000000000005EB20 - mark as bad non removeble missing entry : 000A0000000000000005EB20 06/14 19:06:32 Cell(wn3_1@wn3Domain) : Repository reported Throwable : java.lang.IllegalStateException: No state transition for files in error state java.lang.IllegalStateException: No state transition for files in error state at org.dcache.pool.repository.entry.CacheRepositoryEntryState.setSticky(CacheRepositoryEntryState.java:109) at org.dcache.pool.repository.RepositoryEntryHealer.entryOf(RepositoryEntryHealer.java:137) at org.dcache.pool.repository.CacheRepositoryV3.runInventory(CacheRepositoryV3.java:430) at diskCacheV111.pools.MultiProtocolPoolV3$InventoryScanner.run(MultiProtocolPoolV3.java:659) at java.lang.Thread.run(Thread.java:595)
It doesn't like the state receiving.cient (should this be client?) in the control files. I've checked the pool, and there are hundreds of files in this state.
dCacheDomain.log is reporting various RoutingManager errors:
06/14 19:05:06 Cell(RoutingMgr@dCacheDomain) : update can't send update to RoutingMgr{uoid=<1181844306156:139>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local} 06/14 19:05:11 Cell(RoutingMgr@dCacheDomain) : Couldn't add wellknown route : java.lang.IllegalArgumentException: Duplicated Entry 06/14 19:05:30 Cell(RoutingMgr@dCacheDomain) : update can't send update to RoutingMgr{uoid=<1181844330825:141>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local} 06/14 19:10:26 Cell(RoutingMgr@dCacheDomain) : update can't send update to RoutingMgr{uoid=<1181844626958:179>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local}
Routing errors might be caused by the two components in the two different domains with the same name trying to start-up and register.
Update
This problem was caused by the the pool containing many files in a state where the dCache thought that they were unusable. An upgrade to 1.8.0-6 fixed this as the dCache now ignores these problem error messages reported by the pool.
Edinburgh dCache has been failing basic and use case tests (SOLVED)
Not clear where the problem is from the S2 output. Need input from developers. Upgrade to 1.8.0-8 solved problems.
SRM_INTERNAL_ERROR after srmMv
srmMv https://ccsrmtestv2.in2p3.fr:8443/srm/managerv2 fromSURL=srm://ccsrmtestv2.in2p3.fr:8443/srm/managerv2?SFN=/pnfs/in2p3.fr/data/dteam/s-2/20070806-160003-28233-0.txt toSURL=srm://ccsrmtestv2.in2p3.fr:8443/srm/managerv2?SFN=/pnfs/in2p3.fr/data/dteam/s-2/20070806-160003-28233
It returns:
returnStatus.statusCode=SRM_SUCCESS returnStatus.statusCode=SRM_INTERNAL_ERROR
The clients create a file and then it tries to move it into a just created directory. The file status code is SUCCESS but the request return code is SRM_INTERNAL_ERROR.
srmRmdir
The srmRmdir functions returns SRM_FAILURE instead of SRM_NOT_EMPTY_DIRECTORY at IN2P3 and BNL (2nd instance) when a directory is not empty:
srmRmdir https://ccsrmtestv2.in2p3.fr:8443/srm/managerv2 SURL=srm://ccsrmtestv2.in2p3.fr:8443/srm/managerv2?SFN=/pnfs/in2p3.fr/data/dteam/s-2/20070806-170626-30258
It returns:
returnStatus.explanation="/pnfs/in2p3.fr/data/dteam/s-2/20070806-170626-30258 Delete Failed : ///pnfs/in2p3.fr/data/dteam/s-2/20070806-170626-30258 PnfsDeleteEntryMessage return code=5 reason : java.lang.IllegalArgumentException: Directory 00010000000000000008B2F0 not empty" returnStatus.statusCode=SRM_FAILURE
This problem was fixed at FNAL already.
srmMkdir
Seems to be a problem when use lcg_cp to create new directories in dCache. Reported during LHCb testing.
SQL error in release 1.8.0-11 (SOLVED)
This resulted in "No write pools available" errors during srmcp requests. Fixed in patch version -13.