Random dCache failures in SAM

From GridPP Wiki
Jump to: navigation, search

As most sites will know, random failures are occasionally reported by SAM. It's often not clear what caused this (poor error messages don't help) and the "problem" does not occur during the next test. This is a reason why having a BDII-independent SRM test would help as it would decouple problems with the information system from problems with the dCache. This page continues to be added to.

No valid credential found

User credentials or host credentials? We reckon this is a problem with the LFC, but how can we check?

+ lcg-cp -v --vo ops lfn:SRM-put-gfe02.hep.ph.ic.ac.uk-1185595194 file:/home/samops/.same/SRM/nodes/gfe02.hep.ph.ic.ac.uk/testFile.txt
send2nsd: NS002 - send error : No valid credential found
Bad credentials
lcg_cp: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops

Timeout when executing test ??? after 600 seconds!

Anyone know what is timing out? Is this a network problem, or is the server too busy?

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-heplnx204.pp.rl.ac.uk-1185947904 -d heplnx204.pp.rl.ac.uk
Timeout when executing test SRM-put after 600 seconds!

Communication error on send / Error Delete failed

Internal dCache communication problem perhaps? Or is it a problem communicating with the client?

+ lcg-del -v --vo ops -a lfn:SRM-put-heplnx204.pp.rl.ac.uk-1185982360
java.rmi.RemoteException: srm advisoryDelete failed; nested exception is: 
	java.lang.RuntimeException:  advisoryDelete(User [name=ops001, uid=40101, gid=24336,  root=/],pnfs/pp.rl.ac.uk/data/ops/generated/2007-08-01/file919f58c3-7ca9-41ee-8f3b-1f384eca7d 11) Error Delete failed: NULL
lcg_del: Communication error on send
VO name: ops
Using GUID : 7a5c2db6-fcd6-4ec5-8899-4dcad45ec6d3
set timeout to 0 seconds
srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/ops/generated/2007-08-01/file919f58c3-7ca9-41ee-8f3b-1f384eca7d11 is NOT deleted
+ result=1

Explanation

This is caused by the dCache not deleting the file within 10 seconds of receiving the request from the SRM. The SRM then returns an error to the client, even though the operation could succeed in, e.g., 11 seconds. Therefore, although the lcg-del failed, the file will actually be removed from the dCache. At this time, (v1.7.0-39) this timeout cannot be configured. The dCache developers are working on it.

InvocationTargetException

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-heplnx204.pp.rl.ac.uk-1194051383 -d heplnx204.pp.rl.ac.uk
            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg       0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00  KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec instthe server sent an error response: 500 500  java.lang.reflect.InvocationTargetException: 
java.rmi.RemoteException: srm advisoryDelete failed; nested exception is: 
	java.lang.RuntimeException:  advisoryDelete(User [name=ops001, uid=40101, gid=24336,  root=/],pnfs/pp.rl.ac.uk/data/ops/generated/2007-11-03/file1802f1e7-b022-4052-ae57-dddf45e8641b) Error file does not exist, cannot delete
lcg_cr: No such file or directory
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-heplnx204.pp.rl.ac.uk-1194051383
Using SURL : srm://heplnx204.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/ops/generated/2007-11-03/file1802f1e7-b022-4052-ae57-dddf45e8641b
Source URL: file:/home/samops/.same/SRM/testFile.txt
File size: 41472
VO name: ops
Destination specified: heplnx204.pp.rl.ac.uk
Destination URL for copy: gsiftp://heplnx173.pp.rl.ac.uk:2811//pnfs/pp.rl.ac.uk/data/ops/generated/2007-11-03/file1802f1e7-b022-4052-ae57-dddf45e8641b
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-heplnx204.pp.rl.ac.uk-1194051383
Copy Failed: Unregistering alias from catalog.
+ result=1

CGSI-gSOAP: GSS Major Status: Authentication Failed

Problem getting hold of server certificates?

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-dcache02.tier2.hep.manchester.ac.uk-1186123222 -d dcache02.tier2.hep.manchester.ac.uk
CGSI-gSOAP: GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
(null)
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-dcache02.tier2.hep.manchester.ac.uk-1186123222
Using SURL : srm://dcache02.tier2.hep.manchester.ac.uk/pnfs/tier2.hep.manchester.ac.uk/data/ops/generated/2007-08-03/file82f5de94-7cf2-4de7-9b32-4f07eb2a23a2
+ result=1

gPlazma timed out

This is probably due to the fact that the gPlazma cell is being used, rather than the module. The difference here is that with the module, other dCache cells directly call the methods of gPlazma to do the authorisation. However, with the cell, there is a dedicated process which other cells must talk to. This can lead to time outs if there are problems with inter-cell communication.

+ lcg-cp -v --vo ops lfn:SRM-put-heplnx204.pp.rl.ac.uk-1186472286 file:/home/samops/.same/SRM/nodes/heplnx204.pp.rl.ac.uk/testFile.txt
the server sent an error response: 530 530 Authorization Service failed: diskCacheV111.services.authorization.AuthorizationServiceException: authRequestID  761915796 Message to gPlazma timed out for authentification of /C=CH/O=CERN/OU=GRID/CN=Judit Novak 0973 - ops
lcg_cp: Invalid argument
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops
+ result=1

Name server not active

+ lcg-cr -v --vo ops -d srm.epcc.ed.ac.uk -l lfn:sft-lcg-rm-cr-wn0.epcc.ed.ac.uk.070820122357  file:///home/opssgm/globus-tmp.wn0.7879.0/WMS_wn0_08352_https_3a_2f_2frb127.cern.ch_3a9000_2fpS xaLLFVkJREpHb51kEa4g/work/testjob/nodes/ce.epcc.ed.ac.uk/sft-lcg-rm-cr.txt
Name server not active
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/sft-lcg-rm-cr-wn0.epcc.ed.ac.uk.070820122357
+ result=1

gethostbyname

+ lcg-cp -v --vo ops lfn:SRM-put-srm.epcc.ed.ac.uk-1193901323 file:/home/samops/.same/SRM/nodes/srm.epcc.ed.ac.uk/testFile.txt
globus_ftp_control_connect: globus_libc_gethostbyname_r failed
lcg_cp: Invalid argument
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops
+ result=1

CGSI-gSOAP: GSS Major Status: General failure

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-srm.epcc.ed.ac.uk-1194098403 -d srm.epcc.ed.ac.uk
CGSI-gSOAP: GSS Major Status: General failure
GSS Minor Status Error Chain:
acquire_cred.c:125: gss_acquire_cred: Error with GSI credential
globus_i_gsi_gss_utils.c:1310: globus_i_gsi_gss_cred_read: Error with gss credential handle
globus_gsi_credential.c:721: globus_gsi_cred_read: Valid credentials could not be found in any of the possible locations specified by the credential search order.
globus_gsi_credential.c:447: globus_gsi_cred_read: Error reading host credential
globus_gsi_system_config.c:3977: globus_gsi_sysconfig_get_host_cert_filename_unix: Error with certificate filename
globus_gsi_system_config.c:380: globus_i_gsi_sysconfig_create_cert_string: Error with certificate filename: /etc/grid-security/hostcert.pem not owned by current user.
globus_gsi_credential.c:239: globus_gsi_cred_read: Error reading proxy credential
globus_gsi_system_config.c:4589: globus_gsi_sysconfig_get_proxy_filename_unix: Could not find a valid proxy certificate file location
globus_gsi_system_config.c:446: globus_i_gsi_s
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-srm.epcc.ed.ac.uk-1194098403
Using SURL : srm://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/ops/generated/2007-11-03/filee5e1619f-9329-47a9-a4ee-42aca32127ac
+ result=1