Random DPM errors in SAM

From GridPP Wiki
Revision as of 18:15, 15 February 2008 by Greig cowan (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

See also Random dCache failures in SAM.

Transport endpoint is not connected

The classic Grid error message. Enjoy debugging this one.

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1185897573 -d serv02.hep.phy.cam.ac.uk
           0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst        41472 bytes     41.71 KB/sec avg      41.71 KB/sec instInternal error
lcg_cr: Transport endpoint is not connected
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1185897573
Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-07-31/file94854fad-f01d-42e3-bf59-41b6aaa2a0a7
Source URL: file:/home/samops/.same/SRM/testFile.txt
File size: 41472
VO name: ops
Destination specified: serv02.hep.phy.cam.ac.uk
Destination URL for copy: gsiftp://serv02.hep.phy.cam.ac.uk/serv02.hep.phy.cam.ac.uk:/lcg_flatf-2/ops/2007-07-31/file94854fad-f01d-42e3-bf59-41b6aaa2a0a7.463393.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1185897573
Transfer took 2030 ms
Setting SRM transfer to 'done' failed: Unregistering alias from catalog.
+ result=1

The cause of this may actually be due to a broken grid-mapfile on the DPM server.

globus_ftp_control_connect: globus_libc_gethostbyaddr_r failed

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-se1.pp.rhul.ac.uk-1186235305 -d se1.pp.rhul.ac.uk
globus_ftp_control_connect: globus_libc_gethostbyaddr_r failed
lcg_cr: Transport endpoint is not connected
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch 
Using LFN : /grid/ops/SAM/SRM-put-se1.pp.rhul.ac.uk-1186235305
Using SURL : srm://se1.pp.rhul.ac.uk/dpm/pp.rhul.ac.uk/home/ops/generated/2007-08-04/file3210f9f0-28e3-4e96-b4a7-3e5d0a418fae
Source URL: file:/home/samops/.same/SRM/testFile.txt
File size: 41472
VO name: ops
Destination specified: se1.pp.rhul.ac.uk
Destination URL for copy: gsiftp://gridraid2.pp.rhul.ac.uk/gridraid2.pp.rhul.ac.uk:/grid2/pool/ops/2007-08-04/file3210f9f0-28e3-4e96-b4a7-3e5d0a418fae.650031.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-se1.pp.rhul.ac.uk-1186235305
Copy Failed: Unregistering alias from catalog.
+ result=1
+ set +x

GSS failure

+ lcg-cp -v --vo ops lfn:SRM-put-dgc-grid-34.brunel.ac.uk-1185695992 file:/home/samops/.same/SRM/nodes/dgc-grid-34.brunel.ac.uk/testFile.txt
            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec instglobus_l_ftp_control_send_cmd_cb:  gss_init_sec_context failed
 GSS failure: 
GSS Major Status: General failure
GSS Minor Status Error Chain:
init_sec_context.c:114: gss_init_sec_context: Error with gss context
globus_i_gsi_gss_utils.c:308: globus_i_gsi_gss_create_and_fill_context: Error with GSI credential
acquire_cred.c:125: gss_acquire_cred: Error with GSI credential
globus_i_gsi_gss_utils.c:1310: globus_i_gsi_gss_cred_read: Error with gss credential handle
globus_gsi_credential.c:721: globus_gsi_cred_read: Valid credentials could not be found in any of the possible locations specified by the credential search order.
globus_gsi_credential.c:447: globus_gsi_cred_read: Error reading host credential
globus_gsi_system_config.c:3977: globus_gsi_sysconfig_get_host_cert_filename_unix: Error with certificate filename
globus_gsi_system_config.c:380: globus_i_gsi_sysconfig_create_cert_string: Error with certificate filename: /etc/grid-security/hostcert.pem not owned by current user.
globus_gsi_credential.c:239: globus_gsi_cred_read: Error reading proxy credential
globus_gsi_system_config.c:4589: globus_gsi_sysconfig_get_proxy_filename_unix: Could not find a valid proxy certificate file location
globus_gsi_system_config.c:446: globus_i_gsi_sysconfig_create_key_string: Error with key filename: /tmp/x509up_u23550 has zero length.
globus_gsi_credential.c:351: globus_gsi_cred_read: Error reading user credential
globus_gsi_credential.c:1086: globus_gsi_cred_read_key: Key is password protected: GSI does not currently support password protected private keys.
OpenSSL Error: pem_lib.c:434: in library: PEM routines, function PEM_do_header: bad password read
lcg_cp: Transport endpoint is not connected
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops
Source URL: lfn:/grid/ops/SAM/SRM-put-dgc-grid-34.brunel.ac.uk-1185695992
File size: 41472
Source URL for copy: gsiftp://dgc-grid-50.brunel.ac.uk/dgc-grid-50.brunel.ac.uk:/data2/dpmfs/ops/2007-07-29/fileb14fd55c-fa97-4291-a824-7c1456a83a35.969231.0
Destination URL: file:/home/samops/.same/SRM/nodes/dgc-grid-34.brunel.ac.uk/testFile.txt
# streams: 1
# set timeout to  0 (seconds)
+ result=1

Permission denied

This failure is seen a lot at QMUL and, therefore, may be explained by their use of poolfs (a home-brew filesystem for round-robin access to servers) which DPM runs on top of.

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-se01.esc.qmul.ac.uk-1185674118 -d se01.esc.qmul.ac.uk
            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst        41472 bytes     29.09 KB/sec avg      29.09 KB/sec instPermission denied
Permission denied
lcg_cr: Permission denied
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-se01.esc.qmul.ac.uk-1185674118
Using SURL : srm://se01.esc.qmul.ac.uk/dpm/esc.qmul.ac.uk/home/ops/generated/2007-07-29/file217c8e1a-8152-417c-a6f2-56a6add150e8
Source URL: file:/home/samops/.same/SRM/testFile.txt
File size: 41472
VO name: ops
Destination specified: se01.esc.qmul.ac.uk
Destination URL for copy: gsiftp://se01.esc.qmul.ac.uk/se01.esc.qmul.ac.uk:/pool/data/dpm/ops/2007-07-29/file217c8e1a-8152-417c-a6f2-56a6add150e8.610503.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-se01.esc.qmul.ac.uk-1185674118
Transfer took 2110 ms
Setting SRM transfer to 'done' failed: Unregistering alias from catalog.
+ result=1


No such file or directory

Not sure what causes this. Presumably it's a server side problem? This is a fairly common problem.

+ lcg-cp -v --vo ops lfn:SE-lcg-cr-heplnx204.pp.rl.ac.uk-1185813804 file:/home/samops/.same/SE/nodes/heplnx204.pp.rl.ac.uk/testFile.txt
lcg_cp: No such file or directory
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops
+ result=1

Timeout when executing test ??? after 600 seconds!

Anyone know what is timing out?

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-heplnx204.pp.rl.ac.uk-1185947904 -d heplnx204.pp.rl.ac.uk
Timeout when executing test SRM-put after 600 seconds!

DB fetch error / Communication error on send

Is this a problem with the DPM DB, or the LFC? Or is it a BDII issue?

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1186270777 -d serv02.hep.phy.cam.ac.uk
DB fetch error
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1186270777
Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-08-05/file07c4b265-cffc-4f9e-846d-4d5ebda714ec
+ result=1

Can't get req uniqueid

Related to the one above.

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1186281814 -d serv02.hep.phy.cam.ac.uk
Can't get req uniqueid
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1186281814
Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-08-05/file9c228227-8a48-47b0-bf20-8d1a26e15c4e
+ result=1

No valid credential found

+ lcg-del -v --vo ops -a lfn:SE-lcg-cr-epgse1.ph.bham.ac.uk-1187157563
send2nsd: NS002 - send error : No valid credential found
Bad credentials
lcg_del: Communication error on send
VO name: ops
Using GUID : 194c1754-3d97-464b-b8ea-d576966d1aaf
set timeout to 0 seconds
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/ops/generated/2007-08-15/file36dc7166-4316-400f-a15e-00d99b30260b is deleted
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/ops/generated/2007-08-15/file36dc7166-4316-400f-a15e-00d99b30260b is NOT unregistered
+ result=1

Error reading token data

This could be caused by expired certificates or out-of-date CRLs. Check with:

openssl x509 -in /etc/grid-security/dpmmgr/dpmcert.pem -noout -dates

Run the command in /etc/cron.d/fetch-crl.

Another possibility is that the srmv1 service is in a bad state. Symptoms of this include an empty log on the SE since the problem started (see /var/log/srmv1/log). If this happens, make a core dump of srmv1 process for later analysis by the DPM experts and then restart it.

+ lcg-cr -v --vo ops -d se1.pp.rhul.ac.uk -l lfn:sft-lcg-rm-cr-node31.beowulf.cluster.071102071038  file:///tmp/WMS_node31_015939_https_3a_2f_2frb113.cern.ch_3a9000_2fB3JV17rWKETkD83YwuRugg/work/testj ob/nodes/ce1.pp.rhul.ac.uk/sft-lcg-rm-cr.txt
httpg://se1.pp.rhul.ac.uk:8443/srm/managerv1: CGSI-gSOAP: Error reading token data: Connection reset by peer
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/sft-lcg-rm-cr-node31.beowulf.cluster.071102071038
Using SURL : srm://se1.pp.rhul.ac.uk/dpm/pp.rhul.ac.uk/home/ops/generated/2007-11-02/file107cb9d1-d1fc-4d52-bf87-08726e1a5d02
+ result=1

Unknown Error

$ lcg-cr -v --vo ops file:/etc/group -d some-dpm.some-domain
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/generated/2007-10-01/file-714ebdda-131c-42dc-b87e-e78e80a36485
Using SURL : srm://some-dpm.some-domain/dpm/some-domain/home/ops/generated/2007-10-01/.....
httpg://some-dpm.some-domain:8443/srm/managerv1: Unknown error
lcg_cr: Communication error on send

Please check the GOC wiki article on this subject.