Random DPM errors in SAM

From GridPPwiki

See also Random dCache failures in SAM.

Table of contents

Transport endpoint is not connected

The classic Grid error message. Enjoy debugging this one.

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1185897573 -d serv02.hep.phy.cam.ac.uk
           0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst        41472 bytes     41.71 KB/sec avg      41.71 KB/sec instInternal error
lcg_cr: Transport endpoint is not connected
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1185897573
Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-07-31/file94854fad-f01d-42e3-bf59-41b6aaa2a0a7
Source URL: file:/home/samops/.same/SRM/testFile.txt
File size: 41472
VO name: ops
Destination specified: serv02.hep.phy.cam.ac.uk
Destination URL for copy: gsiftp://serv02.hep.phy.cam.ac.uk/serv02.hep.phy.cam.ac.uk:/lcg_flatf-2/ops/2007-07-31/file94854fad-f01d-42e3-bf59-41b6aaa2a0a7.463393.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1185897573
Transfer took 2030 ms
Setting SRM transfer to 'done' failed: Unregistering alias from catalog.
+ result=1

The cause of this may actually be due to a broken grid-mapfile on the DPM server.

globus_ftp_control_connect: globus_libc_gethostbyaddr_r failed

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-se1.pp.rhul.ac.uk-1186235305 -d se1.pp.rhul.ac.uk
globus_ftp_control_connect: globus_libc_gethostbyaddr_r failed
lcg_cr: Transport endpoint is not connected
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch 
Using LFN : /grid/ops/SAM/SRM-put-se1.pp.rhul.ac.uk-1186235305
Using SURL : srm://se1.pp.rhul.ac.uk/dpm/pp.rhul.ac.uk/home/ops/generated/2007-08-04/file3210f9f0-28e3-4e96-b4a7-3e5d0a418fae
Source URL: file:/home/samops/.same/SRM/testFile.txt
File size: 41472
VO name: ops
Destination specified: se1.pp.rhul.ac.uk
Destination URL for copy: gsiftp://gridraid2.pp.rhul.ac.uk/gridraid2.pp.rhul.ac.uk:/grid2/pool/ops/2007-08-04/file3210f9f0-28e3-4e96-b4a7-3e5d0a418fae.650031.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-se1.pp.rhul.ac.uk-1186235305
Copy Failed: Unregistering alias from catalog.
+ result=1
+ set +x

GSS failure

+ lcg-cp -v --vo ops lfn:SRM-put-dgc-grid-34.brunel.ac.uk-1185695992 file:/home/samops/.same/SRM/nodes/dgc-grid-34.brunel.ac.uk/testFile.txt
            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec instglobus_l_ftp_control_send_cmd_cb:  gss_init_sec_context failed
 GSS failure: 
GSS Major Status: General failure
GSS Minor Status Error Chain:
init_sec_context.c:114: gss_init_sec_context: Error with gss context
globus_i_gsi_gss_utils.c:308: globus_i_gsi_gss_create_and_fill_context: Error with GSI credential
acquire_cred.c:125: gss_acquire_cred: Error with GSI credential
globus_i_gsi_gss_utils.c:1310: globus_i_gsi_gss_cred_read: Error with gss credential handle
globus_gsi_credential.c:721: globus_gsi_cred_read: Valid credentials could not be found in any of the possible locations specified by the credential search order.
globus_gsi_credential.c:447: globus_gsi_cred_read: Error reading host credential
globus_gsi_system_config.c:3977: globus_gsi_sysconfig_get_host_cert_filename_unix: Error with certificate filename
globus_gsi_system_config.c:380: globus_i_gsi_sysconfig_create_cert_string: Error with certificate filename: /etc/grid-security/hostcert.pem not owned by current user.
globus_gsi_credential.c:239: globus_gsi_cred_read: Error reading proxy credential
globus_gsi_system_config.c:4589: globus_gsi_sysconfig_get_proxy_filename_unix: Could not find a valid proxy certificate file location
globus_gsi_system_config.c:446: globus_i_gsi_sysconfig_create_key_string: Error with key filename: /tmp/x509up_u23550 has zero length.
globus_gsi_credential.c:351: globus_gsi_cred_read: Error reading user credential
globus_gsi_credential.c:1086: globus_gsi_cred_read_key: Key is password protected: GSI does not currently support password protected private keys.
OpenSSL Error: pem_lib.c:434: in library: PEM routines, function PEM_do_header: bad password read
lcg_cp: Transport endpoint is not connected
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops
Source URL: lfn:/grid/ops/SAM/SRM-put-dgc-grid-34.brunel.ac.uk-1185695992
File size: 41472
Source URL for copy: gsiftp://dgc-grid-50.brunel.ac.uk/dgc-grid-50.brunel.ac.uk:/data2/dpmfs/ops/2007-07-29/fileb14fd55c-fa97-4291-a824-7c1456a83a35.969231.0
Destination URL: file:/home/samops/.same/SRM/nodes/dgc-grid-34.brunel.ac.uk/testFile.txt
# streams: 1
# set timeout to  0 (seconds)
+ result=1

Permission denied

This failure is seen a lot at QMUL and, therefore, may be explained by their use of poolfs (a home-brew filesystem for round-robin access to servers) which DPM runs on top of.

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-se01.esc.qmul.ac.uk-1185674118 -d se01.esc.qmul.ac.uk
            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst            0 bytes      0.00 KB/sec avg      0.00 KB/sec inst        41472 bytes     29.09 KB/sec avg      29.09 KB/sec instPermission denied
Permission denied
lcg_cr: Permission denied
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-se01.esc.qmul.ac.uk-1185674118
Using SURL : srm://se01.esc.qmul.ac.uk/dpm/esc.qmul.ac.uk/home/ops/generated/2007-07-29/file217c8e1a-8152-417c-a6f2-56a6add150e8
Source URL: file:/home/samops/.same/SRM/testFile.txt
File size: 41472
VO name: ops
Destination specified: se01.esc.qmul.ac.uk
Destination URL for copy: gsiftp://se01.esc.qmul.ac.uk/se01.esc.qmul.ac.uk:/pool/data/dpm/ops/2007-07-29/file217c8e1a-8152-417c-a6f2-56a6add150e8.610503.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-se01.esc.qmul.ac.uk-1185674118
Transfer took 2110 ms
Setting SRM transfer to 'done' failed: Unregistering alias from catalog.
+ result=1


No such file or directory

Not sure what causes this. Presumably it's a server side problem? This is a fairly common problem.

+ lcg-cp -v --vo ops lfn:SE-lcg-cr-heplnx204.pp.rl.ac.uk-1185813804 file:/home/samops/.same/SE/nodes/heplnx204.pp.rl.ac.uk/testFile.txt
lcg_cp: No such file or directory
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
VO name: ops
+ result=1

Timeout when executing test ??? after 600 seconds!

Anyone know what is timing out?

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-heplnx204.pp.rl.ac.uk-1185947904 -d heplnx204.pp.rl.ac.uk
Timeout when executing test SRM-put after 600 seconds!

DB fetch error / Communication error on send

Is this a problem with the DPM DB, or the LFC? Or is it a BDII issue?

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1186270777 -d serv02.hep.phy.cam.ac.uk
DB fetch error
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1186270777
Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-08-05/file07c4b265-cffc-4f9e-846d-4d5ebda714ec
+ result=1

Can't get req uniqueid

Related to the one above.

+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1186281814 -d serv02.hep.phy.cam.ac.uk
Can't get req uniqueid
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1186281814
Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-08-05/file9c228227-8a48-47b0-bf20-8d1a26e15c4e
+ result=1

No valid credential found

+ lcg-del -v --vo ops -a lfn:SE-lcg-cr-epgse1.ph.bham.ac.uk-1187157563
send2nsd: NS002 - send error : No valid credential found
Bad credentials
lcg_del: Communication error on send
VO name: ops
Using GUID : 194c1754-3d97-464b-b8ea-d576966d1aaf
set timeout to 0 seconds
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/ops/generated/2007-08-15/file36dc7166-4316-400f-a15e-00d99b30260b is deleted
srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/ops/generated/2007-08-15/file36dc7166-4316-400f-a15e-00d99b30260b is NOT unregistered
+ result=1

Error reading token data

This could be caused by expired certificates or out-of-date CRLs. Check with:

openssl x509 -in /etc/grid-security/dpmmgr/dpmcert.pem -noout -dates

Run the command in /etc/cron.d/fetch-crl.

Another possibility is that the srmv1 service is in a bad state. Symptoms of this include an empty log on the SE since the problem started (see /var/log/srmv1/log). If this happens, make a core dump of srmv1 process for later analysis by the DPM experts and then restart it.

+ lcg-cr -v --vo ops -d se1.pp.rhul.ac.uk -l lfn:sft-lcg-rm-cr-node31.beowulf.cluster.071102071038  file:///tmp/WMS_node31_015939_https_3a_2f_2frb113.cern.ch_3a9000_2fB3JV17rWKETkD83YwuRugg/work/testj ob/nodes/ce1.pp.rhul.ac.uk/sft-lcg-rm-cr.txt
httpg://se1.pp.rhul.ac.uk:8443/srm/managerv1: CGSI-gSOAP: Error reading token data: Connection reset by peer
lcg_cr: Communication error on send
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/SAM/sft-lcg-rm-cr-node31.beowulf.cluster.071102071038
Using SURL : srm://se1.pp.rhul.ac.uk/dpm/pp.rhul.ac.uk/home/ops/generated/2007-11-02/file107cb9d1-d1fc-4d52-bf87-08726e1a5d02
+ result=1

Unknown Error

$ lcg-cr -v --vo ops file:/etc/group -d some-dpm.some-domain
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Using LFN : /grid/ops/generated/2007-10-01/file-714ebdda-131c-42dc-b87e-e78e80a36485
Using SURL : srm://some-dpm.some-domain/dpm/some-domain/home/ops/generated/2007-10-01/.....
httpg://some-dpm.some-domain:8443/srm/managerv1: Unknown error
lcg_cr: Communication error on send

Please check the GOC (http://goc.grid.sinica.edu.tw/gocwiki/Unknown_error_..._Communication_error_on_send) wiki article on this subject.