Random DPM errors in SAM
See also Random dCache failures in SAM.
Contents
Transport endpoint is not connected
The classic Grid error message. Enjoy debugging this one.
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1185897573 -d serv02.hep.phy.cam.ac.uk 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 41472 bytes 41.71 KB/sec avg 41.71 KB/sec instInternal error lcg_cr: Transport endpoint is not connected Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1185897573 Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-07-31/file94854fad-f01d-42e3-bf59-41b6aaa2a0a7 Source URL: file:/home/samops/.same/SRM/testFile.txt File size: 41472 VO name: ops Destination specified: serv02.hep.phy.cam.ac.uk Destination URL for copy: gsiftp://serv02.hep.phy.cam.ac.uk/serv02.hep.phy.cam.ac.uk:/lcg_flatf-2/ops/2007-07-31/file94854fad-f01d-42e3-bf59-41b6aaa2a0a7.463393.0 # streams: 1 # set timeout to 0 seconds Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1185897573 Transfer took 2030 ms Setting SRM transfer to 'done' failed: Unregistering alias from catalog. + result=1
The cause of this may actually be due to a broken grid-mapfile on the DPM server.
globus_ftp_control_connect: globus_libc_gethostbyaddr_r failed
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-se1.pp.rhul.ac.uk-1186235305 -d se1.pp.rhul.ac.uk globus_ftp_control_connect: globus_libc_gethostbyaddr_r failed lcg_cr: Transport endpoint is not connected Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-se1.pp.rhul.ac.uk-1186235305 Using SURL : srm://se1.pp.rhul.ac.uk/dpm/pp.rhul.ac.uk/home/ops/generated/2007-08-04/file3210f9f0-28e3-4e96-b4a7-3e5d0a418fae Source URL: file:/home/samops/.same/SRM/testFile.txt File size: 41472 VO name: ops Destination specified: se1.pp.rhul.ac.uk Destination URL for copy: gsiftp://gridraid2.pp.rhul.ac.uk/gridraid2.pp.rhul.ac.uk:/grid2/pool/ops/2007-08-04/file3210f9f0-28e3-4e96-b4a7-3e5d0a418fae.650031.0 # streams: 1 # set timeout to 0 seconds Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-se1.pp.rhul.ac.uk-1186235305 Copy Failed: Unregistering alias from catalog. + result=1 + set +x
GSS failure
+ lcg-cp -v --vo ops lfn:SRM-put-dgc-grid-34.brunel.ac.uk-1185695992 file:/home/samops/.same/SRM/nodes/dgc-grid-34.brunel.ac.uk/testFile.txt 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec instglobus_l_ftp_control_send_cmd_cb: gss_init_sec_context failed GSS failure: GSS Major Status: General failure GSS Minor Status Error Chain: init_sec_context.c:114: gss_init_sec_context: Error with gss context globus_i_gsi_gss_utils.c:308: globus_i_gsi_gss_create_and_fill_context: Error with GSI credential acquire_cred.c:125: gss_acquire_cred: Error with GSI credential globus_i_gsi_gss_utils.c:1310: globus_i_gsi_gss_cred_read: Error with gss credential handle globus_gsi_credential.c:721: globus_gsi_cred_read: Valid credentials could not be found in any of the possible locations specified by the credential search order. globus_gsi_credential.c:447: globus_gsi_cred_read: Error reading host credential globus_gsi_system_config.c:3977: globus_gsi_sysconfig_get_host_cert_filename_unix: Error with certificate filename globus_gsi_system_config.c:380: globus_i_gsi_sysconfig_create_cert_string: Error with certificate filename: /etc/grid-security/hostcert.pem not owned by current user. globus_gsi_credential.c:239: globus_gsi_cred_read: Error reading proxy credential globus_gsi_system_config.c:4589: globus_gsi_sysconfig_get_proxy_filename_unix: Could not find a valid proxy certificate file location globus_gsi_system_config.c:446: globus_i_gsi_sysconfig_create_key_string: Error with key filename: /tmp/x509up_u23550 has zero length. globus_gsi_credential.c:351: globus_gsi_cred_read: Error reading user credential globus_gsi_credential.c:1086: globus_gsi_cred_read_key: Key is password protected: GSI does not currently support password protected private keys. OpenSSL Error: pem_lib.c:434: in library: PEM routines, function PEM_do_header: bad password read lcg_cp: Transport endpoint is not connected Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch VO name: ops Source URL: lfn:/grid/ops/SAM/SRM-put-dgc-grid-34.brunel.ac.uk-1185695992 File size: 41472 Source URL for copy: gsiftp://dgc-grid-50.brunel.ac.uk/dgc-grid-50.brunel.ac.uk:/data2/dpmfs/ops/2007-07-29/fileb14fd55c-fa97-4291-a824-7c1456a83a35.969231.0 Destination URL: file:/home/samops/.same/SRM/nodes/dgc-grid-34.brunel.ac.uk/testFile.txt # streams: 1 # set timeout to 0 (seconds) + result=1
Permission denied
This failure is seen a lot at QMUL and, therefore, may be explained by their use of poolfs (a home-brew filesystem for round-robin access to servers) which DPM runs on top of.
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-se01.esc.qmul.ac.uk-1185674118 -d se01.esc.qmul.ac.uk 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 0 bytes 0.00 KB/sec avg 0.00 KB/sec inst 41472 bytes 29.09 KB/sec avg 29.09 KB/sec instPermission denied Permission denied lcg_cr: Permission denied Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-se01.esc.qmul.ac.uk-1185674118 Using SURL : srm://se01.esc.qmul.ac.uk/dpm/esc.qmul.ac.uk/home/ops/generated/2007-07-29/file217c8e1a-8152-417c-a6f2-56a6add150e8 Source URL: file:/home/samops/.same/SRM/testFile.txt File size: 41472 VO name: ops Destination specified: se01.esc.qmul.ac.uk Destination URL for copy: gsiftp://se01.esc.qmul.ac.uk/se01.esc.qmul.ac.uk:/pool/data/dpm/ops/2007-07-29/file217c8e1a-8152-417c-a6f2-56a6add150e8.610503.0 # streams: 1 # set timeout to 0 seconds Alias registered in Catalog: lfn:/grid/ops/SAM/SRM-put-se01.esc.qmul.ac.uk-1185674118 Transfer took 2110 ms Setting SRM transfer to 'done' failed: Unregistering alias from catalog. + result=1
No such file or directory
Not sure what causes this. Presumably it's a server side problem? This is a fairly common problem.
+ lcg-cp -v --vo ops lfn:SE-lcg-cr-heplnx204.pp.rl.ac.uk-1185813804 file:/home/samops/.same/SE/nodes/heplnx204.pp.rl.ac.uk/testFile.txt lcg_cp: No such file or directory Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch VO name: ops + result=1
Timeout when executing test ??? after 600 seconds!
Anyone know what is timing out?
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-heplnx204.pp.rl.ac.uk-1185947904 -d heplnx204.pp.rl.ac.uk Timeout when executing test SRM-put after 600 seconds!
DB fetch error / Communication error on send
Is this a problem with the DPM DB, or the LFC? Or is it a BDII issue?
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1186270777 -d serv02.hep.phy.cam.ac.uk DB fetch error lcg_cr: Communication error on send Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1186270777 Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-08-05/file07c4b265-cffc-4f9e-846d-4d5ebda714ec + result=1
Can't get req uniqueid
Related to the one above.
+ lcg-cr -v --vo ops file:/home/samops/.same/SRM/testFile.txt -l lfn:SRM-put-serv02.hep.phy.cam.ac.uk-1186281814 -d serv02.hep.phy.cam.ac.uk Can't get req uniqueid lcg_cr: Communication error on send Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/SRM-put-serv02.hep.phy.cam.ac.uk-1186281814 Using SURL : srm://serv02.hep.phy.cam.ac.uk/dpm/hep.phy.cam.ac.uk/home/ops/generated/2007-08-05/file9c228227-8a48-47b0-bf20-8d1a26e15c4e + result=1
No valid credential found
+ lcg-del -v --vo ops -a lfn:SE-lcg-cr-epgse1.ph.bham.ac.uk-1187157563 send2nsd: NS002 - send error : No valid credential found Bad credentials lcg_del: Communication error on send VO name: ops Using GUID : 194c1754-3d97-464b-b8ea-d576966d1aaf set timeout to 0 seconds srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/ops/generated/2007-08-15/file36dc7166-4316-400f-a15e-00d99b30260b is deleted srm://epgse1.ph.bham.ac.uk/dpm/ph.bham.ac.uk/home/ops/generated/2007-08-15/file36dc7166-4316-400f-a15e-00d99b30260b is NOT unregistered + result=1
Error reading token data
This could be caused by expired certificates or out-of-date CRLs. Check with:
openssl x509 -in /etc/grid-security/dpmmgr/dpmcert.pem -noout -dates
Run the command in /etc/cron.d/fetch-crl
.
Another possibility is that the srmv1 service is in a bad state. Symptoms of this include an empty log on the SE since the problem started (see /var/log/srmv1/log). If this happens, make a core dump of srmv1 process for later analysis by the DPM experts and then restart it.
+ lcg-cr -v --vo ops -d se1.pp.rhul.ac.uk -l lfn:sft-lcg-rm-cr-node31.beowulf.cluster.071102071038 file:///tmp/WMS_node31_015939_https_3a_2f_2frb113.cern.ch_3a9000_2fB3JV17rWKETkD83YwuRugg/work/testj ob/nodes/ce1.pp.rhul.ac.uk/sft-lcg-rm-cr.txt httpg://se1.pp.rhul.ac.uk:8443/srm/managerv1: CGSI-gSOAP: Error reading token data: Connection reset by peer lcg_cr: Communication error on send Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/SAM/sft-lcg-rm-cr-node31.beowulf.cluster.071102071038 Using SURL : srm://se1.pp.rhul.ac.uk/dpm/pp.rhul.ac.uk/home/ops/generated/2007-11-02/file107cb9d1-d1fc-4d52-bf87-08726e1a5d02 + result=1
Unknown Error
$ lcg-cr -v --vo ops file:/etc/group -d some-dpm.some-domain Using grid catalog type: lfc Using grid catalog : prod-lfc-shared-central.cern.ch Using LFN : /grid/ops/generated/2007-10-01/file-714ebdda-131c-42dc-b87e-e78e80a36485 Using SURL : srm://some-dpm.some-domain/dpm/some-domain/home/ops/generated/2007-10-01/..... httpg://some-dpm.some-domain:8443/srm/managerv1: Unknown error lcg_cr: Communication error on send
Please check the GOC wiki article on this subject.