The aim of this section is to document problems encountered whilst running a DPM in production, so that hopefully the DPM community can learn from each others problems.
Only globus-url-copy Works
This problem was encountered on Durham's DPM -
If the user is unknown to the DPM then lcg-cr will fail, with the cryptic error "transport endpoint not connected". Attempting to srmcp reveals an authentication failure: "SRMClientV1 : CGSI-gSOAP: Could not find mapping for: USERS_DN". Somehow the srm daemon is refusing to authenticate the user properly. It's actually quite a deep problem in the GSI chain, because the daemon doesn't even get as far as logging anything about the connection.
Drilling further down, to globus-url-copy, then revealed a bizarre work around: a globus-url-copy command will create the mapping from the user's DN to a pool account. After this is done everything starts to work properly.
We've checked to see if it's a problem with VOMS proxies and it isn't. It also doesn't seem to affect DPNS itself - even users who can't copy in files can do a dpns-mkdir and they get a new entry in Cns_userinfo fine.
Ensure that the directory /etc/grid-security/gridmapdir is group owned by dpmmgr
[root@gallows gridmapdir]# ls -ld /etc/grid-security/gridmapdir drwxrwxr-x 2 root dpmmgr 71680 Feb 21 14:00 /etc/grid-security/gridmapdir [root@gallows gridmapdir]#
The reason that globus-url-copy worked is that /opt/lcg/sbin/dpm.ftpd runs as root where as all the other DPM daemons run as dpmmgr.
James Casey suggested that this situation might also arise if YAIM's config_mkgridmap is run by hand, as this leaves the ownership of the directory as root:edginfo. Normally YAIM will then run config_DPM_mgr, which resets the permissions to root:dpmmgr, but running config_mkgridmap alone leaves things broken.
This is fixed in YAIM 3.0.1 - it's a definate gotcha!
The number of MySQL processes continues to increase, eventually preventing any new connections from being made. This causes DPM to break.
There may be multiple causes of this effect. However, one possible explanation is that the mysql_fix_privileges.sql script has not been executed. This /var/lib/mysql/hostname.err file suggested this was the case. Before running this, you should kill off all of the existing processes.
File transfers fail due to "could not get user mapping"
Transfers in and out fail. DPM and DPNS logs show messages saying that host is not trusted, for example:
dpns log: 12/22 14:02:41.611 5771,0 Cns_serv: [188.8.131.52] (se2.ppgrid1.rhul.ac.uk): Host is not trusted, identity provided was (ID,"dpmmgr")
dpm log: 12/22 13:52:16.184 6246,3 dpm_serv: [10.141.255.254] (newton.cm.cluster): Could not establish an authenticated connection: Csec_server_negociate_protocol: The client did not send an authentication negotiation request; _Csec_recv_token: Connection closed ! 12/22 13:53:06.330 6246,3 dpm_serv: [184.108.40.206] (se2.ppgrid1.rhul.ac.uk): Host is not trusted, identity provided was (ID,"dpmmgr")
This is a complete service failure, not intermittent, and pretty much only these messages are being logged. Several GGUS tickets opened. dpm and dpns are both running but failing.
Using dpm-1.8.10-1.el5.x86_64 on SL5.11, DB on separate node.
Restarting dpns and dpm seems to fix the problem, i.e. service dpnsdaemon restart service dpm restart The system may have been in a strange state but this happened following a reboot so one would hope it was ok.
If anyone can explain this please enter the reason here.