DPM Gotcha's

From GridPP Wiki
Revision as of 13:56, 6 August 2007 by Greig cowan (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The aim of this section is to document problems encountered whilst running a DPM in production, so that hopefully the DPM community can learn from each others problems.

Only globus-url-copy Works

Problem

This problem was encountered on Durham's DPM -

If the user is unknown to the DPM then lcg-cr will fail, with the cryptic error "transport endpoint not connected". Attempting to srmcp reveals an authentication failure: "SRMClientV1 : CGSI-gSOAP: Could not find mapping for: USERS_DN". Somehow the srm daemon is refusing to authenticate the user properly. It's actually quite a deep problem in the GSI chain, because the daemon doesn't even get as far as logging anything about the connection.

Drilling further down, to globus-url-copy, then revealed a bizarre work around: a globus-url-copy command will create the mapping from the user's DN to a pool account. After this is done everything starts to work properly.

We've checked to see if it's a problem with VOMS proxies and it isn't. It also doesn't seem to affect DPNS itself - even users who can't copy in files can do a dpns-mkdir and they get a new entry in Cns_userinfo fine.

Solution

Ensure that the directory /etc/grid-security/gridmapdir is group owned by dpmmgr


 [root@gallows gridmapdir]# ls -ld /etc/grid-security/gridmapdir
 drwxrwxr-x    2 root     dpmmgr      71680 Feb 21 14:00 /etc/grid-security/gridmapdir
 [root@gallows gridmapdir]#


Reason

The reason that globus-url-copy worked is that /opt/lcg/sbin/dpm.ftpd runs as root where as all the other DPM daemons run as dpmmgr.

James Casey suggested that this situation might also arise if YAIM's config_mkgridmap is run by hand, as this leaves the ownership of the directory as root:edginfo. Normally YAIM will then run config_DPM_mgr, which resets the permissions to root:dpmmgr, but running config_mkgridmap alone leaves things broken.

This is fixed in YAIM 3.0.1 - it's a definate gotcha!


MySQL database

Problem

The number of MySQL processes continues to increase, eventually preventing any new connections from being made. This causes DPM to break.

Solution

There may be multiple causes of this effect. However, one possible explanation is that the mysql_fix_privileges.sql script has not been executed. This /var/lib/mysql/hostname.err file suggested this was the case. Before running this, you should kill off all of the existing processes.