DPM SRMv2.2 Testing

From GridPP Wiki
Jump to: navigation, search

This page keeps track of the issues that arose during the testing of GridPP DPM 1.6.4, SRMv2.2 endpoints. This testing forms part of a wider programme of testing SRMv2.2 endpoints that is being coordinated by the WLCG deployment group at CERN. All flavours (dCache, CASTOR, DPM, StoRM, BeSTMan) of SRM endpoint are having their basic functionality tested via the s2 client and lcg_utils. Stress testing will also be done once the basic functionality and inter-operability tests have been completed.

GridPP endpoints

GridPP has two DPM endpoints involved in the tests.

  • httpg://svr018.gla.scotgrid.ac.uk:8446/srm/managerv2 - this is the Glasgow production machine, running on high end hardware.
  • httpg://wn4.epcc.ed.ac.uk:8443/srm/managerv2 - this is a single test machine, running all services and 2 pools.

Ping

OK for Glasgow.

Note that initially, wn4 had the SRMv2.2 daemon running on port 8446, as is the default but due to firewall restrictions at Edinburgh this had to be changed to 8443. Simply add this line to /etc/shift.conf and restart the service.

SRMV2_2 PORT 8443

Ping now OK.

StatusOfPutRequest

Initially failing for Flavia at Glasgow and Edinburgh due to the central LFC not being writable for VOMS proxies with "Role=lcgadmin". This was fixed, then had to update the ACLs on the dteam area of the namespace

[root@wn4 dpm]# dpns-entergrpmap --group "dteam/Role=production"
[root@wn4 dpm]# dpns-entergrpmap --group "dteam/Role=lcgadmin"
nsentergrpmap -1: Group exists already
[root@wn4 dpm]# dpns-setacl -m "g:dteam/Role=lcgadmin:rwx,m:rwx" /dpm/epcc.ed.ac.uk/home/dteam
[root@wn4 dpm]# dpns-setacl -m "g:dteam/Role=production:rwx,m:rwx" /dpm/epcc.ed.ac.uk/home/dteam
[root@wn4 dpm]# dpns-setacl -m "d:g:dteam/Role=production:rwx,d:m:rwx" /dpm/epcc.ed.ac.uk/home/dteam
[root@wn4 dpm]# dpns-setacl -m "d:g:dteam/Role=lcgadmin:rwx,d:m:rwx" /dpm/epcc.ed.ac.uk/home/dteam

ReserveSpace

The basic functionality testing requires ReserveSpace to be called. These reserved spaces were not being released meaning that the available space in the Glasgow and Edinburgh DPM has been gradually decreasing. This was spotted by Graeme and it effectively illustrated in the plot below. There must be a bug somewhere since the s2 test suite includes a ReleaseSpace test.

File:Glasgow usage store week.png

The reserve space information is contained in the dpm_space_reserv table of the dpm_db database.

mysql> select * from dpm_space_reserv where s_token="3210e804-1e71-41ff-9e2b-f8f3c1530dbc";

The problem was due to the fact that the DPM does not yet have an automatic space garbage collector. Flavia's tests reserve "volatile" space with a lifetime of 3 minutes. After this time, the system is supposed to recollect the space. She has now modified the test so that they always explicitly release unused space. A test has been added in the usecase family to check that the space collector works properly for all implementations (except CASTOR which does not support at the moment this feature).

More DN fun

If a user uses grid-proxy-init, DPM gets the VO name from the /opt/lcg/etc/lcgdm-mapfile If a user uses voms-proxy-init, DPM gets the VO name from the VOMS proxy.

/opt/lcg/etc/lcgdm-mapfile is kept up-to-date via a cron job. Local users and mappings can be added via /opt/lcg/etc/lcgdm-mapfile-local .

BDII problems on SL4

DPM 1.6.4 uses a BDII as the local information provider, replacing globus-mds. This is fine, until you install the DPM on SL4 (32bit, since 64bit still broken) in which case the information provider stops working.

# ldapsearch -LLL -x -H ldap://wn4.epcc.ed.ac.uk:2170 -b mds-vo-name=resource,o=grid
Invalid DN syntax (34)
Additional information: invalid DN
# cat /opt/bdii/etc/bdii-update.conf
GIP file:///opt/lcg/libexec/lcg-info-wrapper

I can run lcg-info-wrapper as edguser, and the correct output is generated. It is clear that this script is being executed, since the /opt/bdii/var/cache/ directories all contain indentical GIP.ldif files. Looking at the stderr log:

# cat tmp/stderr.log
bdb_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December 3, 2003)
bdb_initialize: Sleepycat Software: Berkeley DB 4.2.52: (December 3, 2003)
/opt/bdii//var/2172/bdii-slapd.conf: line 9: schema checking disabled! your mileage may vary!
/opt/bdii//var/2172/bdii-slapd.conf: line 28: unknown directive "defaultaccess" inside backend database definition (ignored)
ldbm_cache_open (blksize 8192) (maxids 2046) (maxindirect 5)
str2entry: entry -1 has invalid DN "GlueServiceUniqueID=httpg://wn4.epcc.ed.ac.uk:8443/srm/managerv1,mds-vo-name=resource,o=grid"
slapadd: could not parse entry (line=29)
str2entry: entry -1 has invalid DN "GlueServiceUniqueID=httpg://wn4.epcc.ed.ac.uk:8446/srm/managerv2,mds-vo-name=resource,o=grid"
slapadd: could not parse entry (line=58)  

It appears that the version of openldap (2.2) that comes with SL4 uses stricter schema checking than version 2.0 that comes with SL3. There was actually a savannah bug from over a year ago (March 2006) about this! The solution is to add

attributetype ( 1.3.6.1.4.1.3536.2.6.1.4.0.1
NAME 'Mds-Vo-name'
DESC 'Locally unique VO name'
EQUALITY caseIgnoreMatch
ORDERING caseIgnoreOrderingMatch
SUBSTR caseIgnoreSubstringsMatch
SYNTAX 1.3.6.1.4.1.1466.115.121.1.44
SINGLE-VALUE
)

to /opt/glue/schema/ldap/Glue-CORE.schema and then restart ldap and bdii.

Changing MySQL passwords

Just a note to say that if you change the password that the DPM users (default: dpmmgr) uses to access MySQL (found in /opt/lcg/etc/DPMCONFIG) then you must restart the srmv2.2 daemon.

dpm-updatespace

I set up the DISK space token description on the Edinburgh DPM. This is for /atlas/Role=production:

# dpm-reservespace --token_desc DISK --gspace 70G --ac_latency O --ret_policy R --s_type P --poolname pool1 --lifetime Inf --group /atlas/Role=production
027fc452-937e-4b86-98ed-ab9af57b5cdd

I noticed that there is a problem with using dpm-updatespace

# dpm-updatespace --token_desc DISK --gspace 5G
dpm_getspacetoken: Unknown user space token description
dpm-updatespace: Invalid argument

I can update the space if I use

# dpm-updatespace --space_token 027fc452-937e-4b86-98ed-ab9af57b5cdd --gspace 5G 

This inconsistency has been submitted as savannah bug 27627.

Problem with srmLs

The Edinburgh, Glasgow and LAL DPMs all suffered a srmv2.2 server crash after the person running the FTS2 tests tried to srmLs the contents of the home/dteam directory in each of their namespaces. As the S2 tests have been writing files into the dteam area (and not into a sub-directory) there are currently ~25,000 files, which must have overwhelmed the server. It appears this bug was introduced when support was added for recursive listing in the DPM.

Testing with GFAL

GFAL does not appear to have an internal time out for the SRM2.2 queries.

lcg_utils

Lana has been running some tests against wn4.epcc.ed.ac.uk with the latest lcg-* client tools.

lcg-ls

Due to the S2 tests, the home/dteam area in each of the DPMs contains ~25000 files. When someone attempts to list this part of the namespace, the srmv2.2 daemon breaks and the process stops. The server does not appear to 'crash' as no core dump is produced in /home/dpmmgr when ALLOW_COREDUMPS="yes" is set in /etc/sysconfig/srmv2.2 .

lcg-ls does have a -t option to allow for the user to set a timeout, but this is not an ideal case as users will not know a priori if a timeout should be set and what is a reasonable value to use. I had also thought that there was a client (or maybe server?) side setting for limiting the number of files that could be returned in a directory listing to prevent precisely this problem.

lcg-cp

lcg-cp --vo lhcb -v -D srmv2 -S lhcb-srm2test file:/etc/group srm://wn4.epcc.ed.ac.uk/dpm/epcc.ed.ac.uk/home/lhcb/greig_test_dir/`date +%s`
httpg://wn4.epcc.ed.ac.uk:8446/srm/managerv2: dpm_getspacetoken: Unknown user space token description
lcg_cp: Communication error on send

But this space token does exist. Looking at the dpm_space_reserv table:

mysql> select u_token from dpm_space_reserv;
+---------------+
| u_token       |
+---------------+
| DISK          |
| lhcb-srm2test |
|               |
+---------------+
3 rows in set (0.00 sec)

If no space token is supplied, the lcg-cp works.

Solution

I think I now understand the cause of this confusion. The DPM does not appear to be correctly parsing the VOMs attribute contained in the user proxy. Even though Andrew belongs to the /lhcb/lcgprod sub-group, when he attempts to copy a file into the DPM, he is always mapped to the lhcb/lcgprod sub-group (note the lack of a leading "/").

Therefore, if sites reserve spaces for /lhcb/lcgprod, no one will be able to use them.

mysql> select * from Cns_groupinfo where groupname like '%lhcb%';
+-------+------+---------------------+
| rowid | gid  | groupname           |
+-------+------+---------------------+
|     5 |  106 | lhcb                |
|    16 |  117 | /lhcb/lcgprod       |
|    18 |  119 | lhcb/administrators |
|    19 |  120 | lhcb/pilotphysics   |
|    20 |  121 | lhcb/support        |
|    21 |  122 | lhcb/lcgprod        |
+-------+------+---------------------+
6 rows in set (0.01 sec)

Users who belong to the /lhcb/lcgprod sub-group according to VOMs, are mapped to lhcb/lcgprod internally to DPM.

lcg-cr

Works without a space token:

lcg-cr --vo lhcb -v -D srmv2 file:/etc/group -d wn4.epcc.ed.ac.uk
Using grid catalog type: lfc
Using grid catalog : lfc-lhcb.cern.ch
Using LFN : /grid/lhcb/generated/2007-08-14/file-2c216368-89c5-4c24-8c93-dc5804676e42
Using SURL : srm://wn4.epcc.ed.ac.uk/dpm/epcc.ed.ac.uk/home/lhcb/generated/2007-08-14/file209d655e-16c6-4f8d-a073-05af284edadd
Source URL: file:/etc/group
File size: 2581
VO name: lhcb
Destination specified: wn4.epcc.ed.ac.uk
Destination URL for copy: gsiftp://wn4.epcc.ed.ac.uk/wn4.epcc.ed.ac.uk:/part2/lhcb/2007-08-14/file209d655e-16c6-4f8d-a073-05af284edadd.53017.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/lhcb/generated/2007-08-14/file-2c216368-89c5-4c24-8c93-dc5804676e42
         2581 bytes      2.22 KB/sec avg      2.22 KB/sec inst
Transfer took 2050 ms
Destination URL registered in Catalog: srm://wn4.epcc.ed.ac.uk/dpm/epcc.ed.ac.uk/home/lhcb/generated/2007-08-14/file209d655e-16c6-4f8d-a073-05af284edadd
guid:75a6aedf-277a-4846-8632-fbcfa3f1abff

But with the same space token as above:

[lxplus098] ~/gfal > lcg-cr --vo lhcb -v -D srmv2 --st lhcb-srm2test file:/etc/group -d wn4.epcc.ed.ac.uk
Using grid catalog type: lfc
Using grid catalog : lfc-lhcb.cern.ch
Using LFN : /grid/lhcb/generated/2007-08-14/file-fa0482c4-e98c-4ffd-b72d-e09f65ec6383
Using SURL : srm://wn4.epcc.ed.ac.uk/dpm/epcc.ed.ac.uk/home/lhcb/generated/2007-08-14/filef8454de6-3d48-44d6-8da1-9d5fbb556a22
httpg://wn4.epcc.ed.ac.uk:8446/srm/managerv2: dpm_getspacetoken: Unknown user space token description
lcg_cr: Communication error on send

Solution

See solution for lcg-cp.