DPM to DCache Migration

From GridPP Wiki
Revision as of 14:53, 26 April 2023 by Robert Currie 63054938fd (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

DPM to DCACHE migration

General Thoughts

The Good

DCache seems to be more performant, handle high load better, have more advanced features. This is a nice break from our production DPM which has long-standing issues stemming from core technologies which we applied mitigations for rather than get fixes.

The permissions model is pleasent to use when it isn't segfaulting due to a badly configured plugin.

The migration was mostly sensible and straight forward.

The Bad

We had a disk node drop from our storage pool with a message (from dcache) that we should restart the dcache services on the node. We didn't have any useful way of monitoring for this and it was only spotted due to other work being done.

Struggling to enable debug logging which might have been useful for some setup issues.

Struggling to get effective logging other than external read/write requests are coming in through the various doors.

Still a large amount of service logging ending up in journalctl which makes managing centralised logging fun.

ACLs are a little strange coming and would require to invest some time to understand how they interact with NFS.

Systemctl hides when DCache has internally got into a right state and can't process requests. External monitoring is needed and there is little documentation on this.

Intro

Main article on CERN public Twiki

https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache

The main set of instructions for the migration are here:

https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache#Migration_steps_quick_overview


Advice from Edinburgh

Pre-Migration

READ THE INSTRUCTIONS FULLY AND CAREFULLY. Go back, do it again, then think about starting.

During Migration

I recommend completely stopping DPM related tasks from running. (This really should be a recommendation as at ECDF we have several systems looking for service failure which kicked in once our storage had been offline due to a migration bug)

dpmheadnode$ systemctl mask httpd rfiod srmv2.2 dpnsdaemon dpm dpm-gsiftp xrootd@dpmredir 
dpmdisknodes$ systemctl mask httpd rfiod dpm-gsiftp xrootd@dpmdisk

This should be done immediately after 'disabling' the services. This action can be reverted in case of needing to roll back with an `unmask`.

Careful with Disk Nodes

You should NOT attempt to migrate the disk nodes until the head node has finished importing the database. The tool spits out and then APPENDS to the migration lists as the PNFSID of each file is derrived with some random runtime quantity vs being a pre-generated value for each file.

This caught us at Edinburgh off guard due to the tool spitting out other config files instantly

Space Tokens Strike Again

DCache is a LOT stricter with reserved spaces and quotas than DPM.

In DPM you could effectively over-commit storage without any problems. In DCache this will result in -ve free space being reported and possibly cause some problems. I recommend you reduce these to sensible values and if a VO needs some time to adjust as a result, do this a few days before the migration.

Edinburgh had experienced a bug where we had ~0.5% more data than was technically allowed in a space token.

DCache will NOT allow you to import more data than a reserved space is setup to allow.

You will need to manually make sure this is consistent before you begin.

Time to Migrate

The length of time it takes for the migration was a bit slower at ECDF than estimated for a 2M entry DB on optane storage with a Zeon Gold (possibly due to the max single thread speed being lower?)

We were able to do our migration in <24hr once we had debugged what didn't work for our setup.

Help. It's all on fire!

If you should have to back off due to something going wrong you can undo everything that has been done until you remove the hard-links on the storage nodes. Based on having run through the very well written instructions and using the tools, you can undo everything until you type the final `rm` command on your disk nodes.

This is a really nice way of managing


Some Further comments

Restarting dcache

This is easiest done via the target which manages services which depend on it.

systemctl restart dcache.target

Firewall

Based on testing with netstat and comparing to documentation. After migration we have the following setup.

This is based on a spoke and wheel sort of setup with the DCache headnode.

I'm assuming you're just running postgres on your head node with no dedicated external instance.

WAN access

The following ports should be globally accessible:

Headnode:

80 HTTP
443 HTTPS
1094 XRootD
2170 BDII
2811 GridFTP
3880 SRR
8446 (optional SRM)
20000-25000 GridFTP

Disknode:

1094 XrootD
2811 GridFTP
2880 Webdav
20000-25000 GridFTP

LAN access

Headnode:

2181 ZooKeeper
11111 Deprecated but DCache is listening

Disknode:

2181 ZooKeeper

Problems with gplazma

Out of the box the migration creates a gplazma config which caused hours of headaches at Edinburgh.

Comment out the line with a hash:

auth     optional     scitoken

And restart dcache.

Now failed auth attempts should be easier to debug rather than giving you java stack traces because something is bad.

Database dumps

You should automate dumps of all postgress databases this can be done with a pg_dump.

This shouldn't be too difficult to automate with a cron job but I would recommend 6hr backups to avoid potential data loss if something explodes, or you enter the wrong command.

RUCIO dumps

This is recommended for ATLAS/Rucio support at the site.

I recommend going the same route that was used with DPM. i.e.:

Setup GPlazma to allow gridmapfile and a storage authzdb. Add the host-cert as an allowed entity which can access ATLAS data.

Use the script to generate the dumps from here https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#dCache

Use the host-cert to copy the generated dump to the correct space with xrdcp.

Automate this using a bash script and crontab.

SRR

You will need to setup the SRR for the storage with DCache.

This in the simplest case is to enable the reporting, fix problems, enable external access to the SRR and email ATLAS.

An alternative is to setup a site-proxy to redirect old queries to the old SRR to the new output, but this is trickier.

Formatting of SRR

The SRR reports on the spaces which have been reserved within the link group.

However out of the box the vos for each space were `Null` for Edinburgh after the migration.

If this is the case and you want to fix it you need to use the admin console.

Connect to the appropriate cell in DCache: `\c SrmSpaceManager`

Update the reservation with the correct owner: `update space -owner=CORRECT-VO SPACETOKEN`

No need to restart, after this the reporting should now show the correct VO against the correct space token.

Location of SRR

The SRR location is typically:

http://dcache.example.com:3880/api/v1/srr

with the simplest case being to open port 3880 as the java server will simply return teh correct errors when trying to access alternate paths.

You may prefer to put this behind a proxy to control access which also grants you the ability to change the port number.

Meaningful Monitoring

Currently investigating https://github.com/NDPF/dcache-exporter in combination with prometheus/grafana.


Why

Biggest problems after migration at ECDF have been:

--- pool (Main pool component) ---
Base directory    : /tank-4TB/gridstorage/dcache/atlas_002
Version           : 8.2.11(8.2.11) (Sub=4)
Report remove     : ON
Pool Mode         : disabled(fetch,store,stage,p2p-client,p2p-server,dead)
Detail            : [666] Pool restart required: Internal repository error
Hsm Load Suppr.   : OFF
Ping Heartbeat    : 30 seconds
Breakeven         : 0.7
LargeFileStore    : NONE
P2P File Mode     : CACHED
Mover Queue (regular) 0(500)/0

This is a problem because as far as systemd is concerned the services are still running even though they're in a bad state.

Addendum

It turns out that the Java library used for creating a database on the "disk nodes" has problems running on ZFS. The issue appears to be related to some way in which the database software writes data to disk. The database expects the data to be flushes whilst (afaik) there is no explicit call for data to be flushed.

This has been fixed/mitigated by mounting the pool node metadata on a filesystem where flushes are foced on every write. This has some minor performance impact, but nothing servere as the filesystem is different to that managing the grid data. The result in the services not collapsing in such a way that systemd doesn't reflect the service state. (This has been running hands off for ~2months at this point).

How to I administer DCache?

TODO: I plan to add more data here on how DCache works (for someone coming from DPM) and how to do some common tasks that would be needed for DPM administration

What are the important parts of DCache?

How do I find a file on disk to check it?

How do I go from a file on disk to a logical filename?

How do I check a file

How do I change ACL on an object

Whitelisting a cert

Draining a Disk Node

Re-Balancing Storage Usage

Handling lost/damaged files/nodes

Load Balancing

How do I monitor-for dropped nodes?

How do I setup token support?