DPM to DCache Migration
- 1 DPM to DCACHE migration
- 2 General Thoughts
- 3 Intro
- 4 Advice from Edinburgh
- 5 Some Further comments
- 6 Meaningful Monitoring
- 7 How to I administer DCache?
- 7.1 What are the important parts of DCache?
- 7.2 How do I find a file on disk to check it?
- 7.3 How do I go from a file on disk to a logical filename?
- 7.4 How do I check a file
- 7.5 How do I change ACL on an object
- 7.6 Whitelisting a cert
- 7.7 Draining a Disk Node
- 7.8 Re-Balancing Storage Usage
- 7.9 Handling lost/damaged files/nodes
- 7.10 Load Balancing
- 7.11 How do I monitor-for dropped nodes?
- 7.12 How do I setup token support?
DPM to DCACHE migration
DCache seems to be more performant, handle high load better, have more advanced features. This is a nice break from our production DPM which has long-standing issues stemming from core technologies which we applied mitigations for rather than get fixes.
The permissions model is pleasent to use when it isn't segfaulting due to a badly configured plugin.
The migration was mostly sensible and straight forward.
We had a disk node drop from our storage pool with a message (from dcache) that we should restart the dcache services on the node. We didn't have any useful way of monitoring for this and it was only spotted due to other work being done.
Struggling to enable debug logging which might have been useful for some setup issues.
Struggling to get effective logging other than external read/write requests are coming in through the various doors.
Still a large amount of service logging ending up in journalctl which makes managing centralised logging fun.
ACLs are a little strange coming and would require to invest some time to understand how they interact with NFS.
Systemctl hides when DCache has internally got into a right state and can't process requests. External monitoring is needed and there is little documentation on this.
Main article on CERN public Twiki
The main set of instructions for the migration are here:
Advice from Edinburgh
READ THE INSTRUCTIONS FULLY AND CAREFULLY. Go back, do it again, then think about starting.
I recommend completely stopping DPM related tasks from running. (This really should be a recommendation as at ECDF we have several systems looking for service failure which kicked in once our storage had been offline due to a migration bug)
dpmheadnode$ systemctl mask httpd rfiod srmv2.2 dpnsdaemon dpm dpm-gsiftp xrootd@dpmredir dpmdisknodes$ systemctl mask httpd rfiod dpm-gsiftp xrootd@dpmdisk
This should be done immediately after 'disabling' the services. This action can be reverted in case of needing to roll back with an `unmask`.
Careful with Disk Nodes
You should NOT attempt to migrate the disk nodes until the head node has finished importing the database. The tool spits out and then APPENDS to the migration lists as the PNFSID of each file is derrived with some random runtime quantity vs being a pre-generated value for each file.
This caught us at Edinburgh off guard due to the tool spitting out other config files instantly
Space Tokens Strike Again
DCache is a LOT stricter with reserved spaces and quotas than DPM.
In DPM you could effectively over-commit storage without any problems. In DCache this will result in -ve free space being reported and possibly cause some problems. I recommend you reduce these to sensible values and if a VO needs some time to adjust as a result, do this a few days before the migration.
Edinburgh had experienced a bug where we had ~0.5% more data than was technically allowed in a space token.
DCache will NOT allow you to import more data than a reserved space is setup to allow.
You will need to manually make sure this is consistent before you begin.
Time to Migrate
The length of time it takes for the migration was a bit slower at ECDF than estimated for a 2M entry DB on optane storage with a Zeon Gold (possibly due to the max single thread speed being lower?)
We were able to do our migration in <24hr once we had debugged what didn't work for our setup.
Help. It's all on fire!
If you should have to back off due to something going wrong you can undo everything that has been done until you remove the hard-links on the storage nodes. Based on having run through the very well written instructions and using the tools, you can undo everything until you type the final `rm` command on your disk nodes.
This is a really nice way of managing
Some Further comments
This is easiest done via the target which manages services which depend on it.
systemctl restart dcache.target
Based on testing with netstat and comparing to documentation. After migration we have the following setup.
This is based on a spoke and wheel sort of setup with the DCache headnode.
I'm assuming you're just running postgres on your head node with no dedicated external instance.
The following ports should be globally accessible:
80 HTTP 443 HTTPS 1094 XRootD 2170 BDII 2811 GridFTP 3880 SRR 8446 (optional SRM) 20000-25000 GridFTP
1094 XrootD 2811 GridFTP 2880 Webdav 20000-25000 GridFTP
2181 ZooKeeper 11111 Deprecated but DCache is listening
Problems with gplazma
Out of the box the migration creates a gplazma config which caused hours of headaches at Edinburgh.
Comment out the line with a hash:
auth optional scitoken
And restart dcache.
Now failed auth attempts should be easier to debug rather than giving you java stack traces because something is bad.
You should automate dumps of all postgress databases this can be done with a pg_dump.
This shouldn't be too difficult to automate with a cron job but I would recommend 6hr backups to avoid potential data loss if something explodes, or you enter the wrong command.
This is recommended for ATLAS/Rucio support at the site.
I recommend going the same route that was used with DPM. i.e.:
Setup GPlazma to allow gridmapfile and a storage authzdb. Add the host-cert as an allowed entity which can access ATLAS data.
Use the script to generate the dumps from here https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#dCache
Use the host-cert to copy the generated dump to the correct space with xrdcp.
Automate this using a bash script and crontab.
You will need to setup the SRR for the storage with DCache.
This in the simplest case is to enable the reporting, fix problems, enable external access to the SRR and email ATLAS.
An alternative is to setup a site-proxy to redirect old queries to the old SRR to the new output, but this is trickier.
Formatting of SRR
The SRR reports on the spaces which have been reserved within the link group.
However out of the box the vos for each space were `Null` for Edinburgh after the migration.
If this is the case and you want to fix it you need to use the admin console.
Connect to the appropriate cell in DCache: `\c SrmSpaceManager`
Update the reservation with the correct owner: `update space -owner=CORRECT-VO SPACETOKEN`
No need to restart, after this the reporting should now show the correct VO against the correct space token.
Location of SRR
The SRR location is typically:
with the simplest case being to open port 3880 as the java server will simply return teh correct errors when trying to access alternate paths.
You may prefer to put this behind a proxy to control access which also grants you the ability to change the port number.
Currently investigating https://github.com/NDPF/dcache-exporter in combination with prometheus/grafana.
Biggest problems after migration at ECDF have been:
--- pool (Main pool component) --- Base directory : /tank-4TB/gridstorage/dcache/atlas_002 Version : 8.2.11(8.2.11) (Sub=4) Report remove : ON Pool Mode : disabled(fetch,store,stage,p2p-client,p2p-server,dead) Detail :  Pool restart required: Internal repository error Hsm Load Suppr. : OFF Ping Heartbeat : 30 seconds Breakeven : 0.7 LargeFileStore : NONE P2P File Mode : CACHED Mover Queue (regular) 0(500)/0
This is a problem because as far as systemd is concerned the services are still running even though they're in a bad state.
It turns out that the Java library used for creating a database on the "disk nodes" has problems running on ZFS. The issue appears to be related to some way in which the database software writes data to disk. The database expects the data to be flushes whilst (afaik) there is no explicit call for data to be flushed.
This has been fixed/mitigated by mounting the pool node metadata on a filesystem where flushes are foced on every write. This has some minor performance impact, but nothing servere as the filesystem is different to that managing the grid data. The result in the services not collapsing in such a way that systemd doesn't reflect the service state. (This has been running hands off for ~2months at this point).
How to I administer DCache?
TODO: I plan to add more data here on how DCache works (for someone coming from DPM) and how to do some common tasks that would be needed for DPM administration