Difference between revisions of "DPM to DCache Migration"
(Created page with " = DPM to DCACHE migration = = Intro = Main article on CERN public Twiki https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache The main set of instructions for the migration ...") |
|||
Line 197: | Line 197: | ||
== Load Balancing == | == Load Balancing == | ||
+ | |||
+ | == How do I monitor-for dropped nodes == |
Revision as of 13:57, 31 January 2023
Contents
- 1 DPM to DCACHE migration
- 2 Intro
- 3 Advice from Edinburgh
- 4 Some Further comments
- 5 How to I administer DCache?
- 5.1 What are the important parts of DCache?
- 5.2 How do I find a file on disk to check it?
- 5.3 How do I go from a file on disk to a logical filename?
- 5.4 How do I check a file
- 5.5 How do I change ACL on an object
- 5.6 Whitelisting a cert
- 5.7 Draining a Disk Node
- 5.8 Re-Balancing Storage Usage
- 5.9 Handling lost/damaged files/nodes
- 5.10 Load Balancing
- 5.11 How do I monitor-for dropped nodes
DPM to DCACHE migration
Intro
Main article on CERN public Twiki
https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache
The main set of instructions for the migration are here:
https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache#Migration_steps_quick_overview
Advice from Edinburgh
Pre-Migration
READ THE INSTRUCTIONS FULLY AND CAREFULLY. Go back, do it again, then think about starting.
During Migration
I recommend completely stopping DPM related tasks from running. (This really should be a recommendation as at ECDF we have several systems looking for service failure which kicked in once our storage had been offline due to a migration bug)
dpmheadnode$ systemctl mask httpd rfiod srmv2.2 dpnsdaemon dpm dpm-gsiftp xrootd@dpmredir dpmdisknodes$ systemctl mask httpd rfiod dpm-gsiftp xrootd@dpmdisk
This should be done immediately after 'disabling' the services. This action can be reverted in case of needing to roll back with an `unmask`.
Careful with Disk Nodes
You should NOT attempt to migrate the disk nodes until the head node has finished importing the database. The tool spits out and then APPENDS to the migration lists as the PNFSID of each file is derrived with some random runtime quantity vs being a pre-generated value for each file.
This caught us at Edinburgh off guard due to the tool spitting out other config files instantly
Space Tokens Strike Again
DCache is a LOT stricter with reserved spaces and quotas than DPM.
In DPM you could effectively over-commit storage without any problems. In DCache this will result in -ve free space being reported and possibly cause some problems. I recommend you reduce these to sensible values and if a VO needs some time to adjust as a result, do this a few days before the migration.
Edinburgh had experienced a bug where we had ~0.5% more data than was technically allowed in a space token.
DCache will NOT allow you to import more data than a reserved space is setup to allow.
You will need to manually make sure this is consistent before you begin.
Time to Migrate
The length of time it takes for the migration was a bit slower at ECDF than estimated for a 2M entry DB on optane storage with a Zeon Gold (possibly due to the max single thread speed being lower?)
We were able to do our migration in <24hr once we had debugged what didn't work for our setup.
Help. It's all on fire!
If you should have to back off due to something going wrong you can undo everything that has been done until you remove the hard-links on the storage nodes. Based on having run through the very well written instructions and using the tools, you can undo everything until you type the final `rm` command on your disk nodes.
This is a really nice way of managing
Some Further comments
Restarting dcache
This is easiest done via the target which manages services which depend on it.
systemctl restart dcache.target
Firewall
Based on testing with netstat and comparing to documentation. After migration we have the following setup.
This is based on a spoke and wheel sort of setup with the DCache headnode.
I'm assuming you're just running postgres on your head node with no dedicated external instance.
WAN access
The following ports should be globally accessible:
Headnode:
80 HTTP 443 HTTPS 1094 XRootD 2170 BDII 2811 GridFTP 3880 SRR 8446 (optional SRM) 20000-25000 GridFTP
Disknode:
1094 XrootD 2811 GridFTP 2880 Webdav 20000-25000 GridFTP
LAN access
Headnode:
2181 ZooKeeper 11111 Deprecated but DCache is listening
Disknode:
2181 ZooKeeper
Problems with gplazma
Out of the box the migration creates a gplazma config which caused hours of headaches at Edinburgh.
Comment out the line with a hash:
auth optional scitoken
And restart dcache.
Now failed auth attempts should be easier to debug rather than giving you java stack traces because something is bad.
Database dumps
You should automate dumps of all postgress databases this can be done with a pg_dump.
This shouldn't be too difficult to automate with a cron job but I would recommend 6hr backups to avoid potential data loss if something explodes, or you enter the wrong command.
RUCIO dumps
This is recommended for ATLAS/Rucio support at the site.
I recommend going the same route that was used with DPM. i.e.:
Setup GPlazma to allow gridmapfile and a storage authzdb. Add the host-cert as an allowed entity which can access ATLAS data.
Use the script to generate the dumps from here https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#dCache
Use the host-cert to copy the generated dump to the correct space with xrdcp.
Automate this using a bash script and crontab.
SRR
You will need to setup the SRR for the storage with DCache.
This in the simplest case is to enable the reporting, fix problems, enable external access to the SRR and email ATLAS.
An alternative is to setup a site-proxy to redirect old queries to the old SRR to the new output, but this is trickier.
Formatting of SRR
The SRR reports on the spaces which have been reserved within the link group.
However out of the box the vos for each space were `Null` for Edinburgh after the migration.
If this is the case and you want to fix it you need to use the admin console.
Connect to the appropriate cell in DCache: `\c SrmSpaceManager`
Update the reservation with the correct owner: `update space -owner=CORRECT-VO SPACETOKEN`
No need to restart, after this the reporting should now show the correct VO against the correct space token.
How to I administer DCache?
TODO: I plan to add more data here on how DCache works (for someone coming from DPM) and how to do some common tasks that would be needed for DPM administration