Revision as of 13:57, 31 January 2023

DPM to DCACHE migration

Intro

Main article on CERN public Twiki

https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache

The main set of instructions for the migration are here:

https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache#Migration_steps_quick_overview

Advice from Edinburgh

Pre-Migration

READ THE INSTRUCTIONS FULLY AND CAREFULLY. Go back, do it again, then think about starting.

During Migration

I recommend completely stopping DPM related tasks from running. (This really should be a recommendation as at ECDF we have several systems looking for service failure which kicked in once our storage had been offline due to a migration bug)

dpmheadnode$ systemctl mask httpd rfiod srmv2.2 dpnsdaemon dpm dpm-gsiftp xrootd@dpmredir 
dpmdisknodes$ systemctl mask httpd rfiod dpm-gsiftp xrootd@dpmdisk

This should be done immediately after 'disabling' the services. This action can be reverted in case of needing to roll back with an `unmask`.

Careful with Disk Nodes

You should NOT attempt to migrate the disk nodes until the head node has finished importing the database. The tool spits out and then APPENDS to the migration lists as the PNFSID of each file is derrived with some random runtime quantity vs being a pre-generated value for each file.

This caught us at Edinburgh off guard due to the tool spitting out other config files instantly

Space Tokens Strike Again

DCache is a LOT stricter with reserved spaces and quotas than DPM.

In DPM you could effectively over-commit storage without any problems. In DCache this will result in -ve free space being reported and possibly cause some problems. I recommend you reduce these to sensible values and if a VO needs some time to adjust as a result, do this a few days before the migration.

Edinburgh had experienced a bug where we had ~0.5% more data than was technically allowed in a space token.

DCache will NOT allow you to import more data than a reserved space is setup to allow.

You will need to manually make sure this is consistent before you begin.

Time to Migrate

The length of time it takes for the migration was a bit slower at ECDF than estimated for a 2M entry DB on optane storage with a Zeon Gold (possibly due to the max single thread speed being lower?)

We were able to do our migration in <24hr once we had debugged what didn't work for our setup.

Help. It's all on fire!

If you should have to back off due to something going wrong you can undo everything that has been done until you remove the hard-links on the storage nodes. Based on having run through the very well written instructions and using the tools, you can undo everything until you type the final `rm` command on your disk nodes.

This is a really nice way of managing

Some Further comments

Restarting dcache

This is easiest done via the target which manages services which depend on it.

systemctl restart dcache.target

Firewall

Based on testing with netstat and comparing to documentation. After migration we have the following setup.

This is based on a spoke and wheel sort of setup with the DCache headnode.

I'm assuming you're just running postgres on your head node with no dedicated external instance.

WAN access

The following ports should be globally accessible:

Headnode:

80 HTTP
443 HTTPS
1094 XRootD
2170 BDII
2811 GridFTP
3880 SRR
8446 (optional SRM)
20000-25000 GridFTP

Disknode:

1094 XrootD
2811 GridFTP
2880 Webdav
20000-25000 GridFTP

LAN access

Headnode:

2181 ZooKeeper
11111 Deprecated but DCache is listening

Disknode:

2181 ZooKeeper

Problems with gplazma

Out of the box the migration creates a gplazma config which caused hours of headaches at Edinburgh.

Comment out the line with a hash:

auth     optional     scitoken

And restart dcache.

Now failed auth attempts should be easier to debug rather than giving you java stack traces because something is bad.

Database dumps

You should automate dumps of all postgress databases this can be done with a pg_dump.

This shouldn't be too difficult to automate with a cron job but I would recommend 6hr backups to avoid potential data loss if something explodes, or you enter the wrong command.

RUCIO dumps

This is recommended for ATLAS/Rucio support at the site.

I recommend going the same route that was used with DPM. i.e.:

Setup GPlazma to allow gridmapfile and a storage authzdb. Add the host-cert as an allowed entity which can access ATLAS data.

Use the script to generate the dumps from here https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#dCache

Use the host-cert to copy the generated dump to the correct space with xrdcp.

Automate this using a bash script and crontab.

SRR

You will need to setup the SRR for the storage with DCache.

This in the simplest case is to enable the reporting, fix problems, enable external access to the SRR and email ATLAS.

An alternative is to setup a site-proxy to redirect old queries to the old SRR to the new output, but this is trickier.

Formatting of SRR

The SRR reports on the spaces which have been reserved within the link group.

However out of the box the vos for each space were `Null` for Edinburgh after the migration.

If this is the case and you want to fix it you need to use the admin console.

Connect to the appropriate cell in DCache: `\c SrmSpaceManager`

Update the reservation with the correct owner: `update space -owner=CORRECT-VO SPACETOKEN`

No need to restart, after this the reporting should now show the correct VO against the correct space token.

How to I administer DCache?

TODO: I plan to add more data here on how DCache works (for someone coming from DPM) and how to do some common tasks that would be needed for DPM administration

Revision as of 12:58, 31 January 2023 (view source) Robert Currie 63054938fd (Talk \| contribs) (Created page with " = DPM to DCACHE migration = = Intro = Main article on CERN public Twiki https://twiki.cern.ch/twiki/bin/view/DPM/DpmDCache The main set of instructions for the migration ...")		Revision as of 13:57, 31 January 2023 (view source) Robert Currie 63054938fd (Talk \| contribs) Newer edit →
Line 197:		Line 197:

	== Load Balancing ==		== Load Balancing ==
		+
		+	== How do I monitor-for dropped nodes ==

Difference between revisions of "DPM to DCache Migration"

Revision as of 13:57, 31 January 2023

Contents

DPM to DCACHE migration

Intro

Advice from Edinburgh

Pre-Migration

During Migration

Careful with Disk Nodes

Space Tokens Strike Again

Time to Migrate

Help. It's all on fire!

Some Further comments

Restarting dcache

Firewall

WAN access

LAN access

Problems with gplazma

Database dumps

RUCIO dumps

SRR

Formatting of SRR

How to I administer DCache?

What are the important parts of DCache?

How do I find a file on disk to check it?

How do I go from a file on disk to a logical filename?

How do I check a file

How do I change ACL on an object

Whitelisting a cert

Draining a Disk Node

Re-Balancing Storage Usage

Handling lost/damaged files/nodes

Load Balancing

How do I monitor-for dropped nodes

Navigation menu

Search