Resilient dCache
Enabling the ReplicaManager within your dCache will cause the dCache to attempt to keep the its files available for use by creating replicas of them on different pools. This means that if one pool goes offline, the file can still be accessed via one of the replicas on another pool. It is possible to define a range [min,max] for the number of replicas that should be retained.
If dCache is being used to aggregate the available storage on the WNs at a site then it is recommeded to operate the system in resilient mode with the ReplicaManager turned on. Without the replication, as soon as a WN goes down (whether that be due to a disk problem or some other component) the files on the pool(s) in that WN will be unavailable (i.e. jobs will fail). As the size of batch farms increases, the probability of failure of a WN increases. With the ReplicaManager turned on, files replicas will be made that exist on different pools. Therefore, the failure of a single node will not prevent access to data. Clearly, the amount of storage that is available at your site will depend on the number of replicas that are made of each file.
Initial GridPP evaluation
I installed a basic dCache using YAIM on a single node. Initially there were 2 pools on the node (I subsequently added more to test out functionality of ResilientManager). The replicas postgreSQL database was created by the YAIM installation. To start the ReplicaManager you must add the line:
replicaManager=yes
to /opt/d-cache/etc/node_config and these lines to the "create" commands to the dCache.batch:
... -poolStatusRelay=replicaManager \ -watchdog=300:300 \ ...
and in pnfs.batch:
... -cmRelay=replicaManager \ ...
I just replaced broadcast with replicaManager in each case. This step is necessary otherwise the ReplicaManager will not automatically start replcating a file once it has been copied to a pool. It can be started by hand via:
/opt/d-cache/jobs/replica -logfile=/opt/d-cache/log/replicaDomain.log start
After starting, any files that are already in pools will automatically be replicated according to the [min,max] range and the input from the pool selection mechanism.
Database structure
postgres=# \c replicas You are now connected to database "replicas". replicas=# \d List of relations Schema | Name | Type | Owner --------+-----------+-------+----------- public | action | table | srmdcache public | heartbeat | table | srmdcache public | pools | table | srmdcache public | replicas | table | srmdcache (4 rows) replicas=# select * from action; pool | pnfsid | datestamp | timestamp ---------+--------------------------+----------------------------+-------------- - exclude | 000200000000000000008B50 | 2006-07-04 16:20:54.237744 | 1152026454237 exclude | 000200000000000000008B88 | 2006-07-04 16:20:54.346473 | 1152026454346 (2 rows) replicas=# select * from heartbeat; process | description | datestamp --------------+--------------+---------------------------- PoolWatchDog | no changes | 2006-07-04 16:20:53.782767 Adjuster | waitDbUpdate | 2006-07-04 16:20:53.970839 (2 rows) replicas=# select * from pools; pool | status | datestamp -------+--------+---------------------------- wn4_1 | online | 2006-07-04 16:10:53.63589 wn4_2 | online | 2006-07-04 16:10:53.712997 (2 rows) replicas=# select * from replicas; pool | pnfsid | datestamp ---------+--------------------------+---------------------------- wn4_1 | 000200000000000000001298 | 2006-07-04 16:10:53.561946 wn4_1 | 000200000000000000001120 | 2006-07-04 16:10:53.563056 wn4_1 | 000200000000000000001230 | 2006-07-04 16:10:53.563909 wn4_1 | 0002000000000000000011B0 | 2006-07-04 16:10:53.564719 wn4_1 | 000200000000000000008B50 | 2006-07-04 16:10:53.565526 wn4_1 | 0002000000000000000012E8 | 2006-07-04 16:10:53.566269 wn4_1 | 0002000000000000000012C0 | 2006-07-04 16:10:53.56707 wn4_1 | 000200000000000000001260 | 2006-07-04 16:10:53.567789 wn4_1 | 0002000000000000000010A0 | 2006-07-04 16:10:53.568569 wn4_1 | 000200000000000000008B88 | 2006-07-04 16:10:53.569352 wn4_2 | 000200000000000000001298 | 2006-07-04 16:10:53.705932 wn4_2 | 000200000000000000001120 | 2006-07-04 16:10:53.706747 wn4_2 | 000200000000000000001230 | 2006-07-04 16:10:53.707502 wn4_2 | 0002000000000000000011B0 | 2006-07-04 16:10:53.708247 wn4_2 | 0002000000000000000012E8 | 2006-07-04 16:10:53.709028 wn4_2 | 0002000000000000000012C0 | 2006-07-04 16:10:53.709988 wn4_2 | 000200000000000000001260 | 2006-07-04 16:10:53.710748 wn4_2 | 0002000000000000000010A0 | 2006-07-04 16:10:53.711532 exclude | 000200000000000000008B50 | 2006-07-04 16:20:54.237744 exclude | 000200000000000000008B88 | 2006-07-04 16:20:54.346473 (20 rows)
The exclude state means that the file will not be included when the ReplicaManager is checking which files should be replicated to other pools. It was not clear what caused some files to go into this state. They can be released from this by running:
release <pnfsid>
but I noticed on a few occasions that the file went back into exclude status.
Behaviour
From the initial evaluation, I have determined that the following logic applies:
- If a pool goes down then the ReplicaManager will automatically attempt to create new replicas in order that each file has a number of replicas in the defined range.
- If the pool then comes back online, some of the replicas (not necessarilty on that pool) will be deleted to ensure the constraint is met. The pools from which the files are deleted are decided upon using the dCache pool selection mechanism.
- If a file is precious, then the replica will also be precious. This prevents the garbage collection mechanism from removing the any of the replicas.
- If the ReplicaManager tries to replicate a file and finds that there is insufficient available space for the replication, then no replica will be made. The log file (replicaDomain.log) will contain a message like:
12.21.00 DEBUG: cacheLocationModified, notify All 12.21.00 cacheLocationModified : pnfsID 000200000000000000009588 added to pool wn4_1 - DB updated 12.21.00 DB updated, rerun Adjust cycle 12.21.00 DEBUG: runAdjustment - started 12.21.00 DEBUG: runAdjustment - scan Redundant 12.21.00 DEBUG: runAdjustment - scan DrainOff 12.21.00 DEBUG: runAdjustment - scan offLine-prepare 12.21.00 DEBUG: runAdjustment - scan Deficient 12.21.00 DEBUG: replicateAsync - get worker 12.21.00 DEBUG: replicateAsync - got worker OK 12.21.00 Replicator ID=47, pnfsId=000200000000000000009588 starting, now 1/10 workers are active 12.21.01 DEBUG: getCostTable(): sendMessage, command=[xcm ls] message=<CM: S=[empty];D=[>PoolManager@local];C=java.lang.String;O=<1152098461166:2529>;LO=<1152098461166:2529>> 12.21.01 DEBUG: DEBUG: Cost table reply arrived 12.21.01 replicate(000200000000000000009588) reported : java.lang.IllegalArgumentException: No pools found, can not get destination pool with available space=1000000000 12.21.01 pnfsId=000200000000000000009588 excluded from replication. 12.21.01 DEBUG: msg='No pools found, can not get destination pool with available space=1000000000' signature OK 12.21.01 Replicator ID=47, pnfsId=000200000000000000009588 finished, now 0/10 workers are active
The manager will continue to operate and replicate files for which there is sufficient space. If more pools come online that have sufficient free space for the large file to be replicated, then the ReplicaManager does not appear to automatically replicate the file. This has to be carried out by hand using the admin interface to the ReplicaManager:
replicate <pnfsid>
Maybe there is an option that will automatically cause replication. Need to check. If not, it is very annoying and probably means that the system could not work in production.
- If a new pool is added, the ReplicaManager has to be told about it, either by restarting or by entering the command:
set pool wn4_4 online
There is a PoolWatchDog that I thought should pick up the addition of a new pool, but this did not seem to happen. Maybe I did not give it enough time. Bringing a pool online again will cause redundant replicas to be removed.
- If you want to take a pool out, then:
set pool wn4_4 down
The contents of the replicas table in the database can be displayed in the admin interface by running:
ls pnfsid
0002000000000000000010A0 wn4_1 wn4_2 000200000000000000001120 wn4_1 wn4_2 0002000000000000000011B0 wn4_1 wn4_2 000200000000000000001230 wn4_1 wn4_2 000200000000000000001260 wn4_1 wn4_2 000200000000000000001298 wn4_1 wn4_2 0002000000000000000012C0 wn4_1 wn4_2 0002000000000000000012E8 wn4_1 wn4_2 000200000000000000008B50 wn4_1 000200000000000000008B88 wn4_1
Some files were not replicated due to them being in exclude status.
- If the [min,max] range is changed, then upon restart the ReplicaManager can pick this up and will alter the number of replicas.