Resilient dCache

From GridPP Wiki
Jump to: navigation, search

Enabling the ReplicaManager within your dCache will cause the dCache to attempt to keep the its files available for use by creating replicas of them on different pools. This means that if one pool goes offline, the file can still be accessed via one of the replicas on another pool. It is possible to define a range [min,max] for the number of replicas that should be retained.

If dCache is being used to aggregate the available storage on the WNs at a site then it is recommeded to operate the system in resilient mode with the ReplicaManager turned on. Without the replication, as soon as a WN goes down (whether that be due to a disk problem or some other component) the files on the pool(s) in that WN will be unavailable (i.e. jobs will fail). As the size of batch farms increases, the probability of failure of a WN increases. With the ReplicaManager turned on, files replicas will be made that exist on different pools. Therefore, the failure of a single node will not prevent access to data. Clearly, the amount of storage that is available at your site will depend on the number of replicas that are made of each file.

Initial GridPP evaluation

I installed a basic dCache using YAIM on a single node. Initially there were 2 pools on the node (I subsequently added more to test out functionality of ResilientManager). The replicas postgreSQL database was created by the YAIM installation. To start the ReplicaManager you must add the line:

replicaManager=yes

to /opt/d-cache/etc/node_config and these lines to the "create" commands to the dCache.batch:

...
-poolStatusRelay=replicaManager \
-watchdog=300:300 \
...

and in pnfs.batch:

 ...
 -cmRelay=replicaManager \
 ...

I just replaced broadcast with replicaManager in each case. This step is necessary otherwise the ReplicaManager will not automatically start replcating a file once it has been copied to a pool. It can be started by hand via:

/opt/d-cache/jobs/replica -logfile=/opt/d-cache/log/replicaDomain.log start

After starting, any files that are already in pools will automatically be replicated according to the [min,max] range and the input from the pool selection mechanism.

Database structure

postgres=# \c replicas
You are now connected to database "replicas".
replicas=# \d
           List of relations
 Schema |   Name    | Type  |   Owner
--------+-----------+-------+-----------
 public | action    | table | srmdcache
 public | heartbeat | table | srmdcache
 public | pools     | table | srmdcache
 public | replicas  | table | srmdcache
(4 rows)

replicas=# select * from action;
  pool   |          pnfsid          |         datestamp          |   timestamp                                                                                 
---------+--------------------------+----------------------------+-------------- -
 exclude | 000200000000000000008B50 | 2006-07-04 16:20:54.237744 | 1152026454237
 exclude | 000200000000000000008B88 | 2006-07-04 16:20:54.346473 | 1152026454346
(2 rows)
 
replicas=# select * from heartbeat;
  process    | description  |         datestamp
--------------+--------------+----------------------------
 PoolWatchDog | no changes   | 2006-07-04 16:20:53.782767
 Adjuster     | waitDbUpdate | 2006-07-04 16:20:53.970839
(2 rows)

replicas=# select * from pools;
 pool  | status |         datestamp
-------+--------+----------------------------
 wn4_1 | online | 2006-07-04 16:10:53.63589
 wn4_2 | online | 2006-07-04 16:10:53.712997
(2 rows)
 
replicas=# select * from replicas;
  pool   |          pnfsid          |         datestamp
---------+--------------------------+----------------------------
 wn4_1   | 000200000000000000001298 | 2006-07-04 16:10:53.561946
 wn4_1   | 000200000000000000001120 | 2006-07-04 16:10:53.563056
 wn4_1   | 000200000000000000001230 | 2006-07-04 16:10:53.563909
 wn4_1   | 0002000000000000000011B0 | 2006-07-04 16:10:53.564719
 wn4_1   | 000200000000000000008B50 | 2006-07-04 16:10:53.565526
 wn4_1   | 0002000000000000000012E8 | 2006-07-04 16:10:53.566269
 wn4_1   | 0002000000000000000012C0 | 2006-07-04 16:10:53.56707
 wn4_1   | 000200000000000000001260 | 2006-07-04 16:10:53.567789
 wn4_1   | 0002000000000000000010A0 | 2006-07-04 16:10:53.568569
 wn4_1   | 000200000000000000008B88 | 2006-07-04 16:10:53.569352
 wn4_2   | 000200000000000000001298 | 2006-07-04 16:10:53.705932
 wn4_2   | 000200000000000000001120 | 2006-07-04 16:10:53.706747
 wn4_2   | 000200000000000000001230 | 2006-07-04 16:10:53.707502
 wn4_2   | 0002000000000000000011B0 | 2006-07-04 16:10:53.708247
 wn4_2   | 0002000000000000000012E8 | 2006-07-04 16:10:53.709028
 wn4_2   | 0002000000000000000012C0 | 2006-07-04 16:10:53.709988
 wn4_2   | 000200000000000000001260 | 2006-07-04 16:10:53.710748
 wn4_2   | 0002000000000000000010A0 | 2006-07-04 16:10:53.711532
 exclude | 000200000000000000008B50 | 2006-07-04 16:20:54.237744
 exclude | 000200000000000000008B88 | 2006-07-04 16:20:54.346473
(20 rows)

The exclude state means that the file will not be included when the ReplicaManager is checking which files should be replicated to other pools. It was not clear what caused some files to go into this state. They can be released from this by running:

release <pnfsid>

but I noticed on a few occasions that the file went back into exclude status.

Behaviour

From the initial evaluation, I have determined that the following logic applies:

  • If a pool goes down then the ReplicaManager will automatically attempt to create new replicas in order that each file has a number of replicas in the defined range.
  • If the pool then comes back online, some of the replicas (not necessarilty on that pool) will be deleted to ensure the constraint is met. The pools from which the files are deleted are decided upon using the dCache pool selection mechanism.
  • If a file is precious, then the replica will also be precious. This prevents the garbage collection mechanism from removing the any of the replicas.
  • If the ReplicaManager tries to replicate a file and finds that there is insufficient available space for the replication, then no replica will be made. The log file (replicaDomain.log) will contain a message like:
12.21.00  DEBUG: cacheLocationModified, notify All
12.21.00  cacheLocationModified : pnfsID 000200000000000000009588 added to pool wn4_1 - DB updated
12.21.00  DB updated, rerun Adjust cycle
12.21.00  DEBUG: runAdjustment - started
12.21.00  DEBUG: runAdjustment - scan Redundant
12.21.00  DEBUG: runAdjustment - scan DrainOff
12.21.00  DEBUG: runAdjustment - scan offLine-prepare
12.21.00  DEBUG: runAdjustment - scan Deficient
12.21.00  DEBUG: replicateAsync - get worker
12.21.00  DEBUG: replicateAsync - got worker OK
12.21.00  Replicator ID=47, pnfsId=000200000000000000009588 starting, now 1/10 workers are active
12.21.01  DEBUG: getCostTable(): sendMessage,  command=[xcm ls]
message=<CM: S=[empty];D=[>PoolManager@local];C=java.lang.String;O=<1152098461166:2529>;LO=<1152098461166:2529>>
12.21.01  DEBUG: DEBUG: Cost table reply arrived
12.21.01  replicate(000200000000000000009588) reported : java.lang.IllegalArgumentException: No pools found, can not get  destination pool with available space=1000000000 
12.21.01  pnfsId=000200000000000000009588 excluded from replication. 
12.21.01  DEBUG: msg='No pools found, can not get destination pool with available space=1000000000' signature OK
12.21.01  Replicator ID=47, pnfsId=000200000000000000009588 finished, now 0/10 workers are active

The manager will continue to operate and replicate files for which there is sufficient space. If more pools come online that have sufficient free space for the large file to be replicated, then the ReplicaManager does not appear to automatically replicate the file. This has to be carried out by hand using the admin interface to the ReplicaManager:

replicate <pnfsid>

Maybe there is an option that will automatically cause replication. Need to check. If not, it is very annoying and probably means that the system could not work in production.

  • If a new pool is added, the ReplicaManager has to be told about it, either by restarting or by entering the command:
set pool wn4_4 online

There is a PoolWatchDog that I thought should pick up the addition of a new pool, but this did not seem to happen. Maybe I did not give it enough time. Bringing a pool online again will cause redundant replicas to be removed.

  • If you want to take a pool out, then:
set pool wn4_4 down


The contents of the replicas table in the database can be displayed in the admin interface by running:

ls pnfsid
0002000000000000000010A0 wn4_1 wn4_2 
000200000000000000001120 wn4_1 wn4_2 
0002000000000000000011B0 wn4_1 wn4_2 
000200000000000000001230 wn4_1 wn4_2 
000200000000000000001260 wn4_1 wn4_2 
000200000000000000001298 wn4_1 wn4_2 
0002000000000000000012C0 wn4_1 wn4_2 
0002000000000000000012E8 wn4_1 wn4_2 
000200000000000000008B50 wn4_1 
000200000000000000008B88 wn4_1 

Some files were not replicated due to them being in exclude status.

  • If the [min,max] range is changed, then upon restart the ReplicaManager can pick this up and will alter the number of replicas.

Documentation