RALPP dCache Problem Solving

From GridPP Wiki
Revision as of 16:43, 24 October 2006 by Chris brew (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Your dCache is failing the SFTs - What should you look at and test to diagnose and fix the problem?

First test the stack by uploading a file - it might just have been a temporary problem:

lcg-cr --vo dteam -v -d heplnx204.pp.rl.ac.uk file:/etc/group -P site/lcg-cr-test`date +%Y%m%d%H%M`

If the problem is in dCache you'd expect the file to get successfully registered into the LFC, then for the data transfer to fail, the channel may fail to open, transfer any data or close.

If you get errors like:

No information found for SE : heplnx201.pp.rl.ac.uk
Could not establish context
lcg_cr: Communication error on send

or

send2nsd: NS002 - connect error : Connection timed out
Communication error

Then they suggest the problem isn't dCache, the first is probably the site bdii, so have a look at gstat, and the second two are probably LFC issues.

If lcg-cr fails with a dCache problem then the next step down in the chain to test is globus-url-copy.

globus-url-copy -verbose file:/etc/group gsiftp://heplnx165.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/dteam/site/gurlcopy-test-165-`date +%Y%m%d%H%M`

For each of heplnx165, heplnx172 and heplnx173

If one or more of these fail try restarting the gridftp door on the relavant node(s)

service dcache-core restart

And wait a few minutes before rerunning the tests.

If they work or if restarting them fails to fix the problem then the next thing to try is restarting all the services on heplnx204, again with:

service dcache-core restart

and waiting a few minutes before trying the tests again.

If that doesn't fix the problem, then the only simple thing left to try is a general shutdown by running:

service dcache-core stop
service dcache-pool stop

on the disk servers (heplnx165, heplnx172 and heplnx173).

and:

service dcache-core stop

on heplnx204.

Before restarting them in the following order:

dcache-core on heplnx204 dcache-pool on the pool nodes dcache-core on the pool nodes

If it still doesn't work after that start posting to the mailing lists.