RALPP dCache Problem Solving
Your dCache is failing the SFTs - What should you look at and test to diagnose and fix the problem?
First test the stack by uploading a file - it might just have been a temporary problem:
lcg-cr --vo dteam -v -d heplnx204.pp.rl.ac.uk file:/etc/group -P site/lcg-cr-test`date +%Y%m%d%H%M`
If the problem is in dCache you'd expect the file to get successfully registered into the LFC, then for the data transfer to fail, the channel may fail to open, transfer any data or close.
If you get errors like:
No information found for SE : heplnx201.pp.rl.ac.uk
Could not establish context lcg_cr: Communication error on send
or
send2nsd: NS002 - connect error : Connection timed out Communication error
Then they suggest the problem isn't dCache, the first is probably the site bdii, so have a look at gstat, and the second two are probably LFC issues.
If lcg-cr
fails with a dCache problem then the next step down in the chain to test is globus-url-copy
.
globus-url-copy -verbose file:/etc/group gsiftp://heplnx165.pp.rl.ac.uk/pnfs/pp.rl.ac.uk/data/dteam/site/gurlcopy-test-165-`date +%Y%m%d%H%M`
For each of heplnx165, heplnx172 and heplnx173
If one or more of these fail try restarting the gridftp door on the relavant node(s)
service dcache-core restart
And wait a few minutes before rerunning the tests.
If they work or if restarting them fails to fix the problem then the next thing to try is restarting all the services on heplnx204, again with:
service dcache-core restart
and waiting a few minutes before trying the tests again.
If that doesn't fix the problem, then the only simple thing left to try is a general shutdown by running:
service dcache-core stop service dcache-pool stop
on the disk servers (heplnx165, heplnx172 and heplnx173).
and:
service dcache-core stop
on heplnx204.
Before restarting them in the following order:
dcache-core
on heplnx204
dcache-pool
on the pool nodes
dcache-core
on the pool nodes
If it still doesn't work after that start posting to the mailing lists.