DCache Log Message Archive

From GridPP Wiki
Jump to: navigation, search

This page is intended to contain a list of dCache log messages and corresponding explanations of their cause. If the message is caused by a real problem in the dCache then potential solutions are suggested. Much of the material and the solutions for this come from the dCache user-forum mailing list.

lm -> Cell Not Found

lm is the dCache Location Manager. This service is required to ensure that all of the dCache components (cells) can communicate with each other. When dCache is starting up, it is common to see error messages related to lm in the logs since when it is first starting, not all of the components are able to talk to one another. These errors should disappear after a minute or so.

Cell not found

(local) admin > cd Pnfsmanager
(Pnfsmanager) admin > help
No Route to cell for packet
{uoid=<1170490140376:118>;path=[>Pnfsmanager@local];msg=Tunnel cell
>Pnfsmanager@local< not found at >dCacheDomain<

Typo. Should be PnfsManager.

TransferManager errortoo many transfers

In catalina.out

01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:861)
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1182)
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at java.lang.Thread.run(Thread.java:595)
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : CopyFileRequest #-2147439634: copy  failed
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : org.dcache.srm.SRMException: TransferManager errortoo many transfers!
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at diskCacheV111.srm.dcache.Storage.performRemoteTransfer(Storage.java:3801)
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at diskCacheV111.srm.dcache.Storage.getFromRemoteTURL(Storage.java:3662)
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:718)
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) :    at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:833) 

Make sure that the RemoteGsiftpTransferManager maximum number of transfers (variable remoteGsiftpMaxTransfers, used in utility.batch) are at least as high as maximum number of the copy requests running (variable srmCopyReqThreadPoolSize, used in srm.batch). You should not edit srm.batch or utility.batch directly, but use dCacheSetup to overwrite the default values of the above mentioned variables.

active transfers on webpage: all dcap connections are "staging"

In dCacheDomain.log:

01/15 16:25:12 Cell(l-100-Unknown-126@dCacheDomain) : runIO : java.io.EOFException
01/15 16:25:12 Cell(l-100-Unknown-126@dCacheDomain) : acceptThread : java.io.EOFException
01/15 16:25:12 Cell(RoutingMgr@dCacheDomain) : update can't send update  to RoutingMgr{uoid=<1168874712338:7058545>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local}
01/15 16:25:16 Cell(RoutingMgr@dCacheDomain) : update can't send update  to RoutingMgr{uoid=<1168874716509:7058744>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local}
java.lang.IllegalArgumentException: Protocol not found : null
        at diskCacheV111.poolManager.RequestContainerV5.addRequest(RequestContainerV5.java:502)
        at diskCacheV111.poolManager.PoolManagerV5.messageArrived(PoolManagerV5.java:500)
        at dmg.cells.nucleus.CellAdapter.messageArrived(CellAdapter.java:918)
        at dmg.cells.nucleus.CellNucleus.run(CellNucleus.java:412)
        at java.lang.Thread.run(Thread.java:595)

Fixed it by running:

ourHomeDir=/opt/d-cache/
. ${ourHomeDir}/config/dCacheSetup
${ourHomeDir}/jobs/dCache ${logParam} stop
${ourHomeDir}/jobs/dCache ${logParam} start

Authorization Service failed

gridftpdoor.log file is reporting:

error response: 530 530 Authorization Service failed: diskCacheV111.services.authorization.AuthorizationServiceException: \
authRequestID 8465657 recevied exception null

This was seen when there were a large numbee of TCP connections in a CLOSE_WAIT state on the door node. It may be that the delegation step to gPlazma is failing because of the open connections. There is an uncaught exception coming from a dependent jar, hence, little info in the error message. However, when the errors start occurring you can log in to the gPlazma cell and execute:

set LogLevel DEBUG

and some print statements should come up which will show at what point authentication is failing.

n must be positive

lcg-cr is one of the LCG/gLite data management tools. It allows users to copy a file to an SE and register it in the file catalog.

lcg-cr -v --vo ops -d globe-door.ifh.de -l lfn:sft-lcg-rm-cr-globe50.ifh.de.07021406262 file:///var/home/ops000/globus-tmp.globe50.6984.0/WMS_globe50_07566_https_3a_2f_2frb115.cern.ch_3a9000_2f7Vvtb6Xut29X79yHC8Cr9w/work/testjob/nodes/lcg-ce0.ifh.de/sft-lcg-rm-cr.txt
getting request #-2145945304 generated error : java.lang.IllegalArgumentException: n must be positive
lcg_cr: Communication error on send 

In the srm.log file:

02/14 07:26:55 Cell(SRM-globe-door@srm-globe-doorDomain) : srm: creating a failed request status with a message: getting request #-2145945302 generated error :  java.lang.IllegalArgumentException: n must be positive 

This is however only a temporary problem. A few seconds later everything works again.

Handle not in the proper state

When I try to srmcp or globus-url-copy a file in I am seeing errors like this:
debug: fault on connection to gsiftp://pool1.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/srm3-18139-1.txt: Handle not in the proper state
debug: error reading response from gsiftp://pool1.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/srm3-18139-1.txt: the handle 0x9330734 was already registered for  closing
debug: data callback, no error, buffer 0xb74d7008, length 0, offset=0, eof=true
debug: operation complete

The appropriate line from the pool domain and pnfs domain logs are below.

pool domain:

03/27 23:17:05 Cell(pool1_23@pool1Domain) : getChecksumFromPnfs : No crc available for 000E0000000000000021CE00
03/27 23:17:05 Cell(pool1_23@pool1Domain) : PnfsHandler : CacheException (4) : Pnfs error : Pnfs lookup failed
03/27 23:17:05 Cell(pool1_23@pool1Domain) : Exception in runIO for : 000E0000000000000021CE00 CacheException(rc=4;msg=Pnfs error : Pnfs lookup failed)
03/27 23:17:05 Cell(pool1_23@pool1Domain) : CacheException(rc=4;msg=Pnfs error : Pnfs lookup failed)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.util.PnfsHandler.pnfsRequest(PnfsHandler.java:156)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.util.PnfsHandler.setFileSize(PnfsHandler.java:96)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1455)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at java.lang.Thread.run(Thread.java:595)
03/27 23:17:05 Cell(pool1_23@pool1Domain) : removeing empty file: 000E0000000000000021CE00
03/27 23:17:05 Cell(pool1_23@pool1Domain) : PnfsHandler : CacheException (10001) : Pnfs error : 000E0000000000000021CE00 not found
03/27 23:17:05 Cell(pool1_23@pool1Domain) : Stacked Exception (Original) for : 000E0000000000000021CE00 <-P----RD---(0)[0]> 0 si={dteam:STATIC} :   CacheException(rc=4;msg=Pnfs error : Pnfs lookup failed )
03/27 23:17:05 Cell(pool1_23@pool1Domain) : Stacked Throwable (Resulting) for : 000E0000000000000021CE00 <-P----RD---(0)[0]> 0 si={dteam:STATIC} :  CacheException(rc=10001;msg=Pnfs error : 000E0000000000000021CE00 not found )
03/27 23:17:05 Cell(pool1_23@pool1Domain) : CacheException(rc=10001;msg=Pnfs error : 000E0000000000000021CE00 not found)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.util.PnfsHandler.pnfsRequest(PnfsHandler.java:156)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.util.PnfsHandler.deletePnfsEntry(PnfsHandler.java:357)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1525)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64)
03/27 23:17:05 Cell(pool1_23@pool1Domain) :     at java.lang.Thread.run(Thread.java:595) 

pnfs domain:

03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : Exception in setLengthjava.lang.IllegalArgumentException: Failed : CacheException(rc=10001;msg=no such file or  directory000E0000000000000021CE00 )
03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : java.lang.IllegalArgumentException: Failed : CacheException(rc=10001;msg=no such file or directory000E0000000000000021CE00)
03/27 23:17:05 Cell(PnfsManager@pnfsDomain) :   at diskCacheV111.namespace.provider.BasicNameSpaceProvider.getFileMetaData(BasicNameSpaceProvider.java:274)
03/27 23:17:05 Cell(PnfsManager@pnfsDomain) :   at diskCacheV111.namespace.PnfsManagerV3.setLength(PnfsManagerV3.java:873)
03/27 23:17:05 Cell(PnfsManager@pnfsDomain) :   at diskCacheV111.namespace.PnfsManagerV3.processPnfsMessage(PnfsManagerV3.java:1081)
03/27 23:17:05 Cell(PnfsManager@pnfsDomain) :   at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:984)
03/27 23:17:05 Cell(PnfsManager@pnfsDomain) :   at java.lang.Thread.run(Thread.java:595) 

This problem can be caused due to an incorrectly configured gridftp port range on the dCache. Check that all of the port ranges in the dCacheSetup file of your gridftp doors are set correctly and the relevant ports are open in your firewall.

Repository got lost

During normal operation, the dCache runs a background process that tries to write a small test file to each pool in order to determine that the pool is still operating. When this operation fails, the following message will appear in the usageInfo table of the dCache webpage:

pool1_04 	pool1Domain 	[99] 	Repository got lost

This situation will arise due to a filesystem problem or a slow disk response. It does not appear to be the case that the background process retries the operation once a failure has occurred. Typically a restart of the pool process will rectify the situation.

Low Level Exc : java.io.IOException: Too many open files

The usageInfo webpage will sometimes display this message:

pool1_05 	pool1Domain 	[204] 	Low Level Exc : java.io.IOException: Too many open files

It should be possible to fix this issue by increasing the maximum number of available processes for the pool.

 ulimit -n 32768

This command can be placed in the file d-cache/jobs/dcache.local.sh which is sourced on startup so that the change ulimit will only affect the dCache processes.

Pool too high

For example, a transfer fails with:

"Pool manager error: Best pool <f01-101-103-e_T_dteam> too high : 2.0000000001E8"

This is because the most suitable pool to satisfy the request is full (or almost full), making the cost of transfer too high. Check that there is space left.

Client error: TransferManager errortoo many transfers!

RequestFileStatus#-2146468676 failed with error:[  at Wed Aug 08 15:56:42 BST 2007 state RetryWait :  nonfatal error [org.dcache.srm.scheduler.NonFatalJobFailure:  org.dcache.srm.SRMException: TransferManager errortoo many transfers!] retrying 
 at Wed Aug 08 15:57:42 BST 2007 state RetryWait :  nonfatal error [org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!] retrying 

These are errors on the dCache side, it looks like srm was trying to schedule more transfers then RemoteGridftpTransferManager was configured to handle. Could you please let me know what version of dCache is being used and what are the parameters for TransferManager and srm cell are used.

It looks to me like a misconfiguration of the dCache system. Check the contents of the dCacheSetup file for the following:

MaximumNumber of active srm copy jobs (specified by -copy-req-thread-pool-size=${srmCopyReqThreadPoolSize} option) should be less or equal to maximum number of the Transfer Manager copies (-max_transfers=${remoteGsiftpMaxTransfers} option). The last number should be roughly equal to the sum of the WAN movers on the pools.

Pool request interrupted

In the gridftp logs on the pool nodes I see log entries like:

10/24 13:30:39 Cell(GFTP-pool2-Unknown-3590@gridftp-pool2Domain) : pool request interrupted
10/24 13:30:41 Cell(GFTP-pool2-Unknown-3579@gridftp-pool2Domain) : pool request interrupted
10/24 13:30:50 Cell(GFTP-pool2-Unknown-3588@gridftp-pool2Domain) : pool request interrupted

This is normally a result of the FTP client dropping the control channel. As a result the FTP door breaks out of whatever it was doing, and this is what is noted in the log.

Disk I/O error

Pools going offline with Disk I/O error. The files are readily accessible, the error is popping up randomly across different pools which are all on different RAID 5 -s. The offending files can be copied around.

10/27 21:13:16 Cell(europa_1@europaDomain) : (DCap_3_nio) Problem in command block : RequestBlock [Size=12 Code=2 Buffer=java.nio.HeapByteBuffer[pos=12 lim=12 cap=8192]
10/27 21:13:16 Cell(europa_1@europaDomain) : java.io.IOException: Connection reset by peer
10/27 21:13:16 Cell(europa_1@europaDomain) :    at sun.nio.ch.FileDispatcher.write0(Native Method)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at sun.nio.ch.IOUtil.write(IOUtil.java:60)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at diskCacheV111.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1166)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at diskCacheV111.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:570)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1474)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at java.lang.Thread.run(Thread.java:595)
10/27 21:13:16 Cell(europa_1@europaDomain) : (DCap_3_nio) FIN : SEEK_READ failed (IO not ok)
10/27 21:13:16 Cell(europa_1@europaDomain) : Exception in runIO for : 000200000000000000C93650 CacheException(rc=204;msg=Disk I/O Error )
10/27 21:13:16 Cell(europa_1@europaDomain) : CacheException(rc=204;msg=Disk I/O Error )
10/27 21:13:16 Cell(europa_1@europaDomain) :    at diskCacheV111.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:823)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1474)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64)
10/27 21:13:16 Cell(europa_1@europaDomain) :    at java.lang.Thread.run(Thread.java:595)
10/27 21:13:16 Cell(europa_1@europaDomain) : New Pool Mode : disabled(fetch,store,stage,p2p-client,p2p-server,)
10/27 21:13:16 Cell(europa_1@europaDomain) : Disk I/O Error 

Probably a disk error causes this. Diagnostic tools should be used.