DCache Log Message Archive
This page is intended to contain a list of dCache log messages and corresponding explanations of their cause. If the message is caused by a real problem in the dCache then potential solutions are suggested. Much of the material and the solutions for this come from the dCache user-forum mailing list.
Contents
- 1 lm -> Cell Not Found
- 2 Cell not found
- 3 TransferManager errortoo many transfers
- 4 active transfers on webpage: all dcap connections are "staging"
- 5 Authorization Service failed
- 6 n must be positive
- 7 Handle not in the proper state
- 8 Repository got lost
- 9 Low Level Exc : java.io.IOException: Too many open files
- 10 Pool too high
- 11 Client error: TransferManager errortoo many transfers!
- 12 Pool request interrupted
- 13 Disk I/O error
lm -> Cell Not Found
lm is the dCache Location Manager. This service is required to ensure that all of the dCache components (cells) can communicate with each other. When dCache is starting up, it is common to see error messages related to lm in the logs since when it is first starting, not all of the components are able to talk to one another. These errors should disappear after a minute or so.
Cell not found
(local) admin > cd Pnfsmanager (Pnfsmanager) admin > help No Route to cell for packet {uoid=<1170490140376:118>;path=[>Pnfsmanager@local];msg=Tunnel cell >Pnfsmanager@local< not found at >dCacheDomain<
Typo. Should be PnfsManager.
TransferManager errortoo many transfers
In catalina.out
01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers! 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:861) 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1182) 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java) 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at java.lang.Thread.run(Thread.java:595) 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : CopyFileRequest #-2147439634: copy failed 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : org.dcache.srm.SRMException: TransferManager errortoo many transfers! 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at diskCacheV111.srm.dcache.Storage.performRemoteTransfer(Storage.java:3801) 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at diskCacheV111.srm.dcache.Storage.getFromRemoteTURL(Storage.java:3662) 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:718) 01/31 13:55:08 Cell(SRM-dcsrm@srm-dcsrmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:833)
Make sure that the RemoteGsiftpTransferManager maximum number of transfers (variable remoteGsiftpMaxTransfers, used in utility.batch) are at least as high as maximum number of the copy requests running (variable srmCopyReqThreadPoolSize, used in srm.batch). You should not edit srm.batch or utility.batch directly, but use dCacheSetup to overwrite the default values of the above mentioned variables.
active transfers on webpage: all dcap connections are "staging"
In dCacheDomain.log:
01/15 16:25:12 Cell(l-100-Unknown-126@dCacheDomain) : runIO : java.io.EOFException 01/15 16:25:12 Cell(l-100-Unknown-126@dCacheDomain) : acceptThread : java.io.EOFException 01/15 16:25:12 Cell(RoutingMgr@dCacheDomain) : update can't send update to RoutingMgr{uoid=<1168874712338:7058545>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local} 01/15 16:25:16 Cell(RoutingMgr@dCacheDomain) : update can't send update to RoutingMgr{uoid=<1168874716509:7058744>;path=[>RoutingMgr@local];msg=Missing routing entry for RoutingMgr@local} java.lang.IllegalArgumentException: Protocol not found : null at diskCacheV111.poolManager.RequestContainerV5.addRequest(RequestContainerV5.java:502) at diskCacheV111.poolManager.PoolManagerV5.messageArrived(PoolManagerV5.java:500) at dmg.cells.nucleus.CellAdapter.messageArrived(CellAdapter.java:918) at dmg.cells.nucleus.CellNucleus.run(CellNucleus.java:412) at java.lang.Thread.run(Thread.java:595)
Fixed it by running:
ourHomeDir=/opt/d-cache/ . ${ourHomeDir}/config/dCacheSetup ${ourHomeDir}/jobs/dCache ${logParam} stop ${ourHomeDir}/jobs/dCache ${logParam} start
Authorization Service failed
gridftpdoor.log file is reporting:
error response: 530 530 Authorization Service failed: diskCacheV111.services.authorization.AuthorizationServiceException: \ authRequestID 8465657 recevied exception null
This was seen when there were a large numbee of TCP connections in a CLOSE_WAIT state on the door node. It may be that the delegation step to gPlazma is failing because of the open connections. There is an uncaught exception coming from a dependent jar, hence, little info in the error message. However, when the errors start occurring you can log in to the gPlazma cell and execute:
set LogLevel DEBUG
and some print statements should come up which will show at what point authentication is failing.
n must be positive
lcg-cr is one of the LCG/gLite data management tools. It allows users to copy a file to an SE and register it in the file catalog.
lcg-cr -v --vo ops -d globe-door.ifh.de -l lfn:sft-lcg-rm-cr-globe50.ifh.de.07021406262 file:///var/home/ops000/globus-tmp.globe50.6984.0/WMS_globe50_07566_https_3a_2f_2frb115.cern.ch_3a9000_2f7Vvtb6Xut29X79yHC8Cr9w/work/testjob/nodes/lcg-ce0.ifh.de/sft-lcg-rm-cr.txt getting request #-2145945304 generated error : java.lang.IllegalArgumentException: n must be positive lcg_cr: Communication error on send
In the srm.log file:
02/14 07:26:55 Cell(SRM-globe-door@srm-globe-doorDomain) : srm: creating a failed request status with a message: getting request #-2145945302 generated error : java.lang.IllegalArgumentException: n must be positive
This is however only a temporary problem. A few seconds later everything works again.
Handle not in the proper state
When I try to srmcp or globus-url-copy a file in I am seeing errors like this:
debug: fault on connection to gsiftp://pool1.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/srm3-18139-1.txt: Handle not in the proper state debug: error reading response from gsiftp://pool1.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/srm3-18139-1.txt: the handle 0x9330734 was already registered for closing debug: data callback, no error, buffer 0xb74d7008, length 0, offset=0, eof=true debug: operation complete
The appropriate line from the pool domain and pnfs domain logs are below.
pool domain:
03/27 23:17:05 Cell(pool1_23@pool1Domain) : getChecksumFromPnfs : No crc available for 000E0000000000000021CE00 03/27 23:17:05 Cell(pool1_23@pool1Domain) : PnfsHandler : CacheException (4) : Pnfs error : Pnfs lookup failed 03/27 23:17:05 Cell(pool1_23@pool1Domain) : Exception in runIO for : 000E0000000000000021CE00 CacheException(rc=4;msg=Pnfs error : Pnfs lookup failed) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : CacheException(rc=4;msg=Pnfs error : Pnfs lookup failed) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.util.PnfsHandler.pnfsRequest(PnfsHandler.java:156) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.util.PnfsHandler.setFileSize(PnfsHandler.java:96) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1455) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at java.lang.Thread.run(Thread.java:595) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : removeing empty file: 000E0000000000000021CE00 03/27 23:17:05 Cell(pool1_23@pool1Domain) : PnfsHandler : CacheException (10001) : Pnfs error : 000E0000000000000021CE00 not found 03/27 23:17:05 Cell(pool1_23@pool1Domain) : Stacked Exception (Original) for : 000E0000000000000021CE00 <-P----RD---(0)[0]> 0 si={dteam:STATIC} : CacheException(rc=4;msg=Pnfs error : Pnfs lookup failed ) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : Stacked Throwable (Resulting) for : 000E0000000000000021CE00 <-P----RD---(0)[0]> 0 si={dteam:STATIC} : CacheException(rc=10001;msg=Pnfs error : 000E0000000000000021CE00 not found ) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : CacheException(rc=10001;msg=Pnfs error : 000E0000000000000021CE00 not found) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.util.PnfsHandler.pnfsRequest(PnfsHandler.java:156) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.util.PnfsHandler.deletePnfsEntry(PnfsHandler.java:357) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1525) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64) 03/27 23:17:05 Cell(pool1_23@pool1Domain) : at java.lang.Thread.run(Thread.java:595)
pnfs domain:
03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : Exception in setLengthjava.lang.IllegalArgumentException: Failed : CacheException(rc=10001;msg=no such file or directory000E0000000000000021CE00 ) 03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : java.lang.IllegalArgumentException: Failed : CacheException(rc=10001;msg=no such file or directory000E0000000000000021CE00) 03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.namespace.provider.BasicNameSpaceProvider.getFileMetaData(BasicNameSpaceProvider.java:274) 03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.namespace.PnfsManagerV3.setLength(PnfsManagerV3.java:873) 03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.namespace.PnfsManagerV3.processPnfsMessage(PnfsManagerV3.java:1081) 03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:984) 03/27 23:17:05 Cell(PnfsManager@pnfsDomain) : at java.lang.Thread.run(Thread.java:595)
This problem can be caused due to an incorrectly configured gridftp port range on the dCache. Check that all of the port ranges in the dCacheSetup file of your gridftp doors are set correctly and the relevant ports are open in your firewall.
Repository got lost
During normal operation, the dCache runs a background process that tries to write a small test file to each pool in order to determine that the pool is still operating. When this operation fails, the following message will appear in the usageInfo table of the dCache webpage:
pool1_04 pool1Domain [99] Repository got lost
This situation will arise due to a filesystem problem or a slow disk response. It does not appear to be the case that the background process retries the operation once a failure has occurred. Typically a restart of the pool process will rectify the situation.
Low Level Exc : java.io.IOException: Too many open files
The usageInfo webpage will sometimes display this message:
pool1_05 pool1Domain [204] Low Level Exc : java.io.IOException: Too many open files
It should be possible to fix this issue by increasing the maximum number of available processes for the pool.
ulimit -n 32768
This command can be placed in the file d-cache/jobs/dcache.local.sh which is sourced on startup so that the change ulimit will only affect the dCache processes.
Pool too high
For example, a transfer fails with:
"Pool manager error: Best pool <f01-101-103-e_T_dteam> too high : 2.0000000001E8"
This is because the most suitable pool to satisfy the request is full (or almost full), making the cost of transfer too high. Check that there is space left.
Client error: TransferManager errortoo many transfers!
RequestFileStatus#-2146468676 failed with error:[ at Wed Aug 08 15:56:42 BST 2007 state RetryWait : nonfatal error [org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!] retrying at Wed Aug 08 15:57:42 BST 2007 state RetryWait : nonfatal error [org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!] retrying
These are errors on the dCache side, it looks like srm was trying to schedule more transfers then RemoteGridftpTransferManager was configured to handle. Could you please let me know what version of dCache is being used and what are the parameters for TransferManager and srm cell are used.
It looks to me like a misconfiguration of the dCache system. Check the contents of the dCacheSetup file for the following:
MaximumNumber of active srm copy jobs (specified by -copy-req-thread-pool-size=${srmCopyReqThreadPoolSize} option) should be less or equal to maximum number of the Transfer Manager copies (-max_transfers=${remoteGsiftpMaxTransfers} option). The last number should be roughly equal to the sum of the WAN movers on the pools.
Pool request interrupted
In the gridftp logs on the pool nodes I see log entries like:
10/24 13:30:39 Cell(GFTP-pool2-Unknown-3590@gridftp-pool2Domain) : pool request interrupted 10/24 13:30:41 Cell(GFTP-pool2-Unknown-3579@gridftp-pool2Domain) : pool request interrupted 10/24 13:30:50 Cell(GFTP-pool2-Unknown-3588@gridftp-pool2Domain) : pool request interrupted
This is normally a result of the FTP client dropping the control channel. As a result the FTP door breaks out of whatever it was doing, and this is what is noted in the log.
Disk I/O error
Pools going offline with Disk I/O error. The files are readily accessible, the error is popping up randomly across different pools which are all on different RAID 5 -s. The offending files can be copied around.
10/27 21:13:16 Cell(europa_1@europaDomain) : (DCap_3_nio) Problem in command block : RequestBlock [Size=12 Code=2 Buffer=java.nio.HeapByteBuffer[pos=12 lim=12 cap=8192] 10/27 21:13:16 Cell(europa_1@europaDomain) : java.io.IOException: Connection reset by peer 10/27 21:13:16 Cell(europa_1@europaDomain) : at sun.nio.ch.FileDispatcher.write0(Native Method) 10/27 21:13:16 Cell(europa_1@europaDomain) : at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) 10/27 21:13:16 Cell(europa_1@europaDomain) : at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104) 10/27 21:13:16 Cell(europa_1@europaDomain) : at sun.nio.ch.IOUtil.write(IOUtil.java:60) 10/27 21:13:16 Cell(europa_1@europaDomain) : at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:302) 10/27 21:13:16 Cell(europa_1@europaDomain) : at diskCacheV111.movers.DCapProtocol_3_nio.doTheRead(DCapProtocol_3_nio.java:1166) 10/27 21:13:16 Cell(europa_1@europaDomain) : at diskCacheV111.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:570) 10/27 21:13:16 Cell(europa_1@europaDomain) : at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1474) 10/27 21:13:16 Cell(europa_1@europaDomain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64) 10/27 21:13:16 Cell(europa_1@europaDomain) : at java.lang.Thread.run(Thread.java:595) 10/27 21:13:16 Cell(europa_1@europaDomain) : (DCap_3_nio) FIN : SEEK_READ failed (IO not ok) 10/27 21:13:16 Cell(europa_1@europaDomain) : Exception in runIO for : 000200000000000000C93650 CacheException(rc=204;msg=Disk I/O Error ) 10/27 21:13:16 Cell(europa_1@europaDomain) : CacheException(rc=204;msg=Disk I/O Error ) 10/27 21:13:16 Cell(europa_1@europaDomain) : at diskCacheV111.movers.DCapProtocol_3_nio.runIO(DCapProtocol_3_nio.java:823) 10/27 21:13:16 Cell(europa_1@europaDomain) : at diskCacheV111.pools.MultiProtocolPool2$RepositoryIoHandler.run(MultiProtocolPool2.java:1474) 10/27 21:13:16 Cell(europa_1@europaDomain) : at diskCacheV111.util.SimpleJobScheduler$SJob.run(SimpleJobScheduler.java:64) 10/27 21:13:16 Cell(europa_1@europaDomain) : at java.lang.Thread.run(Thread.java:595) 10/27 21:13:16 Cell(europa_1@europaDomain) : New Pool Mode : disabled(fetch,store,stage,p2p-client,p2p-server,) 10/27 21:13:16 Cell(europa_1@europaDomain) : Disk I/O Error
Probably a disk error causes this. Diagnostic tools should be used.