RAL Tier1 SC3 Log

From GridPP Wiki
Jump to: navigation, search

This page refers to events during WLCG Service Challenge 3 from the RAL perspective, which took place in January 2006, as such, it is largely of historical interest.


Log of RAL in service challenge 3. Previously this log was maintained elsewhere. The main service at RAL involved with SC3 is the RAL dCache service.


24/01/2006

File:CERN-RAL-24012006.jpg
CERN to RAL 24th January 2006
  • 10:10 - Uploaded plot showing 12 hours receiving data at >= 150MB/s
  • 09:30 - csfnfs60 has stopped receiving data, nothing in log file, dcache-pool restarted
  • 00:20 - csfnfs61 has stopped receiving data, nothing in log file, dcache-pool restarted

23/01/2006

File:CERN-RAL-20060123.jpg
CERN to RAL 23rd January 2006
  • 15:56 - Uploaded plot of our record hour so far.
  • 14:06 - forced clear out of duplicate CMS files on csfnfs39_1 and csfnfs39_2 pools to increase free space
  • 13:44 - did manual clear out of sc3 dir, plus misc files in dteam directory, set clearout cron to hourly
  • 13:25 - restarted dcache-core on gftp0446 & gftp0447 - transfers stuck not doing much
  • 13:17 - no traffic going through gftp0445 - restarted dcache-core
  • 11:50 - csfnfs62: "Too many open files" - restarted dcache-pool, init.d script updated, but after restart on 13th
  • 11:45 - gftp0444 showed 43 gftp transfers active, but very little traffic, restarted dcache-core

21/01/2006

  • 20:44 - csfnfs42 and 54 show "Too many open files", updated startup scripts, restarted dcache-pool

20/01/2006

  • 23:25 - noticed that CERN-RAL channel is now at 40 transfers
  • 10:06 - csfnfs51 giving too many open files errors, added ulimit, restarted

19/01/2006

  • 16:25 - confirmed as FTS problems at CERN
  • 16:15 - transfers stopped at 15:40, srm seems fine, but FTS has no active transfers for CERN-RAL channel, mailed CERN
  • 09:55 - increased frequency of cleaning script to 3 hours

18/01/2006

  • 15:55 - transfers had stopped to csfnfs51 - all pools filled - did manual run of deleter cron
  • 13:17 - gmond on pnfs.gridpp hung, restarted
  • 13:02 - Increased number of FTS files from 20 to 30
  • 11:10 - gftp0447 had hung about 10:00, rebooted
  • 10:53 - copied updated sysctl.conf onto gftp0445 to get picked up at next reboot - rate to gftp0444 not any worse and it hasn't crashed yet
  • 10:50 - pool restarted messages in PoolManaget from csfnfs63 pools stopped, looks like two dcache-pool instances were running on csfnfs63
  • 09:50 - more transfers queued on csfnfs63_4 pool, restarting dcache-pool on csfnfs63
  • 09:25 - gftp0445,446 locked up around 05:00

17/01/2006

  • 23:23 - All is well.
  • 17:16 - 57 transfers queued on csfnfs63_4 pool, increased max movers from 50 to 100
  • 09:30 - All gftp systems locked up at 05:00, have now been rebooted, gftp0444 has picked up new systcl settings

16/01/2006

  • 19:30 - No data has been flowing through gftp0445 for the last couple of hours but I don't know why.
  • 09:29 - csfnfs39 - Too many open files - dcache-pool restarted
  • 01:07 - Transfer balance has improved - restarted gftp0447's dcache-core

15/01/2006

  • 23:10 - Restarted dcache-core on gftp0444 to see if it sorts out transfer imbalance
File:CERN-RAL-FTP-SERVER-20050115.jpg
CERN to RAL 15th Jan 2006
  • 15:48 - Uploaded picture showing that two dead gridftp servers does not make a large difference.
  • 13:43 - Noticed gftp0445 & gftp0446 had hung - rebooted remotely
  • 13:40 - csfnfs53 showing "Too many open files", copied startup script from nfs60, restarted dcache-pool
  • 10:38 - csfnfs60 showing "Too many open files" messages, dcache-pool restarted

14/01/2006

  • 18:50 - CASTOR running and data arriving at RAL again.
  • 03:45 - Transfers have competely stopped - CERN CASTOR problem?
  • 03:45 - Maarten Litmaath reported failures to two pools on csfnfs63, on investigation noticed frequent PoolRestarted messages for csfnfs63's pools in PoolManager pinboard, decided to restart dcache-pool service on csfnfs63.

13/01/2006

  • 10:56 - Added ulimit -n 16384 to csfnfs63:/etc/init.d/dcache-pool to match other disk servers, will be implemented at next resart. I missed this one out.
  • 09:50 - csfnfs62's pools were showing "Too many open files" errors - restarted dcache-pool service

12/01/2006

File:CERN-RAL-20050112.jpg
CERN to RAL 12 January 2006

Service Challenge CERN Disk to RAL Disk start.

  • A good rate overnight up to 1.3 Gbits/s.
  • 17:50 - Added all dCache pools to dteam pool group on UKLight visible servers, except for 4 tape buffer pools.
  • 17:00 - Updated gftp0444's /etc/sysctl.conf with new values, will take effect on next reboot.
  • 15:00 - Two of the GridFTP doors crashed, this shortly after Martin L raised us to 30 concurrent files.

FTS Interaction

The state of FTS channel can be queried with

 $ glite-transfer-channel-list \ 
     -s https://sc3-fts-external.cern.ch:8443/site-fts/glite-data-transfer-fts/services/ChannelManagement  \
     CERN-RAL
 Channel: CERN-RAL
 Between: CERN-SC and RAL
 State: Active
 Contact: lcg-support@gridpp.rl.ac.uk
 Bandwidth: 0
 Nominal throughput: 0
 Number of files: 30, streams: 5
 Number of VO shares: 5
 VO 'dteam' share is: 20
 VO 'alice' share is: 20
 VO 'atlas' share is: 20
 VO 'cms' share is: 20
 VO 'lhcb' share is: 20


Links