RAL Tier1 SC4 Status

From GridPP Wiki
Jump to: navigation, search

This page refers to events during WLCG's Service Challenger 4 from the RAL perspective in April through June 2006 and is thus largely of historical interest

RAL's log for SC4 - similar to RAL Tier1 SC3 Log. Times are GMT+1


Tape Rates

Day #files seconds MB/sec MB/sec/drive wr/mount
Wed 19th April 1283 31080 41.28 30.2 4.4
Thu 20th April 2530 86400 29.28 33.9 3.8
Fri 21st April 3454 86400 39.98 29.7 5.2
Sat 22rd April 2569 86400 29.73 36.15 4.2
Sun 23th April 1377 86400 15.94 38.5 7.9
Mon 24th April 2680 86400 31.02 40.15 10.34
Tue 25th April 1833 86400 21.22 34.04 7.6
Wed 26th April 2676 86400 30.97 29.06 7.39

Event Log

30th June 2006

Unloaded disk server

440 start Fri Jun 30 18:20:30 BST 2006
440   end Fri Jun 30 18:20:47 BST 2006
444 start Fri Jun 30 18:20:47 BST 2006
444   end Fri Jun 30 18:21:02 BST 2006
445 start Fri Jun 30 18:21:02 BST 2006
445   end Fri Jun 30 18:21:17 BST 2006
446 start Fri Jun 30 18:21:17 BST 2006
446   end Fri Jun 30 18:21:32 BST 2006
447 start Fri Jun 30 18:21:32 BST 2006
447   end Fri Jun 30 18:21:48 BST 2006
450 start Fri Jun 30 18:21:48 BST 2006
450   end Fri Jun 30 18:22:03 BST 2006
451 start Fri Jun 30 18:22:03 BST 2006
451   end Fri Jun 30 18:22:19 BST 2006
452 start Fri Jun 30 18:22:19 BST 2006
452   end Fri Jun 30 18:22:37 BST 2006

Busy disk server

440 start Fri Jun 30 18:22:47 BST 2006
440   end Fri Jun 30 18:25:12 BST 2006
444 start Fri Jun 30 18:25:12 BST 2006
444   end Fri Jun 30 18:26:56 BST 2006
445 start Fri Jun 30 18:26:56 BST 2006
445   end Fri Jun 30 18:27:30 BST 2006
446 start Fri Jun 30 18:27:30 BST 2006
446   end Fri Jun 30 18:28:15 BST 2006
447 start Fri Jun 30 18:28:15 BST 2006
447   end Fri Jun 30 18:28:38 BST 2006
450 start Fri Jun 30 18:28:38 BST 2006
450   end Fri Jun 30 18:30:39 BST 2006
451 start Fri Jun 30 18:30:39 BST 2006
451   end Fri Jun 30 18:31:04 BST 2006
452 start Fri Jun 30 18:31:04 BST 2006
452   end Fri Jun 30 18:31:30 BST 2006

28th June 2006

Two gridftp doors timing out this morning - restarted.

Implemented automatic restart of dcache doors on gftp0440 & gftp0444 at 4am to see if that improves things - hopefully they should have a lower transfer count and behave more reliably.

Grabbed gftp0445 all to myself by removing the LoginBroker link from the gridftpdoor.batch file - stops the SRM using it for TURLS and ran some single stream transfers against some disk servers.

First 2 results are after a restart of the dcache-core service, but the system had not been rebooted, the last 3 are after a reboot. Times are take by running the date command, storing it in an environment variable, doing the transfer, echo the saved value and running date again. The busy and full disk servers are dCache's pick of csfnfs60-63. The quiet and empty disk server is on a different switch to the other two disk server and gftp door, which probably explains the full,quiet server being faster. Quiet to Busy appears to reduce speed by a factor of 5.

File size is 954MB (100000000Bytes)

Non rebooted dedicated gftp server to quiet & empty disk server

Wed Jun 28 17:55:13 BST 2006
Wed Jun 28 17:56:04 BST 2006
= 49 seconds

Non rebooted dedicated gftp server to busy & full disk server

Wed Jun 28 17:56:24 BST 2006
Wed Jun 28 18:00:25 BST 2006
= 241 seconds

Rebooted dedicated gftp server to quiet & empty disk server

Wed Jun 28 18:06:53 BST 2006
Wed Jun 28 18:07:27 BST 2006
= 37 seconds

Rebooted dedicated gftp server to busy & full disk server

Wed Jun 28 18:08:08 BST 2006
Wed Jun 28 18:11:09 BST 2006
= 181 seconds

Rebooted dedicated gftp server to quiet & fullish (140GB free) disk server

Wed Jun 28 18:16:49 BST 2006
Wed Jun 28 18:17:20 BST 2006
= 31 seconds


Afterthought - The busy disk servers have been busy for a while without dcache-pool restarted while the quiet ones have been idle since their last restart of the dcache-pool service, perhaps a slightly unfair comparison on those grounds - they busy servers may have suffered a performance drop because of this, will retest tomorrow by restarting dcache-pool on one of the busy disk servers and doing targetted transfers to one of its pools.

22nd June 2006

  • 10:00 - No gridftp doors dropped overnight, but the powercut at CERN probably helped.

G-U-C speeds :

 gftp0440 : Thu Jun 22 09:54:38 BST 2006 to Thu Jun 22 09:55:06 BST 2006
 gftp0444 : Thu Jun 22 09:55:06 BST 2006 to Thu Jun 22 09:55:24 BST 2006
 gftp0445 : Thu Jun 22 09:55:24 BST 2006 to Thu Jun 22 09:55:55 BST 2006
 gftp0446 : Thu Jun 22 09:55:55 BST 2006 to Thu Jun 22 09:57:55 BST 2006
 gftp0447 : Thu Jun 22 09:57:55 BST 2006 to Thu Jun 22 09:59:57 BST 2006
 gftp0450 : Thu Jun 22 09:59:57 BST 2006 to Thu Jun 22 10:01:58 BST 2006
 gftp0451 : Thu Jun 22 10:01:58 BST 2006 to Thu Jun 22 10:02:26 BST 2006
 gftp0452 : Thu Jun 22 10:02:26 BST 2006 to Thu Jun 22 10:04:27 BST 2006

Looking at these results and yesterdays, later runs are slower than earlier runs, so I've reversed the list of doors to see if the slowdown is host based or time based :

 gftp0452 : Thu Jun 22 10:16:37 BST 2006 to Thu Jun 22 10:18:39 BST 2006
 gftp0451 : Thu Jun 22 10:18:39 BST 2006 to Thu Jun 22 10:20:41 BST 2006
 gftp0450 : Thu Jun 22 10:20:41 BST 2006 to Thu Jun 22 10:22:43 BST 2006
 gftp0447 : Thu Jun 22 10:22:43 BST 2006 to Thu Jun 22 10:24:45 BST 2006
 gftp0446 : Thu Jun 22 10:24:45 BST 2006 to Thu Jun 22 10:27:13 BST 2006
 gftp0445 : Thu Jun 22 10:27:13 BST 2006 to Thu Jun 22 10:27:57 BST 2006
 gftp0444 : Thu Jun 22 10:27:57 BST 2006 to Thu Jun 22 10:28:49 BST 2006
 gftp0440 : Thu Jun 22 10:28:49 BST 2006 to Thu Jun 22 10:30:08 BST 2006

Looking at these, it appears the higher numbered doors are on the whole slower than the lower number doors


21st June 2006

  • 14:00 - Long time no log.

I need somewhere to document my efforts to understand the current issues we're having with gridftp doors slowing down and hanging and this is probably the most approriate place.

Configuration Changes @ 1400

gftp0440 - currently using 1 stream per transfer instead of 10
    0444 - pnfs timeout increased to 1200 from 120 (was only running a gridftp door not a gsidcap door, but gsidcap door has now been restarted)
    0445 - /etc/sysctl.conf removed
    0446 - New version of dcache-server installed,pnfs timeout increased as gftp0444
    0447 - Using j2sdk 1.4.2-08 rather than j2re 1.4.2-01. pnfs timeout increased
    0450 - Performance Markers disabled
    0451 - PoolManager timeout to 54000 from 5400
    0452 - Same as gftp0444


So far, only gftp0440 & and gftp0445 have not yet stopped working with the

  kernel: TCP: too many of orphaned sockets

error message


gftp Door time test - 446 and 452 were broken at the time, file size is 954MB (1000000000 Bytes), from RAL ui.

gftp0440 : Wed Jun 21 14:25:36 BST 2006 to Wed Jun 21 14:25:58 BST 2006
gftp0444 : Wed Jun 21 14:25:58 BST 2006 to Wed Jun 21 14:27:59 BST 2006
gftp0445 : Wed Jun 21 14:27:59 BST 2006 to Wed Jun 21 14:28:18 BST 2006
 Cancelling copy...
gftp0446 : Wed Jun 21 14:28:18 BST 2006 to Wed Jun 21 14:35:42 BST 2006
gftp0447 : Wed Jun 21 14:35:42 BST 2006 to Wed Jun 21 14:37:41 BST 2006
gftp0450 : Wed Jun 21 14:37:41 BST 2006 to Wed Jun 21 14:39:41 BST 2006
gftp0451 : Wed Jun 21 14:39:41 BST 2006 to Wed Jun 21 14:39:57 BST 2006
 Cancelling copy...
gftp0452 : Wed Jun 21 14:39:57 BST 2006 to Wed Jun 21 14:40:49 BST 2006

2nd May 2006

  • 17:00 - Network restored at ~16:30 - transfers immediately back and running. It took a speculative prod to one of the router ports on UKLR to get the links back, after some cards were reset on the Cienna kit. No viable reason why this should have cured the 'only 3 of the 4 link working' problem.

Martin Bly 17:10, 2 May 2006 (BST)

29th April 2006

  • 19:00 - Transfers stopped at ~10:00 - looks like OPN down though difficult to say for sure. Ping to castorgridsc.cern.ch fails. Two warnings 'major critical' up on the Cienna box. Not good. To whom do we report this?

Martin Bly 19:15, 29 Apr 2006 (BST)

28th April 2006

  • 15:45 - Minor network break for cable swap
  • 15:00 - Upped PnfsManager threads to 8, restarted dcache-core on pnfs to take effect
  • 12:00 - Converted two pools on nfs57 into hsm backed pools and added in.
  • 11:30 - Restarted dcache-core on gftp0451 and gftp0452 - being idle

26th April 2006

  • 13:12 - Various network problems manefested themselves this morning after the stack reset yesterday, seeming to indicate a fault in one unit of stack 5510-2: the one that had the dcache units attached (before they were moved). A further reset of the suspect unit failed to clear the problems so the unit was power-cycled at around 10am. This appears to have cured the problem - all the systems (mostly batch workers) on the effected unit have returned to normal operation.

dcache.gridpp.rl.ac.uk has been returned to 5510-2 from 5530-0b. dcache-head still on 5530-0b. dcache-tape has been on 5510-2 the whole time. gftp0440 and gftp0444 restarted.

Martin bly 13:27, 26 Apr 2006 (BST)

25th April 2006

  • 18:00 - Down from approximately 5am to 5pm due to an odd network issue - lots of timeouts and the various dCache components were having insurmountable difficulty communication with each other, moved two systems onto a different stack which allowed us to get the SE(s) back up and running


24th April 2006

  • 16:13 - Increased number of streams back to 5
  • 11:00 - Set sysctl -w vm.max-readahead=512 vm.min-readahead=256 on the 4 tape cache disk server

23rd April 2006

  • 23:30 - Transfers resumed from CERN end at ~22:00.
  • 20:30 - I tried that (on the same box) and it didn't work! Ho Hum. No traffic on UKLight since around 11:45, link aboslutely dead (on cacti) save for low level chatter from around 15:45. Probably stopped by CERN end. M.
  • 19:33 - Changed the permissions on the directory, someone else reported a similar problem a couple of days ago I think where a directory changed permission for no apparent reason.
  pnfs.gridpp.rl.ac.uk# chown dteam001:dteam /pnfs/gridpp.rl.ac.uk/tape/dteam/sc3/2006-04-23
 

Steve traylen 19:42, 23 Apr 2006 (BST)

  • 16:15 - Looking at this report from Maarten L:
   "xfers to RAL have produced more end more failures during Sun. morning, and since 11 GMT 
    all attempts fail as follows:
   
   user has no permission to write into path
   /pnfs/gridpp.rl.ac.uk/tape/dteam/sc3/2006-04-23"

This has permissions root:root rather than dteam001:dteam, gained around 4am. Tried doing the obvious with chown on various boxes but no dice: dcache obviously not the same as a standard fs. Derek will need to reset the permissions.

22nd April 2006

  • 01:30 - Changed io elevator tuning from r=64,w=8192 to r=8192,w=8192 on nfs61
  • 00:20 - Rate to tape had also dropped at same time - Failed_Tape_Open alarm for dougal possibly related?, seemed to recover without any intervention on my part
  • 00:15 - Report on sc-tech of timeouts on files since 23:30, killed lingering idle transfers , rate increased


21st April 2006

  • 17:10 - Reduced fts channel to 1 stream from 5
  • 17:00 - Changed io elevator tuning from r=64,w=8192 to r=8192,w=64 on nfs60
  • 15:00 - Increased concurrent stores to 3 per server (from 2)
  • 11:00 - OPN to CERN went down

20th April 2006

  • 17:15 - OPN recovered
  • 13:43 - OPN to CERN went down
  • 10:30 - Rates to tape slow overnight - increase movers on each pool to 10 to give a pool more chance of a gap with no incoming traffic

19th April 2006

  • 17:03 - Reduced movers on the 4 tape pools to maximum of 5 (from 50) to try smooth out network traffic on each server
  • 13:55 - Restarted gftp0440

18th April 2006

  • 08:15: Yee-hah #2! 160.10MB/s average for 17/04.

17th April 2006

  • 12:27: Yee-hah! 151.04MB/s average for 16/04. Mind, it's interesting that this happens when no one is on-line a-fiddling!
  • 09:16: Observation: Looking at the Cacti plots, there is a cyclic drop in transfer rate every 6.5 hours.

13th April 2006

  • 09:43 - Created new PoolManager link to give csfnfs42 pools less priority than others
  • Observation: RAL rate appeared to drop from the initial 200MB/s straight after the restart from DB maintenance yesterday to less than 150MB/s in correllation to the upturn in traffic to PIC.

12th April 2006

  • 15:00 - dCache servers csfnfs51,54,56,57 now connected to 5530-0b.
  • 11:30 - Moved csfnfs63 onto 5530-0b
  • 08:20 - Big spike in network stats on Cacti for UKLR and most links, due to jam on server caused by /var becoming full.

11th April 2006

  • 22:52 - Restarted doors on gftp0446
  • 22:10 - See a lot of pnfs nfs server not responding - upped pnfscopies from 4 to 8 and shmclients from 8 to 12 in /usr/etc/pnfssetup on pnfs, to see if it helps
  • 13:35 - Rate still varying widely - implemented SARA scheme - reduced movers from 10 to 2 on pools
  • 10:30 - csfnfs61 moved to 5530-0b
  • 10:15 - csfnfs60 moved to 5530-0b
  • 09:20 - Removed dteam access from csfnfs39 pools - insufficent space

10th April 2006

  • 15:30 - added gftp0440 & gftp0450 back.
  • 15:00 - network interruption to put switch in between UKLight Router and Tier 1 head switch to hang gftp doors from, intention is to move disk servers up also.
  • 13:25 - restarted dcache-core on gftp0452
  • 13:11 - added ulimit tweak to nfs57 and restarted
  • 11:40 - gftp0445 & gftp0452 moved from 5510-2 to 5530-0a
  • 11:10 - gftp0444 moved from 5510-2 to 5530-0a
  • 11:00 - csfnfs57 back into production

8th April 2006

  • 22:30 - restarted dcache-core on gftp0446

7th April 2006

  • 12:57 - removed gftp0440 & 450 to see what effect it has on link distribution
  • 09:24 - nfs57 has broken overnight

6th April 2006

  • 17:00 - There is a correlation between the addition of the 4 new gridftp doors and the unbalanced link
File:Trunk-maxed.png
1 link near max
  • 16:30 - One of the gigabit links in a 4x1 trunk link between 2 T1 switches is close to maxed out
  • 13:00 - csfnfs54 just too full - removed dteam access, added fixed csfnfs42 pools
  • 10:07 - pools csfnfs54_1 & _3 were getting full, switched dteam activity to csfnfs54_2 & _4
  • 09:50 - ditto gftp0444
  • 09:30 - Restarted dcache-core on gftp0440 - going slow

5th April 2006

  • 20:55 - Restarted gftp0440 & gftp0447 - lots of sockets with data queue'd but not going anwhere
  • 16:30 - Speed has dipped, added pool from other array from each system back into dteam pool group
  • 16:00 - Removed dteam access from all but one pool on each system - trying to increasing distribution across systems
  • 14:00 - Set space cost factor to 0 - free space is so vastly different between pools
  • 11:25 - Set all non-tape/non csfnfs58 pools to 10 gftp transfers (from 50) to try and improve distribution
  • 11:00 - Transfers stopped - mailed sc-tech
  • 09:20 - CERN paused for DB move : restart dcache-core on gftp0447 as its not being pulling its weight, upped max running put transfers per user (if other transfers queued) from 10 to 30 on srm to see if it makes a difference.

4th April 2006

  • 21:25 - restarted dcache-core on gftp0445 & gftp0446 - both idle for >1 hour
  • 18:20 - restarted dcache-core on gftp0444 - LoginBroker claiming 11 transfers, but no network traffic evident
  • 18:00 - added 4 new gftp servers
  • 16:15 - csfnfs42 has dropped a drive from one of its arrays, removed dteam access from both to give it a break

3rd April 2006

  • 14:10 - No significant effect from cpu cost change - remove 3 of 4 pools of csfnfs57 from dteam pool group
  • 13:50 - Rate has dropped since addition of pools - set cpu cost factor to 2.0 (from 1.0) to get dCache to prefer quiet pools over empty pools
  • 12:00 - Added 4 pools from csfnfs57 to dteam pool pgroup