This page refers to events during WLCG's Service Challenger 4 from the RAL perspective in April through June 2006 and is thus largely of historical interest

RAL's log for SC4 - similar to RAL Tier1 SC3 Log. Times are GMT+1

Tape Rates

Day	#files	seconds	MB/sec	MB/sec/drive	wr/mount
Wed 19th April	1283	31080	41.28	30.2	4.4
Thu 20th April	2530	86400	29.28	33.9	3.8
Fri 21st April	3454	86400	39.98	29.7	5.2
Sat 22rd April	2569	86400	29.73	36.15	4.2
Sun 23th April	1377	86400	15.94	38.5	7.9
Mon 24th April	2680	86400	31.02	40.15	10.34
Tue 25th April	1833	86400	21.22	34.04	7.6
Wed 26th April	2676	86400	30.97	29.06	7.39

Event Log

30th June 2006

Unloaded disk server

440 start Fri Jun 30 18:20:30 BST 2006
440   end Fri Jun 30 18:20:47 BST 2006
444 start Fri Jun 30 18:20:47 BST 2006
444   end Fri Jun 30 18:21:02 BST 2006
445 start Fri Jun 30 18:21:02 BST 2006
445   end Fri Jun 30 18:21:17 BST 2006
446 start Fri Jun 30 18:21:17 BST 2006
446   end Fri Jun 30 18:21:32 BST 2006
447 start Fri Jun 30 18:21:32 BST 2006
447   end Fri Jun 30 18:21:48 BST 2006
450 start Fri Jun 30 18:21:48 BST 2006
450   end Fri Jun 30 18:22:03 BST 2006
451 start Fri Jun 30 18:22:03 BST 2006
451   end Fri Jun 30 18:22:19 BST 2006
452 start Fri Jun 30 18:22:19 BST 2006
452   end Fri Jun 30 18:22:37 BST 2006

Busy disk server

440 start Fri Jun 30 18:22:47 BST 2006
440   end Fri Jun 30 18:25:12 BST 2006
444 start Fri Jun 30 18:25:12 BST 2006
444   end Fri Jun 30 18:26:56 BST 2006
445 start Fri Jun 30 18:26:56 BST 2006
445   end Fri Jun 30 18:27:30 BST 2006
446 start Fri Jun 30 18:27:30 BST 2006
446   end Fri Jun 30 18:28:15 BST 2006
447 start Fri Jun 30 18:28:15 BST 2006
447   end Fri Jun 30 18:28:38 BST 2006
450 start Fri Jun 30 18:28:38 BST 2006
450   end Fri Jun 30 18:30:39 BST 2006
451 start Fri Jun 30 18:30:39 BST 2006
451   end Fri Jun 30 18:31:04 BST 2006
452 start Fri Jun 30 18:31:04 BST 2006
452   end Fri Jun 30 18:31:30 BST 2006

28th June 2006

Two gridftp doors timing out this morning - restarted.

Implemented automatic restart of dcache doors on gftp0440 & gftp0444 at 4am to see if that improves things - hopefully they should have a lower transfer count and behave more reliably.

Grabbed gftp0445 all to myself by removing the LoginBroker link from the gridftpdoor.batch file - stops the SRM using it for TURLS and ran some single stream transfers against some disk servers.

First 2 results are after a restart of the dcache-core service, but the system had not been rebooted, the last 3 are after a reboot. Times are take by running the date command, storing it in an environment variable, doing the transfer, echo the saved value and running date again. The busy and full disk servers are dCache's pick of csfnfs60-63. The quiet and empty disk server is on a different switch to the other two disk server and gftp door, which probably explains the full,quiet server being faster. Quiet to Busy appears to reduce speed by a factor of 5.

File size is 954MB (100000000Bytes)

Non rebooted dedicated gftp server to quiet & empty disk server

Wed Jun 28 17:55:13 BST 2006
Wed Jun 28 17:56:04 BST 2006
= 49 seconds

Non rebooted dedicated gftp server to busy & full disk server

Wed Jun 28 17:56:24 BST 2006
Wed Jun 28 18:00:25 BST 2006
= 241 seconds

Rebooted dedicated gftp server to quiet & empty disk server

Wed Jun 28 18:06:53 BST 2006
Wed Jun 28 18:07:27 BST 2006
= 37 seconds

Rebooted dedicated gftp server to busy & full disk server

Wed Jun 28 18:08:08 BST 2006
Wed Jun 28 18:11:09 BST 2006
= 181 seconds

Rebooted dedicated gftp server to quiet & fullish (140GB free) disk server

Wed Jun 28 18:16:49 BST 2006
Wed Jun 28 18:17:20 BST 2006
= 31 seconds

Afterthought - The busy disk servers have been busy for a while without dcache-pool restarted while the quiet ones have been idle since their last restart of the dcache-pool service, perhaps a slightly unfair comparison on those grounds - they busy servers may have suffered a performance drop because of this, will retest tomorrow by restarting dcache-pool on one of the busy disk servers and doing targetted transfers to one of its pools.

22nd June 2006

10:00 - No gridftp doors dropped overnight, but the powercut at CERN probably helped.

G-U-C speeds :

 gftp0440 : Thu Jun 22 09:54:38 BST 2006 to Thu Jun 22 09:55:06 BST 2006
 gftp0444 : Thu Jun 22 09:55:06 BST 2006 to Thu Jun 22 09:55:24 BST 2006
 gftp0445 : Thu Jun 22 09:55:24 BST 2006 to Thu Jun 22 09:55:55 BST 2006
 gftp0446 : Thu Jun 22 09:55:55 BST 2006 to Thu Jun 22 09:57:55 BST 2006
 gftp0447 : Thu Jun 22 09:57:55 BST 2006 to Thu Jun 22 09:59:57 BST 2006
 gftp0450 : Thu Jun 22 09:59:57 BST 2006 to Thu Jun 22 10:01:58 BST 2006
 gftp0451 : Thu Jun 22 10:01:58 BST 2006 to Thu Jun 22 10:02:26 BST 2006
 gftp0452 : Thu Jun 22 10:02:26 BST 2006 to Thu Jun 22 10:04:27 BST 2006

Looking at these results and yesterdays, later runs are slower than earlier runs, so I've reversed the list of doors to see if the slowdown is host based or time based :

 gftp0452 : Thu Jun 22 10:16:37 BST 2006 to Thu Jun 22 10:18:39 BST 2006
 gftp0451 : Thu Jun 22 10:18:39 BST 2006 to Thu Jun 22 10:20:41 BST 2006
 gftp0450 : Thu Jun 22 10:20:41 BST 2006 to Thu Jun 22 10:22:43 BST 2006
 gftp0447 : Thu Jun 22 10:22:43 BST 2006 to Thu Jun 22 10:24:45 BST 2006
 gftp0446 : Thu Jun 22 10:24:45 BST 2006 to Thu Jun 22 10:27:13 BST 2006
 gftp0445 : Thu Jun 22 10:27:13 BST 2006 to Thu Jun 22 10:27:57 BST 2006
 gftp0444 : Thu Jun 22 10:27:57 BST 2006 to Thu Jun 22 10:28:49 BST 2006
 gftp0440 : Thu Jun 22 10:28:49 BST 2006 to Thu Jun 22 10:30:08 BST 2006

Looking at these, it appears the higher numbered doors are on the whole slower than the lower number doors

21st June 2006

14:00 - Long time no log.

I need somewhere to document my efforts to understand the current issues we're having with gridftp doors slowing down and hanging and this is probably the most approriate place.

Configuration Changes @ 1400

gftp0440 - currently using 1 stream per transfer instead of 10
    0444 - pnfs timeout increased to 1200 from 120 (was only running a gridftp door not a gsidcap door, but gsidcap door has now been restarted)
    0445 - /etc/sysctl.conf removed
    0446 - New version of dcache-server installed,pnfs timeout increased as gftp0444
    0447 - Using j2sdk 1.4.2-08 rather than j2re 1.4.2-01. pnfs timeout increased
    0450 - Performance Markers disabled
    0451 - PoolManager timeout to 54000 from 5400
    0452 - Same as gftp0444

So far, only gftp0440 & and gftp0445 have not yet stopped working with the

  kernel: TCP: too many of orphaned sockets

error message

gftp Door time test - 446 and 452 were broken at the time, file size is 954MB (1000000000 Bytes), from RAL ui.

gftp0440 : Wed Jun 21 14:25:36 BST 2006 to Wed Jun 21 14:25:58 BST 2006
gftp0444 : Wed Jun 21 14:25:58 BST 2006 to Wed Jun 21 14:27:59 BST 2006
gftp0445 : Wed Jun 21 14:27:59 BST 2006 to Wed Jun 21 14:28:18 BST 2006
 Cancelling copy...
gftp0446 : Wed Jun 21 14:28:18 BST 2006 to Wed Jun 21 14:35:42 BST 2006
gftp0447 : Wed Jun 21 14:35:42 BST 2006 to Wed Jun 21 14:37:41 BST 2006
gftp0450 : Wed Jun 21 14:37:41 BST 2006 to Wed Jun 21 14:39:41 BST 2006
gftp0451 : Wed Jun 21 14:39:41 BST 2006 to Wed Jun 21 14:39:57 BST 2006
 Cancelling copy...
gftp0452 : Wed Jun 21 14:39:57 BST 2006 to Wed Jun 21 14:40:49 BST 2006

2nd May 2006

17:00 - Network restored at ~16:30 - transfers immediately back and running. It took a speculative prod to one of the router ports on UKLR to get the links back, after some cards were reset on the Cienna kit. No viable reason why this should have cured the 'only 3 of the 4 link working' problem.

Martin Bly 17:10, 2 May 2006 (BST)

29th April 2006

19:00 - Transfers stopped at ~10:00 - looks like OPN down though difficult to say for sure. Ping to castorgridsc.cern.ch fails. Two warnings 'major critical' up on the Cienna box. Not good. To whom do we report this?

Martin Bly 19:15, 29 Apr 2006 (BST)

28th April 2006

15:45 - Minor network break for cable swap
15:00 - Upped PnfsManager threads to 8, restarted dcache-core on pnfs to take effect
12:00 - Converted two pools on nfs57 into hsm backed pools and added in.
11:30 - Restarted dcache-core on gftp0451 and gftp0452 - being idle

26th April 2006

13:12 - Various network problems manefested themselves this morning after the stack reset yesterday, seeming to indicate a fault in one unit of stack 5510-2: the one that had the dcache units attached (before they were moved). A further reset of the suspect unit failed to clear the problems so the unit was power-cycled at around 10am. This appears to have cured the problem - all the systems (mostly batch workers) on the effected unit have returned to normal operation.

dcache.gridpp.rl.ac.uk has been returned to 5510-2 from 5530-0b. dcache-head still on 5530-0b. dcache-tape has been on 5510-2 the whole time. gftp0440 and gftp0444 restarted.

Martin bly 13:27, 26 Apr 2006 (BST)

25th April 2006

18:00 - Down from approximately 5am to 5pm due to an odd network issue - lots of timeouts and the various dCache components were having insurmountable difficulty communication with each other, moved two systems onto a different stack which allowed us to get the SE(s) back up and running

24th April 2006

16:13 - Increased number of streams back to 5
11:00 - Set sysctl -w vm.max-readahead=512 vm.min-readahead=256 on the 4 tape cache disk server

23rd April 2006

23:30 - Transfers resumed from CERN end at ~22:00.

20:30 - I tried that (on the same box) and it didn't work! Ho Hum. No traffic on UKLight since around 11:45, link aboslutely dead (on cacti) save for low level chatter from around 15:45. Probably stopped by CERN end. M.

19:33 - Changed the permissions on the directory, someone else reported a similar problem a couple of days ago I think where a directory changed permission for no apparent reason.

  pnfs.gridpp.rl.ac.uk# chown dteam001:dteam /pnfs/gridpp.rl.ac.uk/tape/dteam/sc3/2006-04-23

Steve traylen 19:42, 23 Apr 2006 (BST)

16:15 - Looking at this report from Maarten L:

   "xfers to RAL have produced more end more failures during Sun. morning, and since 11 GMT 
    all attempts fail as follows:
   
   user has no permission to write into path
   /pnfs/gridpp.rl.ac.uk/tape/dteam/sc3/2006-04-23"

This has permissions root:root rather than dteam001:dteam, gained around 4am. Tried doing the obvious with chown on various boxes but no dice: dcache obviously not the same as a standard fs. Derek will need to reset the permissions.

22nd April 2006

01:30 - Changed io elevator tuning from r=64,w=8192 to r=8192,w=8192 on nfs61
00:20 - Rate to tape had also dropped at same time - Failed_Tape_Open alarm for dougal possibly related?, seemed to recover without any intervention on my part
00:15 - Report on sc-tech of timeouts on files since 23:30, killed lingering idle transfers , rate increased

21st April 2006

17:10 - Reduced fts channel to 1 stream from 5
17:00 - Changed io elevator tuning from r=64,w=8192 to r=8192,w=64 on nfs60
15:00 - Increased concurrent stores to 3 per server (from 2)
11:00 - OPN to CERN went down

20th April 2006

17:15 - OPN recovered
13:43 - OPN to CERN went down
10:30 - Rates to tape slow overnight - increase movers on each pool to 10 to give a pool more chance of a gap with no incoming traffic

19th April 2006

17:03 - Reduced movers on the 4 tape pools to maximum of 5 (from 50) to try smooth out network traffic on each server
13:55 - Restarted gftp0440

18th April 2006

08:15: Yee-hah #2! 160.10MB/s average for 17/04.

17th April 2006

12:27: Yee-hah! 151.04MB/s average for 16/04. Mind, it's interesting that this happens when no one is on-line a-fiddling!
09:16: Observation: Looking at the Cacti plots, there is a cyclic drop in transfer rate every 6.5 hours.

13th April 2006

09:43 - Created new PoolManager link to give csfnfs42 pools less priority than others
Observation: RAL rate appeared to drop from the initial 200MB/s straight after the restart from DB maintenance yesterday to less than 150MB/s in correllation to the upturn in traffic to PIC.

12th April 2006

15:00 - dCache servers csfnfs51,54,56,57 now connected to 5530-0b.
11:30 - Moved csfnfs63 onto 5530-0b
08:20 - Big spike in network stats on Cacti for UKLR and most links, due to jam on server caused by /var becoming full.

11th April 2006

22:52 - Restarted doors on gftp0446
22:10 - See a lot of pnfs nfs server not responding - upped pnfscopies from 4 to 8 and shmclients from 8 to 12 in /usr/etc/pnfssetup on pnfs, to see if it helps
13:35 - Rate still varying widely - implemented SARA scheme - reduced movers from 10 to 2 on pools
10:30 - csfnfs61 moved to 5530-0b
10:15 - csfnfs60 moved to 5530-0b
09:20 - Removed dteam access from csfnfs39 pools - insufficent space

10th April 2006

15:30 - added gftp0440 & gftp0450 back.
15:00 - network interruption to put switch in between UKLight Router and Tier 1 head switch to hang gftp doors from, intention is to move disk servers up also.
13:25 - restarted dcache-core on gftp0452
13:11 - added ulimit tweak to nfs57 and restarted
11:40 - gftp0445 & gftp0452 moved from 5510-2 to 5530-0a
11:10 - gftp0444 moved from 5510-2 to 5530-0a
11:00 - csfnfs57 back into production

8th April 2006

22:30 - restarted dcache-core on gftp0446

7th April 2006

12:57 - removed gftp0440 & 450 to see what effect it has on link distribution
09:24 - nfs57 has broken overnight

6th April 2006

17:00 - There is a correlation between the addition of the 4 new gridftp doors and the unbalanced link

File:Trunk-maxed.png

1 link near max

16:30 - One of the gigabit links in a 4x1 trunk link between 2 T1 switches is close to maxed out
13:00 - csfnfs54 just too full - removed dteam access, added fixed csfnfs42 pools
10:07 - pools csfnfs54_1 & _3 were getting full, switched dteam activity to csfnfs54_2 & _4
09:50 - ditto gftp0444
09:30 - Restarted dcache-core on gftp0440 - going slow

5th April 2006

20:55 - Restarted gftp0440 & gftp0447 - lots of sockets with data queue'd but not going anwhere
16:30 - Speed has dipped, added pool from other array from each system back into dteam pool group
16:00 - Removed dteam access from all but one pool on each system - trying to increasing distribution across systems
14:00 - Set space cost factor to 0 - free space is so vastly different between pools
11:25 - Set all non-tape/non csfnfs58 pools to 10 gftp transfers (from 50) to try and improve distribution
11:00 - Transfers stopped - mailed sc-tech
09:20 - CERN paused for DB move : restart dcache-core on gftp0447 as its not being pulling its weight, upped max running put transfers per user (if other transfers queued) from 10 to 30 on srm to see if it makes a difference.

4th April 2006

21:25 - restarted dcache-core on gftp0445 & gftp0446 - both idle for >1 hour
18:20 - restarted dcache-core on gftp0444 - LoginBroker claiming 11 transfers, but no network traffic evident
18:00 - added 4 new gftp servers
16:15 - csfnfs42 has dropped a drive from one of its arrays, removed dteam access from both to give it a break

3rd April 2006

14:10 - No significant effect from cpu cost change - remove 3 of 4 pools of csfnfs57 from dteam pool group
13:50 - Rate has dropped since addition of pools - set cpu cost factor to 2.0 (from 1.0) to get dCache to prefer quiet pools over empty pools
12:00 - Added 4 pools from csfnfs57 to dteam pool pgroup

RAL Tier1 SC4 Status

Contents