RAL Tier1 SC4 Status
This page refers to events during WLCG's Service Challenger 4 from the RAL perspective in April through June 2006 and is thus largely of historical interest
RAL's log for SC4 - similar to RAL Tier1 SC3 Log. Times are GMT+1
Contents
- 1 Tape Rates
- 2 Event Log
- 2.1 30th June 2006
- 2.2 28th June 2006
- 2.3 22nd June 2006
- 2.4 21st June 2006
- 2.5 2nd May 2006
- 2.6 29th April 2006
- 2.7 28th April 2006
- 2.8 26th April 2006
- 2.9 25th April 2006
- 2.10 24th April 2006
- 2.11 23rd April 2006
- 2.12 22nd April 2006
- 2.13 21st April 2006
- 2.14 20th April 2006
- 2.15 19th April 2006
- 2.16 18th April 2006
- 2.17 17th April 2006
- 2.18 13th April 2006
- 2.19 12th April 2006
- 2.20 11th April 2006
- 2.21 10th April 2006
- 2.22 8th April 2006
- 2.23 7th April 2006
- 2.24 6th April 2006
- 2.25 5th April 2006
- 2.26 4th April 2006
- 2.27 3rd April 2006
Tape Rates
Day | #files | seconds | MB/sec | MB/sec/drive | wr/mount |
---|---|---|---|---|---|
Wed 19th April | 1283 | 31080 | 41.28 | 30.2 | 4.4 |
Thu 20th April | 2530 | 86400 | 29.28 | 33.9 | 3.8 |
Fri 21st April | 3454 | 86400 | 39.98 | 29.7 | 5.2 |
Sat 22rd April | 2569 | 86400 | 29.73 | 36.15 | 4.2 |
Sun 23th April | 1377 | 86400 | 15.94 | 38.5 | 7.9 |
Mon 24th April | 2680 | 86400 | 31.02 | 40.15 | 10.34 |
Tue 25th April | 1833 | 86400 | 21.22 | 34.04 | 7.6 |
Wed 26th April | 2676 | 86400 | 30.97 | 29.06 | 7.39 |
Event Log
30th June 2006
Unloaded disk server
440 start Fri Jun 30 18:20:30 BST 2006 440 end Fri Jun 30 18:20:47 BST 2006 444 start Fri Jun 30 18:20:47 BST 2006 444 end Fri Jun 30 18:21:02 BST 2006 445 start Fri Jun 30 18:21:02 BST 2006 445 end Fri Jun 30 18:21:17 BST 2006 446 start Fri Jun 30 18:21:17 BST 2006 446 end Fri Jun 30 18:21:32 BST 2006 447 start Fri Jun 30 18:21:32 BST 2006 447 end Fri Jun 30 18:21:48 BST 2006 450 start Fri Jun 30 18:21:48 BST 2006 450 end Fri Jun 30 18:22:03 BST 2006 451 start Fri Jun 30 18:22:03 BST 2006 451 end Fri Jun 30 18:22:19 BST 2006 452 start Fri Jun 30 18:22:19 BST 2006 452 end Fri Jun 30 18:22:37 BST 2006
Busy disk server
440 start Fri Jun 30 18:22:47 BST 2006 440 end Fri Jun 30 18:25:12 BST 2006 444 start Fri Jun 30 18:25:12 BST 2006 444 end Fri Jun 30 18:26:56 BST 2006 445 start Fri Jun 30 18:26:56 BST 2006 445 end Fri Jun 30 18:27:30 BST 2006 446 start Fri Jun 30 18:27:30 BST 2006 446 end Fri Jun 30 18:28:15 BST 2006 447 start Fri Jun 30 18:28:15 BST 2006 447 end Fri Jun 30 18:28:38 BST 2006 450 start Fri Jun 30 18:28:38 BST 2006 450 end Fri Jun 30 18:30:39 BST 2006 451 start Fri Jun 30 18:30:39 BST 2006 451 end Fri Jun 30 18:31:04 BST 2006 452 start Fri Jun 30 18:31:04 BST 2006 452 end Fri Jun 30 18:31:30 BST 2006
28th June 2006
Two gridftp doors timing out this morning - restarted.
Implemented automatic restart of dcache doors on gftp0440 & gftp0444 at 4am to see if that improves things - hopefully they should have a lower transfer count and behave more reliably.
Grabbed gftp0445 all to myself by removing the LoginBroker link from the gridftpdoor.batch file - stops the SRM using it for TURLS and ran some single stream transfers against some disk servers.
First 2 results are after a restart of the dcache-core service, but the system had not been rebooted, the last 3 are after a reboot. Times are take by running the date command, storing it in an environment variable, doing the transfer, echo the saved value and running date again. The busy and full disk servers are dCache's pick of csfnfs60-63. The quiet and empty disk server is on a different switch to the other two disk server and gftp door, which probably explains the full,quiet server being faster. Quiet to Busy appears to reduce speed by a factor of 5.
File size is 954MB (100000000Bytes)
Non rebooted dedicated gftp server to quiet & empty disk server
Wed Jun 28 17:55:13 BST 2006 Wed Jun 28 17:56:04 BST 2006 = 49 seconds
Non rebooted dedicated gftp server to busy & full disk server
Wed Jun 28 17:56:24 BST 2006 Wed Jun 28 18:00:25 BST 2006 = 241 seconds
Rebooted dedicated gftp server to quiet & empty disk server
Wed Jun 28 18:06:53 BST 2006 Wed Jun 28 18:07:27 BST 2006 = 37 seconds
Rebooted dedicated gftp server to busy & full disk server
Wed Jun 28 18:08:08 BST 2006 Wed Jun 28 18:11:09 BST 2006 = 181 seconds
Rebooted dedicated gftp server to quiet & fullish (140GB free) disk server
Wed Jun 28 18:16:49 BST 2006 Wed Jun 28 18:17:20 BST 2006 = 31 seconds
Afterthought - The busy disk servers have been busy for a while without dcache-pool restarted while the quiet ones have been idle since their last restart of the dcache-pool service, perhaps a slightly unfair comparison on those grounds - they busy servers may have suffered a performance drop because of this, will retest tomorrow by restarting dcache-pool on one of the busy disk servers and doing targetted transfers to one of its pools.
22nd June 2006
- 10:00 - No gridftp doors dropped overnight, but the powercut at CERN probably helped.
G-U-C speeds :
gftp0440 : Thu Jun 22 09:54:38 BST 2006 to Thu Jun 22 09:55:06 BST 2006 gftp0444 : Thu Jun 22 09:55:06 BST 2006 to Thu Jun 22 09:55:24 BST 2006 gftp0445 : Thu Jun 22 09:55:24 BST 2006 to Thu Jun 22 09:55:55 BST 2006 gftp0446 : Thu Jun 22 09:55:55 BST 2006 to Thu Jun 22 09:57:55 BST 2006 gftp0447 : Thu Jun 22 09:57:55 BST 2006 to Thu Jun 22 09:59:57 BST 2006 gftp0450 : Thu Jun 22 09:59:57 BST 2006 to Thu Jun 22 10:01:58 BST 2006 gftp0451 : Thu Jun 22 10:01:58 BST 2006 to Thu Jun 22 10:02:26 BST 2006 gftp0452 : Thu Jun 22 10:02:26 BST 2006 to Thu Jun 22 10:04:27 BST 2006
Looking at these results and yesterdays, later runs are slower than earlier runs, so I've reversed the list of doors to see if the slowdown is host based or time based :
gftp0452 : Thu Jun 22 10:16:37 BST 2006 to Thu Jun 22 10:18:39 BST 2006 gftp0451 : Thu Jun 22 10:18:39 BST 2006 to Thu Jun 22 10:20:41 BST 2006 gftp0450 : Thu Jun 22 10:20:41 BST 2006 to Thu Jun 22 10:22:43 BST 2006 gftp0447 : Thu Jun 22 10:22:43 BST 2006 to Thu Jun 22 10:24:45 BST 2006 gftp0446 : Thu Jun 22 10:24:45 BST 2006 to Thu Jun 22 10:27:13 BST 2006 gftp0445 : Thu Jun 22 10:27:13 BST 2006 to Thu Jun 22 10:27:57 BST 2006 gftp0444 : Thu Jun 22 10:27:57 BST 2006 to Thu Jun 22 10:28:49 BST 2006 gftp0440 : Thu Jun 22 10:28:49 BST 2006 to Thu Jun 22 10:30:08 BST 2006
Looking at these, it appears the higher numbered doors are on the whole slower than the lower number doors
21st June 2006
- 14:00 - Long time no log.
I need somewhere to document my efforts to understand the current issues we're having with gridftp doors slowing down and hanging and this is probably the most approriate place.
Configuration Changes @ 1400
gftp0440 - currently using 1 stream per transfer instead of 10 0444 - pnfs timeout increased to 1200 from 120 (was only running a gridftp door not a gsidcap door, but gsidcap door has now been restarted) 0445 - /etc/sysctl.conf removed 0446 - New version of dcache-server installed,pnfs timeout increased as gftp0444 0447 - Using j2sdk 1.4.2-08 rather than j2re 1.4.2-01. pnfs timeout increased 0450 - Performance Markers disabled 0451 - PoolManager timeout to 54000 from 5400 0452 - Same as gftp0444
So far, only gftp0440 & and gftp0445 have not yet stopped working with the
kernel: TCP: too many of orphaned sockets
error message
gftp Door time test - 446 and 452 were broken at the time, file size is 954MB (1000000000 Bytes), from RAL ui.
gftp0440 : Wed Jun 21 14:25:36 BST 2006 to Wed Jun 21 14:25:58 BST 2006 gftp0444 : Wed Jun 21 14:25:58 BST 2006 to Wed Jun 21 14:27:59 BST 2006 gftp0445 : Wed Jun 21 14:27:59 BST 2006 to Wed Jun 21 14:28:18 BST 2006 Cancelling copy... gftp0446 : Wed Jun 21 14:28:18 BST 2006 to Wed Jun 21 14:35:42 BST 2006 gftp0447 : Wed Jun 21 14:35:42 BST 2006 to Wed Jun 21 14:37:41 BST 2006 gftp0450 : Wed Jun 21 14:37:41 BST 2006 to Wed Jun 21 14:39:41 BST 2006 gftp0451 : Wed Jun 21 14:39:41 BST 2006 to Wed Jun 21 14:39:57 BST 2006 Cancelling copy... gftp0452 : Wed Jun 21 14:39:57 BST 2006 to Wed Jun 21 14:40:49 BST 2006
2nd May 2006
- 17:00 - Network restored at ~16:30 - transfers immediately back and running. It took a speculative prod to one of the router ports on UKLR to get the links back, after some cards were reset on the Cienna kit. No viable reason why this should have cured the 'only 3 of the 4 link working' problem.
Martin Bly 17:10, 2 May 2006 (BST)
29th April 2006
- 19:00 - Transfers stopped at ~10:00 - looks like OPN down though difficult to say for sure. Ping to castorgridsc.cern.ch fails. Two warnings 'major critical' up on the Cienna box. Not good. To whom do we report this?
Martin Bly 19:15, 29 Apr 2006 (BST)
28th April 2006
- 15:45 - Minor network break for cable swap
- 15:00 - Upped PnfsManager threads to 8, restarted dcache-core on pnfs to take effect
- 12:00 - Converted two pools on nfs57 into hsm backed pools and added in.
- 11:30 - Restarted dcache-core on gftp0451 and gftp0452 - being idle
26th April 2006
- 13:12 - Various network problems manefested themselves this morning after the stack reset yesterday, seeming to indicate a fault in one unit of stack 5510-2: the one that had the dcache units attached (before they were moved). A further reset of the suspect unit failed to clear the problems so the unit was power-cycled at around 10am. This appears to have cured the problem - all the systems (mostly batch workers) on the effected unit have returned to normal operation.
dcache.gridpp.rl.ac.uk has been returned to 5510-2 from 5530-0b. dcache-head still on 5530-0b. dcache-tape has been on 5510-2 the whole time. gftp0440 and gftp0444 restarted.
Martin bly 13:27, 26 Apr 2006 (BST)
25th April 2006
- 18:00 - Down from approximately 5am to 5pm due to an odd network issue - lots of timeouts and the various dCache components were having insurmountable difficulty communication with each other, moved two systems onto a different stack which allowed us to get the SE(s) back up and running
24th April 2006
- 16:13 - Increased number of streams back to 5
- 11:00 - Set sysctl -w vm.max-readahead=512 vm.min-readahead=256 on the 4 tape cache disk server
23rd April 2006
- 23:30 - Transfers resumed from CERN end at ~22:00.
- 20:30 - I tried that (on the same box) and it didn't work! Ho Hum. No traffic on UKLight since around 11:45, link aboslutely dead (on cacti) save for low level chatter from around 15:45. Probably stopped by CERN end. M.
- 19:33 - Changed the permissions on the directory, someone else reported a similar problem a couple of days ago I think where a directory changed permission for no apparent reason.
pnfs.gridpp.rl.ac.uk# chown dteam001:dteam /pnfs/gridpp.rl.ac.uk/tape/dteam/sc3/2006-04-23
Steve traylen 19:42, 23 Apr 2006 (BST)
- 16:15 - Looking at this report from Maarten L:
"xfers to RAL have produced more end more failures during Sun. morning, and since 11 GMT all attempts fail as follows: user has no permission to write into path /pnfs/gridpp.rl.ac.uk/tape/dteam/sc3/2006-04-23"
This has permissions root:root rather than dteam001:dteam, gained around 4am. Tried doing the obvious with chown on various boxes but no dice: dcache obviously not the same as a standard fs. Derek will need to reset the permissions.
22nd April 2006
- 01:30 - Changed io elevator tuning from r=64,w=8192 to r=8192,w=8192 on nfs61
- 00:20 - Rate to tape had also dropped at same time - Failed_Tape_Open alarm for dougal possibly related?, seemed to recover without any intervention on my part
- 00:15 - Report on sc-tech of timeouts on files since 23:30, killed lingering idle transfers , rate increased
21st April 2006
- 17:10 - Reduced fts channel to 1 stream from 5
- 17:00 - Changed io elevator tuning from r=64,w=8192 to r=8192,w=64 on nfs60
- 15:00 - Increased concurrent stores to 3 per server (from 2)
- 11:00 - OPN to CERN went down
20th April 2006
- 17:15 - OPN recovered
- 13:43 - OPN to CERN went down
- 10:30 - Rates to tape slow overnight - increase movers on each pool to 10 to give a pool more chance of a gap with no incoming traffic
19th April 2006
- 17:03 - Reduced movers on the 4 tape pools to maximum of 5 (from 50) to try smooth out network traffic on each server
- 13:55 - Restarted gftp0440
18th April 2006
- 08:15: Yee-hah #2! 160.10MB/s average for 17/04.
17th April 2006
- 12:27: Yee-hah! 151.04MB/s average for 16/04. Mind, it's interesting that this happens when no one is on-line a-fiddling!
- 09:16: Observation: Looking at the Cacti plots, there is a cyclic drop in transfer rate every 6.5 hours.
13th April 2006
- 09:43 - Created new PoolManager link to give csfnfs42 pools less priority than others
- Observation: RAL rate appeared to drop from the initial 200MB/s straight after the restart from DB maintenance yesterday to less than 150MB/s in correllation to the upturn in traffic to PIC.
12th April 2006
- 15:00 - dCache servers csfnfs51,54,56,57 now connected to 5530-0b.
- 11:30 - Moved csfnfs63 onto 5530-0b
- 08:20 - Big spike in network stats on Cacti for UKLR and most links, due to jam on server caused by /var becoming full.
11th April 2006
- 22:52 - Restarted doors on gftp0446
- 22:10 - See a lot of pnfs nfs server not responding - upped pnfscopies from 4 to 8 and shmclients from 8 to 12 in /usr/etc/pnfssetup on pnfs, to see if it helps
- 13:35 - Rate still varying widely - implemented SARA scheme - reduced movers from 10 to 2 on pools
- 10:30 - csfnfs61 moved to 5530-0b
- 10:15 - csfnfs60 moved to 5530-0b
- 09:20 - Removed dteam access from csfnfs39 pools - insufficent space
10th April 2006
- 15:30 - added gftp0440 & gftp0450 back.
- 15:00 - network interruption to put switch in between UKLight Router and Tier 1 head switch to hang gftp doors from, intention is to move disk servers up also.
- 13:25 - restarted dcache-core on gftp0452
- 13:11 - added ulimit tweak to nfs57 and restarted
- 11:40 - gftp0445 & gftp0452 moved from 5510-2 to 5530-0a
- 11:10 - gftp0444 moved from 5510-2 to 5530-0a
- 11:00 - csfnfs57 back into production
8th April 2006
- 22:30 - restarted dcache-core on gftp0446
7th April 2006
- 12:57 - removed gftp0440 & 450 to see what effect it has on link distribution
- 09:24 - nfs57 has broken overnight
6th April 2006
- 17:00 - There is a correlation between the addition of the 4 new gridftp doors and the unbalanced link
- 16:30 - One of the gigabit links in a 4x1 trunk link between 2 T1 switches is close to maxed out
- 13:00 - csfnfs54 just too full - removed dteam access, added fixed csfnfs42 pools
- 10:07 - pools csfnfs54_1 & _3 were getting full, switched dteam activity to csfnfs54_2 & _4
- 09:50 - ditto gftp0444
- 09:30 - Restarted dcache-core on gftp0440 - going slow
5th April 2006
- 20:55 - Restarted gftp0440 & gftp0447 - lots of sockets with data queue'd but not going anwhere
- 16:30 - Speed has dipped, added pool from other array from each system back into dteam pool group
- 16:00 - Removed dteam access from all but one pool on each system - trying to increasing distribution across systems
- 14:00 - Set space cost factor to 0 - free space is so vastly different between pools
- 11:25 - Set all non-tape/non csfnfs58 pools to 10 gftp transfers (from 50) to try and improve distribution
- 11:00 - Transfers stopped - mailed sc-tech
- 09:20 - CERN paused for DB move : restart dcache-core on gftp0447 as its not being pulling its weight, upped max running put transfers per user (if other transfers queued) from 10 to 30 on srm to see if it makes a difference.
4th April 2006
- 21:25 - restarted dcache-core on gftp0445 & gftp0446 - both idle for >1 hour
- 18:20 - restarted dcache-core on gftp0444 - LoginBroker claiming 11 transfers, but no network traffic evident
- 18:00 - added 4 new gftp servers
- 16:15 - csfnfs42 has dropped a drive from one of its arrays, removed dteam access from both to give it a break
3rd April 2006
- 14:10 - No significant effect from cpu cost change - remove 3 of 4 pools of csfnfs57 from dteam pool group
- 13:50 - Rate has dropped since addition of pools - set cpu cost factor to 2.0 (from 1.0) to get dCache to prefer quiet pools over empty pools
- 12:00 - Added 4 pools from csfnfs57 to dteam pool pgroup