Monday 1st October 14.30 BST</br>
40 36 Open tickets this week, and it's the start of the month so we get to go other all of them! Maybe every month is a little too often for such a review...
NGI</br>
https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7)</br>
COMET VO creation, On Hold pending the other VO creation gubbins (6/9) UPDATE - THE GUBBINS TICKET SUGGESTS THAT THE VO IS IN PRODUCTION https://ggus.eu/ws/ticket_info.php?ticket=%2085736
https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/5)</br>
Chris's VOMS request rejig ticket. On hold until the UK Voms reshuffle is complete, the reminder date (24/9) has passed. (6/9)
TIER 1</br>
https://ggus.eu/ws/ticket_info.php?ticket=86570 (1/10)</br>
GGUS is moving to a SHA2 certificate on their next release (~24th), and have asked if the SHA2 cert will cause any trouble. Gareth has noted the ticket, but unsure if others will take notice. In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86552 (30/9)</br>
Atlas transfers from/to RAL-LCG2 failed, apparently due to high load at the RAL end. Found to be caused by a database problem. Should be fixed, at risk for a little while longer. In Progress (1/10) SOLVED-ORACLE WORKAROUND PUT BACK IN PLACE
https://ggus.eu/ws/ticket_info.php?ticket=86541 (29/9)</br>
Before the above problem atlas transfers were failing with SECURITY_ERRORs. A known FTS bug caused this (https://ggus.eu/tech/ticket_show.php?ticket=81844). Patch applied this morning. In progress (1/10) SOLVED -PATCH WORKED
https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9)</br>
Duncan has ticketed the Tier 1 over packet loss seen on many (not all) Perfsonar tests where the RAL perfsonar is the destination. The RAL chaps are looking into it, but aren't expecting a solution to easily present itself. In progress (19/9)
https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)</br>
biomend nagios jobs can't register files on srm-biomed.gridpp.rl.ac.uk. An odd problem that only seemed to affect biomed jobs. Looked to be dealt with for a while, but seems to have re-emerged. In progress (24/9)
https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11)</br>
SL4 DPM retirement master ticket. On hold but should be In progressed with a view to close (6/9)
DURHAM</br>
https://ggus.eu/ws/ticket_info.php?ticket=86578 (1/10)</br>
Ops srm-put tests are failing. Only put in this morning though, still assigned. (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86534 (28/9)</br>
Ops wn-rep tests failing. Related to above? Still just assigned (28/8)
https://ggus.eu/ws/ticket_info.php?ticket=86281 (21/9)</br>
Another wn-rep related ticket (for a different CE). This one too is just assigned. Are these getting to Mike? (21/9) UPDATE- MACHINES ARE HAVING CERTIFICATE ISSUES
https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9)</br>
Biomed having trouble submitting to cream02, "no space left on device" errors. Not much movement, just in progressed (24/9)
https://ggus.eu/ws/ticket_info.php?ticket=85181 (20/8)</br>
One of the last two glite 3.1 retirement tickets. No reply since Daniela asked if the BDII was indeed glite 3.1. In Progress (13/9)
https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)</br>
High atlas production rate failure at Durham. Durhams rocky summer hasn't helped, but hopefully they're out of the woods(?). On Hold (3/9)
https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11)</br>
Ancient compchem ticket. On hold but might not be relevant as all the CE's have been reinstalled (6/9)
https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11)</br>
SL4 retirement ticket. It looks like it can be closed, just need some confirmation from someone Durham side. In progress (28/9)
OXFORD</br>
https://ggus.eu/ws/ticket_info.php?ticket=86544 (29/9)</br>
Problems after running out of atlas pool accounts at Oxford. Probably caused by lcg-expiregridmapdir bug (I missed the discussion of this, maybe it was offline?), fix in place. Long term plan to up the number of atlas pool accounts. In progress (29/9)
https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9)</br>
Low atlas sonar rate seen between Oxford & BNL. Ewan has been and is looking into it (17/9)
https://ggus.eu/ws/ticket_info.php?ticket=85968 (10/9)</br>
Oxford being bitten by the EMI lcg_utils bug. On hold pending EMI pulling thier finger out. (20/9)
LIVERPOOL</br>
https://ggus.eu/ws/ticket_info.php?ticket=86542 (29/9)</br>
Liverpool suffered a bunch of SRM transfer failures in a short timeframe, no obvious causes found at the time. Were investigating, but were probably interrupted by their unexpected cable bisecting incident today. In progress (29/9). SOLVED - PROBLEM WAS TRANSIENT
https://ggus.eu/ws/ticket_info.php?ticket=86095 (14/9)</br>
Liverpool's encounter with the EMI lcg-utils bug mucking up their WN-rep ops tests. On hold, but has been green for a while -maybe just been lucky? (20/9)
BIRMINGHAM</br>
https://ggus.eu/ws/ticket_info.php?ticket=86540 (28/9)</br>
Atlas transfers to Birmingham failed with "SRM_ABORTED" messages. Mark reports that the VM they are using as a headnode isn't beefy enough to cope with the demand, causing SRM responses to be too slow. He upped the power of the VM but that wasn't a full fix, hoping to get a reinstall in today. A note from atlas this morning mentions that transfers fail for DATADISK but not for PRODDISK, which is odd. Are there any differences in the nature of these transfers? In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9)</br>
One of the tickets clocking poor atlas sonar rates between Birmingham and BNL. Mark and Laurie have looked into this, but not come up with anything conclusive. In progress (19/9)
BRUNEL</br>
https://ggus.eu/ws/ticket_info.php?ticket=86533 (28/9)</br>
Ops "WN-RepDel" tests failing, likely due to the known EMI WN lcg-utils timing out bug. As Brunel already have at ticket about this issue on a different CE (presumably fronting the same WNs) then Daniela asks if the ROD team can sum it in one ticket rather then multiples. In progress (1/10) CLOSED DUE TO BEING A DUPLICATE
https://ggus.eu/ws/ticket_info.php?ticket=85973 (10/9)</br>
The "original" RepDel test failure ticket at Brunel. On hold (awaiting a fix from EMI) (20/9)
IMPERIAL</br>
https://ggus.eu/ws/ticket_info.php?ticket=86426 (26/9)</br>
Hone have trouble submitting to the Imperial WMSi. Daniela reports that the machines are suffering from being too old (something we can all relate to), replacements should have arrived on Friday but hadn't. Dell report a new delivery date of the 8th. In progress (could be on hold until the kit arrives?) (29/9).
GLASGOW</br>
https://ggus.eu/ws/ticket_info.php?ticket=86391 (25/9)</br>
Atlas were having staging in problems due to high disk server load. Problems however persisted for a while after the load on the server calmed down. Did thing sort themselves out after the weekend? In progress (27/9)
https://ggus.eu/ws/ticket_info.php?ticket=85183 (14/8)</br>
One of the last few glite 3.1 retirement tickets. Due to the severe crustiness of the old WMS hardware Glasgow powered it down rather then upgrade (was it only 32-bit hardware?) and are now pondering the next steps. In progress (28/9) UPDATE - DO WE NEED TO REMOVE IT FROM THE GOCDB OR IS DOWNTIME ENOUGH?
https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)</br>
Sno+ WMS problems at Glasgow. AFAICS the wms in question has been switched off due to the reasons above? It might be useful to make that clear to Sno+! In progress (10/9).
RHUL</br>
https://ggus.eu/ws/ticket_info.php?ticket=86383 (25/9)</br>
RHUL stopped publishing UserDN accounting after "upgrading" from glite to EMI apel in August. Apel support have been called in, and Daniela suggests checking the FAQ. In progress (1/10)
QMUL</br>
https://ggus.eu/ws/ticket_info.php?ticket=86378 (25/9)</br>
Hone had jobs waiting "too long" at QM, but the problems disappeared. Along with a bunch of jobs, looks like the QM creams suffered from the database resetting issue (https://ggus.eu/tech/ticket_show.php?ticket=85970, as advertised by Daniela). In progress (27/9)
https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)</br>
Queen Mary is being swamped by unkillable lhcb zombie pilots. Neither the submitters or the site admins can do ought about them using "normal" tools. Daniela has suggested some DB queries to try or attempting to use the JobPurger tool (which would be my suggestion too). In progress (1/10). UPDATE- Some success with the JobPurger with a 5 day time frame
https://ggus.eu/ws/ticket_info.php?ticket=85967 (10/9)</br>
QM failing ops Apel tests. Chris ticketed apel support for help (https://ggus.eu/ws/ticket_info.php?ticket=84326), but not having much luck due to the shear size of their DB, and progress interrupted by GridPP last week. Hopefully will break this problem this week. On hold (21/9)
ECDF</br>
https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9)</br>
Poor atlas sonar rates between BNL and ECDF. Waiting on moving disk servers to new switches and other general network wizardry scheduled for this week. On hold till then (28/9).
CAMBRIDGE</br>
https://ggus.eu/ws/ticket_info.php?ticket=86108 (14/9)</br>
Duncan noticed a WAN bandwidth asymmetry at Cambridge. John contacted the local networking guys, who've investigated and found nothing. Still in progress (26/9)
LANCASTER</br>
https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)</br>
ilc were having trouble submitting jobs to one of Lancaster's CEs. Robin tracked the issues to high disk IO load, and we're figuring out a some ways of mitigating these problems. In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=84583 (26/7)</br>
lhcb jobs failing on a Lancaster CE, originally due to a pool account misconfiguration. The problem has been fixed (probably...) but files don't seem to be being staged in for lhcb and there are no errors (or mention of lhcb at all) in the gridftp logs. Debugging is not being helped by the load issues documented above. In progress (27/9)
https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)</br>
t2k.org transfers from RAL to Lancaster timing out. We hoped the gateway upgrade would improve things, but we were disappointed. Back to the network investigation. In progress (1/10)
RALPP</br>
https://ggus.eu/ws/ticket_info.php?ticket=85019 (9/8)</br>
ILC had some adventures due to VO misconfiguration at RALPP, but looks like things are fixed and the ticket can be closed now. In progress (1/10)
SUSSEX</br>
https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)
Emyr wondered last week if this was the longest ticket ever? Sadly I doubt it! The baton has passed oddly enough to Lancaster, as we've come across a bazaar problem whereby communication from the Sussex cream CE (and only the cream CE) is being refused by machines on a specific Lancaster subnet. Sadly this is the subnet where the lancaster nagios box is sitting. We've ruled out firewalls and had the network chaps at both sides take a look. Traffic is being stopped at the Lancaster end, but by the servers themselves (not the network gateways). I'm currently investigating to see if there's any oddity with our network settings. In progress (26/9)
Ticket of Interest:</br>
https://ggus.eu/tech/ticket_show.php?ticket=85970</br>
As mentioned above, the ticket documenting the EMI2 cream database "reset" problems.
Solved Tickets</br>
Ran out of time for these, but I notice that most of the glite 3.1 tickets are closed and the neurogrid VO has taken off. Good stuff!
|