Monday 2nd November 2015, 13.30 GMT
22 Open UK Tickets this week. First Monday of the Month, so all the tickets get looked at, however run of the mill they are.
First, the link to all the UK tickets.
SUSSEX
116915 (14/10)
Low availability Ops ticket. On holded whilst the numbers sooth themselves. On Hold (23/10)
116865 (12/10)
Sno+ job submission failures. Not much on this ticket since it was set In Progress. Looks like an argus problem. How goes things at Sussex before Matt RB moves on? (We'll miss you Matt!). In progress (20/10)
RALPP
117261 (28/10)
Atlas jobs failing with stage out failures. Federico notices that the failures are due to odd errors - "file already existing", and that things seem to be calming themselves. He's at a loss of what RALPP can do. Checking the panda link suggests the errors are still there today. Waiting for reply (29/10)
BRISTOL
116775 (6/10)
Bristol's CMS glexec ticket. It looks like the solution is to have more cms pool accounts (which of course requires time to deploy). In progress (28/10)
117303 (30/10)
CMS, not Highlander fans, don't seem to believe that There can be only One (glexec ticket). Poor old Bristol seem to be playing whack-a-mole with duplicate tickets. Is there a note that can be left somewhere to stop this happening? Assigned (30/10)
ECDF
95303 (Long long ago)
Edinburgh's (and indeed Scotgrid's) only ticket is this tarball glexec ticket. A bit more on this later. On hold (18/5)
SHEFFIELD
114460 (18/6)
Gridpp (and other) VO pilot roles at Sheffield. No news for a while, snoplus are trying to use pilot roles now for dirac so this is becoming very relevant. In progress (9/10)
116560 (30/9)
Sno+ jobs failing, likely due to too many being submitted to the 10 slots that Sno+ has. Maybe a WMS scheduling problem - Stephen B has given advice. Elena asked if the problem persisted a few weeks ago. Waiting for reply (12/10)
116967 (17/10)
A ROD availability ticket, on hold as per SOP. On hold (20/10)
LANCASTER
116478 (28/9)
Another availability ticket. Autumn was not kind to many of us! On hold (8/10)
116882 (13/10)
Enabling pilot snoplus users at Lancaster. Shouldn't have been a problem, but turned into a bit of a comedy/tragedy of errors by yours truly mucking up. Hopefully fixed now- thanks to Daniela for her patience. In progress (2/11)
95299 (Far far away)
glexec tarball ticket. There's been a lot of communication with the glexec devs about this - the hopefully last hurdle is sorting out the RPATHs for the libraries. It's not a small hurdle though... On hold (2/11)
QMUL
117151(23/10)
A ticket about jumbo frame problems, submitted to QM. After Dan provided some education the user replied, in that he only sees this problem at two atlas sites. But he is contacting the network admins at his institution to see if it is their end. On hold (29/10)
117011 (19/10)
ROD ticket for glue-validate errors. Went away for a while after Dan re-yaimed his site bdii, but possibly back again. Daniela suggests re-running the glue-validate test. Reopened (2/11)
116689 (6/10)
Another ROD ticket, where Ops glexec test jobs are seemingly timing out for QM (this is the ticket Daniela mentioned on the ops mailing list). Dan noted that with the cluster half full tests were passing, suggesting some kind of load correlation (but as he also notes - what's getting loaded and causing the problem - Batch, CE or WNs?). Kashif reckons the argus server, and suggests a handy glexec time test which he posted. In progress (2/11)
BRUNEL
117324 (2/11)
A fresh looking ROD ticket - Raul had to restart the BDII and hopefully that got it. In progress (2/11)
100IT
116358 (22/9)
Missing Image at 100IT. 100IT have asked for more details, no news since. Waiting for reply (19/10)
THE TIER 1
116866 (12/10)
Snoplus pilot enablement (not actually a word) at the Tier 1. New accounts were being requested after some internal discussion. On hold (19/10)
116864 (12/10)
CMS AAA tests failing (the submitter notes "again..."). There are some oddities with other sites, which might be remote problems, but Andrew notes that previous manual fixes have been overwritten which likely explains why problems came back. In progress (does it need to be waiting for a reply?) (26/10)
117171 (24/10)
LHCB had problems with an arc CE that was misbehaving for everyone. Things were fixed, and this ticket can now be closed. Waiting for reply (can be closed) (27/10)
117277 (30/10)
Atlas have spotted "bring online timeout has been exceeded). This appears to be a mixture of problems adding up, such as a number of borken disk nodes and heavy write access by atlas. In progress (2/11)
117248 (28/10)
I believe related to the discussion on tb-support, this ticket requests that new SRM host certs that meet the requirements specified be requested for the RAL SRMs. Jens was on it, and the new certs are ready to be deployed. In progress (30/10)
Other VO Nagios - some badness at Sussex, but they have a ticket open for that.
|