Past Ticket Bulletins 2015

From GridPP Wiki
Jump to: navigation, search

Monday 2nd November 2015, 13.30 GMT
22 Open UK Tickets this week. First Monday of the Month, so all the tickets get looked at, however run of the mill they are.

First, the link to all the UK tickets.

116915 (14/10)
Low availability Ops ticket. On holded whilst the numbers sooth themselves. On Hold (23/10)

116865 (12/10)
Sno+ job submission failures. Not much on this ticket since it was set In Progress. Looks like an argus problem. How goes things at Sussex before Matt RB moves on? (We'll miss you Matt!). In progress (20/10)

117261 (28/10)
Atlas jobs failing with stage out failures. Federico notices that the failures are due to odd errors - "file already existing", and that things seem to be calming themselves. He's at a loss of what RALPP can do. Checking the panda link suggests the errors are still there today. Waiting for reply (29/10)

116775 (6/10)
Bristol's CMS glexec ticket. It looks like the solution is to have more cms pool accounts (which of course requires time to deploy). In progress (28/10)

117303 (30/10)
CMS, not Highlander fans, don't seem to believe that There can be only One (glexec ticket). Poor old Bristol seem to be playing whack-a-mole with duplicate tickets. Is there a note that can be left somewhere to stop this happening? Assigned (30/10)

95303 (Long long ago)
Edinburgh's (and indeed Scotgrid's) only ticket is this tarball glexec ticket. A bit more on this later. On hold (18/5)

114460 (18/6)
Gridpp (and other) VO pilot roles at Sheffield. No news for a while, snoplus are trying to use pilot roles now for dirac so this is becoming very relevant. In progress (9/10)

116560 (30/9)
Sno+ jobs failing, likely due to too many being submitted to the 10 slots that Sno+ has. Maybe a WMS scheduling problem - Stephen B has given advice. Elena asked if the problem persisted a few weeks ago. Waiting for reply (12/10)

116967 (17/10)
A ROD availability ticket, on hold as per SOP. On hold (20/10)

116478 (28/9)
Another availability ticket. Autumn was not kind to many of us! On hold (8/10)

116882 (13/10)
Enabling pilot snoplus users at Lancaster. Shouldn't have been a problem, but turned into a bit of a comedy/tragedy of errors by yours truly mucking up. Hopefully fixed now- thanks to Daniela for her patience. In progress (2/11)

95299 (Far far away)
glexec tarball ticket. There's been a lot of communication with the glexec devs about this - the hopefully last hurdle is sorting out the RPATHs for the libraries. It's not a small hurdle though... On hold (2/11)

A ticket about jumbo frame problems, submitted to QM. After Dan provided some education the user replied, in that he only sees this problem at two atlas sites. But he is contacting the network admins at his institution to see if it is their end. On hold (29/10)

117011 (19/10)
ROD ticket for glue-validate errors. Went away for a while after Dan re-yaimed his site bdii, but possibly back again. Daniela suggests re-running the glue-validate test. Reopened (2/11)

116689 (6/10)
Another ROD ticket, where Ops glexec test jobs are seemingly timing out for QM (this is the ticket Daniela mentioned on the ops mailing list). Dan noted that with the cluster half full tests were passing, suggesting some kind of load correlation (but as he also notes - what's getting loaded and causing the problem - Batch, CE or WNs?). Kashif reckons the argus server, and suggests a handy glexec time test which he posted. In progress (2/11)

117324 (2/11)
A fresh looking ROD ticket - Raul had to restart the BDII and hopefully that got it. In progress (2/11)

116358 (22/9)
Missing Image at 100IT. 100IT have asked for more details, no news since. Waiting for reply (19/10)

116866 (12/10)
Snoplus pilot enablement (not actually a word) at the Tier 1. New accounts were being requested after some internal discussion. On hold (19/10)

116864 (12/10)
CMS AAA tests failing (the submitter notes "again..."). There are some oddities with other sites, which might be remote problems, but Andrew notes that previous manual fixes have been overwritten which likely explains why problems came back. In progress (does it need to be waiting for a reply?) (26/10)

117171 (24/10)
LHCB had problems with an arc CE that was misbehaving for everyone. Things were fixed, and this ticket can now be closed. Waiting for reply (can be closed) (27/10)

117277 (30/10)
Atlas have spotted "bring online timeout has been exceeded). This appears to be a mixture of problems adding up, such as a number of borken disk nodes and heavy write access by atlas. In progress (2/11)

117248 (28/10)
I believe related to the discussion on tb-support, this ticket requests that new SRM host certs that meet the requirements specified be requested for the RAL SRMs. Jens was on it, and the new certs are ready to be deployed. In progress (30/10)

Other VO Nagios - some badness at Sussex, but they have a ticket open for that.

Monday 26th October 2015, 15.30 GMT
26 Open UK Tickets this week. Not many seem all that exciting though.

The link to all the UK tickets.

The few (two) tickets that really caught my eye are:

117151 (23/10)
This ticket is quite interesting, mainly for Dan schooling the submitter. QM received a ticket complaining that their jumbo frames were breaking stuff - it doesn't looking like the problem is at QM though. Naught wrong with the ticket handling. Waiting for reply (23/10)

117065 (20/10)
Bristol have a CMS glexec error ticket that looks very similar to an existing one (116683), which is in turn spookily similar to a ticket being worked on by the Bristol admins (116775). At the very least I would say that if the two problems are different one would likely obfuscate the other. Is this a case of over-keen shifters submitting tickets without checking? I'd be tempted to close one or both of 116683 and this one (117065). Tell them I said it was okay to[1]. In progress (21/10)

[1] It probably is okay to.

There are also 4 availability tickets, all On Hold waiting for 30 days or so to pass.

Other VO Nagios looks clean at time of writing.

Monday 19th October 2015, 14.30 BST
28 Open UK Tickets today.

116920 (14/10)
UCL have a availability ticket, and Andrew M wonders what can be done for them as a VAC site to stop getting these sort of tickets? Assigned (15/10) Update - Alessandra has updated and On-Holded the ticket. Thanks!

116865 (12/10)
116915 (14/10)
Sussex have a Sno+ ticket and a ROD ticket that don't seem to have been looked at since they were submitted last week. Both just Assigned.

116918 (14/10)
Another ROD ticket (Invalid glue), I think this one has snuck past the Liver-Lad's watch. Assigned (14/10)

116782 (7/10)
Another Rod ticket, Dan looks like he's tracked down why one of his CE's is misbehaving (MaxStartups in sshd_config). Looks like this ticket at least can be closed, Gareth confirms that the test is green. Waiting for reply (19/10)

116560 (30/9)
Stephen B has added some more information to try to figure out why Sno+ jobs are flooding Sheffield's 10 Sno+ slots. Elena is not sure if the problem persists though. Waiting for reply (12/10)

116864 (12/10)
It looks like this CMS AAA test problem has resolved itself - Federica asks if you chaps at RAL changed anything? Looks like the ticket can be closed. In progress (15/10)

116866 (12/10)
Enabling Sno+ pilots at the Tier 1. Only the LHC VOs had pilot roles enabled at the Tier 1, Andrew was going to discuss how best to make these changes. As with a similar issue at Lancaster - probably best to do it for all the VOs that will be using Dirac. In progress (13/10)

Monday 12th October 2015, 14.30 BST
23 Open UK Tickets this week. Just a light review.

116752 (1/10)
Oxford's CMS Phedex renewal ritual ticket. Chris B kindly offered in his ticket for RALPP to officiate this (un)holy task for Oxford, so I advise some communication between you chaps and him - which may well be going on, but it ain't in the ticket! Assigned (6/10)

116683 (5/10)
116775 (6/10)
CMS have by the looks of it thrown a pair of duplicate tickets Bristol's way. How rude! Lukasz has rightfully suggested closing one of them (I suggest 116683. Winnie's put a good reply in t'other one).The underlying problem appears to be a shortage of pool accounts - what are the recommended amount of accounts for VOs these days? In progress.

114460 (18/6)
Sheffield pilot role tickets. Daniela has pointed out that as Sno+ have started using Sheffield in earnest they really could do with pilot roles enabled. In progress (9/10)

116812 (8/10)
LHCB asked Andy to clean up the $LOGIN_POST_SCRIPT for LHCB at his site, removing a "export CMTEXTRATAGS="host-slc5"" line. Naught wrong with the ticket, but I liked how lhcb did some good debugging, and there's something about a profile script problem that reminds me of a simpler time... In progress (9/10)

A link to the rest of the tickets for completeness.

The other VO nagios.
Looks okay, some (known I think) errors at QM, and Sheffield seems to have a few SRM errors going on since Saturday.

Monday 5th October 2015, 14.15 BST

22 Open UK Tickets this month, all of them, Site by Site:

116136 (9/9)
Sussex got a snoplus ticket for a high number of job failures, although simple test jobs ran okay. Matt asks if the problem persists, the reply was a resounding "not sure". In progress (think about closing) (21/9)

A ticket from CMS, about some important Phedex ritual that must occur on the 3rd of November, when the stars are right. The ticket needs some confirmation and feedback, plus the nomination of one site acolyte to receive the DBParam secrets from CMS - but the ticket only got assigned to sites this morning. Assigned (5/10)

Same as the RALPP ticket, Winnie has volunteered Dr Kreczko for the task. In progress (5/10)

95303(Long, long ago)
glexec ticket. On hold (18/5)

Atlas ticket asking Durham to delete all files outside of the datadisk path. Oliver asks what this means for the other tokens (I think they can be sacrificed to feed datadisk, but Brian et al can confirm that). Waiting for reply (5/10)

Sno+ jobs having trouble at Sheffield. Looks like a proxy going stale problem as only 10 Sno+ jobs at a time can run at Sheffield. Matt M asks if/how the WMS can be notified to stop sending jobs in such a case. In progress (30/9)

Gridpp Pilot roles. No news on this for a while, after the last attempt seemed to not quite work. In progress (30/7)

Biomed ticketed Manchester with problems from their VO nagios box - which Alessandra points out being due to there being no spare cycles for biomed to run on. Assigned (can be put on hold or closed?) (1/10)

A classic Rod Availability ticket. On Hold (7/9)

LANCASTER (a little embarrassing that my own site has the most tickets)
116478 (28/9)
Another availability ticket, this time for Lancaster (which has been through the wars in September). Still trying to dig our way out, but even the Admin's broke. On hold (5/10)

116676 (5/10)
Another ROD ticket, Lancaster's not quite out of the woods. We think WMS access is somewhat broken. We have no idea about the sha2 error. In progress (5/10)

116366 (22/9)
Sno+ spotted malloc errors at Lancaster. The problems seemed to survive one batch of fixes, but I asked again if they still see problems after running a good number of jobs over the weekend. Waiting for reply (5/10)

95299 (In a galaxy far, far way)
glexec ticket. This was supposed to be done last week, after I had figured out "the formula" - but then last week happened. On hold (5/10)

115959 (31/8)
LHCB job errors at QM, with a 70% pilot failure rate on ce05. Dan couldn't see where things are breaking (only that the CE wasn't publishing to APEL- and asks if this is the cause of the problem?) Waiting for reply (5/10)

116662 (5/10)
LHCB job failures on ce05 - almost certainly a duplicate of 115959, but it might have some useful information in it. Assigned (probably can be closed as a duplicate) (5/10)

116650 (1/10)
Imperial's invitation to the CMS Phedex DBParam ritual. Daniela's on it, as well as the other CMS sites. On hold (5/10)

116649 (1/10)
Brunel's ticket for the great DBParam alignment of 2015. On hold (5/10)

116455 (28/9)
A CMS request to change the xrootd monitoring configs. Did you get round to doing this last week Raul? In progress (29/9)

115448 (3/8)
Biomed having trouble tagging the jet CE. The Jet admins think this is the same underlying issues as their other ticket 115496. In progress (25/9)

115496 (5/8)
Biomed unable to remove files from the jet SE. There are clues that suggest that some dns oddness is the cause, but it's not clear. In progress (18/9)

116358 (22/9)
Ticket complaining about a missing image at the site. Some to and fro, the ball is back in the site's court. In progress (2/10)

116618 (1/10)
The Tier 1's CMS DBParam ritual ticket. In progress (5/10)

Let me know if I missed ought.

At time of writing things looka a bit rough at QM, Liverpool (just getting over their downtime) and for Sno+ at Sheffield (likely related to their ticket).

Matt's on holiday until the 29th, so he's being replaced with links or any update Jeremy is kind enough to provide.



"Normal" service will resume in October. I'll leave y'all anticipating that lovely review of all the tickets.

Monday 14th September, 15.00 BST

Yet another brief ticket review - its been a tad busy! Hopefully normal service will resume, err, next month. My apologies.

Only 20 Open UK tickets this week.

115805 - This RAL ticket from Sno+ looks like it can be closed, with there not actually being a problem with the site, or a feature that RAL would really want to implement.

Keeping on the Tier 1, does this ticket concerning the FTS3 certificates (115290) have anyone looking into it yet?

Are these two QMUL LHCB tickets duplicates, or related? 115959 & the more recent 116153

This Sussex ticket looks like it could do with someone taking a look - still just "assigned" since the 9th - 116136

Finally, an interesting VAC ticket for Oxford - "no space left on device" for atlas jobs - 116123. Bit of an odd one!

Monday 7th September 2015, 15.00 BST
20 Open UK Tickets this week. Due to Interesting Downtimes (apologies for reusing that pun!) yet another fairly light review. But not much is going on in ticketland.

The Sno+ ticket 115805 is interesting, Sno+ are looking at monitoring jobs submitted via WMS through the port 9000 URLS, but the RAL WMS behaves differently from the Glasgow one and doesn't let others with the same roles look at the links. Sno+ are still "developing" their grid infrastructure with the WMS in mind by the looks of it.

The pilot role at Sheffield ticket 114460 could still do with some attention, or an update.

Bristol had a ticket from CMS that looks interesting - 115883. The CMS SAM3 tests are confused by Bristol having an SRM-less SE endpoint. Waiting for reply after things were clarified.

T'Other VO Nagios Page
A few "The CREAM service cannot accept jobs at the moment" style errors at QM, but they're only a few hours old. Otherwise looking alright beyond the usual noise.

Of course with these light reviews I could well be missing something, so feel free to let me know - sites or VO representatives.

Monday 24th August 2015, 15.45 BST

21 Open UK tickets this week, most being looked at or are understood. There will be no ticket update from Matt next week (1st September) either, as he will be flapping about during a local downtime.

114460 (gridpp pilots at Sheffield) could do with an update. The Liverpool ticket discussed last week (114248) has received an update from the user saying that the ticket can be closed. As Steve mentioned last week the underlying issue is still very much there, but I don't think this ticket is a suitable banner for us to fight that battle under.

T'Other VO Nagios Page
As usual nothing to see here at time of writing, sites are doing a grand job of working for the monitored VOs.

And that's all folks! See you in Liverpool.

Monday 17th August 2015, 15.00 BST
29 Open UK Tickets this week.

114992 (10/7)
It looks like this CMS transfer problem ticket can be closed after a user update last week, which reported enabling multiple streams solved the initial failures. In progress (13/8)

115613 (assigned) Update- looks like this was a temporary problem, and the ticket can be closed.
115448 (in progress, but empty)
115496 (in progress, but might not be a site problem).
Jet have 3 biomed tickets, 2 of which are looking a little neglected.

115655 (12/8)
John rightfully asks why does a lengthy but very scheduled downtime set off a ROD alarm. Other then that it'll be worth "on-holding" this ticket whilst the unrighteous red flag fades. In progress (17/8) Update - On holded

114248 (10/6)
This Sno+ ticket looks like it needs some chasing up, no news for nearly 2 months. In progress (21/7) Update - Steve commented on this on TB-SUPPORT

Monday 10th August 2015, 15.00 BST

T'Other VO Nagios looks alright at time of writing.

24 Open UK tickets this week.

Lots of activity on GGUS since yesterday. Most positive, but I see the number of tickets at EFDA-JET increasing - all three from biomed.

Tier 1
115512 (5/8)
An interesting ticket - where a banned user is still banned after moving to LHCB from Biomed (no, he's not called Heinz). In Progress (6/8) Update - Andrew has asked that the ticket be reassigned to the argus devs as the pepd logs are showing oddness. Waiting for reply now.

115565 (7/8)
Bristol's phedex agents are down, and have been for a few days. I might have dreamt this, but thought that the Bristol Phedex service might not be hosted at Bristol, especially after RALPP had a similar ticket (115566) at the same time. Assigned (7/8) Update - solved

115504 (ROD Ticket) Solved
115399 (Wiki Ticket) Still Open
Both these tickets look like they can be closed.

115525 (5/8)
Atlas deletion errors after a a disk server fell over- nothing wrong with the ticket handling, but Alessandra brings up a point that always niggles me - the emphasis on the total number of transaction errors and not the number of affected unique files. In progress (8/8)

114573 (23/6)
LHCB ticket sparked by those IPv6 problems. Still no word from Vladimir; Raja - could you comment? I suspect there's been plenty of room for LHCB jobs in QM's (and everyone else who mainlines atlas jobs) queues this weekend. Waiting for reply (21/7)

Monday 3rd August 2015, 14.30 BST

23 Open tickets this month, full review time.

Newish this morning
As discussed on TB-SUPPORT, a few sites have been getting "I can't lcg-tag your CE" tickets from Biomed. The Liverpool ticket 115449, solved and verified, was the flagship of these issues. Brunel (115445) and EFDA-JET (115448) also have tickets about this.

Sno+ "glite-wms-job-status warning" (3/8)
Glasgow: 115435
Tier 1: 115434
Matt M submitted these tickets to Glasgow and the Tier 1 after having trouble with a proportion of Sno+ jobs. Both are being looked at- definitely worth collaborating on this one.

Jeremy noticed that the wiki didn't work for him on Friday - but it seems to work for Jeremy, Alessandra and myself now. As Jeremy notes the ticket can be closed, but out of interest did anyone else spot any problems? In progress (3/8)

Spare the ROD...
115433 (3/8)
Some CE problems noticed on the dashboard for the Liver-lads - who might be in mourning. Assigned (3/8) Update - aaannd Solved by upping gridftp max connections

Both of these are "availability" alarm tickets, on-holding until they clear. I hope RALPP managed to get a re-computation for their unfair failures (IGTF-1.65 problems on ARC).

Sno+'d Under
115387 (30/7)
I'm uncertain if this Sno+ ticket, probably somewhat related to Matt M's recent thread on TB-SUPPORT and concerning xrootd access, is meant for the Tier 1 or RALPP. Assigned (3/8)

First Tier Problems.
115417 (2/8)
LHCB spotted a number of nodes with cvmfs problems at the Tier 1, which the RAL team had already jumped on and repaired this morning. They wonder if the problem persists. Waiting for reply (3/8)

115290 (28/7)
An FTS problem requiring some special CA magic to solve, but the current CA-wizard isn't about. On hold (29/7)

113836 (20/5)
Glue 1 vs Glue 2 queue mismatches. It's being worked on perfecting cluster publishing for ARC CEs, but the ticket could either do with an update or on-holding. In progress (24/6)

114992 (10/7)
CMS transfers failing between RAL and, err, TAMU in the US. Assigned to RAL, where Brian has investigating and Andrew has posed an good question, asking if the user has considered managing the transfers with FTS. Quiet on the user side. In progress (21/7)

108944 (1/10/2014)
One of the tickets from the before times, about CMS AA access tests. It has become a long and confusing saga, but Gareth rescued it with a handy summary of the issue in his last update. How goes the battle? In progress (17/7)

Oxford Squid is red.
115230 (24/7)
Which might be the colour Ewan's seeing right now! The ticket is reopened, with a comment from Alessandra that the current recommendation is to allow all CERN addresses, and asks if this is something Oxford could do. Reopened (3/8) Update - solved after a squid restart. It's not the end of this ordeal, but Ewan would like to tackle it in a different arena then an Oxford ticket.

Mavaricks and Gooses - GridPP Pilot roles.
114485 Bristol
114460 Sheffield
114442 RALPP
114441 RHUL
Daniela hopped right back into the pilot seat after getting back from her holidays. Bristol and RALPP are looking good, Sheffield and RHUL are still in the Danger Zone - RHUL in particular were having troubles with argus and could do with some working configs from elsewhere to compare and contrast with their own.
Update - RHUL are looking better, just a few queue permission tweaks to go by the looks of it.

My shame - tarball glexec
95303 ECDF
95299 Lancaster
The tarball glexec tickets. Actually this is likely to become a defunct (or at least different) problem at Edinburgh with their SL7 move. Lancaster has a plan - we plan to deploy *something* (amazing plan there Matt) during our next big reinstall in September. Between now and then I have a test CE, cluster and most importantly some time.

Pot-luck tickets (or those I couldn't group).

114381 (16/6)
A tiny fraction of jobs publishing 0 cores used. Looks to be a slurm oddity. Oliver upgraded their CEs to ARC5 last week and hopes this has fixed things. Fingers crossed! In progress (29/7)

100566 (27/1/2014)
My other shame - Lancaster's poor perfsonar performance. It's being worked on. On Hold (should be back in progress soon) (30/7)

114248 (21/7)
Sno+ production problems at Liverpool, probably due to a lack of space in the shared area. Things are back in Sno+'s court, with the submitter consulting the Sno+ gurus (I think). In progress (21/7)

114573 (23/6)
LHCB job submission problems due to the known about dual-stacking problems. Waiting for input from LHCB for a while now, as things look okay at QM now but at least check LHCB jobs still weren't running for some reason. Waiting for reply (21/7)

That's all folks!

Monday 27th July 2015, 16.10 BST

Only 20 UK Tickets this week, and many are on hold for summer holidays. I pruned a few tickets, but none are striking me as needing urgent action, so this will be brief.

Mandatory UK GGUS Link
Nothing to see here really, but maybe I'm missing something? Nit-picking I see:

114381 (publishing problems at Durham) could still do with an update - not sure if work is progressing offline on the issue.

The Snoplus ticket 115165 looks like it might be of interest for others - in it Matt M asks about tape-functionality in gfal2 tools. Brian has updated the ticket clueing us in about gfal-xattr.

UK T'other VO Nagios
A few failures here at time of writing - although only one at Brunel seems to be more then a few hours old ( is failing pheno CE tests).

Let me know if I missed ought!

Monday 20th July 2015, 14.30 BST
27 Open UK Tickets this week.

115113 (17/7)
This Brunel Ops ticket (otherwise okay) is a good reminder that when removing CEs from the GOCDB for your site make sure to get all the services, or else you just might end up still monitored (and thus end up ticketed!). Reopened (can probably be closed soon) (20/7) Update - ticket solved

114006 (31/5)
This ticket about problems with Brunels accounting figures was looking promisingly close to being solved (at least to my layman's eyes) 3 weeks ago. Any word offline? In progress (30/6)

114845 (6/7)
LHCB jobs were failing at Lancaster a fortnight ago, but things should have been fixed quite promptly. Are they still broken? Waiting for reply (14/7)

114573 (23/6)
A similar case for this LHCB ticket for QM (part of their ongoing dual-stack saga). Waiting for reply (13/7)

Note: The related issue 115017 has been "solved" pending further testing.

114649 (26/6)
This Sno+ ticket has been in "Waiting for Reply" for a little while, no word from the user (who isn't Matt M). Could we poke Sno+ though another channel about this? Waiting for reply (6/7)

114248 (10/6)
We could do with finding out from Sno+ the state of play at Liverpool too, although we might have to wait until Steve is back from his hols later this week to field any replies. In progress (17/6)

114381 (16/6)
Ticket concerning the small percentage of Durham jobs that aren't publishing their core count (probably a slurm oddity). Now that Oliver's back from holiday has he had time to look at this? On Hold (19/6)

Tier 1 (20/5)
I suspect that whilst the work described chugs along in the background we consider on holding this ticket. In progress (24/6)

UCL 114746 (30/6)
Ben has been battling getting his DPM working, and spotted an interesting problem where SELinux was blocking the httpd from accessing mysql. It's nice to see someone not just switching SELinux off (like I have a habit of doing...). In progress (20/7)

115003 (12/7)
Andy having some problems on a test SE at ECDF - he seems ot be suffering a series of unfortunate errors. Maybe the storage group could help? In progress (17/7)

GridPP Pilot Roles.
Bristol are ready for testing, Govind discovered a possible bug in argus that needs a bit more testing at RHUL. Things seem quiet at RALPP and Sheffield, and Brunel too.

Supplemental - In the region multicore publishing ticket (114233) only Oxford have a CE still appearing to publish 0 cores - but I thought this CE was Old-Yeller'ed?

Tuesday 14th July 2015, 9.30 BST

Lazy update today, due to some fun and games at Lancaster yesterday.

UK GGUS Tickets

Other VO Nagios

Tickets that pop:

114952 and 114951 are both atlas frontier tickets at RALPP and Oxford, both have been reopened - although the underlying issues seem different. The Oxford ticket is similar to one at RAL (114957), which looks to be caused by an unannounced change in IP for some important atlas squids (AIUI - speed reading the tickets this morning).

QM IPv6 Woes
Followers of the atlas uk lists will have noticed some heroic attempts to diagnose and repair problems at QM which appear to be someone else's fault. LHCB's ticket to QM on the matter: 114573
Dan's ticket concerning the "rogue routes": 115017

GridPP Pilot Roles
Durham, Bristol, Sheffield, Brunel and RHUL still have open tickets about this. Bristol are working on it, as are Durham - Oliver's ready for their setup to be tested again (Puppet overwrote his last changes!). Not much recent news from the other three.

That's all from me folks, let me know if I missed ought!

Tuesday 14th July 2015, 9.30 BST

Lazy update today, due to some fun and games at Lancaster yesterday.

UK GGUS Tickets

Other VO Nagios

Tickets that pop:

114952 and 114951 are both atlas frontier tickets at RALPP and Oxford, both have been reopened - although the underlying issues seem different. The Oxford ticket is similar to one at RAL (114957), which looks to be caused by an unannounced change in IP for some important atlas squids (AIUI - speed reading the tickets this morning).

QM IPv6 Woes
Followers of the atlas uk lists will have noticed some heroic attempts to diagnose and repair problems at QM which appear to be someone else's fault. LHCB's ticket to QM on the matter: 114573
Dan's ticket concerning the "rogue routes": 115017

GridPP Pilot Roles
Durham, Bristol, Sheffield, Brunel and RHUL still have open tickets about this. Bristol are working on it, as are Durham - Oliver's ready for their setup to be tested again (Puppet overwrote his last changes!). Not much recent news from the other three.

That's all from me folks, let me know if I missed ought!

Monday 6th July 2015, 14.00 BST
30 Open UK Tickets this month. Looking at them all!

114233 (10/6)
The UK not publishing core counts at all sites. Some progress, but at last check John G couldn't see a change for Oxford or Glasgow. In progress (30/6) Update - Glasgow seems to be okay after de-creaming, checking the July list we have t2ce6 at Oxford, ce3 and ce4 at Durham (see their ticket) and cetest02 at IC (but that node has test in its hostname!).

114442 (18/6)
Gridpp Pilot role ticket. Accounts need to be created, but no word for a few weeks. In progress (19/6)

114764 (1/7)
Ticket tracking (false) availability issues, created to appease COD - the problem caused by a broken CA rpm release for Arc CEs. Kashif has created a counter-ticket 114742 Gordon's sagely advice is to submit a recalculation request once the issue is fixed. Assigned (1/7)

114485 (19/6)
Bristol's gridpp pilot role ticket. No news, could do with an update really. In progress (22/6)

114426 (18/6)
CMS AAA reading test problems. The Bristol admins have transferred data to their new shiny SE and have asked CMS to test again. No word since. Waiting for reply (30/6)

95303 (1/7/13...)
Tarball glexec ticket, now 2 years old. After a really promising burst the last 6 weeks haven't seen any progress, due to a lot of other "normal" tarball work taking up the time. Sorry! On hold (18/5)

114536 (22/6)
Durham's gridpp pilot role ticket. Not acknowledged yet, is Oliver back yet? Assigned (22/6)

114765 (1/7)
See RALPP ticket 114764. Assigned (1/7)

114727 (30/6)
Catalin ticketed that a number of SW_DIR variables at Durham are still pointing to the old school cvmfs space. Assigned (30/6)

114381 (16/6)
John G ticketed Durham over a small percentage of jobs being published as "zero core". Looks like a SLURM timeout problem, although a fix isn't obvious. Put on the back burner whilst Oliver is on holiday. On Hold (19/6)

114649 (26/6)
A ticket from a Sno+ user about not being able to access software using the Sheffield CEs. Acknowledged but no news. In progress (26/6) Update - Elena can't find anything wrong, cvmfs seems to be working fine. Perhaps a problem with the environment?

114460 (18/6)
Sheffield's gridpp pilot role ticket. Did you get round to rolling them out? In progress (19/6)

114444 (18/6)
LHCB ticket concerning the DPM's SRM not returning checksum information. On hold whilst a related ticket is being looked at (111403). On Hold (22/6)

114248 (10/6)
Another Sno+ ticket, about grid production jobs failing at Liverpool. AIUI caused by Sno+ running out of space on the shared pool. At last check Steve posted the usage information for Sno+ but no word since (and Steve's off on his hols). In progress (17/6)

114845 (6/7)
LHCB pilots failing at Lancaster. Looks like a simple node misconfiguration, hopefully fixed, waiting to see if it is. On hold (6/7)

95299 (1/7/2013)
glexec ticket - see Edinburgh description. On hold (15/5)

100566 (27/1)
Bad bandwidth performance at Lancaster. Hoping that IPv6 will shake things up a bit so pushing that. On hold (18/5)

114746 (30/6)
SRM-put failures ROD ticket. No news at all. Assigned (30/6)

114851 (6/7)
Low availability ROD ticket, related to above. Assigned (6/7)

114441 (18/6)
Another GridPP pilot role ticket. Pilots rolled out, but something isn't quite right and they're not working - Govind is looking again. In progress (6/7)

114573 (23/6)
LHCB ticket about two out of three QM CEs not responding for them. Dan spotted the broken CEs were dual-stacked, the working one wasn't. The ticket seemed to have trailed off into some confusion over who needs to do some testing where. I agree with Dan that that who needs to be someone with LHCB credentials! The waters still seem muddied. In progress (1/7)

114737 (30/6)
The IC voms wasn't updating properly, due to what I infer from the ticket as "SSL/mysql madness". Simon and Robert have been heroically battling this one - it's a good read. On hold (3/7)

114379 (16/6)
Sam's ticket about SE support in Dirac. Sam will shortly try testing things out on the new Dirac to see how it fares. In progress (6/7)

114447 (18/6)
Brunel's gridpp pilot ticket. Being worked on, with one CE with the pilots enabled. In progress (26/6)

114006 (31/5)
A ticket from APEL, about Brunel under-reporting the number of jobs they are doing. Turned out to be a problem with Arc, which Raul upgraded to the fixed 5.0 version. The APEL team deleted the sync records, but no word since. In progress (30/6)

114850 (6/7)
Another APEL ticket, likely the fallout of the previous one - it looks like GAP publishing has been left on for the Brunel CREAM CEs. Assigned (6/7) Update - solved

114786 (2/7)
Low availability ticket - see RALPP ticket 114442 - probably could do with On holding. In progress (2/7) Update - Onholded

113910 (26/5)
Sno+ data staging problems. Brian gave some advice on how the large VOs do data staging from tape, and has asked if Sno+ still has problems. Matt M might still be on leave though. Waiting for reply (23/6)

108944 (1/10/14)
CMS AAA problems, which eventually brought to light to a problem with super-hot datasets which were alleviated (I think). Despite an update to castor that improved performance the last batch of tests didn't show improved results. No news since. In progress (17/6)

113836 (20/5)
Glue mismatch problems at RAL. Working on getting "many-Arcs" to correctly publish. In progress (24/6)

Monday 29th June 2015, 14.30 BST

Looking at the "Other VO" Nagios.
Things look generally alright - but Durham look like they need to update their CA rpms - but that might have to wait until Oliver is back from leave.

I don't think this effects many, but there's was a ticket to produce a new version of the WN tarball (which is done): 114574
Although AFAICS there is no urgent need to upgrade tarball WNs.

26 UK Tickets, although not many stand out.

Gridpp VO Pilot tickets:
Largely doing alright. With Oliver away the Durham ticket hasn't been looked at yet. Sheffield and Bristol's tickets could do with an update (or on-holding if there's going to be a delay). The RHUL ticket has been reopened as they're deployment of the pilot roles hasn't quite worked out.

114573 (23/6)
LHCB having trouble with two out of three QM CEs. Dan notes that the two "broken" CEs have been recently dual-stacked, and asks if this could be the problem. The answers is a resounding "maybe", and Raja asks if problems could be duplicated by others using lxplus. Waiting for reply (24/6)

114379 (16/6)
Sam's ticket trying to get SE support with, spruced up. Daniela has asked if the tests can be redone with the "new" dirac. Waiting for reply (22/6)

Let me know if I missed any tickets. Monday 22nd June 2015, 14.00 BST

35 Open UK Tickets this week (!!!)

GridPP Pilot Role
A dozen of them are from Daniela (who painstakingly submitted them all) concerning getting the gridpp (and other) pilot role enabled on the site in question's CEs.

An example of one of these tickets is:
114440 (Lancaster's smug solved ticket).

Ticketed sites are: Durham, Bristol, Cambridge, Glasgow, ECDF (who are also having general gridpp VO support problems), EFDA-JET (looking solved), Oxford, Liverpool, Sheffield, Brunel, RALPP and RHUL. Most tickets are being worked on fine, but the Bristol and Liverpool ones were still just in an "assigned" state at time of writing.
Update - good progress on this, just one ticket left "assigned". Cambridge are done, as are JET (ticket needs to be closed). Oxford and Manchester are ready for to have their new setups tried out, with Oxford kindly road-testing glexec for the pilot roles. Good stuff.

Core Count Publishing
114233 (10/6)
Of the sites mentioned in this ticket (Durham[1], IC, Liverpool, Glasgow, Oxford) who *hasn't* had a go at changing their core count publishing? I know Oxford have. Daniela had a pertinent question about publishing for VMs, which John answered. In progress (17/6)

[1] Durham have another ticket on this which may explain their lack of core count publishing: 114381 (16/6)

114379 (16/6)
Sam S formed this ticket over having trouble accessing the majority of SEs over Dirac, after some discussion around this last week. Sam acknowledges that this could be a site problem, not a DIRAC problem, but you gotta start somewhere (he worded that point more eloquently). Daniela has posted her latest and greatest DIRAC setup gubbins for Sam to try out. Another, unrelated, point to have are the names missing from Sam's list - for example I'm pretty sure Lancaster should support gridpp VO storage but I've forgotten to roll it out! Waiting for reply (22/6)

114248 (10/6)
Final ticket today, and another one discussed last week in the storage meeting. Steve's explanation of why (and how) Sno+ would need to start to using space tokens was fantastically well worded in a way to not spook easily startled users. David is digesting the information, but it will likely need to wait for Matt M's return before we'll see progress. In progress (16/6)

I told a pork pie when I said that was the last ticket - this one caught my eye. A ticket from lhcb over files not having their checksums stored on Manchester's DPM. A link was given to another ticket at CBPF for a similar issue which got the DPM devs involved (111403) - although Andrew McNab was already subscribed to the ticket. In progress (19/6)

Monday 15th June 2015, 14.15 BST

Other VO nagios: and seem to be having a spot of bother for multiple VOs, and seems to be starting to have trouble too.

19 Open UK Tickets this week.

114233 (10/6)
John Gordon ticketed the NGI (as well as others) about some sites in the UK not publishing core counts with their APEL numbers (or more precisely have submission hosts at that site not publishing). Following the link it looks like Imperial, Liverpool, Durham, Glasgow and Oxford are on this list, I've listed the submission hosts reporting "0" core jobs below to help people clear up their rogues! If you've only fixed things in the last fortnight you'd still show up on this list. In progress (15/6)

"0" core job submission hosts:

Tier 1
113914 (26/5)
Sno+ file copying ticket. Matt M is away on his hols, but Dave Auty has took over his duties and reports that this problem seems to have gone away - it can probably be closed. In progress (9/6)

114248 (10/6)
Another Sno+ ticket here from David concerning job failures. Nothing wrong with the ticket handling, but I thought that David's errors in submitting test jobs are worth documenting, as they were very understandable. David has since asked if the Sno+ job failures are linked to Sno+ nagios test failures at Liverpool. In progress (12/6)

114157 (8/6)
After Simon and Daniela have cleared up the atlas dark data and expanded their Space Tokens using the space freed up there still seems to be some confusion in Rucio, disagreeing with the SRM numbers. In progress (10/6)

114208 (9/6)
Oxford being ticketed for UKI-SOUTHGRID-OX-HEP_IPV6TEST failing connection tests, tests for which Oxford should not be getting ticketed afaacs. I remember this being mentioned in the Thursday cloud meeting, but I'm ashamed to say I wasn't paying attention. Were any conclusions drawn/decisions made? In progress (10/6)

114153 (7/6)
Atlas transfer failures from Manchester. Errors are still occurring as of yesterday, any news? In progress (14/6)

Monday 8th June 2015, 15.00 BST
21 Open Tickets this week.

113914 (26/5)
Sno+ had problems at the Tier 1 where jobs failed whilst uploading data, believed to be due to an incorrect VOInfoPath. There's been a failure at replicating the issue, and the VOInfoPath advertised is correct. Very confusing, as I assume it all worked at some point before! In progress (2/6)

113910 (26/5)
Another Sno+ ticket, concerning lcg-cp timeouts whilst data-staging from tape. Matt M has asked for advice on the best practice for doing this, or if Sno+ would be better off just upping their timeouts. Brian has given some advice on using the "bringonline" command, but is himself unsure the best way of seeing what files are currently in a VO's cache. Not much news since. In progress (28/5)

114004 (31/5)
Atlas transfers fail due to the "bring-online" timeout being exceeded. Brian spotted a problem with file timestamps mismatching, but no news on this ticket since. In progress (1/6)

114006 (31/5)
APEL accounting oddness at Brunel, noticed by the APEL team. After much to-and-fro-ing John noticed that multiple CEs were using the same SumbitHost, and thus overwriting each other's sync records. Something to watch out for. In progress (7/6)

114157 (8/6)
There's been some debate on the atlas lists about this ticket, a classic "not enough space at the site" ticket. Raising above the indignation over being ticketed for this, Daniela has offered a couple of TB to give some space, and pointed out that IC have some atlas data outside space tokens, and that this could be used to expand the tokens if cleaned up. Waiting for reply (8/6)

Tuesday 26th May 2015
Matt's on leave until the 8th of June. But he's replaceable with handy links... 23 tickets today:

Other VO Nagios

UK NGI GGUS tickets

Monday 18th May 2015, 14.30 BST
Full review this week.

Other VO Nagios
At time of writing I see problems with test jobs at Brunel for pheno and Liverpool for a number of VOs (see Sno+ ticket for probable cause and fix at Liverpool).

22 Open UK Tickets this week. Going site-by-site:

113473 (4/5)
Missing accounting date for April for some sites. Raul is discussing things for Brunel in the ticket, although they have republished. I think it's only ECDF left to republish their April data. In progress (16/5)

113482 (26/4)
Loss of accounting data for Oxford needing a APEL republish. The Oxford guys republished, but there is some confusion with the resulting numbers. Discussion is ongoing, John G is currently looking at the records. In progress (14/5)

113650 (11/5)
CMS glideins failing at Oxford. The original problem was with a config tweak being left out of the cvmfs setup, but the ticket has been reopened citing problems persisting on the ARC CE (the CREAM appears to be fixed). Reopened (16/5)

GLASGOW 113095 (17/4)
ROD ticket about batch system BDII failures, left open to avoid unnecessary ticket filing. Gareth noted that the full migration to ARC and HTCondor, which should see the end of these issues, will hopefully be completed by the end of June. On Hold (12/5)

ECDF 95303 (31/7/13)
Somehow left this one out of the e-mail update. Edinburgh's glexec ticket, dependent on the tarball. I put in my tuppence worth today with my tarball hat on. On hold (18/5)

113769 (18/5)
LHCB see a cvmfs problem at Sheffield. Elena has probably fixed the problem(restarted the sssd), just waiting to see if it all pans out. In progress (18/5)

113744 (15/5)
For the VOMS rather then the site, Jens' request for the creation of the dIrac VO, In progress (18/5)

113692 (13/5)
A request from pheno to add support to for their new cvmfs area at Manchester, and as I understand it, to support them in a new "form" ( In progress (13/5)

113742 (15/5)
Sno+ noticed their nagios failures at Liverpool. Rob reckons this was a problem with the DPM BDII service certificate not being updated (that's bitten me too), and fixed things this morning. Let's see how that goes. In progress (18/5)

95299 (1/7/13!)
Lancaster's vintage glexec ticket. An update on this - after have a roundtuit session last week I was building glexec for different paths. It still needs some testing to make sure it works properly. There however definitely won't be a one-size-fits-all tarball solution. On hold (15/5)

100566 (27/1/14)
Only the crustiest old tickets for us at Lancaster! Poor perfsonar performance. Sadly didn't get roundtuit on this one - we're pushing getting these nodes dual stacked as Ewan had pointed out that it would be interesting to see if IPv6 tests also saw this issue. On hild (18/5)

113721 (14/5)
The only UCL ticket, this is a egi "low availability" ticket. However Daniela notes that the plots are on the rise, so things are looking alright. Probably want to "On Hold" it but otherwise not much to be done. In progress (14/5)

113743 (15/5)
A ticket from Durham concerning the Dirac instance at Imperial's settings for their site. Daniela hopes to get it fixed soon. In progress (15/5)

112948 (10/4)
CA certificate update at 100IT leading to a discussion of other authentication based failures. David has asked for voms information after posting his configs. In progress (13/5)

113035 (14/4)
Ticket tracking the decommissioning of the Tier 1 CREAM CEs. I think things are just about done now, this ticket can soon be closed. In progress (11/5)

109694 (28/10/14)
Sno+ gfal-copy ticket. Brian reports that the Tier 1 is upgrading gfal2 on their WNs, and notes that there's a lot of active debugging work going on in the area. As he eloquently puts it "situation is quite fluid". In progress (13/5)

108944 (1/10/14)
CMS AAA tests failing at the Tier 1. There's been a lot of work on this, deploying then trying to get the new xrootd director configured. New problems have cropped up, and are under investigation. In progress (11/5)

112721 (28/3)
Atlas transfer failures ("failed to get source file size"). Tracked to a odd double transfer error, possibly introduced in one of the recent "upgrades". Brian has been declaring these files as bad, and a workaround or solution is being thought about. In progress (14/5)

113705 (13/5)
Atlas transfer failures from RAL tape. Checksum failures, which Brian tracked to being due to not being of a type Castor supports. Brian has asked if this can be changed at the CERN FTS or in rucio. Waiting for reply (14/5)

113748 (16/5)
Another atlas transfer ticket, but as the error indicates no space left at the Brunel space token being transferred to Elena has noted that this isn't a site problem, telling the submitter to put in a JIRA ticket instead. Waiting for reply, but probably can be just closed (16/5)

112866 (7/4)
Lots of cms job failures at RAL. This has been traced to some super-hot files, mitigation is being looked into. A candidate for perhaps On Holding, depends on the time frame of a work around. In progress (13/5)

113320 (27/4)
CMS data transfer issues. I'm not actually too sure what's going on. There are files that need invalidating, which seems to be the root of the evil befalling transfers. The issue is being actively worked on though. In progress (18/5)

Monday 11th May 2015, 14.10 BST
22 Open UK Tickets this week.

There are a few tickets at the Tier 1 that are set "In Progress" but haven't received an update yet this month:
108944 (CMS AAA Tests, 30/4)
112721 (Atlas Transfer problems, 16/4)
109694 (SNO+ gfal copy trouble, 15/4)
112866 (CMS job failures, 7/4)
112819 (SNO+ arcsync troubles, 20/4)

Other Tier 1 Tickets (sorry to be picking on you guys!)
111699 (10/2)
Atlas glexec hammercloud test jobs at the Tier 1. It appears to be working, but a batch of test jobs failed because they couldn't find the "mkgltempdir" utility on some nodes ("" and ""). In progress (4/5)

113320 (27/4)
Maybe repeating what Daniela is going to say in the CMS update - trouble with CMS data transfers within RAL. It's under investigation, but it looks like the files in question will need to be invalidated - even if it's just to paint a clearer picture. In progress (10/5)

At last update Brunel, Liverpool, Edinburgh, Birmingham and Oxford need to republish still. Oxford have their own ticket about it due to complications (113482).

UCL Tickets - Ben is starting to move to close these, some are going to be "unsolved".

113095 (17/4)
Andrew asks if the timeframe for the move to Condor be added to this ticket, for the ROD team's information. On Hold (7/4)

112948 (10/4)
No news on this 100IT ticket for a while. In progress (27/4)

Friday 1st May
The Bank Holiday weekend might muck up plans for a Ticket review this week. Just in case, some links!

Other VO Nagios page.

UK GGUS Tickets

Hope you all have a nice weekend!

A quick check of the Other VO Nagios page.

26 Open UK tickets this week.

ITWO Decommissioning
Three of the tickets are to the VOMS sites (Manchester, Oxford, IC), concerning the decommissioning of the ITWO VO. Just an FYI to y'all.

113293 (26/4)
There was an APEL problem last month where a lot of sites needed to republish their data for the month. I think Edinburgh are the only UK site that suffered this problem, but another FYI ticket. Assigned (26/4) And solved

Atlas production jobs not running at ECDF. Andy noticed that analysis jobs were running fine, and believes that this might be a problem scheduling pilots in time. Perhaps a multicore issue if this is only effecting (affecting?) production jobs. In progress (22/4) Update - solved

111699 (TIER 1)
111703 (RALPP)
It was discovered that there was a problem in the test code, so the ball is very much in atlas' court for this one. The problem has been fixed and the tests are being rebuilt and resubmitted.

112721 (28/3)
ATLAS FTS failures too RAL. A rucio issue causing double-transfers has been discovered (here), which would explain the behaviour seen. No news since this revelation. In progress (16/4)

There are a number of other Tier 1 tickets that could do with either an update or On Holding

Monday 20th April 2015, 14.30 BST
24 Open UK tickets this week, only a light review. Update - down to 20 open tickets as of this morning

113150 (20/4)
Fresh in - the NGI has been ticketed to change the regional VO from emi.argus to ngi.argus in the gocdb. Seems a bit pedantic, but hey! I assigned it to the NGI ops, and notified RAL as keepers of the regional argus. Assigned (20/4) Update - solved

113035 (14/4)
Just for people's interest, the ticket tracking the decommissioning of the last of the RAL CREAM CEs. In progress (14/4)

112819 (2/4)
A SNO+ ticket I must of somehow missed last week, concerning SNO+'s manual renewing of proxies on ARC machines. Matt M has noticed that ArcSync occasionally hangs rather then timeouts smoothly (although he later notes that he doesn't see the initial problems working from a different network). I'm thinking that this should be redirected at the arc devs, but I don't think they have a GGUS support group (I could be wrong, I'm well behind on the ARC curve). In progress (7/4)

113110 (17/4)
Looks like this atlas low transfer efficiency ticket can be closed. Waiting for reply (20/4) Update - solved

113095 (17/4)
ROD ticket for some BDII misreporting at Glasgow. The botheration seems to be ephemeral in nature, the blunders passing with the abating of their batch system's burden. This ticket can probably be solved. In progress (17/4)

Monday 13th April 2015, 14.00 BST
24 Open tickets this week - going over all of them this week, site by site.

Fresh in this morning - 113010 and 113011 - Sno+ tickets concerning the RAL and Glasgow WMSes not updating job statuses.

Atlas glexec hammercloud tests failing. There's been a lot of waiting on atlas to build new HC jobs. The most recent exchange (delayed due to Easter), was asking about SELinux - but no news since the first. In progress (1/4)

Low availability ROD ticket. Availability is crawling back up, just need it to go green. On hold (13/4)

Another ROD ticket for bdii errors at Glasgow. Gareth has been doing everything right investigating this. Kashif recommended ticketed the midmon unit, but Gareth has spotted that the errors correspond to high load on their ARC CE - so it might be a site problem after all - Gareth asks for clarification. Waiting for reply (13/4)

95303 (1/7/13)
Tarball glexec ticket. No news (sorry). End of April I believe was the "deadline" I set for having this made. On Hold (9/3)

100566 (27/1/14)
Lancaster's poor perfsonar performance. I'm not believing quite what I was seeing with the tests I performed so I'm aiming to rerun them. On hold (13/4)

95299 (1/7/13)
Lancaster's tarball glexec ticket. Same as ECDF. On hold (9/3)

112966 (13/3)
A ROD cream job submit ticket, freshly assigned this afternoon. It's a bit mean of me to bring notice to it. Assigned (13/4) And POW, Raul closed this after kicking torque into shape - solved

112948 (10/4)
100IT needed to upgrade to the latest CA release. They've done this, but there are still authentication problems. In progress (13/4)

108356 (10/9/14)
Deploying vmcatcher at 100IT. After David's questions falling on deaf ears for a while it has been advised that the ticket be closed as this issue will be dealt with elsewhere. Whether or not it is to be "solved" or "unsolved" is open to debate! In progress (can possibly be closed) (13/4)

108944 (1/10/14)
CMS AAA tests failing at RAL. After a lot of work and new xrootd redirectors problems persist. It's looking to be a problem that needs the CASTOR and/or xrootd devs to look at. In progress (30/3)

112713 (27/3)
CMS asking to clean up the "unmerged area". Andrew conjured up a list of files and asked if they could be deleted - CMS responded with a "yes please then close the ticket". Has the deed been done? In progress (31/3)

109694 (28/10/14)
The Sno+ gfal copy ticket. Matt M still sees gfal-copy hang for files at RAL when he uses the GUID (SURL works). A Castor oddity perhaps? Matt asks a question about what problems like this (coupled with the move away from lcg tools) will mean for VOs that rely on the LFC. In progress (31/3)

112977 (10/3)
CMS high job failure rate at RAL. Related to 112896 (below) - the jobs all want that file! In progress (13/3) (9/4)
CMS Dataset access problems - caused by over a million access attempts on a single file over a 18 hour period. Andrew L comments that CMS needs to have a think about how they access pileup datasets. In progress (9/4)

111699 (10/2)
Tier 1 counterpart to 111703. A new HC stress test was submitted near the end of March, but no news on how it did. In progress (23/3)

112866 (7/4)
A different "lots of CMS job failures" ticket. Again a "hot file" seems to be the root cause. In progress (7/4)

112721 (28/3)
An atlas file access ticket, seemingly caused by some odd FTS behaviour. No answers to Shaun's question about this odd occurrence or much noise at all till today. Waiting for reply (13/4)

UCL has 6 tickets - 4 just "assigned". I'll just list them in the interests of brevity.
112371 (ROD low availiability, On Hold)
112841 (atlas 0% transfer efficiency, assigned)
112873 (ROD srm put failures, assigned)
95298 (glexec ticket, on hold)
112722 (atlas checksum timeouts, in progress)
112966 (ROD job submit failures, assigned)

Tuesday 7th April

Monday 23rd March 2015, 15.30 GMT
19 Open tickets this week.

A ticket fresh off the ROD dashboard - the Birmingham CREAMs aren't being matched ("BrokerHelper: no compatible resources"). Matt W has double checked their setup and can't spot anything wrong - they've been running "normal" atlas/lhcb etc jobs fine over the last few weeks. Any advice appreciated. In progress (23/3)

TIER 1 (21/3)
MICE problems running jobs at RAL, which Andrew L discovered coincided with WMS problems that he fixed. Probably should be in "Waiting for Reply/Seeing if the problem's evaporated". In progress (23/3) (14/3)
The cause of this Sno+ ticket, about a recent user not being able to access files due to not being in the gridmap, has been discovered. As Robert F sagely pointed out the latest version of the mkgridmap rpm is required to talk to the voms server. Just waiting on the time for it to updated at now. In progress (17/3)

100IT (10/9/14)
This 100IT ticket is still waiting for a reply (since mid-January). The question needs to be answered by someone familiar with the technologies and terminologies? Is anyone up on vmcatcher? Anyone know what other channel to pass David's query onto? Waiting for reply (15/1)

Monday 16th March 2015, 15.30 GMT

16 Open UK tickets this week. Half Red, Half Green.

RALPP and TIER 1 glexec HC tickets
111703 (RALPP)
111699 (TIER 1)
No news on these tickets since it was expected that a new stable HC job would be released last Tuesday. All very quiet.

Similar for this CMS glexec ticket - no news after Kashif asked for some more information way back. Waiting for reply (27/2)

The Sno+ gfal copying ticket. A lot of people are working on this, and attempts to recreate the problems seem to occasionally be devolving into "did we get this complicated command right?". At some point it might be necessary to get the gfal devs involved (are there gfal devs?). Waiting for reply (11/3)

Also there's the poor 100IT ticket, still waiting for a reply. The JET ticket also needs wrapping up, I put a reminder in to the end of it.

Monday 9th March 2015, 15.00 GMT
From last week's crusty ticket round up:

Tier 1
109694- 28/10/14
Matt M has managed to reclaim his tickets after a certificate change orphaned his old ones. Progress has resumed. Duncan has asked Matt to retry his failing tests with a simple copy to local disk example.

108944- 1/10/14
CMS AAA access at RAL. Andrew posed a question to the CMS xroot experts last week - if you have their details it might be a good idea to involve them in the ticket.

108356- 10/9/14
This VMCatcher ticket is still stuck waiting for a reply. Deafening silence for our 100IT colleagues.

97485 - 21/9/13
The Jet LHCB ticket is in the state of being wrapped up. LHCB have been removed from the local configs, and the site has been removed from LHCB's. I believe that this ticket can be terminated.

110353 - 25/11/14
Dan's managed to get webdav working on one SE, but not t'other. Very strange, but Dan is investigating (see also 111942).

No movement on the three glexec tickets (none expected on the two tarball ones in the last week though), the Lancaster perfsonar ticket is still waiting on another batch of local tweaks (and I still need to make sense of what I'm seeing). Matt RB closed Sussex's perfsonar ticket though - nice one.

The "Normal" tickets:

Atlas gLexec Hammercloud failures (RALPP and Tier 1)
111699 (Tier 1)
111703 (RALPP)
These tests were waiting on a new stable job release being made - this has been delayed (hopefully out tomorrow).

This LHCB ticket about stalled jobs looks like it can be closed (LHCB no longer see a problem). Update - set to solved, the jobs were being killed for using too much memory.

A CMS user saw glexec failures on some nodes - Kashif asked for some more information but there has been no reply. I'd consider giving the user till the end of the week then closing the ticket if there's still no word. Waiting for reply (27/2)

Tuesday 3rd March

Concentrating on pre-2015 tickets this week in an attempt to Spring Clean the UK ggus presence. I will review these again next week - can people please take a look at these tickets if they're owned by them (or if they think they can help!).

SUSSEX - 26/11/14
This is a perfsonar ticket - the initial request (reinstalling the perfsonar node) has been done a while ago but things weren't quite right. Matt RB did some soothing to this last week and asks if he's missed anything - I put it to waiting for reply this morning. Waiting for reply (24/2)

QMUL - 25/11/14
Atlas wanting https access on QM's SE. Dan's been working on this nicely, carefully testing each stage of his rollout. The end is in sight here. In progress (17/2)

TIER 1 - 28/10/14
This is a SNO+ ticket about getting gfal tools working for the Tier 1 - with the new version out Brian has tested it correctly (and I saw a related thread on lcg-rollout) - but no word from Sno+. Who is wrangling the other VOs in these post-Walker times? Waiting for reply (24/2)

TIER 1 - 1/10/14
A CMS access about AAA tests at the Tier 1. This ticket is being actively worked on, with a new xrootd redirector at RAL and problems with the EU redirectors mucking things up. No problems that I can see. Waiting for reply (2/3)

100IT - 10/9/14
Getting VMcatcher stuff to work at 100IT. This ticket seems to keep stalling due to lack of documentation or replies from the submitters. Waiting for reply (19/1)

LANCASTER - 27/1/14
Lancaster's poor perfsonar performance. Being poked and prodded on and off over the last year, but the problems remain a mystery - Ewan's lending of a iperf endpoint has helped out greatly though, waiting on yet another network tweak. On Hold (23/2)

EFDA-JET - 21/9/13
LHCB job failures at EFDA-JET. The causes of this remain a mystery, and is the first ticket on my "to be set to unsolved" list.

UCL - 1/7/13
UCL's glexec ticket. Ben's been working hard at this recently, but keeps hitting show stoppers - the latest being a performance problem possibly due to the VM he's running argus on. On hold (19/2)

ECDF and LANCASTER - 1/7/13
95303 (ECDF)
95299 (Lancaster)
glexec for the tarball. This is *still* waiting on the tarball glexec, which is again waiting on me, which is waiting on me magicking some extra tarball development time. Will be reviewed by the end of March.

Monday 23rd February 2015, 15.00 GMT
Only 15 Open UK tickets this week. Feel free to bring up any ticket-based issues of your own on this quiet week.

This CMS glexec hammercloud ticket is looking a little quiet - no update for a while. If it's continuing offline or waiting on input could it at least be put On Hold? In progress (11/2) (The Tier-1 version of this ticket, 111699, seems to be chugging along fine - there might be useful snippets in there).

The Cloud accounting probe ticket was reopened, asking if 100IT ticketed apel support (I assume contacting them via other means would work too) otherwise the new cloud accounting won't be properly republished. Reopened (20/2)

IMPERIAL (but not really their issue)
Tom opened a ticket after another cern@school user had troubles using the IC SE - there has been some problems with the newer versions of the dirac UI. Sometimes it's better to go Vintage! Although after trying Simon and Daniela haven't been able to reproduce the failure - perhaps something's up with the user's UI? Waiting for reply (23/2)

Sussex's Perfsonar ticket. I know Matt RB has put the ticket On Hold and is very busy - is there any news/anything we can do to help? On Hold (21/1

Sno+ gfal copy problems at RAL. Brian informs us that the latest version of the gfal tools works for him and has asked if they work for Matt M. and Co. Did you get these packages out of epel/epel-testing or somewhere else Brian? Waiting for reply (18/2)

Monday 16th February 2015, 14.30 GMT
Only 19 open UK tickets today.

An atlas ticket concerning transfer failures between RAL and BNL. Brian mentioned last week that the lack of recent failures is due to atlas not attempting to transfer any older data recently. Perhaps this could do with being put into the ticket (and the ticket being put On Hold, or prodded some more)? Waiting for reply (29/1)

Also 111800 (17/2)
ARC CE issues at RAL detected.

Atlas running glexec hammerclouds - having trouble at RALPP (and RAL, see 111699). The glexec experts have gotten involved on this one, and asked to take a peek at a proxy - not sure about anyone else, but I'd feel a tad uncomfortable sharing proxies, even with known and trusted experts as in this case. Am I being overly paranoid? Either way the ticket has gone a bit quiet. In progress (11/2)

108356 & 111333
Both the 100IT tickets are in Waiting for reply - the oldest one for quite a while - David asked a question a while back and no answer. The newest one asks if the 100IT logs made it to apel safely - I think what David has to do is submit a ticket with this question to the apel support team - have I got the right end of the stick?

Biomed Tickets at Manchester and Imperial
111356 & 111357
FYI There's a note at the bottom of both of these tickets that the version of CREAM that should fix this has been delayed until the end of February(ish).
Update - I read those updates wrong - the cream update has been released and these tickets have been (perhaps erroneously) closed.

Monday 9th February 2015, 15.00 GMT

Other VO Nagios Results
At the time of writing the only site showing red that aren't suffering an understood problem was RALPP with org.nordugrid.ARC-CE-submit and SRM-submit test failures for gridpp, pheno, t2k and southgrid for both its CEs and its SE. The failures are between 1 and 12 hours old, so it doesn't seem to be a persistent failure, but it seems to be quite consistent. They all seem to be failing with "Job submission failed... arcsub exited with code 256: ...ERROR: Failed to connect to XXXX(IPv4):443 .... Job submission failed, no more possible targets". Anyone seen something like this before?

Only 20 Open UK tickets this week.

Biomed tickets:
111356 (Manchester)
111357 (Imperial)
Biomed have linked both these tickets as children of 110636, being worked on by the cream blah team. AFAIKS no sign of Cream 1.16.5 just yet.

111347 (22/1)
CMS consistency checks for January 2015. It looks like everything that was asked of RAL has been done by RAL, so hopefully this can be successfully closed. In progress (3/2)

111120 (12/1)
Another ticket, this time concerning a period of Atlas transfer failures between RAL and BNL, that looks like it can be closed as the failures seem to have stopped (and might well have been at the BNL end). Waiting for reply (22/1)

108944 (1/10/14)
CMS AAA test failures at RAL. Federica can't connect to the new xrootd service according to the error messages. No news for a while. In progress (29/1)

100IT 108356
Both of these 100IT tickets are looking a bit crusty - the first is waiting for advice, the second was just put "In progress".

110353 (25/11/14) Dan has set up to test out the latest https-accessible version of storm for dteam and atlas. As a cherry on top this node is also IPv6 enabled. I'm not sure if Dan wants others in the UK to "give it a go"? In progress (6/2)

100566 (27/1/14)
(Blatantly scounging for advice) Trying to figure out why Lancaster's perfsonar is under-performing. Ewan kindly gave us access to a iperf endpoint and it's been very useful in characterising some of the weirdness - although I'm still confused. Ewan also gave us a bunch of suggestions for testing that have been useful - next stop, window sizes. If anyone else wants to throw advice to me all wisdom donations are thankfully accepted. My advice for others in be careful trying to connect to the default iperf port on a working DPM pool node.... In Progress (9/2)

Monday 2nd February 2015, 14.00 GMT
22 Open UK tickets this month.

110389 (26/11/14)
A perfsonar ticket for Sussex. Their perfsonar has been reinstalled, but needs soothing. Matt has informed us that this might have to wait a few weeks due to other issues. On Hold (21/1)

110536 (2/12/14)
MICE job failures at RALPP - it looked like they were dying due to running of of memory. The queues have been tweaked to give MICE more, but no word from the MICE if this has solved the problem. Waiting for reply (12/1)

110365 (25/11/14)
Another perfsonar ticket. Again the node is reinstalled, just not quite working right. Winnie is waiting for news from the other sites in a similar boat. In progress (maybe On Hold it?) (20/1)

111118 (12/1)
ECDF "low availability" ticket - just waiting for the silly alarm to clear. Daniela submitted a ticket about this foolish alarm a while ago - 107689. On Hold (19/1)

95303 (1/7/13)
glexec tarball ticket. With my tarball hat on - still no positive news on this front - it's beginning to look like this can't be done but we're having one last go. Sorry! On Hold (19/12)

110225 (18/11/14)
Change of VO Manager for It looks like this ticket is being held up at the user end a lot. I'm not sure there's anything we can do as it involves outside CAs. On Hold (20/1)

111356 (23/1)
One of Manchester's CEs not working for biomed, due to problems with the new CREAM/old WMS communication. Alessandra gave biomed some sagely advice, but I suspect this ticket will need to be prodded soon to get a reponse from biomed (who I agree should use a newer WMS and close it). On Hold (26/1)

111547 (2/2)
I'm reporting on a ticket that I submitted to myself today. I'm not sure what that says about the world. Anyway - a ticket to track the decommissioning of one of Lancaster's CEs, as we try to do it all proper like. On Hold (2/2)

100566 (21/1/14)
Lancaster's perfsonar ticket, which I sadly let reach its first birthday. I've been prodding this offline, does anyone have the address for a regular, open iperf endpoint I could borrow? On Hold (9/1)

95299 (1/7/13)
Lancaster's tarball glexec ticket, as the ECDF one. On hold (26/1)

95299 (1/7/13)
UCL's glexec ticket. They've been having trouble getting it to behave, and at last check Ben was off ill - probably due to dealing with glexec :-) On Hold (20/1)

110353 (25/11/14)
Atlas asking for QM's storage to be made available via https. Waiting on a production ready STORM that can provide this - Dan is trying it out on his testbed, which still needs tweaking. In progress (28/1)

One of the IC CEs not working for biomed. Similar to the Manchester ticket, Daniela points to ticket 110635 and is waiting on an EMI release to fix it (due out imminently AIUI). On Hold (28/1)

97485 (21/9/13)
Jet's LCHB job failure tickets. I'm afraid I haven't been able to chase this up (partly due to only ever remembering on the first Monday of the month) - there's been no news for a while. On Hold (1/10/14)

111333 (22/1)
A ticket to 100IT and the NGI to get the cloud accounting probe upgraded. I notified 100IT, but forgot to reassign the ticket - thanks to Jeremy for doing it. Assigned (2/2)

Getting VMcatcher working at 100IT. David from 100IT has asked for some answers on which "glancepush" to use, but no reply for a while. Waiting for reply (19/1)

CMS would like to run some staging tests to warm up for Run2. The Tier 1 warned CMS of today's outage and they're happy to proceed tomorrow (the 3rd) - I think they'd like a response. In progress (30/1)

A ticket regarding inconsistent BDII and SRM storage numbers. Waiting on a fix from the developers regarding read-only disk accounting (I think), Brian is still on the case. Stephen B let us know that Maria the ticket submitter is on maternity leave, and asks in her stead if the numbers are expected to align now. On hold (28/1)

An atlas ticket about a large number of data transfer errors seen between RAL and BNL. Brian reckoned that this was due to shallow checksums on the old data being transferred, but had trouble looking at the BNL FTS. Regardless, the ADCoS shifter hadn't seen any errors for a week and suggests the ticket can be closed. Waiting for reply (29/1)

CMS AAA test problems at RAL. After setting up a new xrootd box the test failures have changed in nature, but sadly they're still failures. In progress (29/1)

CMS Consistency Check for RAL, January 2015 edition. Filelists were generated, orphan files were identified, then purged. Just need to know what CMS want to do next. Waiting for reply (26/1)

Sno+ ticket concerning gfal tool problems, waiting on the new release to come out (middle of this month I believe). If you don't want to wait that long then I believe the 2.8 gfal2 tools can be found in the fts3 repo at last check. On hold (20/1)

Monday 26th January 2015, 14.15 GMT
Back after being forgotten about by me:
Other VO Nagios Status:

At the time of writing I see:
Imperial: gridpp VO job submission errors (but only 34 minutes old so probably naught to worry about).
Brunel: gridpp VO jobs aborted (one of these is 94 days old, so might be something to worry about).
Lancaster: pheno failures (I can't see what's wrong, but this CE only has 10 days left to live).
Sussex: snoplus failures (but I think Sussex is in downtime).
RALPP: A number of failures across a number of CEs, all a few hours old. An SE problem?
Sheffield: gridpp VO job submission failure, but only 6 hours old. And of course the srm-$VONAME failures at the Tier 1, which are caused by incompatibility between the tests and Castor AIUI. Things are generally looking good.

22 Open UK Tickets this week.
The NGI has been asked to upgrade the cloud accounting probe, and then notify our (only at the moment) cloud site to republish their accounting. Not entirely sure what this entails or who this falls on, I assigned it to NGI-OPERATIONS (and also noticed that 100IT isn't on the "notify site" list - odd). Assigned (22/1)

CMS AAA test failures. Andrew Lahiff reported last week that the Tier 1 is building a replacement xrootd box which is currently being prepared. If that will take a while can the ticket be put on hold? In progress (19/1)

An atlas ticket, asking for httpd access to at QMUL. The QM chaps were waiting on a production ready Storm that could handle this, and are preparing to test one out. This is another ticket that looks like it might need to be put On Hold (will leave that up to you chaps - there's a big difference between "slow and steady" progress and "no progress for a while"). In progress (21/1)

A dteam ticket - concerning http access to RHUL's SE. Although the initial observation about the SE certificate being expired was incorrect (the expiry date was reported as 5/1/15, which to be fair I would read as the 5th of January and not the 1st of May!) there still is some underlying problem here with intermittent test failures. Also this ticket raises the question of under what context are these tests being conducted? Anyone know, or shall we ask the submitter? In progress (26/1)

Manchester: 111356(23/1)
Imperial: 111357(23/1)
Biomed are having job problems, looking to be caused by using crusty old WMSes to communicate with these site's shiny up-to-date CEs. According to ticket 110635 a cream side fix should be out by the end of January (CREAM 1.16.5), although Alessandra suggests that Biomed should try to use newer, working WMSes - or Dirac instead!

Monday 19th January 2015, 14.30 GMT
23 Open UK Tickets this week.

CMS seeing AAA test failures at RAL. The tests have been restarted recently and now seem to be having some suspicious looking authentication failures. In Progress (13/1)

Atlas complaining about httpd doors not working on Sheffield's SE. After schooling the submitter in how to submit more useful information Elena is working on it. I bring this up as in the last few days I've had quite a few of my pool nodes have their httpd daemons crash on them (they're up to date, but still SL5), which may or may not be related. In Progress (19/1)

ECDF "low availability" ticket after a few days of argus trouble, which Wahid fixed. Now the ticket will languish for a few weeks as the alarm clears. Daniela has reminded us of her ticket against these fairly silly alarms: . In the mean time this ticket could do with being put On Hold whilst the alarm clears. In progress (19/1)

Setting up VMCatcher at 100IT. After some troubles things seem to have be looking up, although there are still some questions that the 100IT chaps have for the configurations and what they should be using that aren't getting answers. I set the ticket to "Waiting for Reply" hoping that this will help get those in the know's attention. Waiting for reply (15/1)

Perfsonar Tickets
110382(TIER 1)
Everyone seems to have updated their perfsonar hosts, so we're all good on that front, but a number of sites are either having trouble with their reinstalled hosts, or are having problems that they had pre-reinstall still haunt them. I'm afraid I have no suggestions of what to do about the growing number of these tickets though!