Past Ticket Bulletins 2014

From GridPP Wiki
Revision as of 17:18, 16 December 2014 by Matthew Doidge 1ac9bd3994 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Monday 15th December 2014, 14.15 GMT

Last Christmas - you sent me a ticket,
And the very next day, you escalated it anyway.
This year, to save me from tears,
I'm going to put them On Hold (on hold).


It's the last ticket update from me for 2014, and as is my Christmas ticket tradition I won't go into too much detail as I suspect people will be winding down this week rather then rolling out changes. Still it would be good if sites take the time to tidy up their tickets before we all go and enjoy our Winter festivities, and any ones that will be left open could sites please make sure to update and put them On Hold if they're not going to be looked at for a few weeks.

36 Open UK tickets this winter's day.

Obligatory link to all the UK tickets:
http://tinyurl.com/p37ey64

Here's a few that really, really could do with a Solstice Update or at least On Holding:
110570 (lhcb cvmfs problems at Durham - looks like it can be closed).

110570 (cms AAA tests at the TIER 1)
109712 (cms glexec errors at the TIER 1)

110389(reinstalling perfsonar at Sussex)

108356 (fedcloud and vmcatcher rollout at 100IT)

110384 (perfsonar reinstall at UCL)

110608 (Sheffield low availability ticket due to accidental 1.8.9 upgrade- always worth On Holding here as they take so long to clear).

110606 (a similar story for UCL).

110482 (Lancaster still suffering in SAM tests after upgrading to 1.8.9 too soon).

A bit of Christmas cheer - the related CMS tickets at Bristol and the Tier 1 look to have a solution in sight, thanks to the Condor Masters:
106324
106325

Let me know if I've missed ought any tickets you would like brought up.

I'll leave it there - it would be nice if everyone could have a look at all their tickets this week, just in case. But more importantly, everyone have a good Festive Season!

Merry Christmas, and a Happy New Year! Monday 8th December 2014, 14.45 GMT.

38 Open UK Tickets this week.
A few of the perfsonar upgrade tickets are still open (although most don't seem to be stalled per se, I see that a lot of them are of the form "we didn't quite get time to finish it this week"). We have a number of nagios tickets open - two belonging to Lancaster (our premature DPM upgrade is biting us) - Gareth opened a few of them this morning. I also see a couple of LHCB cvmfs tickets - it looks like the lhcb cvmfs areas might be "clogging up" at sites and are probably worth a preemptive health check.

TIER 1
109694(28/10)
Matt M's ticket to the Tier 1 concerning not being able to get the gfal commands to work accessing Castor. Duncan has posted to the ticket that things are working for him now, along with the details of his setup. On Hold (4/12)

110397(26/11)
Duncan ticketed the Tier 1 about not being able to access the LFC via webdav. Catalin fixed a few misconfigurations on the LFC, but notes the limitations concerning VOMS proxies and browsers (i.e. they don't work together), and proposes to close the ticket. Waiting for reply (4/12)

LIVERPOOL
110391(26/11)
This "http at Liverpool not quite working" ticket raised some valid points about what Atlas wants/expects from its http access. I think the original problem is "fixed", which leaves this ticket in danger of limbo-ing (like me after a few too many Pina Coladas), sadly there was no atlas cloud meeting last week to bring this up at. In Progress (2/12)

ECDF
110599(7/12)
ECDF got an atlas transfer ticket, but as Andy correctly pointed out the rest of the UK cloud isn't looking pretty at all on the DDM matrix. Why did poor old Edinburgh get singled out for a ticket? Waiting for reply (8/12)

That's about it for the tickets that really caught my eye. Feel free to bring up any tickets that you think I should have picked up on in the meeting.

Monday 1st December 2014, 14.30 GMT
34 Open UK Tickets this month. Quite a few of them are from Duncan, asking sites to please reinstall their perfsonar hosts.

THE CA
110484(1/12)
Simon F ticketed the CA concerning a possible problem with the ticket reminder system. JK has responded with a reply, and asked that similar tickets in the future use the helpdesk at support@grid-support.ac.uk rather then GGUS (and definitely don't use both!). He's looking into it at his end, and has asked Simon to check the spam filters. Assigned (should be In Progress?) (1/12) Update - in progress now, and Jens has been roped in to the ticket as well - there was a problem after all (see JK's email).

SUSSEX
110389(26/11)
Duncan has reminded Matt RB to reinstall his Perfsonar with the latest release. Matt reckons he'll get to this the first half of this week. Nothing more to say. In Progress (26/11)

BRISTOL
110365(25/11)
Another perfsonar ticket, Bristol's perfsonar seems ill, but Duncan gave the URL for the Sheffield perfsonar. Probably just a copy and paste error when he wrote the ticket though. In progress (26/11) Update - Winnie confirms that the perfsonar has been reinstalled, poked and prodded. Things are still off with the box, and the site firewall admins are being consulted - but if it isn't a firewall problem Winnie would appreciate assistance debugging the problem.

106325(18/6)
CMS pilots losing connection at Bristol. The Bristol admins are still looking at this, and the problems are still happening. They've asked some questions (which likely will need a ticket status switch), and have tried disabling IPv6 on their workers for the time being to cross another factor off the list. On Hold (27/11)

BIRMINGHAM
110388(26/11)
Duncan has also asked Birmingham to update their perfsonar boxen- no reply from Matt or Mark yet. Maybe they missed the ticket. Assigned (26/11)

GLASGOW
110387(26/11)
Another request to upgrade perfsonar boxes. Gareth has replied, hopefully it'll get done this week. In progress (26/11)

ECDF
110386(26/11)
The Edinburgh "please upgrade your Perfsonar" ticket. Wahid has replied with the ECDF stance on perfsonar, and put the ticket On Hold. On Hold (26/11)

95303(1/7/13)
ECDF's glexec tarball ticket. Same position as last month I'm afraid. On Hold (29/8)

DURHAM
108273(5/9)
Durham's perfsonar results going just plain weird. The Durham chaps have reinstalled their perfsonar, but as expected things are still odd. Hope to test a new routing arrangement later this week. Is that still on course? On hold (12/11)

MANCHESTER
110457(29/11)
Atlas have ticketed Manchester about the same issue again (see 110366), which boils down to lost files not being able to be declared lost due to the rucio migration. Not much that can be done Manchester side until the file deletion service is back up at full swing- On Hold the ticket? In progress (1/12)

110225(18/11)
A ticket for the voms service host at Manchester, detailing the change in VO manager for vo.helios-vo.eu. Bit of confusion with the new VO manager's certificate to be used for this, this ticket might need some shepherding, perhaps even On Holding if it gets too close to Christmas. In Progress (21/11)

LIVERPOOL
110391(26/11)
Atlas have noticed that the Liverpool DPM has some kind of webdav access problem, browsing works but downloads didn't. This was on purpose as a security, but John enabled http access offsite from the disk nodes. There was some discussion in the ticket about http/https access within DPM, but I suspect this ticket is done unless these points need to be thrashed out a bit. In progress (26/11)

LANCASTER
110482(1/12)
I upgraded my DPM to 1.8.9, and all I got was this ticket! Lancaster's failing the second half of the getTURL test due to what I believe is an incompatibility with the latest DPM version and the SAM tests (and I wasn't rolling back to pass nagios tests!). Waiting on a new set of tests to be rolled out. On Hold (1/12)

100566(27/1)
Lancaster's bad perfsonar performance ticket. No win after upgrading to the latest perfsonar, hope to run some other tests in the pre-Christmas quiet period.

95299(1/7/2013)
Lancaster's glexec tarball ticket. No news - my hope is to work on this in the two week per-Christmas quiet period, same as our perfsonar problem. On Hold (14/11)

UCL
110442(28/11)
Atlas have noticed transfer problems to UCL. Ben is trying to investigate, and Wahid is lending a hand. In Progress (28/11)

110384(26/11)
UCL's "please reinstall your perfsonar" ticket. In progress (26/11)

110358(25/11)
Nagios ticket for UCL, concerning glexec test failures. Ben has replied that he is trying to debug their glexec installation. In progress (28/11)

95298(1/7/13)
UCL's glexec ticket. Ben's working on it, but the site got hit by problems last week. In progress (24/11)

QMUL
110353(25/11)
Another atlas httpd access ticket, although this one is quite different from the Liverpool one as it appears they are trying from within a job. I don't think this has been noticed by the QM chaps yet. Assigned (25/11) Update - In Progress now, Dan's checking if https should be working. Elena has involved uk cloud support.

107880(26/8)
The not-really-a-QM problem snoplus/suse/srmcp ticket. We discussed how to handle this last week, but no news - it seems we're waiting for Matt M to re-engage? Waiting for reply (20/11)

BRUNEL
110383(26/11)
Brunel's "please reinstall your perfsonar" ticket. Raul is on it. In progress (26/11)

EFDA-JET
97485(21/9/13)
The Jet LHCB job failure ticket. If ever there was a candidate for setting a ticket to unsolved, this is it. On Hold (1/10)

100IT
108356(10/9)
Our commercial cloud site's vmcatcher ticket. After Owen's help it looks like things are on the up, but the images still aren't being published. An interesting link was posted with the instructions how to do that. In progress (28/11)

THE TIER 1
106324(18/6)
CMS Pilots losing connectivity at RAL, sister to the Bristol ticket. Not much news, but Andrew L has a plan to discuss the problem with the HTCondor devs at CERN when he's there. On Hold (27/11)

109694(28/10)
Sno+ not being able to copy files out of RAL with the gfal tools. It appears to be a non-snoplus specific gfal problem. Perhaps an install problem with wrong versions of gfal2-utils? Andrew L is going to contact the gfal2 devs for help. On hold (26/11)

107935(27/8)
Inconsistant published BDII/SRM storage numbers. Has been discussed recently in the Ops meeting, a conversation is ongoing with the Castor devs about this, but there wasn't much noise from them at last check. The ticket could do with a mini-update, even if it's "nothing to see here, move along". On Hold (3/11)

109276(11/10)
Some CMS users having trouble with the RAL FTS REST web interface. Everything seems to be fixed now, so it looks like this ticket can be closed. In progress (27/11)

110397(26/11)
Duncan has ticketed the Tier 1 regarding not being able to access the LFC via his browser. Catalin confirmed that the problem was occurring for him for his non-dteam identities. Things seem to be working for Chris though. How goes it? In progress (27/11)

109712(29/10)
CMS glexec errors at the Tier 1. Andrew is back on the case, but needs to test things out first before rolling them out. In progress (27/11)

108944(1/10)
Another CMS ticket, this time AAA tests failing at RAL. Andrew L asked for the testing scripts so that RAL can test themselves - Duncan provided a link that will help point the way. In progress (26/11)

110382(26/11)
And the last ticket, the Tier 1's "please upgrade your perfsonar" ticket. In progress (26/11)

Monday 24th November 2014, 15.00 GMT
22 Open UK tickets this week: 11 On Hold, 3 Waiting for Reply, 8 In Progress.

Ticket with No Home
107880(26/8)
It's that srmcp ticket that has been assigned to QMUL after being assigned to RAL. Chris has suggested that the ticket be assigned to the srmcp devs (if there are any left...). Not a bad suggestion (although I would suggested closing this ticket and opening a fresher one for clarity, as the initial problems are solved AIUI), let's make a decision on this one in the meeting. Waiting for reply (20/11)

100IT
108356(10/9)
Much like when I learning to drive around my hilly home town, this vmcatcher ticket seems to keep stalling. Owen has updated with some good information. In progress (13/11) Update - David replied to Owen, with positive news.

BRUNEL
110059(11/11)
This ticket (Brunel's DPM being shut down by spider attacks!) was being kept open for fear of the issue showing up again (as this is the second incarnation of the issue) - however Henry has had a chance to reyaim his DPM this time and all seems alright, so maybe it can be closed? On Hold (17/11) Update - Henry closed this ticket.

TIER 1
109712(29/10)
CMS glexec error at the tier 1. Andrew L said he'd look into this again after he's back from a well-deserved break, but that was a while ago. Any news? On Hold (10/11)

107935(27/8)
BDII/SRM storage capacity mismatch. At last word Brian had submitted a request to Castor to find out how it reports read-only volumes. Any news? On Hold (3/11)

(I realise that both these tickets are On Hold and therefore no update should be necessarily expected, but they were both seemed that they might not be held up for long).

MANCHESTER
109272(11/10)
Atlas having transfer problems, related to a filesystem loss at Manchester. The files are *still* going through recovery (http://bourricot.cern.ch/dq2/recovery/ - thanks Wahid, I had forgotten about this page). They're very nearly done though, I was going to suggest On Holding this ticket but I doubt it will be worth it now. In progress (18/11)

Monday 10th November 2014, 15.00 GMT;

21 Open UK Tickets this week.

Tier 1
109712(29/10)
CMS seeing glexec errors at the Tier 1, likely due to a lack of "wildcard mapping" at RAL. Andrew L was investigating, but no news on the ticket since. In progress (29/10)

100IT
108356(10/9)
The setting up vmcatcher ticket at 100IT. It looks like this ticket is done, or at least getting there. I've prodded the ticket. In progress (29/10) Update - David has replied saying that all is not well with vmcatcher - it's not doing as it's supposed to and not giving back any error messages!

Sheffield
109906(5/11)
Some publishing problems have caused Sheffield to get a low availability ticket. Things are fixed, but as it has been pointed out before this alarm requires time to sooth it. My advice is to put the ticket On Hold whilst waiting for it to go on its own. Waiting for reply (7/11) Also how goes the Sno+ ticket 109207?

Durham and Lancaster
108273
108715
Both these sites have perfsonar tickets on hold after the shellshock scare, here's a gentle reminder that the new perfsonar and accompanying instructions are available. (I hoped to not have to mention Lancaster by sneaking in a reinstall this morning, but had trouble getting my iDRAC interfaces to work).

The Ticket with no home
107880(26/8)
Those funny SNO+ SUSE users and their problems with srmcp. The Tier 1 has cast this ticket out onto the streets, and assigned it to QMUL. Who don't really want it (or deserve it!). As mentioned in his last update, Chris has been having a chat with Matt M and it looks like srmcp is working...kinda (if you give it the correct port numbers, and somehow magically know these for each SE). Chris mentions that this could be viewed as a bug in srmcp, or solved with a wrapper script that he doesn't have time to write to. My suggestion is to give the SUSE users the necessary ldap query to pull the information they need and let them sort out the rest! Assigned (7/11) Update - Henry has asked if anyone has had a chance to try out the VBrowser (SRM GUI thing)?

Monday 3rd of November 2014, 14.45 GMT
26 Open UK tickets this month.

Sussex
109539(22/10)
Sussex publishing "all the 4s" (bdii bingo!) for their waiting jobs. Matt RB has a ticket in with the developers over these problems (109263), although he has bravely said that he might try to tackle the problem himself...and it looks like lcg-infosites returns a sensible number now. On Hold (can be closed?) (23/10)

108765(24/9)
Cross-referenced with the above ticket, looking at the last few updates it looks like Matt RB release a spooky Hallow'een patch, and now they look to be green. Another ticket that can be closed? On hold (31/10)

Bristol
106325(18/6)
CMS pilots losing connection at Bristol. No news for a while, it looks to me like Bristol are still in downtime though? This has been a tough issue to debug. On hold (14/10)

Glasgow
109807(1/11)
Someone at atlas were trying to raise the dead at Glasgow over Hallow'een, although rather then zombies it was long lost files. It appears that despite these files being declared lost last summer the deletion/recovery ritual hadn't been completed. UK cloud support are on the case. In progress (3/11)

Edinburgh
95303(1/7/13)
Tarball glexec ticket. On Hold (29/8)

Durham
108273(5/9)
Durham's perfsonar results going "proper weird" suddenly. The local networking team where on the case, but the perfsonar got offlined from fear of shellshock and there has been no news since (is it alright to reinstall perfsonar yet?). On hold (6/10)

Sheffield
109207(8/10)
Sno+ asking for their VO_SW_DIR to point to cvmfs. Elena rolled this out, but sadly the ticket was reopened due to some job failures accessing cvmfs, and a few holdouts still with the wrong environment variable (Matt M threw in some CE errors he was seeing too, but he was very apologetic about it). Elena's investigating. In progress (30/10) Update - Catalin posted a reminder of the new cvmfs-keys release (1.5-1), and suggested moving snoplus' cvmfs area to teh egi.eu domain - /cvmfs/snoplus.egi.eu

Manchester
109272(11/10)
Atlas have been seeing transfer problems, although it looks like these failures have mutated since the ticket was opened (checksum errors to srm type errors by the looks of it). Alessandra is on the case. In progress (3/11)

Lancaster
108715(23/9)
Getting Sno+ jobs running at Lancaster. It looks like everything is in place, just waiting for Sno+ to confirm (or give us a list of errors!). Waiting for reply (30/10)

95299(1/7/13)
Tarball glexec ticket... no news other then my last attempt a few weeks ago failed (not as simple as I hoped) On hold (8/9)

100566(27/1)
Poor Perfsonar Performance. Has hit a bit of a roadblock with both perfsonar boxes being switched off for the last month... have I missed an announcement saying that the latest perfsonar release is ready? On hold (31/10)

UCL
95298(1/7/13)
UCL's glexec ticket. Ben hit a snag installing this mid-October, no news since then after some feedback from Maarten. In progress (14/10)

Imperial
109526(22/10)
LHCB having cvmfs trouble at IC, which was likely caused by a batch of naughty CMS jobs ruining it for everyone else. LHCB re-enabled IC to see if things were back on track, no news since. Waiting for reply (24/10)

EFDA-JET
109571(23/10)
Ops "availability" test failures at Jet. The cause of the alarms is known (Jet had a certificate problem on a few hosts). Just waiting for alarm to clear now. On Hold (28/10)

97485(21/9/13)
The case of the mysterious lhcb failures at Jet. No progress, none expected really though. On hold (1/10)

100IT
108356(10/9)
AFAICS this ticket now distills down to "Getting vmcatcher working at 100IT". Things seem to be progressing well, although the 100IT chaps aren't very good at setting their ticket statuses correctly! In progress (28/10)

109573(23/10)
Ticket listing the requirements for a cloud site. All the three actions have or already were completed, but there is a question over the state of the 100IT site BDII. In progress (30/10)

La Grada Uno
109712(29/10)
CMs are seeing glexec errors ("status 203") at the Tier 1. Looks to be caused by a lack of wildcard mapping, only just coming to light with the recent cms analysis jobs coming into the site. Andrew L is on it like a scotch bonnet. Or just on it. (29/10)

109694(28/10)
Matt M from Sno+ has noticed gfal-copy errors when trying to access the Tier 1 using those tools. He's not sure if this is a problem with the Tier 1 or the tools themselves (or even his setup), Duncan is already helping him out. In progress (3/11)

107880(26/8)
(possibly related) Sno+ "srmcp failures" for a bunch of SUSY users. Some great input on how to get the tools working from Duncan and Chris, but no word since. My suspicion is Matt is waiting to hear back from this user group. Maybe their mail clients don't work under SUSE either? In progress (21/10)

106324(18/6)
The Tier 1 version of the Bristol CMS pilots losing connection ticket. On hold after exhausting all ideas. On hold (13/10)

109276(11/10)
Submissions to the RAL FTS3 "REST" interface failing for some reason - AIUI thought to be a problem with the CRLs and apache. After some advice the system has been tweaked, and is in the waiting-to-see-if- that-fixed-it stage. On hold (3/11)

108944(1/10)
CMS AAA access tests failing at RAL. Reading down the ticket it looks to be a cms redirector problem at RAL... or something... Andrew has been working to fix things, adding another redirector and other tweaks. Andrew has asked the xrootd experts (cc'd?) why the behaviour they are seeing is occurring (and also notes some references to RALPP slipping into the Tier 1 discussion). Waiting for reply (27/10)

109608(24/10)
T2K notice the LFC denying the existence of the new user. The problem seem to go away from the T2K side, but Catalin has spotted a potential problem and asked for some voms-proxy-info output. Waiting for reply (28/10)

109814(3/11)
Atlas have noticed a lot of lost job heartbeats over the last day, the Tier One guys are on it. In progress (3/11)

107935(27/8)
Inconsistent BDII/SRM numbers. Looks to be a problem with how castor reports read-only disk servers, Brian has put in a request to the Castor team for information on this. On hold (3/11)

Tuesday 28th October

  • There are 26 open UK tickets at the moment. Click here to list them.

Scanning the VO Nagios, what do I see...
VO Nagios

Brunel's had gridpp failures for the last 4 days.
Lancaster's still failing for pheno (something in the authorisation chain is broken, can't see what).
RalPP is having a brief problem with T2K (Job submission failed, no more possible targets).
Bristol's in downtime.
Sheffield is having trouble with gridpp, and has started having trouble with pheno, Sno+ and T2K for the last week.
And all the srms at the Tier 1 are failing their tests (which I believe is the status quo, although this set of tests has only been failing for the last 11 days).

Tickets
26 Open UK Tickets this morning.

100IT
109573(23/10)
100IT have got a ticket describing the Requirements for Fed Cloud sites, detailing requirements such as what they need to put into the GOC DB and publish in their BDII, and that they should support dteam. Still just Assigned (23/10) (The other 100IT ticket, 108356, is progressing nicely and might be of interest to anyone thinking of playing with vmcatcher).

TIER 1
109276(11/10)
Some users were having trouble using the FTS3 REST interface at RAL. Ticket is progressing, but I just flagged it up as it has a few possibly-of-interest technical tidbits about reloading CRLs. In progress (28/10)

SHEFFIELD
109644(27/10)
Sheffield failing SAM tests, likely caused by a reyaiming. Elena has noticed errors in her bnotifier and bupdater logs (errors like "key job_registry_add_remote not found"), and has asked for help - I've not seen these errors before, has anyone else? In progress (27/10)

EFDA-JET
109571(23/10)
Nagios failures at Jet after some certificate troubles last week. The Jet admins have said they've fixed things, and they're looking all green, so can this ticket be closed? In Progress (23/10)

MANCHESTER
109272(11/10)
Atlas transfer failure ticket. The original problem looked to be at the NDGF end, but atlas have spotted (possibly unrelated) problems with other transfers - the example was one between Manchester and Liverpool. Atlas also observed that the error rate seemed to be very different for each space token. In progress (26/10)


Monday 20th October 2014, 14.30 BST

Non-LHC VO Nagios Failures:
VO Nagios

Liverpool, Lancaster (we're getting better), Sheffield, EFDA-JET, The Tier 1, Bristol (in downtime) and Cambridge are on "the list". Most are transient, load based errors. gridpp, pheno and southgrid seem to be the VOs having most problems.

We're up to 30 Open UK Tickets this week.

TIER 1
109276(11/10)
Submissions to the FTS3 REST interface was failing for some, probably after the certs or crls got stale. Andrew L suggested implementing an httpd restart which Maarten suggested was overkill - but anyhoo the submitter has come back to say that he hasn't seen a problem all week, so this ticket can likely be closed. In progress (20/10)

108845(27/9)
Just a heads up that this atlas transfer failure ticket has been reopened. Reopened (18/10)

RALPP
109360(15/10)
This SNO+ ticket, about failing nagios tests at RALPP, hasn't been noticed yet. Assigned (15/10) Update - it was a ticket meant for the Tier 1 all along, In Progress now - actually, waiting for reply

SHEFFIELD
109207(8/10)
SNO+ would like the VO_SW_DIR environmental variable to point to cvmfs - I know Elena has looked at this, any progress? In progress (9/10)

Similar story with another Sno_ ticket at Sheffield:
109223(9/10)

BRUNEL
109379(16/10)
SRM Nagios test failures. It looks like Brunels SE is in a dodgey state - too many ftp connection failures have been seen in the gridftp logs, httpd causing heavy load, possible SELinux problems after DB move. I'm sure if anyone has any input on this it would be appreciated. In progress (17/10)

IMPERIAL/DIRAC
108723(23/9)
I think this ticket from Chris W, containing questions for the DIRAC team, can be closed in favour of the new line of communication Daniela set up (https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users). Waiting for reply (7/10) Update - closed.

ECDF AND GLASGOW
Two very similar LHCB cvmfs tickets at these sites, any chance of a link? Or perhaps just a coincidence?
ECDF: 109440
GLASGOW: 109439
Update - probably not, the Edinburgh ticket is now closed.

Another Update
I think that the SnoPlus ticket asking for srmcp help for SUSE users (107880) can be closed now, thanks to Duncan's tip for getting srmcp to work (num_steams=1).

Another another Update
The 100IT ticket (108356), about making fedcloud.egi.eu available at the site, has been updated by David B after some silence - currently waiting for a reply. Perhaps a sign that the process needs better documentation, in the event any sites go down this rabbit hole?

Monday 13th October 2014, 15.00 BST

Other VO Nagios:
VO Nagios

Site's seeing problems at the time of writing are:
Lancaster - long term errors for pheno & gridpp on one CE (still to be fixed).
Liverpool - short term errors for snoplus on a cream CE (looks like it's rejecting jobs).
RALPP - short term errors for southgrid and pheno on ARC CEs (job submission problem).
Bristol - short term error for southgrid ("Job submission to LRMS failed").
Sheffield - long term errors for gridpp on multiple CEs, short term errors for pheno, t2k and snoplus (timeouts affecting job submission).
QMUL - long term errors for t2k on their SE ("GlueVOInfoPath or GlueSAPath not published").
TIER 1 - long term errors for t2k and snoplus on their respective SEs.

I'm still figuring out how best to present this, please bare with me.

23 Open UK tickets this week 10 Green, 3 Yellow, 1 Orange and 10 Red.

Tier 1
108944(1/10)
CMS having trouble finding some files at RAL during a AAA access test. The RAL team has satisfied the ticket to the first order (confirming that the files in question are indeed in castor), so the ticket could be solved - or at least CMS could be asked to see if they still have trouble accessing the files. In progress (1/10)

108546(16/9)
An atlas ticket, about some job failures that might well not be relevant any more. Looking very stale, and possibly like it could be closed. In progress (22/9)

Also on the probably should be on hold list: 106324(CMS)

And Chris W, could you please take a peek at: 107880
(Sno+'s odd suse user group needing help).

SUSSEX
108765(24/9)
ROD ticket about the state of the Sussex BDII output. Matt RB tracked it to a problem with their (updated) SGE and has submitted a ticket (109263) which appears to have been picked up. Correctly On Hold (13/10)

IMPERIAL/DIRAC
108723(23/9)
Ticket from Chris W, asking some question about DIRAC. It really could do with some input from him, and Daniela points out the existence of the new dirac user mailing list as a better place for such discussion: https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users. Waiting for reply (1/10)

SHEFFIELD
Could this Sno+ ticket: 109223 (jobs not be assigned to Sheffield) be related to this Sno+ ticket: 109207 (Sno+ SW DIR needs to be pointed to cvmfs)? Just a naive thought if the SW_DIR was one of the requirements for jobs.

That's all Folks, please let me know if I've missed anything out.

Monday 6th October 2014, 14.30 BST
On top of the lovely tickets there was a discussion in the Ops team last week and it was mentioned how it would be handy to look how sites were doing on the VO nagios, so I thought I'd go over that here.

VO Nagios

Site's that seem to be having trouble on one or more of their nodes at the time of writing are:
Durham: pheno and gridpp
Lancaster: pheno and gridpp
Sussex: snoplus
EFDA-JET: gridpp, pheno, southgrid
Liverpool: gridpp, snoplus
Sheffield: gridpp, snoplus
QMUL: t2k.org
TIER 1: snoplus and t2k
Although only Lancaster, Sheffield and the Tier 1 seem to be having really long term problems.

(I'm still trying to think how best to parse this information, so my apologies that it's poorly presented).

On to the tickets.

Only 24 open UK tickets this month (organised by site).

SUSSEX
108765(24/9) Sussex have a ROD ticket, originating from a glue validation error (although it's just picked up some SHA-2 failures). Matt RB was away though, so not much progress - Matt can you get to it this week? In progress (3/10)

RALPP
109115(6/10)
A fresh ticket from cms, complaining that RALPP don't have any backup squids listed in their site xml file. Assigned (6/10) and closed on (7/10) as the site name was old (the old one being too long!).

BRISTOL
106325(18/6)
CMS pilots losing network connectivity. CMS have confirmed that it is only a subset of the Bristol clusters seeing pilots dropping connections. Winnie has continued to poke and prod this, and between her and CMS they've (more or less) ruled out natting as the cause of the problem. Bristol are still quite stuck, and kind of hoping some unrelated network tweaks might sweep this issue away. On Hold (2/10)

ECDF
95303(1/7/13)
tarball glexec deployment - see Lancaster entry on the same issue. On hold (29/8)

DURHAM
108273(5/9)
Durham experienced a sudden, odd change in their perfsonar results (outbound bandwidth went up, in bound dropped). The Durham chaps were looking into this but were interrupted by this shellshock business. Oliver has included some long term plans in the ticket and will update it again when they have their perfsonar back. On hold (6/10)

SHEFFIELD
108716(23/9)
Snoplus jobs not running at Sheffield. Elena had to bash one of her CEs into shape, but it should be fixed now and has asked Matt M if he still sees a problem. Waiting for reply (6/10)

MANCHESTER
109001(2/10)
Not quite a site problem, but David M was having trouble committing to the SVN hosted at Manchester (and a reminder that I believe the "official" way of reporting problems with these services is to ticket the site). It looks like this has been solved and the ticket can probably be closed. In progress (3/10)

109049(4/10)
Atlas transfer problems - the underlying issue being a downed (and dead) disk server. Alessandra is doing the lost file declaration stuff and offered to provide lists of these files to the users directly. Not much more that Manchester can do. In progress (6/10)

LANCASTER
100566(27/1)
Poor, unexplained perfsonar performance. Although some ideas have been made how to tackle this, holidays then shellshock have got in the way of implementing them. On hold (1/10)

108715(23/9)
Sno+ jobs not running at Lancaster. Hopefully after a tweak to the information system on our CEs I fixed this - as Duncan pointed out things are looking okay on the VO nagios. I've asked Matt M how things are looking for "real" Sno+ work. Waiting for reply (1/10)

95299(1/7/13)
tarball glexec ticket. As mentioned in last week's Ops meeting, due to holidays there has been no progress over the last month but things look hopeful. On hold (9/9)

UCL 95298(1/7/13)
Non-tarball glexec ticket. Ben's been trying to install this, but having dependency troubles - did anyone who uses rpms notice this when they last tried to install the glexec WN? In progress (29/9)

109039(3/10)
Another Glue2 validation ROD ticket. In progress (3/10)

IMPERIAL
108723(23/9)
Chris W has ticket Imperial with a few dirac file catalogue queries. Duncan responded with some documentation that others might also find useful and some other information. I believe the ticket is now waiting for feedback from Chris (who may in turn be waiting for feedback from the other VO user groups). Waiting for reply (1/10)

EFDA-JET
108735(23/9)
biomed have asked that JET activate the biomed cvmfs repo at their site. Ticket seen but no news or action. In progress (23/9)

97485(21/9/13)
One of the ancient tickets. LHCB having authentication errors at Jet. No change. On hold (1/10)

109080(6/10)
A fresh ROD ticket about a number of alarms - at first glance I would say a certificate has expired. In progress (6/10)

100IT
108356(10/9)
VM images from fedcloud.egi.eu not available at 100IT. This ticket showed up an issue with creating an AppDB profile, but that has since been solved. No news on the state of this ticket other then that the issue persists. In progress (1/10)

THE TIER 1
107935(27/8)
"BDII vs SRM inconsistent storage capacity numbers". No news on this for a long time. This ticket really could do with some love (or at least on holding!). In progress (3/9)

106324(18/6)
CMS pilots losing connection, similar to the Bristol ticket. The issue has been tracked to being *something* in the Tier 1's internal network after comparing firewall rules to RALPP. CMS have updated the ticket with some more information and some nice plots, but the long and the short of it is the problem persists. In progress (1/10)

108546(16/9)
atlas seeing failures on the RAL-LCG2_HIMEM_SL6 queue. Ticket in an odd state - the atlas shifters seem to think the problem was transient but Gareth and go are seeing a lot of load on diskservers despite nothing on BiGpanda. The RAL team is keeping an eye, but this ticket could do with some updates/on holding in the mean time. In progress (22/9)

107880(26/8)
Sno+ asking RAL for help/alternatives with srmcping for a small group of seemingly awkward Suse using users. Some input from others but not much word from Sno+ or the Tier 1 - Chris, could you please take a peek with your small VO hat on? In progress (30/9)

108944(1/10)
CMS running into a lot of "file not found" errors when running a AAA check at RAL, and asking if things are alright. When looking over the whole Castor namespace it appears that all files are present and correct which doesn't explain why CMS had trouble finding them. In progress (1/10)

108845(27/9)
Atlas seeing gridftp timeouts. This looks to be a hotspot problem (at this point in the review I'm just skim reading tickets). Atlas also report seeing deletion errors, and have included links. I'm not sure if this ticket will be impacted by this afternoon's Castor intervention. Still very much In Progress (5/10)

Monday 29th of October 2014, 15.00 BST
24 Open UK tickets this week.

TIER 1
107935(27/8)
Inconsistent BDII/SRM reported storage numbers for ATLASHOTDISK. No news on this ticket for quite a while. In Progress (3/9)

107880(26/8)
Note quite a RAL ticket, Sno+ asking about how to accommodate users who only have access to srm tools to access data. Nothing since Henry's helpful input on the 10th. Anyone with any ideas? I've mentioned the UI tarball but I think that's clutching at the weediest of straws. In progress (29/9)

Sno + having trouble running jobs.
Sheffield: 108716(23/9)
Lancaster: 108715(23/9)
Matt M reports that Sno+ are having troubles running at Lancaster and Sheffield - it looks like we could be seeing the same problem, I'll let you (Elena) know what I find out. Both In Progress.

100IT
108356(10/9)
100IT not running VMCatcher at their site. There was some trouble creating a AppDB profile, but this was solved (108548). No news on this ticket since then. In progress (17/9)

SUSSEX
108765(24/9)
Sussex has a BDII nagios check ticket that has escalated - can you please give it an update and if you're stuck let us know - Lancaster got a similar ticket on Friday and these issues are annoying to tackle. In Progress (29/9)

RHUL
108448(12/9)
Just a warning that this atlas data transfer ticket has been re-opened on you, with a new set of transfers detected. Reopened (29/9) (It could well be that 108856 is a duplicate of this ticket.).

Tuesday 23rd September

  • Down to 19 tickets this week.
  • The GGUS summary is available for review.

Tuesday 16th September 2014, 10.00 BST
We seem to once again be in the Dark Ages here at Lancaster, with yet another power outage that is overrunning. Hopefully I'll be at the meeting though, thanks to a huge laptop battery and a brave little eduroam wireless hub that somehow is still up and running. Sorry for the lack of email warning, and trampling over Jeremy's update!

26 Open UK tickets this week.

Tier 1
107935 (27/8)
Regarding the mismatch between BDII and SRM storage numbers for ATLASHOTDISK at RAL. Maria asked a question about this last week, but no answer yet. In progress (3/9)

107880
Not really a Tier 1 issue, Sno+ trying to allow some remote SUSE users access data. Henry has given some nice suggestions from his experience with MICE, using a java srm web gui thing. In progress (10/9)

100IT
108356
100IT have been asked to set up the vmcatcher tool at their site, but have hit a snag at the first hurdle of creating an appdb account to allow them to download the images. Anyone have any experience with this? Looks like a counter ticket is needed, which the 100IT guys might not be confident in doing themselves. Waiting for reply (10/9)

OXFORD
107911
Sno+ ticket about wanting to set software tags on ARC CEs. Ewan has replied with a comprehensive but not particularly positive (from Sno+'s point of view) post, but there's some hope that they can get what information they need from the VO nagios pages (which Kashif has got working for ARC CEs now as well). I don't think there's anything more that can be done here, but we might want to give Matt M a chance to reply. In progress (15/9)

SHEFFIELD
107886 (26/8)
Sheffield's perfsonar box playing up. Elena has tried to get it back on its NICs, but no joy. My advice is a reinstall or a mail to perfsonar support (or at least TB-SUPPORT). In progress (9/9)

QMUL
108217 (3/9)
Duncan ticketed QM's IPv6 test perfsonar about not initiating any tests in the Test Mesh. Glancing at the link it looks to me like this is no longer the case, but if I'm mistaken and there will be no progress for a while it'll be nice to on hold this ticket. In progress (8/9)

ECDF
108353(10/9) Pheno have been having a sort out of their storage in the UK (108334). The ECDF version of this ticket seems to have created some confusion, as the Edinburgh chaps don't support pheno on their storage, whilst Pheno are under the impression that they aren't supported at ECDF at all. In Progress (10/9)

Tuesday 16th September

  • 23 open tickets this week.
  • The GGUS summary is available for review.
  • With reference to tickets discussed last week (OK. means done. Y. means continuing).


NO SITE IN PARTICULAR
Ok. 108182(3/9)
As seen on TB-SUPPORT, the NGI has a ticket telling it to get sites to have the new voms servers configured for the switch over. Jeremy has kindly offered to field the ticket. I think we all have this in hand, but as I type this I realise I may have forgotten to set things up for the ops VO. I encourage everyone to double check their readiness ahead of next Monday's switchover. Assigned (8/9)

Y. 106615(2/7)
The RAL FTS2 service has been shutdown for nearly a week now, so I suspect this ticket tracking the switch off can be closed. In progress (3/9)

RALPP
Ok. 108306(8/9)
CMS having trouble running a "locateall" AAA test at RALPP (TBH I don't know what that is) - Chris has let them know that this is due to their xrootd reverse proxy being down, and it should be up and running in a day or two after it's reinstalled. In progress (8/9)

OXFORD
Y. 107911(27/8)
As mentioned last week, Sno+ have been having trouble as they can't assign software tags on Arc CEs, and they use these tags to do stuff like black/white listing. There was some dicussion on this in the ticket, but it fizzled out- I suspect due to the topic moving offline. Can it have an update please? In progress (27/8)

BRISTOL
Y. 106554(29/6)
CMS transfer problems to Bristol. Winnie put an update, where she mentioned she has applied a fix to their Storm that might have fixed the problem. Maybe. She's asked if the problem still persists, as the monitoring links provided have all gone stale. Lukasz is on leave, can anyone CMS savvy help her? Waiting for reply (8/9)

Y. 106325(18/6)
CMS Pilots losing contact with home base. No progress since Winnie noticed that the problem only seems to affect one of the Bristol clusters, but none expected due to leave. On Hold (8/9)

Ok. Update - Bristol have another, possibly related CMS ticket 108317

EDINBURGH
Ok. 108100(1/9)
Maarten ticketed ECDF about this CE's not having the new voms servers configured. Andy is working on it. There's a reminder that on top of adding the right configs services do need restarting. In progress (5/9)

Y. 95303(1/7/2013)
glexec tarball ticket. There's a bit more movement on getting this done, but it's all on me to get the tarball glexec working still - naught the Edinburgh chaps can do.

DURHAM
Y. 108273(5/9)
Duncan noticed some interesting goings on on the Durham perfsonar page. The Durham chaps are talking to their networking team to figure out what the flip is going on. In progress (8/9)

SHEFFIELD
Y. 107886(26/8)
Duncan's unwavering gaze also noticed a problem on Sheffield's perfsonar. Elena was tweaking it when it broke, and it looks like it's still broken, any luck fixing it Elena? In progress (26/8)

LIVERPOOL
OK. 108288(8/9)
Liverpool got a ROD ticket when their CREAM CE got poorly. Steve worked his magic and things were fixed, but Gareth asks about the persisting BDII tests still failing. Solved (8/9) Update - the problems seems to have disappeared, so was probably just a artifact of BDII lag.

LANCASTER
Y. 100566(27/1)
My personal shame number 1. Lancaster's poor perfsonar performance. Despite a reinstall of the box and not showing any signs of a bottle neck in transfers or running manual tests we still have really poor perfsonar results. No problems with the network have been found. Duncan helped formulate a plan at GridPP, but I haven't had the time to test it out yet. On hold (8/9)

Y. 95299(1/7/13) My personal shame number 2 - Lancaster's glexec deployment ticket. Some news in that I have something I'd like to test now - I just need to find time to test it, then see if I can package it somehow. On hold (8/9)

UCL
Y. 95298(1/7/13)
UCL's glexec deployment ticket. This work was pushed back to the end of August - any news on it? On Hold (29/7)

OK. 107711(15/8)
A ROD ticket for UCL APEL publishing errors. The apel admins got involved and things are looking better now - although Gareth points out that there is some missing data in the Spring. In progress (8/9)

QMUL
OK. 107799(21/8)
Pointing VO_SNOPLUS_SNOLAB_CA_SW_DIR to /cvmfs/snoplus.gridpp.ac.uk. No news for a while on this after it was acknowledged - has the job fallen to the bottom of the stack? In progress (22/8) Solved now, issue was dealt with last week but the ticket wasn't updated.

Y. 108217(3/9)
Duncan ticketed QM about one of their pefsonar boxen - which Dan pointed out is their IPv6 perfsonar. So does that mean this ticket can be closed? In progress (4/9) Update - Duncan would like the ticket kept open to track this node's assimmalation into the mesh.

EFDA-JET
Y. 97485(21/9/13)
Longstanding LHCB ticket with JET. No movement on this, but none was expected. Still if anyone wants to heroically interject with some ideas I'm sure it would be appreciated. On hold (29/7)

TIER 1
Y. 107880(26/8)
As mentioned last week, Matt M of Sno+ fame has a user who only has access to srm tools and is having trouble accessing files at RAL. Brian has suggested using the webfts, but Matt doesn't think this will work for the user's limited abilities. Any thoughts? In progress (8/9)

Y. 107935(27/8)
Inconsistency between BDII and SRM reported storage capacity...hang on, haven't we been here before (105571)? It's not quite the same problem, but it's close. Brian has confirmed the mismatch, Maria has asked for an explanation for it (and how it only really effects ATLASHOTDISK). In progress (3/9)

Y. 105405(14/5)
Checking the site firewall configuration for RAL's Vidyo router. Last update was in July, is the dialogue between the Vidyo team and the RAL networking chaps ongoing? On hold (1/7)

Y. 106324(18/6)
The Tier 1's version of 106325 - CMS pilots losing contact. This was waiting on the firewall expert getting back from hols to compare the settings between the Tier 1 and Tier 2 (who don't see this issue). Are they back yet? On Hold (14/8)

Monday 8th September 2014, 15.00 BST
25 Open UK tickets this week.

NO SITE IN PARTICULAR
108182(3/9)
As seen on TB-SUPPORT, the NGI has a ticket telling it to get sites to have the new voms servers configured for the switch over. Jeremy has kindly offered to field the ticket. I think we all have this in hand, but as I type this I realise I may have forgotten to set things up for the ops VO. I encourage everyone to double check their readiness ahead of next Monday's switchover. Assigned (8/9)

106615(2/7)
The RAL FTS2 service has been shutdown for nearly a week now, so I suspect this ticket tracking the switch off can be closed. In progress (3/9)

RALPP
108306(8/9)
CMS having trouble running a "locateall" AAA test at RALPP (TBH I don't know what that is) - Chris has let them know that this is due to their xrootd reverse proxy being down, and it should be up and running in a day or two after it's reinstalled. In progress (8/9)

OXFORD
107911(27/8)
As mentioned last week, Sno+ have been having trouble as they can't assign software tags on Arc CEs, and they use these tags to do stuff like black/white listing. There was some dicussion on this in the ticket, but it fizzled out- I suspect due to the topic moving offline. Can it have an update please? In progress (27/8)

BRISTOL
106554(29/6)
CMS transfer problems to Bristol. Winnie put an update, where she mentioned she has applied a fix to their Storm that might have fixed the problem. Maybe. She's asked if the problem still persists, as the monitoring links provided have all gone stale. Lukasz is on leave, can anyone CMS savvy help her? Waiting for reply (8/9)

106325(18/6)
CMS Pilots losing contact with home base. No progress since Winnie noticed that the problem only seems to affect one of the Bristol clusters, but none expected due to leave. On Hold (8/9)

Update - Bristol have another, possibly related CMS ticket 108317

EDINBURGH
108100(1/9)
Maarten ticketed ECDF about this CE's not having the new voms servers configured. Andy is working on it. There's a reminder that on top of adding the right configs services do need restarting. In progress (5/9)

95303(1/7/2013)
glexec tarball ticket. There's a bit more movement on getting this done, but it's all on me to get the tarball glexec working still - naught the Edinburgh chaps can do.

DURHAM
108273(5/9)
Duncan noticed some interesting goings on on the Durham perfsonar page. The Durham chaps are talking to their networking team to figure out what the flip is going on. In progress (8/9)

SHEFFIELD
107886(26/8)
Duncan's unwavering gaze also noticed a problem on Sheffield's perfsonar. Elena was tweaking it when it broke, and it looks like it's still broken, any luck fixing it Elena? In progress (26/8)

LIVERPOOL
108288(8/9)
Liverpool got a ROD ticket when their CREAM CE got poorly. Steve worked his magic and things were fixed, but Gareth asks about the persisting BDII tests still failing. Solved (8/9) Update - the problems seems to have disappeared, so was probably just a artifact of BDII lag.

LANCASTER
100566(27/1)
My personal shame number 1. Lancaster's poor perfsonar performance. Despite a reinstall of the box and not showing any signs of a bottle neck in transfers or running manual tests we still have really poor perfsonar results. No problems with the network have been found. Duncan helped formulate a plan at GridPP, but I haven't had the time to test it out yet. On hold (8/9)

95299(1/7/13) My personal shame number 2 - Lancaster's glexec deployment ticket. Some news in that I have something I'd like to test now - I just need to find time to test it, then see if I can package it somehow. On hold (8/9)

UCL
95298(1/7/13)
UCL's glexec deployment ticket. This work was pushed back to the end of August - any news on it? On Hold (29/7)

107711(15/8)
A ROD ticket for UCL APEL publishing errors. The apel admins got involved and things are looking better now - although Gareth points out that there is some missing data in the Spring. In progress (8/9)

QMUL
107799(21/8)
Pointing VO_SNOPLUS_SNOLAB_CA_SW_DIR to /cvmfs/snoplus.gridpp.ac.uk. No news for a while on this after it was acknowledged - has the job fallen to the bottom of the stack? In progress (22/8) Solved now, issue was dealt with last week but the ticket wasn't updated.

108217(3/9)
Duncan ticketed QM about one of their pefsonar boxen - which Dan pointed out is their IPv6 perfsonar. So does that mean this ticket can be closed? In progress (4/9) Update - Duncan would like the ticket kept open to track this node's assimmalation into the mesh.

EFDA-JET
97485(21/9/13)
Longstanding LHCB ticket with JET. No movement on this, but none was expected. Still if anyone wants to heroically interject with some ideas I'm sure it would be appreciated. On hold (29/7)

TIER 1
107880(26/8)
As mentioned last week, Matt M of Sno+ fame has a user who only has access to srm tools and is having trouble accessing files at RAL. Brian has suggested using the webfts, but Matt doesn't think this will work for the user's limited abilities. Any thoughts? In progress (8/9)

107935(27/8)
Inconsistency between BDII and SRM reported storage capacity...hang on, haven't we been here before (105571)? It's not quite the same problem, but it's close. Brian has confirmed the mismatch, Maria has asked for an explanation for it (and how it only really effects ATLASHOTDISK). In progress (3/9)

105405(14/5)
Checking the site firewall configuration for RAL's Vidyo router. Last update was in July, is the dialogue between the Vidyo team and the RAL networking chaps ongoing? On hold (1/7)

106324(18/6)
The Tier 1's version of 106325 - CMS pilots losing contact. This was waiting on the firewall expert getting back from hols to compare the settings between the Tier 1 and Tier 2 (who don't see this issue). Are they back yet? On Hold (14/8)

Monday 1st September 2014, 15.00 BST
29 Open UK tickets this week.

A Sno+ query
107880(26/8)
Matt M asks the Tier 1 if they can help with one of their user problems, where the SUSE bound user is trying to use srmcp to access files at the Tier 1 and failing. The errors look to me like the client is connecting to a port whilst setting up the transfer. My advice would be to try an uberftp-like client, or muck around with using passive/active settings. Of course there could also be something Tier 1 specific getting in the way. Any wisdom appreciated. In Progress (27/8)

Another Sno+ Issue
Oxford: 107911(27/8)
RALPP: 107910(27/8) - Still just in the "assigned" state. These two tickets concern, as far as I can see, more or less the same problem. Sno+ can't set software tags on ARC CEs, and therefore are running into a spot of bother due to how they're using the tag's functionality (as a blacklisting mechanism). Ewan's been helping out on the Oxford ticket, but I don't think the RALPP one has been noticed yet.

QMUL
107799(21/8)
A friendly poke to remind Dan about this Sno+ ticket (moving VO_SW_DIR variable to point to the cvmfs area). In Progress (22/8)

ECDF
107884(26/8)
Atlas ticketed Edinburgh over webdav instabilities at the site - although it's mentioned in the ticket that they're seeing problems across the UK. Wahid mentions that this is due to the problems with webdav in DPM 1.8.7 and not all their pools are upgraded. Naught wrong with the ticket handling, but this link to so proto-webdav monitoring is interesting: http://sblunier.web.cern.ch/sblunier/webdavGridStatus/ In progress (probably will be closed soon) (28/8)

EFDA-JET
107551(7/8)
The JET CSIRT e-mail check ticket. After some confusion over the GGUS e-mailing (which I believe was broken for a while when these tickets went out) this has stalled. In progress (18/8)

Monday 25th August 2014, 22.30 BST
29 Open UK tickets this week.

SUSSEX
107814(22/8)
Ops failures on the Sussex Cream. Matt sent a e-mail to TB-SUPPORT about this so if anyone could chime in that would be appreciated - atlas are running fine, but Ops tests (and possibly other jobs coming in via WMS) are hitting a spot of bother. In my experience delegation errors like this often pass in time, but the errors have been going for over 4 days. Any help appreciated.

107801(21/8)
Perhaps this problem is also affecting Sno+? In Progress (22/8)

RALPP
107844(24/8)
Just a heads up that this atlas "no free space" ticket has been reopened with (possibly unrelated) srm errors. Reopened tickets often sneak past our sentries. Reopened (24/8)

SNOPLUS SOFTWARE DIR to CVMFS (21/8)
LIVERPOOL: 107796
SHEFFIELD: 107798
QMUL: 107799
Sno+ has asked sites to have their VO SW DIR environmental variable point to their cvmfs directory. All three sites are on it, something for anyone rolling out Sno+ support.

BRISTOL
106325(18/6)
Winnie has spotted that the CMS pilots losing contact to the submission host problem is only (at least recently) affecting their ARC CE. Whilst the CE flavour isn't the only difference between the clusters this is strongly suggesting that the problem isn't with the site firewall. On Hold (19/8)

UCL
107711(15/8)
UCL received an Apel-Pub Ops ticket nearly a fortnight ago which has yet to be even acknowledged. I suspect Ben is on holiday, can someone (looking at the Londoners) poke through other channels? Assigned (15/8)

TIER 1
107815(22/8)
DirectJobSubmit Ops failures at the Tier 1. Catalin asks if the Ops jobs can be tuned and have their registration timeouts increased - as it appears only the test jobs are suffering failures of this kind. Waiting for reply (22/8)

Monday 18th August 2014, 14.15 BST
Still only 20 open UK tickets.

Site CSIRT e-mail address checks
107551(7/8)
EFDA-JET are the only site with a ticket still open - and they replied straight away. Probably hit by the GGUS mail parser problems seen last week. I suspect this ticket (and the master 107538) can be closed. In progress (13/8)

(see also 107648. where Jeremy was re-testing the GGUS email functionality).

SUSSEX
107710(15/8)
Sussex triggered a low availability warning, probably due to being accidentally left in a "Warning" downtime. Could this week's dashboard sentinel (aka ROD) please have a peek to see if the Sussex alarms are disappearing? Poking around the portal for Sussex [1] it doesn't look to my layman's eyes like the numbers are improving. In progress (15/8)

[1] UKI-SOUTHGRID-SUSX


EDINBURGH
107526(6/8)
LHCB noticed a cvmfs problem at ECDF. Andy noted that it looked like a local rpm update had whacked cvmfs, and gave it a whack back into shape. At last word the site was set back into production to see if the fix fixed things, but no news since. In Progress (10/8)

107633(11/8)
ECDF's perfsonar box has stopped producing results. Andy mentioned some border firewall troubles that knocked out the ECDF middleware, is it still causing a problem? Looking at the perfsonar dashboard [2] it looks like there's still no data being produced from the ECDF perfsonar. In Progress (12/8)

[2] Dashboard

BRISTOL
106554(29/6)
CMS transfer problems from the US to Bristol. After the slightly terse response from CMS last week it would be a good idea for there to be a response to the ticket - and if things are going to take some time, for the ticket to be On Held (On Holded?). In progress (6/8)

106325(18/6)
CMS confirm that the problem (pilots losing connection to their submission hosts) still persists - and have included some good links to their pilot monitoring. The corresponding RAL ticket (106325) has stalled due to holiday. On Hold (5/8)

UCL
101285(16/2)
UCL getting their perfsonar back on it's NICs (as servers don't have feet). Ben asked Duncan to put UCL back into the UK mesh, which as the UCL entries in the dashboard are looking very healthy, I think has been done. Can this ticket be closed now? Waiting for reply (29/7)

Monday 11th August 2014, 15.30 BST
20 Open UK tickets this week.

CSIRT Site Security Checks
Master ticket: 107538(7/8)

Jeremy has been submitting tickets to site security e-mails addresses to test that they're all working. Most seem to be present and correct, but on the not-yet-replied pile we have:
SUSSEX (107545)
EFDA-JET (107551)
CAMBRIDGE (107558)

It could be that the sites in question have everyone on holiday, or that the site security contact address is a generic one for the University and someone at the other end is wondering what in the name of Odin's beard a GGUS ticket is.

In the event that that any of these three saw the ticket's here first rather then through the proper channels please, please can you check your lines of communication rather then treating this like any other ticket.

SHEFFIELD
107217(24/7)
The information publishing on one of Elena's CEs seems to have spontaneously broken (as I found Torque and Maui to be wont to do). If anyone has had their CE's suddenly start publishing all the 4s for their job numbers recently then I'm sure any help would be appreciated. In progress (6/8)

BRISTOL
106554(29/6)
As seen from TB-SUPPORT, Bristol are having networking problems - especially with regards to the US. Lukasz and Winne are working at it, but the submitter (and maybe CMS) seems to be losing patience (unfairly). In progress (6/8)

TIER 1
106615(2/7)
Decommissioning of the FTS2 service. Gareth sent out a broadcast - https://operations-portal.in2p3.fr/broadcast/archive/id/1187 Preparation are well underway. In progress (11/8)

106324(18/6)
The RAL counterpart to Bristol's 106325 (cms pilots losing contact to their submission hosts), it looks like this problem still persists as well. A mixup between RALPP and Bristol on the CMS end (and not for the first time) meant that CMS didn't answer RAL's question (which was do you see the same problem at RALPP too?). Still Waiting for reply (7/8)


Monday 4th August 2014, 14.30 BST
20 Open UK tickets this week.

NGI/Other
107369(30/7)
NGIs are being asked to ask Cloud sites to fill in a questionnaire about grid deployed cloud stuff security. This ticket was meant for 100IT - although shouldn't there be one for UKI-GridPP-Cloud-IC too? (I couldn't see such a ticket in the solved pile). I've assigned to uk-ngi ops and notified the site of the ticket. Assigned (4/8)

106615(2/7) Decommissioning ticket for the FTS2 service at RAL on the 2/9/14. Nothing else to do really, on hold to closer to the time. On hold (14/7)

BRISTOL
106325(18/6)
CMS pilots losing their network connections. The Bristol admins are waiting to see how things pan out for a similar RAL ticket (106324) but I'm not sure if waiting for this is the right thing to do - it could be that the RAL problems are very RAL specific. On hold (14/7)

106554(29/6)
Another CMS ticket, about FTS backlogs between FNAL and Bristol. Although the original transfer has finished a connectivity problem still seems to persist and Lukasz has offered some suggestions and asked cms how they'd like to proceed. Waiting for reply (29/7)

Could these two issues be somehow related?

GLASGOW
107435(1/8)
CMS glideins were getting held up at Glasgow. Dave and the gang tracked down missing /cms/Role=pilot explicit mappings in their argus, and have added them in. Things are looking better, with the ticket now in the customary "How's it look on your end?" state. Waiting for reply (4/8)

ECDF
95303(1/7/13)
glexec deployment ticket. There's been some movement on the glexec tarball development front at last. Jeremy tweaked the reminder date and assigned person. On hold (29/7)

SHEFFIELD
107217(24/7)
Sheffield failing site-bdii checks due to the old "all the 4s" published waiting jobs problem (usually due to broken dynamic publishing). Ticket's acknowledged, and the expiry has been extended, but could do with a proper update soon. In progress (1/8)

LANCASTER
100566(27/1)
Poor perfsonar performance. After a reinstall of the node and establishing that there is no hard bottleneck we're stuck. Currently waiting on some network engineer time at Lancaster, whilst scratching our heads over why the perfsonar isn't working right for us. On hold (4/8)

95299(1/7/13)
Tarball glexec ticket. I've opened up a line with the glexec devs, who have been very helpful. They've given me some build tips, but then I went on holiday before I could use their advice. On hold (29/7)

UCL
101285(16/2)
Ben got the UCL perfsonar box back on its NICs, and is just waiting on getting it back into the WLCG mesh. Jeremy is on the case. Waiting for reply (29/7)

95298(1/7/13)
UCL's glexec ticket. Ben has stated this will have to wait until he is back from leave at the end of August. On hold (29/7)

RHUL
107436(1/8)
Atlas having transfer problems to RHUL. Govind has tracked down a grdftp mapping problem that solves some of the errors, the rest seem to be due to his new pool nodes misbehaving. Perhaps a problem with the configuration of the latest version of DPM? In progress (3/8)

QMUL
107440(2/8)
LHCB seem to be having problems getting files from the input sandbox on what appears to be all QM cream CEs. Chris is on the case. In progress (4/8)

107402(31/7)
Site BDII test failures. It looks like this problem has evapourated, Gareth has suggest that somebody at QM close the ticket if they're happy. Assigned (can be closed) (1/8)

Cloud-IC
106347(19/6)
The new cloud site was noted as hogging 12% of the cern statum one cvmfs connections. There was some discussion about this, and the Shoal installation at Oxford might well prevent this from happening again, but the site was down for maintenance so no confirmation could be made. When is the cloud site likely to be back in action? On Hold (14/7)

EFDA-JET
97485(21/9/13)
LHCB authentication errors at jet, which survived OS and EMI upgrades. I've been all talk and no trousers about getting round to helping jet out, has anyone seen anything like this? Or have any ideas? On Hold (29/7)

TIER 1
107416 (31/7)
The RAL FTS has been accused of hammering the US MWT2 srm. Andrew has suggested a course of action that might soothe things. Waiting for Reply (4/8)

106655 (4/7)
Castor failing ops tests. This was due to reasons, that are understood by clever people. A fix was delayed, but hopefully should roll out this week. In progress (31/7)

105405 (14/5)
Vidyp router firewall checking ticket. This ticket has been left fallow for a while, with some offline discussion between the Vidyo devs and RAL networking. Any news? On hold (1/7)

106324(18/6)
CMS pilots losing connection to their submission hosts. Firewall tweaks haven't fixed the problem, but a suggestion of of changing the pilot "keepalive" parameter was put forward. There seems to be some confusion on the cms end about the current state of this issue, but last word says it persists. In progress (30/7)


Monday 28th July 2014, 14.15 BST
I'm on holiday again, but the UK tickets can be checked here:
http://tinyurl.com/p37ey64

There were 19 UK tickets at time of writing, none of which looked particularly troubling.

If for some reason you ever want to look up past ticket reviews they can be found at:
https://www.gridpp.ac.uk/wiki/Past_Ticket_Bulletins

Cheers!
Matt

Tuesday 29th July As Matt is away here is my review of the tickets - Jeremy

BRUNEL

Remove mapping to PIC FTS2. GGUS 107282. (Created: 01/07? Only assigned 28/07).

SHEFFIELD

org.bdii.GLUE2-Validate issue. GGUS 107217. (24/07: In progress 25/07). Site is publishing SW tags for CMS – unable to delete them GGUS 106820. Looks stuck but RHUL resolved similar issue by editing tags. GGUS 106819 (11/07: In progress 14/07).

UCL

Publishing with EMI-2 APEL. GGUS 106876. Currently reinstalling APEL node, after backing up database. (14/07: Last updated 24/07).

RAL Tier-1

Two alarms from srm-dteam. GGUS 106655. Cross-contamination of information due to the GEN-CASTOR SRMs sharing a database, and some VOs sharing service classes. In progress. (04/07: Last update 24/07)

NGI

Decommissioning FTS2 Service at RAL Tier1. GGUS 106615 It is a master ticket following the decommissioning process. (02/07: On hold 14/07)

RED TICKETS!

BRISTOL

6 day backlog transferring from FNAL to Bristol. GGUS 106554 Several things tried. Luke looking for some suggestions. (29/06: Last update 22/07]

LANCASTER

CVMFS problem. Pilots aborting. Various things tried. GGUS 106406 (23/06: on hold 28/07).

IMPERIAL

Biomed unable to list files with gsiftp. GGUS 106369 They want gsiftp rather than SRM for performance but it is not supported for this use in dCache (20/06: Last update 22/07).

IMPERIAL-CLOUD

Cloud site Service leading to 12% of all WLCG traffic to the service cvmfs-stratum-one.cern.ch GGUS 106347. Thought shoal may help. Site in maintenance. (19/06: on hold 14/07).

BRISTOL

CMS pilots losing network connections. GGUS 106325. Tier-1 sees something similar so waiting on GGUS 106324 (18/07: on hold 14/07)

RAL T1

CMS pilots losing network connections. GGUS 106324. Network settings suggestion made… no response yet from the site. Needs a response. (18/06: 23/07)

GGUS 105405. Check your vidyo router config. Was being followed up and a question remained about connections for clients (14/05: on hold 01/07).

UCL

Problem with perfsonar host. GGUS 101285. Reinstalled and as of 24/07 waiting to be added to WLCG mesh before closure of ticket. (16/02: 24/07)

LANCASTER

perfSONAR poor performance GGUS 100566. Host reinstalled but issue remains. Matt looking for ideas. (27/01:01/07). On hold since 23/06.

EFDA-JET

LHCb jobs failed. GGUS 97485. (21/09/13!: 12/05). On hold in May. Site needs help to resolve.**

ECDF

Glexec deployment. GGUS 95303. There is no tarball. (01/07/13: on hold 29/11/13).

LANCASTER

Glexec deployment. GGUS 95299. There is no tarball. . (01/07/13: on hold 07/07/13).

UCL

Glexec deployment. GGUS 95298. Was awaiting a site update… and possibly new staff member. (01/07/13: on hold 29/08/13).


Monday 21st July 2014, 14.15 BST.
26 Open UK tickets this week.

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106770 (10/7)
This enmr.eu ticket concerning not being able to write software tags to the RAL seems to have stalled somewhat. The RAL team did what was asked and Stephen Burke used his wisdom to help out the enmr chaps with matching the new tag. According to the last user post this ticket can be closed, although the users asks if any other services at RAL should host this new tag as well. In progress (can be closed) (15/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106802 (10/7)
The RAL version of the ILC/CMS contacting the wrong voms server ticket. Ticket has been acknowledged but no update since. In progress (11/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106610 (2/7)
HyperK support at the Tier 1. No word since Chris confirmed that HyperK would be alright being arc-ce only. In Progress (2/7)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar box being out of commission after a hardware fault. Ben has got things back on track and is in a position to close the ticket - he's just waiting to see if UCL need adding back into relevant meshes first. In Progress (15/7)

ECDF
https://ggus.eu/index.php?mode=ticket_info&ticket_id=107070 (21/7)
Wahid pounced on this atlas deletion ticket (which was against their test dpm anyhoo). If you think the problems fixed I'd go ahead and close it. In progress (21/7)

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106820 (11/7)
CMS unable to delete software tags at Sheffield - there's been no updates on this ticket for a while. In progress (14/7)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106733 (8/7)
Atlas had transfer problems looking to be the fault of svr018. Some investigation dug up that for some reason there was a lot of rfio connections going on - which stressed the server. The problems seemed to pass on their own - but they're back now (or svr018 has hit a different reef) and the ticket has been reopened. The shifters have put the ticket into Waiting for reply after asking the site to take a look - which we all know isn't the way to go. (21/7)

BRUNEL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106864 (14/7)
This is a CMS ticket about the performance of Brunels's DPM. Sam and Wahid have both lent a hand (possibly at the Storage meeting? I missed it last week), and Raul took steps to improve the DPM performance and upgrade (I was particularly intrigued with the memcached and dmlite memcache plugin bits). This looks very much like one of those tickets that could very well do with being documented. In progress (18/7)

Monday 14th July 2014, 14.30 BST.
29 Open UK tickets today. I might have to send my apologies to this week's meeting as Lancaster is receiving a delivery Tuesday morning.

FNAL VOMS TICKETS
As seen on TB-SUPPORT - a number of sites got tickets concerning jobs still contacting the FNAL voms server for CMS/ILC. Birmingham, RHUL, Liverpool and the Tier 1's tickets are still being worked on - RHUL's ticket might not have been spotted yet (still assigned).

DECOMMISSIONING THE FTS3 SERVICE
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106615 (2/7)
Gareth opened a ticket to document the retirement, in accordance with ancient grid laws. As naught is happening until the 2nd of September I put on hold till nearer the time. On Hold (14/7)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106770 (10/7)
enmr.eu wanted to add tags to one of the Tier 1's arc ces, which of course didn't work. There was an interesting exchange about why a VO would still want to have a site publish tags in the age of cvmfs (essentially so they can minimise changes to the submission gubbins). Andrew offered to add in the tag "VO-enmr.eu-CVMFS" by hand to his CE, it's likely that other sites might be asked to do the same - and it's a solution worth noting for other VOs. In progress (14/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106610 (2/7)
Enabling HyperK at the Tier 1. Ticket looks a little stalled after Chris commented that it was wise for Hyper K to be enabled on only Arc-CEs (in light of RAL going dairy free). In progress (2/7)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106425 (4/7)
UCL are still having trouble with nagios tests after a pool node died. Ben is having trouble getting the new disk server set up - I tried to give him some tips and advised shouting out for help. In progress (8/7)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 (1/6)
Bristol having trouble with CMS transfers- Lukasz noticed Storm was being odd (believing there to be no free space when there was). The SE was kicked but the problem (or a similar one) showed up again. Anyone seen similar? (Looking at Chris Walker:Storm Sage again here). In Progress (9/7)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (1/6)
cf TIER 1 ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324
CMS pilots losing contact with their home base. Looks similar to the issue at RAL, where they seem to have had some success (still waiting to see if it was complete). If the RAL chaps could elaborate on the firewall tweaks that brought about this improvement it would be greatly appreciated (The RAL ticket could do with an update too)! In Progress (14/7)

Monday 30th June 2014, 14.30 BST
Full Review this week, a little earlier then usual. 28 Open UK Tickets

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937 (2/6)
Low availability ticket, due to EMI3 upgrade woes. Most issues have been solved, but Apel publishing problems have been rolled into the ticket. Matt RB seems digging his way out in the right direction though. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sno+ CVMFS unavailable at Sussex. On Hold whilst the other issues are dealt with. On Hold (23/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106492 (25/6)
A request from atlas to resize Space Tokens. Matt also asked if atlashostdisk and atlasgroupdisk could be deleted - Brian gave the nod yes. Probably all done with here? In Progress (27/6)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106438 (23/6)
CMS having some trouble running jobs at Bristol (especially having lots of "held" jobs- but reading the ticket this means held on the cms queue, not in the local batch system). Winnie notes that for at least one of their queues they have over a hundred waiting cms jobs on a 72 slot shared queue. But it looks like the problem may have evapourated. At last word the cms submitter said he'd close the ticket if things stayed clear - but this was last Thursday. In Progress (26/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (1/6)
A different CMS ticket, about pilot jobs losing connection to their submission hosts. After another round of nomenclature confusion, it was found that the problem seems to be between Bristol and hosts cmssrv119.fnal.gov and vocms97.cern.ch. Lukasz suggests using perfsonar to investigate. Also the dates on this ticket are well off (creation date 1/6, but first update 18/6) In progress (27/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 (1/6)
Again the dates on this ticket are very off (creation date was the 1/6, but the first update is the 29/6)- so the issue may have disappeared. This is another cms ticket about a heavy transfer backlog between Bristol and FNAL - if it's still a problem possibly linked to the above issue. Waiting on Lukasz to get back. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106058 (9/6)
CMS xrootd problems at Bristol. Also waiting on Lukasz's return (which I think has happened). On Hold (16/6)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/2013)
glexec ticket. No news, the early review meant I couldn't sooth my shame on this matter. On Hold (27/1)

MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 (2/6)
Manchester publishing to EMI2 APEL. It's being worked on, but one piece is missing - on hold until this detail is sorted. On Hold (25/6)

LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106406 (23/6)
LHCB having trouble on Lancaster's older cluster. First issue was cvmfs timeouts - linked to older WNs being overloaded. Second issue is cream CE losing track of jobs in the batch system. Being worked on, but like a case of old age-tuning can only fix so much. In progress (26/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/2013)
glexec ticket. As with ECDF. On Hold (4/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)
Persistant Poor Perfsonar Performance Problems Plaguing Plymouth-born Postdoc... nope, that's as many Ps as I can get (and I'm not sure I still count as a Postdoc). A reinstall of the box hasn't helped. If anyone has a normal 10G iperf endpoint I could test against that would be great. Other then that waiting on some networking rejigging at Lancaster to shake things up and give the network engineers another chance to go over things. On Hold (23/6)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106425 (23/6)
UCL failing ops tests that are using their SE. Ben noticed a problem with one of their pools, but fixing it didn't seem to solve the problem. Gareth has asked for an update pending being forced to escalate. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/2013)
UCL's glexec ticket. Last word was this would be the first job of a newer staff member, who was due to start within a few months (so about nowish?). On Hold (16/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar not working after suffering a hardware failure. Bits have been replaced and the machine was due a reinstall a while ago. On Hold (28/4)

RHUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106437 (23/6)
Atlas have inaccessible file(s) at RHUL due to a pool node in distress. Govind hopes to install a new motherboard tomorrow and will update after. Good luck with the repair! In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105943 (2/6)
Biomed asking for gsiftp access on the RHUL headnode so that they can read the namespace with gsiftp. Govind tried to enable this but biomed report that it didn't work. Not much word since - but I expect Govind's been busy. In progress (23/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105923 (2/6)
RHUL still publishing to EMI2 APEL too. On Govind's to do list, but low priority. No word for a while. On Hold (17/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106495 (25/6)
Inconsistent storage capacity publishing at RHUL. Govind reckons (quite rightly) that this is due to having a pool node out of commission and will look at it once that's fixed. In Progress (26/5)

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105771 (27/5)
Biomed having problems accessing files via https at QM. Chris explains that they've had to switch off https access and are waiting for 105361 to be fixed and storm to be updated. On Hold (12/6)

IMPERIAL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106369 (20/6)
Biomed ticket, similar to 105943 for RHUL, but with some added history (106369). Biomed are being a little insistent, and asked a question that I don't fully understand about path publishing. In Progress (30/6)

IMPERIAL CLOUD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106347 (19/6)
The new cloud site needed to tune things as VMs weren't using proxies but hitting the cern statum 0 directly. Adam is working on how to get around this - Ewan has mentioned that Oxford have shoal running and have seen accesses from the Imperial Cloud machines - so the problem may have a no work required workaround (the best kind!). In Progress (29/6)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/2013)
LHCB jobs having openssl like problems at Jet. No progress on this for a while but none was expected - the problem survived the move to EMI3, and the jet admins are stuck. On Hold (12/5)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
Vidyo router firewall ticket. I suspect this ticket can be closed, as other issues are being followed up elsewhere- or it at least needs an update/being ste on hold. In Progress (10/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
Inconsistent BDII and SRM storage numbers for lhcb. This has been worked on, and seems almost fixed. There's some debate over the tape figures, Brian points out that the 'online' values are correct. In progress (30/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324 (18/6)
CMS pilots losing connection to their submission hosts at RAL. It looks like this has been going on silently for a while, the RAL team are taking it up with their networking chaps to see if it's a firewall issue.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106480 (25/6)
The information publishing police have pointed out that the RAL Castor isn't publishing a sane version. Brian suspects an rogue ":" causing the problems.


Monday 23rd of June 2014, 15.00 BST
27 Open UK tickets today.

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
RAL is publishing inconsistent storage numbers for lhcb. No word on this for a while - but the problem persists. In progress (17/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
Vidyo firewall ticket - it looks like it's heading to ticket limbo, can it be saved from this fate and given an update (or closure). In progress (10/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)
Sno+ cvmfs stratum-0 ticket. Some interesting conversation in this ticket about cvmfs mirroring on the other side of the Atlantic. In progress (17/6)

BIRMINGHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106020 (6/6)
cern@school jobs stuck at Birmingham. Did the investigation yield any results? Or perhaps the problem has evaporated? Silence isn't golden when it comes to tickets! Well, unless they're on hold. In progress (23/6)

IMPERIAL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106369 (20/6)
Biomed have submitted a second ticket (first one was 105942) asking IC to get gsiftp read access to their dcache namespace (think I've got that right). Simon has replied saying that he doesn't want to circumvent what he sees has a security feature (fair enough), so I suspect this one might have to go Unsolved). Waiting for reply (20/6)

IC-Cloud
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106347 (19/6)
Naught wrong with the ticket handling here, but I thought was interesting - The new Cloud site has been hammering the cvmfs stratum zero - this looks to be a problem with atlas jobs/images trying something new with proxy discovery. An installation of Shoal should fix things. Interesting that not too long ago we had a similar VAC problem. In Progress (20/6)

Monday 16th June 2014, 15.00 BST
28 Open UK Tickets today.
Please can everyone check to make sure they don't have tickets going stale.

NGI
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106057 (9/6)
The creation of a new UK Cloud site UKI-LT2-IC-HEP-Cloud. Jeremy has created the site and I see Adam has signed himself up as an Admin to it. Does anything else need doing? In progress (11/6)

EMI 2 APEL tickets for RHUL and MANCHESTER:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105923 https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 Not much noise from either of these tickets.

MANCHESTER/VOMS
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106243 (16/6)
Sno+ ran into a spot of bother with some of the UK vomses, Robert replied with a good explanation of what was going on at Manchester. Something for the other voms sites to watch out for (although you probably know all about it). In progress (is it solved?) (16/6)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106060 (9/6)
Matt RB fixed one atlas problem with the Sussex Storm SE, but another has come along- looking like bad checksums. Wahid suggests asking Chris Walker (the Storm Whisperer) for his advice. In progress (16/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's EMI ticket - almost done now, I believe the alarms are disappearing and now the problems are with services not working (but as we discussed last week, upgraded and broken is still upgraded!). In progress (13/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sorry to be picking on Sussex. I suspect this Sno+ cvmfs ticket has been put on the back burner - can it be put on hold until you get round to it. In progress (9/6)

Same for https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937

Monday 9th June 2014, 15.00 BST
26 Open Tickets this week.

NGI
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
The ILC ticket. Things got a bit muddled but ILC would like to know the state of Durham's CE. My impression is that they're submitting to a now defunct one - could you please let us know what's up? In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105989 (4/6)
Technically I think this is a Glasgow ticket - I was going to give this a home but there but noticed that the ticket looked solved (it concerned enabling the cern@school cvmfs at Glasgow - which the Glasgow lads had done alongside the other gridpp repos). In progress (can be solved) (6/6)

STOP PRESS
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106057 (9/6)
A ticket from Adam concerning the creation of a new UK Cloud site (UKI-LT2-IC-HEP-Cloud). I'm not sure who this needs to be bounced to (NGI-OPS, Imperial?), it could be that it's all in hand. Assigned (9/6)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937 (2/6)
Sussex got a low availability nagios ticket - Matt RB replied that the trouble is with the EMI3 upgrade and hopes to have dug his way out of that pit shortly. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's EMI3 upgrade ticket. The deadline is pass, and anything not upgraded is in downtime. How goes things? In progress (2/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sno+ were/are having cvmfs problems at Sussex. Related to 105989 above, has /cvmfs/snoplus.snolab.ca been replaced by /cvmfs/snoplus.gridpp.ac.uk? (The latter of which I can see at my site). In progress (29/5).

BIRMINGHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106020 (6/6)
Some little lost cern@school jobs at Birmingham, sitting in an odd state. Matt W is having a look, suspecting argus. In progress (6/6)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106011 (5/6)
Atlas deletion errors at Glasgow. Sam and the lads suspect a dodgey disk pool, and are working on it. In progress (6/6)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105996 (5/6)
Duncan spotted that the ECDF perfsonar box had fallen over. Andy and Wahid are prodding it with their remote stick. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105839 (28/5)
Glue Validator failures at ECDF. Andy's reckoning that the CE's are misconfigured, and is digging into the guts of the matter. In progress (3/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/2013)
My shame, the tarball glexec tickets. Sorry to say nothing to see here again. On hold (27/1)

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105617 (21/5)
A Sno+ cvmfs ticket, similar to the Sussex one (105618). Not much news on it. In progress (21/5)

MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 (2/6)
Manchester are still publishing using the EMI2 apel. The work is scheduled to be done next (this) week. In the mean time has publishing been turned off? On hold (2/6)

LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105939 (2/6)
Biomed ticketed Lancaster over gridftp not being open on our dpm headnode. After advice from Sam we decided that opening up the firewall ports would be okay, but also told biomed that restricting gfal to just one protocol was a bit silly. Waiting to hear if all's well for them. Waiting for reply (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1/2013)
Poor perfsonar bandwidth performance at Lancaster. Following Duncan's advice a downtime has been declared to try a reinstall of the node on Wednesday. In progress (9/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/2013)
glexec tarball ticket. On hold (4/4)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar hit a spot of hardware trouble. Disks and RAID controller have been replaced, last word was that the OS was hoped to be reinstalled at the end of April. I suspect then the EMI3 upgrade storm hit. Any news since? On hold (28/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/2013)
UCL's glexec ticket. At last word waiting on a new staff member to take the reins. On hold (16/4)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)
LHCB problems at JET. The last updated was from me in May, saying that'd I'd ask for help on JET's behalf (which I did...but failed to push on it. Sorry Jet). On Hold (12/5)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/2013)
Sno+ CVMFS ticket. After looking like it was almost done this ticket has become a bit more murky in recent weeks, with talk of desire for an OSG "mirror" which Catalin points out breaks the cvmfs model. I think some more planning in Sno+ and discussion with the experts is needed. Waiting for reply (2/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
A Vidyo router firewall ticket. Not really sure it's that interesting to any outside the Tier 1 - although there are a lot of Vidyo documentation links that might be useful. Not much news on the ticket for a while. In progress (27/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
Mismatch between bdii and srm storage numbers - which has happened before (101310). In progress but no news. In progress (3/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105100 (2/5)
CMS are doing a round of their Storage Consistency Checks. There's been some back and forth between CMS and RAL with clean up being done. Not entirely sure what's the next step for this ticket - it doesn't seem to be a problem yet though. In progress (6/6)


Matt's tak'in it easy for the rest of this week and sends the following postcard "not much else [other than the EMI tickets] that stands out at the moment. cvmfs-enabling request tickets keep coming in (we've had a few for biomed at various sites, Sno+ have 3 tickets open on this, and of course there's the ILC ticket), there has already been talk about recommending that sites get ahead of the game on this and preempt the requests, so maybe this needs to be revisited".

Monday 27th May 2014

  • EGI-2/UMD-2 decommissioning now reaching a critical point. So this week we need another check on:

Any others to be declared!?

And a quick look at the overall ticket situation.


Monday 19th of May 2014, 16.45 BST
30 Open tickets this week.

Big news at the moment are the EMI upgrade tickets (still). The Durham and Sussex tickets have stalled a little. UCL are working on it but have hit problems with their DPM upgrade. Edinburgh are waiting on nagios jobs to start running on their upgraded kit, but otherwise look almost done, and Bristol have vanquished their ticket. The balloon is going up on this one, we have 11 days left till the deadline.

Ye Olde ILC ticket is almost done, Durham have made the move and are just waiting on ILC to test (who in turn are waiting for Durham to come back online).

The only other ticket that really catches my eye is this atlas one:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105308

It concerns multicore atlas jobs failing at RAL. Alastair did a good bit of sleuthing and looks to have tracked the problem down to an issue with multiple multicore jobs running on the same node - which is worrying. Watch this space! Where (where space == ticket).

Monday 12th May 2014, 14.30 BST/

A mere 27 open tickets for the UK today.

NGI
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
The ILC cvmfs ticket. Only Durham is left (I actually missed out Durham last few times I looked at this ticket). So it's all on you Durham chaps now. No pressure (except, there is a little bit). In Progress (7/5)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's EMI3 upgrade ticket. Matt's fighting the good fight, and hopes to have it all sorted soon. Let us know if you need a hand Matt! In Progress (8/5)

RALPP
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105290 (9/5)
The ROD has spotted Glue2 Validation errors on the RALPP bdii. Chris B spotted the ticket, but no news. In progress (9/5)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102205 (14/3)
Bristol's EMI3 ticket. Winnie has beaten the site-BDII into EMI3 shape and is visiting the same fate on their cream CEs and WNs, with one CE already converted and two more about to fall. Make sure you get the WNs too! On Hod (should really be In Progress) (12/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105189 (6/5)
LHCB jobs having some trouble at Bristol, Winnie thinks it's some dodgey nodes at fault and is working on it. Waiting to see if failure continue. Waiting for Reply (7/5)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)
Publishing Max CPU time for LHCB. I believe that we've left it with LHCB asking that it be set to "a value that is obviously made up but isn't the default value" (although I could have the wrong end of the mace here). Been on hold for a while, so we probably want to make some kind of ruling. On Hold (8/4)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/2013)
glexec ticket. No news here - sorry. On Hold (27/1)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102201 (14/3)
The ECDF EMI3 upgrade ticket. Had some problems with a lingering ghost of their previous site-BDII, but hopefully time has exorcised that gremlin and the new EMI3 CE will be seen too, which just leaves one straggler to be dealt with. On Hold (should probably be In Progress) (9/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105267 (8/5)
The other ECDF EMI3 upgrade ticket. Actually this only got submitted by Daniela to satisfy the dashboard demons, probably as you can't physically lift the ROD dashboard to throw it out of the window and shut it up that way. On Hold (12/5)

DURHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103722 (14/4)
Durham's EMI3 upgrade ticket. Daniela has extended the ticket to the zeroth hour. Let us know if you chaps get stuck on anything, but it looks like you have the upper hand. In Progress (2/5)

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105090 (2/5)
Sheffield had some CE nagios failures, but it looks like that storm has passed, with nothing but green as far as the eye can see on the nagios pages. Elena asks if she can close the ticket (i.e. has the alarm disappeared from the dashboard?). Waiting for reply (12/5)

LIVERPOOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105299 (9/5)
Liverpool also have received a ROD ticket, this time of the Glue2 validation variety. Steve has set it in progress. In Progress (9/5)

LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/2013)
Lancaster's glexec ticket. No news I'm afraid. On Hold (4/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)
Lancaster's PerfSonar sucking. Duncan has suggested a reinstall, and noticed spikes of goodness. A reinstall has been put on the todo list. On Hold (12/5)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102193 (14/3)
UCL's EMI3 upgrade ticket. Quiet, but Ben had scheduled the date for the upgrade as the 13th. Hopefully we'll hear positive news from him shortly. On Hold (30/4).

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar host carking it. And last work Ben had brought it back from the great beyond and hoped to have a reinstall done on the 30/4. No word since though. On Hold (28/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/13)
UCL's glexec ticket. Ben mentions a new chap being deputised, and that this will likely have to wait until then. On Hold (16/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=104824 (22/4)
Nagios ticket due to low site availability, caused by a period of outdated CA RPMs. Just waiting for the numbers to pick up again. In progress (6/5)

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103028 (6/4)
A much talked about (and right so) atlas ticket, about job failures at QM essentially due to atlas jobs not requesting the right amount of RAM. There's a question from atlas "if all the questions have been answered". Have they? In Progress (8/5)

BRUNEL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105324 (12/5)
Brunel are having some bother with their APEL publishing, it looks like there's a lot of missing data. In progress (12/5)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105161 (5/5)
Hone noticed their jobs in the ready status for a long time whilst submitted through the RAL WMSeses. Catalin has been engaging with Alexander to debug the issue. Waiting for reply (12/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105100 (2/5)
CMS have embarked on their next Storage Consistency Check. Andrew closed the ticket after providing the desired information, but CMS have reopened (wanting to keep the ticket to track the SCC). Reopened (needs to be put In Progress or On Hold) (12/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)
cvmfs for Sno+. Things have picked up pace on this ticket, with Matt M ready to kick off the uploading the Sno+ tarball. Catalin has tweaked the web access to allow him to do so. Waiting for reply (12/5)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105308 (11/5)
Atlas MCORE jobs failing with "Failed to open shared memory object: Permission denied". RAL team are looking at it. In progress (12/5)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/2013)
Longstanding LHCB authentication problem at JET. The Jet admins have exhausted all their ideas, and have asked for any help. As the problem survived the upgrades to SL6 and EMI3 it's probably something specific with their setup. On Hold (25/4)

Tuesday 6th May

IOU one full ticket review - Matt.

EMI3 upgrade: Down to four EMI upgrade tickets - well done everyone. ECDF, Sussex, Bristol and UCL left to go. Things look good on the UCL and Edinburgh fronts. Bristol and UCL aren't progressing as well, but are adamant that they'll make the deadline. The ECDF ticket triggered a ticket to the ROD, but that's being sorted.

CVMFS for ILC: How are things going at Oxford and Glasgow for rolling out cvmfs for ilc (https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502)? IIRC Glasgow were just waiting for the changes to gently percolate, Oxford were waiting for Kashif "The Puppet Master" to return to work his magic. That just leaves Bristol unaccounted for.

QMUL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=103028
This atlas ticket, essentially regarding atlas jobs using more resources then they said they would (and thus being killed) has seen a lot of discussion at the Thursday atlas meeting. I thought I'd mention it in case anyone outside that meeting wants to weigh in.

Afraid that's all folks!

Monday 28th April 2014, 16.30 BST
I'm afraid the ticket roundup is incredibly light and not in the usual (or any) format.

EMI upgrade tickets: ECDF, Bristol, RHUL, Durham, EFDA-JET, Glasgow, Sussex, UCL and RALPP all have open EMI upgrade tickets. Can everyone with an open ticket please update it this week (preferably buy the first) if they haven't done so in the last 7 days (or if you have but have made progress since then). It's a lot easier for the Person on Duty to extend tickets when there's site updates to validate their actions.

(RALPP have submitted https://ggus.eu/index.php?mode=ticket_info&ticket_id=104839 in response to an argus problem they were seeing post upgrade).

UCL have another Nagios error ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=104824

Interesting One: https://ggus.eu/index.php?mode=ticket_info&ticket_id=104937 Manchester received a ticket from Steve Traylen regarding a lot of connections to the CVMFS stratum 1. Andrew confirms these are VAC machines (unless I've misread something). It looks like the local squid cache was being ignored, Andrew is on the case.

Afraid that's it from me. Next week's will be better (because it's the first Monday of the month... that came around quickly!).

Monday 14th April 2014, 15.30 BST
No ticket update from Matt next week.

33 Open UK tickets today.

NGI (No Geezers In-particular in this case)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
ILC cvmfs ticket, No change since last week really, after tomorrows meeting I'll on hold this ticket until I'm back next week. In progress (3/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103043 (7/4)
Tom's ticket requesting cern@school access to the IC Dirac server. It's all done, the ticket just needs closing (and whilst I'm happy to stick my nose into tickets I won't close or reopen them). Assigned(!) (7/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103197 (9/4)
Chris W has spotted several instances where the old myproxy server shows up in the online documentation. Andrew has tried to edit https://www.gridpp.ac.uk/deployment/users/myproxy.html but can't get access - Daniela suggested asking the hosting site but maybe Tom has access? Waiting for Reply (9/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/2013)
The Sno+ CVMFS ticket. Could some of the progress mentioned last week please be put into the ticket? In progress (26/3)

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103028 (6/4)
Chris ran these atlas job failures down and discovered they were due to the jobs going over their memory quotas. What I didn't like the looks of was how it the jobs themselves requesting these amounts of memory. Atlas says can be solved, but something to watch out for. In progress (11/4)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)
As mentioned last week, LHCB have got back to Glasgow deciding that MaxCPUTime needs to be set to something, Sam respectfully maintains his stance. Steve B links a interesting ticket to the cream devs: https://ggus.eu/index.php?mode=ticket_info&ticket_id=97721 On Hold (8/4)

"EMI UPGRADE" tickets.

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102611
Kashif points out that the NGI argus isn't in the site bdii, which is the probably cause of the test failures. The other two problem servers are due to be decommissioned, so all good here. In progress (14/4)

DURHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103722 (14/4)
A very fresh alarm ticket for Durham's CE and SE. Sorry you guys have to do this dance again! Assigned (14/4)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102201 (14/3)
Andy notes that the links to the alarms given in the ticket appear to be broken. How gos the upgrade in general? On Hold (7/4)

RHUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102189 (14/3)
I think RHUL just has some CEs to upgrade, have you done the site BDII? The list of services that need to be upgraded isn't exhaustive. On hold (21/3)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
You guys put in a good plan, did it survive contact with the enemy? In progress (1/4)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102202 (14/3)
The Glasgow list of services to upgrade was long, but that's just a reflection of how much stuff they run. Gareth gave a good update last week, so there's naught to worry about here (hopefully I didn't just curse you...). In Progress (8/4)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102205 (14/3)
Winnie sounded confident that upgrade will be done by the end of April (and we aren't halfway though the month yet). In progress (4/4)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102193 (14/3)
Ben set a reminder date for the 31st of March, no news since then. On hold (14/3)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102166 (14/3)
It's just the Jet DPM that looks like it needs upgrading. If they've kept it up to date then this upgrade is trivial. Hope to be done by the end of April. On hold (24/3)

Monday 7th April 2014, 13.30 BST

32 Open UK tickets this week.

No site in particular.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
The ILC cvmfs rollout ticket. Glasgow, Oxford, Durham and Bristol were missing at last head count - although Glasgow are mid-rollout and should be fully deployed any day now (if not already). I think Oxford are in a similar boat? As JK points out, we've got to the point where probably need to on hold the ticket whilst I harass the last few stragglers. In progress (3/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=103043 (7/4)
Squire Whyntie has asked for cern@school registration on the Imperial Dirac. Janusz has done so and Tom confirmed it works and cab be solved. If only all things were solved so quickly! Assigned (7/4)

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
The new Sussex EMI2 upgrade ticket. Matt RB copied the Sussex plan over from the original ticket. Daniela cleared up the mystery of what happened to the original ticket (dashboard shenanigans) and posted some useful instructions for the BDII upgrade. In progress (1/4)

RALPP
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102990 (3/4)
Duncan's unending perfsonar vigilance discovered a a problem with the RALPP latency box. Ian reports firewall problems that have been solved, so it looks like this one can be closed (if all is well). In progress (can be closed) (4/4) Not quite out of the woods yet after all, Ian spotted and fixed a few more problems, Duncan has spotted something else away.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102953 (24/3)
CMS glidein hammercloud jobs not running at the site (specifically their defunct cream CEs)- Chris points out another ticket (https://ggus.eu/index.php?mode=ticket_info&ticket_id=102915) essentially detailing the same problem (just for different job types). Probably worth on holding this one whilst waiting on the other, as it looks like the problems are CMS side. In progress (2/4)

OXFORD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103027 (5/4)
LHCB pilots aborting, Kashif asks if the problem persists, the ticket fairy set the ticket to Waiting for Reply (5/4) SOLVED

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102469 (19/3)
cvmfs for t2k. I think this has fallen through some cracks, no word for a while. In progress (21/3)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102205 (14/3)
Bristol's EMI2 upgrade ticket. Not much news, although there was a positive update from Winnie that looks like the April deadline will be made. In progress (4/4)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102914 (1/4)
An atlas ticket, detailing some odd transfer behaviour for some files, likely attributed to some off tcp window settings on a disk server. There was a similar looking (although possibly not identical) problem at RHUL (https://ggus.eu/index.php?mode=ticket_info&ticket_id=102311). Some interesting stuff. In progress (4/4) Sam updated the ticket, with no more "sub-optimally tuned" disk pools. I think it should be set to waiting for reply though

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102202 (14/3)
Not as interesting, Glasgow's EMI upgrade ticket. Chugging along, last word was from David a little while back about having watching some atlas canary jobs running on the EMI3 worker nodes. How did these pan out? In progress (27/3) Gareth updates that progress is slow but steady, draining nodes is taking a while.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)
LHCB asked Glasgow to publish their max CPU time. Not wanting to be made liars of, Sam pointed out why they didn't (shouldn't) do this. This has seemed to send LHCB back to the drawing board, so the ticket is on hold. On Hold (12/3)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102201 (14/3)
The ECDF EMI upgrade ticket. Not much to report here, although the apel box through a wobbly as well, Andy's on it. In progress (2/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/13)
glexec ticket. Word on that later. On Hold (27/1)

DURHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102199 (14/3)
Another EMI upgrade deadline ticket. A plan is in place and the work is underway. On Hold (24/3)

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (3/1)
Sheffield's perfsonar having trouble. Elena upgraded and got the Sheffield IT guys to open port 8086 - it looks like she's nailed the problem and has asked for confirmation. Waiting for reply (7/4) (And before I even finished the review, the ticket was solved).

LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7)
GLEXEC ticket. The tarball glexec isn't going well (no thanks to EMI3 taking up the last 6 weeks of tarball time). I might have to admit defeat (but will ask the devs for help before I do). On hold (4/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1/13)
Lancaster's Poo Perfsonar Performance (I said I wouldn't use that alliteration again, I lied). Using "normal" iperf to probe the boxes I see no 1Gb bottlenecks in my network, could be problem be software? On hold (7/4)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar also having difficulty, although their difficulty is caused by the hardware going kaput on them. Ben is chasing up Dell for new bits. In progress (3/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102193 (14/3)
EMI upgrade ticket. Ben put in a brief plan, but the reminder date has passed. How goes it? The bdii and DPM are fairly straightforward to upgrade. On hold (14/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/13)
GlexeC ticket. No news for a while, is this work to be rolled into the EMI3 upgrade? On hold (27/1)

RHUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102189 (14/3)
RHUL's EMI upgrade ticket. Not much news here. On hold (21/3)

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=103028 (6/4)
Atlas seeing production jobs failing due to pilot errors. Chris asked if production job options have changed recently? The ticket fairy struck again, setting the ticket to Waiting for reply (although he's less sure if that was the intention of Chris' reply). In progress (7/4) Atlas replied saying that they don't think there has been any job changes. Full prod disk is making things even cloudier, but Dan has asked for clarification on what an error message actually means - "!!FAILED!!1999!! Job killed by signal 24: Signal handler has set job result to FAILED, ec = 1204"

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101639 (26/2)
RFC3820 proxy problems at QM (and elsewhere). JK has asked the submitter for his ticket intentions. Set to Waiting for reply by our friend, the ticket fairy. (1/4)

(Please remember to set your tickets to Waiting for Reply after asking a question to the submitter. Don't make me spend yet another Monday afternoon referring to myself as the ticket fairy.)

IMPERIAL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102888 (1/4)
Biomed asked for access to their cvmfs repo to be rolled out at IC. Daniela has said fine but asked that they completely migrate to it within 3 months (nfs or cvmfs). Daniela has completed the rollout and asked biomed to test. Waiting for reply (7/4) Biomed have got back saying that they've launched some test jobs, but expect it might take a while for them to run. I think they also were kinda asking if Imperial would give them some leyway on moving wholey to cvmfs.

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102166 (14/3)
The JET EMI upgrade ticket. There was a hope to upgrade before the end of April. On hold (24/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)
SSL type errors for LHCB at JET. No progress on this for a while, the problem somehow survived the move to SL6/EMI3. On hold (11/2)

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102611 (24/3)
The Tier 1 EMI upgrade ticket. There seem to be some false positives on the list, which could do with clarification which these are (especially due to the dashboard noise on the ticket). In progress (27/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)
CVMFS for SNO+. Matt reported that the collaboration has given permission to have their software on cvmfs, and hoped to have tarballs ready for last week. Has there been any progress offline? In progress (26/3) Update - Squire Whyntie informed me that this is being actively worked on offline, with Tom kindly providing assisstance.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101079 (9/2)
ARC CEs publishing the wrong DefaultSE. Andrew has hacking this on his todo list, but bumped this issue down the list (which is fine, as it's low priority) . In progress (1/4)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=99556 (6/12/13)
The NGI argus ticket. I'm pretty sure that this can be closed, as argusngi.gridpp.rl.ac.uk is setup and tested by several sites- so all looks well here. On Hold (21/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101968 (11/3)
Atlas deletion errors at the Tier 1. The problem is known, but not well understood, and sadly persists (last set of errors reported on the 4th). Alastair has put in a good explanation of the symptoms. On hold (4/4)

Monday 31st March 2014, 15.00 BST
34 Open UK tickets this week.

TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101968 (11/3)
Atlas deletion errors at the Tier 1. Alastair posted a good explanation of the problem and some mitigation details, but atlas would like an update. On hold (12/3) Update - Problem persists, reminder set for 7/4

https://ggus.eu/index.php?mode=ticket_info&ticket_id=102611 (24/3)
The Tier 1's EMI upgrade ticket. Some false positives on this list, Kashif asks if the NGI argus is also a false alarm? In progress (28/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101079 (9/2)
Tweaking the ARCCE DefaultSE publishing. As a bit of bookkeeping can the priority be tweaked to less urgent (seeing as the issue isn't causing great woe). On hold (17/3)

As an aside tickets often are submitted using the default priority of "urgent" and category of "Incident" - if you catch these in your tickets then you should feel free to change them.

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102810 (28/3)
Sussex's original EMI upgrade ticket (102212) was closed "automatically"- ("broken ticket - close by Operations Portal"), leaving this one in it's stead. I'm not sure if the information Matt RB carefully posted in the previous ticket needs to be cut and pasted over to here. All seems a bit weird. In progress (28/3)

OXFORD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102469 (19/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102544 (21/3) Solved
A couple of Oxford tickets look a bit neglected (one about cvmfs for T2K, t'other an lhcb/torque problem). I suspect these got overlooked with the excitement of Pitlochry last week. In progress (21/3)

(Also there's ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=102740, which could be seen as either an annoyingly finicky request or the epitome of a low hanging fruit, for when you *really* need a win that day!). Also Solved.

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102489 (20/3)
Similarly at Sheffield, maybe this biomed "invalid publishing" ticket got forgotten about on the trip to sunny Scotland. In progress (20/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (3/1)
The Sheffield perfsonar ticket. Things just needed finishing off by the looks of it - let us know if any advice is needed. On hold (11/3)

BIRMINGHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102404 (18/3)
Birmingham's perfsonar "being weird" (ignoring Bristol), although Matt fixed it. Just doing the post-game roundup to figure out what magic actually fixed things, but could do with an update in the ticket. In progress (20/3) Update-Solved

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101639 (26/2)
RFC3820 proxy problems. The problem is spread wider then QM, and likely needs a middleware patch or three to solve. Dan and Chris have asked for a master ticket to be created (failing that some more information would be nice). Nothing forthcoming from the submitter yet. I think this ticket has mutated to include the issues from RAL as well as QM. A bit of a mess. In progress (18/3)


Monday 17th March 2014, 14.00 GMT
47 Open UK tickets this week, a dozen of them are EMI2 retirement tickets so they'll get the lion's share of our attention.

NGI
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)
The ILC software area move ticket. IC, RAL, QMUL, Cambridge and Liverpool have moved. Lancaster moved but is (hopefully was) broken for ILC (pardon my noise). Assuming that anyone not mentioned on the ticket hasn't migrated the ILC SW_DIR yet that leaves the following list of uk sites to migrate their ILC software area:
OXFORD
GLASGOW
BRISTOL
RALPP
BIRMINGHAM
BRISTOL
BRUNEL - Moved but hadn't updated the ticket
RHUL
DURHAM
MANCHESTER
(I might have missed some of you out, this list is from lcg-infosites and grepped using my admittedly poor eyeballs. A braver man would have crafted his own ldapsearch to glean this info).
If you don't want to make the change then removing support for ILC is a viable course of action, if you have made the change please update the ticket to let ILC know. In progress (17/3)

EMI ALARMS
Remember that you need to have *at least* your upgrade plans in these tickets within a fortnight of the ticket's submission - so by the 28th of March.

SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102212 (14/3)
BDII is the culprit here, ticket acknowledged but no other news (or plan). In Progress (17/3)

RALPP
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102207 (14/3)
ARC CE, a few CREAMs, site BDII and WNs - Chris reports that this is probably a false alarm, but is looking into it (in case the publishing is off). Chris has included a plan for the other components (if I'm reading right, is RALPP ditching all CREAMs?). In progress (14/3)

BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102205 (14/3)
site BDII (seems to be a common one), some CEs and the WNs. Winnie has posted an assurance that the upgrade will be done in time, but I'm not sure if that'll count as a plan to the powers that be. In progress (17/3)

GLASGOW
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102202 (14/3)
Lots of services, but Dave has given a detailed upgrade battleplan. In progress (17/3)

EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102201 (14/3)
I think site-BDII and WNs. Wahid has given assurances, but (sorry to be a pedantic patsy) not sure if that'll count as a "upgrade plan". In other news the testing for the SL6 WN tarball is going well. In progress (14/3)

DURHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102199 (14/3)
The Durham DPM, CE and site BDII are on the list. Assigned (14/3)

SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102197 (14/3)
Some CEs and the APEL box (that's a guess). Elena has given a good plan, there's some hassle as their APEL and BDII box are shared. In progress (14/3)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102193 (14/3)
DPM, BDII, CE and WNs. Ben has said he will upgrade in the next few weeks. On hold (14/3)

RHUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102189 (14/3)
A CREAM and the site BDII. Govind is planning his upgrade plans. In progress (14/3)

IMPERIAL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102185 (14/3)
Just some WNs. Engaged in testing and plan to upgrade the last cluster this week. In progress (14/3)

BRUNEL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102184 (14/3)
Just the DPM I think. Henry remarks that they're just about to embark on a physical server move and doesn't want to change anything significant before the move. I gave my recipe for the EMI3 dpm move in case it helps. In progress (14/3)

EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102166 (14/3)
Again just the DPM. Acknowledged, but no plan. In progress (17/3)

I'm actually pretty sure Lancaster should have got a ticket as we still have one cluster on the EMI2 tarball. I'm not going to complain though.


NORMAL TICKETS

QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101639 (26/2)
This ticket about jobs using RFC3820 style proxies not working at QM is in an odd state. The user seemed to be confused as to what feedback he should give. In progress (17/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101916 (8/3)
Sorry to be picking on QM, but this 444444 publishing jobs ticket is looking neglected, I suspect you've been frying bigger fish but can you please show it (or even better, the underlaying issue!) some love. Is this linked to your other information publishing problems? In progress (10/7)

UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL Perfsonar ticket. After having his perfsonar box whacked by a power outage Ben is reinstalling, but is seeing some odd hardware issues. Has anyone else seen their R610s PERCs play up like this (only showing 3G partitions)? In progress (12/3)

MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=102394 (18/3)
Just in, but similar to the ILC ticket - Catalin has asked Manchester to deploy cvmfs for t2k. In progress (18/3)

As always please pipe up if I've missed anything or if there's any other ticket related issues you want to bring up.


Monday 10th March, 13.00 GMT</br> Only 28 Open UK tickets this week.

NGI
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)</br> ILC moving to cvmfs for their software area. As Jeremy mentioned after tomorrow we're going to start chasing sites that support ILC but haven't rolled out these changes. 4 sites have implemented the move and passed muster. A tip from me is to remember to update the software area entry in your CE's info system for ILC as well as on the nodes. In progress (10/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101820 (5/3)</br> This goc db ticket ended up assigned to the UK. I've punted it in the direction of the GOC DB support unit. Assigned (10/3)

EDINBURGH</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100569 (28/1)</br> Wahid has got stuck trying to reinstall his perfsonar box, if I'm reading it right the reinstall from the netimage isn't "taking". Has anyone seen this before or have any tips? Waiting for reply (10/3)

GLASGOW</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)</br> LHCB wanting MaxCPUTime to be published. Sam has eloquently explained his point about why he doesn't want to set this, I fear that some kind of impasse has been reached, and I'm not sure where to go on this issue. In progress (4/3)

PERFSONAR</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101136 (RALPP)</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (SHEFFIELD)</br> Any news on upgrading the perfsonar instances at RALPP or SHEFFIELD? Reminder dates on these tickets have passed by a week now.


That's all my addled brain can process I'm afraid, can sites please check the link below (oh, and yippie for GGUS search bringing back ordering by site again):</br> http://tinyurl.com/p37ey64

Monday 3rd March 2014, 14.30 GMT</br> 44 Open UK NGI tickets this week.

NGI</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502 (24/2)</br> ILC moving to cvmfs, so those of us seekign to continue support will need to enable it. IC and Cambridge have already moved and been confirmed working. It might be easier if we collate any other sites who have moved into a single list to give to ILC. The working plan is to open tickets against sites who haven't moved after giving them a suitable grace period. In progress (26/2)

TIER 1</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=99556 (6/12/13)</br> The NGI Argus ticket. There's been great progress on this, can we reflect some of this in the ticket? Or perhaps close it if we're satisfied. In progress (13/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101491 (23/2)</br> The RAL perfsonar latency box is being troublesome. It crashed and was brought back up again, but has crashed again so Duncan has reopened the ticket. Reopened (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101716 (28/2)</br> This cms transfer ticket has INFN as the "notified site", surely it should be RAL-LCG2 instead? I didn't change it myself in case I missed some nuance. Transfer problems appear to be linked to the virtualisation problems RAL have been experiencing affecting FTS3. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101729 (1/3)</br> LHCB pilots failing on a RAL CE. Being looked into. In Progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101701 (28/2)</br> ILC having troubles with the RAL ARC CEs. Looks to be a user group for ilc (production) missing. In progress (28/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101052 (6/2)</br> Biomed having trouble retrieving results from RAL cream CEs. Tracked down to the RAL EMI2 argus not handling Rfc proxies. An update to EMI3 is hoped to fix this, although Dan reports that this isn't the case at QM (see 101639). In progress (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101532 (25/2)</br> LHCB noting that RAL is publishing the default MaxCPUtime. Fixed but Orlin notes some caching behaviour. Maria AP chimed in that you might have a buffy bdii version in the chain. In progress (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100114 (8/1)</br> Chris W's ticket concerning jobs failing to get from RAL to Imperial. Catalin asked for some testing, but Chris has been on busy. The ticket hit its second reminder though. Waiting for reply (11/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=97025 (3/9/13)</br> Longstanding myproxy issue. Andrew reports that the new myproxy service is up and running, so I assume this ticket can be closed soon? Or at least put back in progress. On hold (25/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101079 (9/2)</br> ARC CEs having a default SE of 0 and not being able to tune this per VO. Andrew is figuring out a fix to this. In progress (25/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)</br> cvmfs for Sno+. Ticket on hold whilst tarballs are created. Been that way for a while. On hold (29/1).

EDINBURGH</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100569 (28/1)</br> ECDF's perfsonar box refusing MA connections. Wahid has rebooted the box but no joy, Duncan linked some instructions as requested. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=99794 (16/12/13)</br> Access to the ECDF perfsonar pages. There's a big ACL overhaul going on at the moment, Andy apologises and will chase the central IT chaps about it. On hold (28/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101659 (27/2)</br> 44444 jobs publishing on some ECDF CEs (as part of information system cleanup campaign). These CEs are due for retirement (replicant style) today, so this and the related tickets will be done with soon. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100840 (29/1)</br> Apel-Pub nagios test failures at ECDF. The guys are working on it, but sadly the ticket is escalating. Daniela posted a note that if you have a support ticket with APEL open (which I think is advisable) to link that into this ticket. In progress (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/13)</br> glexec deployment ticket. The ECDF lads are waiting on the tarball (i.e. me). Still. On hold (27/1)

RALPP</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101726 (1/3)</br> LHCB ticket about the default CPU time (999999) being published at RALPP. I thought that RALPP had solved something like this recently, but maybe I dreamt it? Assigned (1/3) Update - Solved, something was being published that shouldn't be any more.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101727 (1/3)</br> Info system cleanup campaign, 4444444 job at RALPP. Assigned (1/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101398 (19/2)</br> LHCB would like xrootd holes poked in the RALPP firewall. As mentioned last week I believe this requires holes poked in the RAL firewall, which is undergoing an overhaul. This ticket could do with some attention mentioning these problems, and possible on holding. In progress (19/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101136 (11/2)</br> Request to upgrade the RALPP perfsonar to the latest version. Due to a lack of hands on deck Chris postponed this work, with a reminder date of today. On hold (21/2)

IMPERIAL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101367 (18/2)</br> A cms user having trouble srmcping in his jobs at IC. Looks to be a java 1.7 mismatch problem. Simon has asked some questions, no answer yet (user has set notify to "on solution" so might not have got the update). Waiting for reply (24/2)

DURHAM</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101752 (3/3)</br> LHCB jobs having problems at Durham. Ewan S. has asked if the problems persist. Waiting for reply (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101763 (3/3)</br> Part of the campaign to clean up the information system, Durham have been asked to update their BDIIs (site and resource) to not-buggy versions. Assigned (3/3)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101177 (12/2)</br> Durham trying to wash the biomed out of their SE's information system. No joy yet. I advise asking at the storage meeting if stuck. In progress (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=99621 (10/12/13)</br> enmr noticed a bad WN, which was promptly quarantined. It hasn't been fixed, but I maintain that the problem itself is contained and solved if you want to close the ticket... On hold (28/1)

GLASGOW</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101710 (28/2)</br> Nagios SRM-Put test failures. The problem is known (it's DPM being odd with its space reporting whilst a pool is readonly -Sam describes it better). In progress (28/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101565 (26/2)</br> LHCB sees that Glasgow is also publishing default max CPU time for some (all? one?) of their queues. Sam points out that this is on purpose (due in part to multicore jobs, jobs are limited by Wall time only), and asks if LHCB can't make educated guesses. Stefen replies with a point about the difference in "MaxCPUTime" and "MaxTotalCPUTime", but I'm not sure that covers the Glasgow concerns. Worth discussing to get a UK stance on this. In progress (3/3)

BRUNEL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=100568 (28/1)</br> Perfsonar MA problem. Raul has been working steadily at this and it looks to be progressing nicely. In progress (28/2)

QMUL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101676 (27/2)</br> One of QM's perfsonar boxes is having problems, missing services. Likely to be caused by running a bleeding edge version of perfsonar. In progress (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101682 (27/2)</br> Brian has asked for a SE dump of QM atlas files. Assigned (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101557 (25/2)</br> Matt from SNO+ having trouble on a QM UI, delegating proxies to the FTS. The same works on lxplus though. This ticket needs a home, but there's an argument that it isn't a site problem (as a UI isn't necessarily part of a site). Assigned (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=94746 (10/6/13)</br> Biomed haunting the QM SE's info system. I believe Chris is waiting on his changes to seep into the Storm release (100290). On hold (14/1)

BRISTOL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101669 (27/2)</br> lhcb ticketed Bristol, but the CE in question is in scheduled downtime. Possibly worth keeping this open whilst downtime is on to avoid a duplicate. In progress (27/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101516 (24/2)</br> Bristol's perfsonar ticket. Bristol upgraded which seems to have solved some of their problems, but their other server is having trouble now. Maybe the same again will fix it? In progress (25/2)

UCL</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/13)</br> glexec at UCL. No news for a while from Ben. Daniela reminds him that the EMI3 upgrade is also imminent. On hold (26/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)</br> A perfsonar ticket for UCL. A power outage looks to have brutalised their box. No word yet on if Ben has been able to save it. On hold (22/2)

SHEFFIELD</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=101374 (19/2)</br> Sheffield's LHCB maxcputime ticket. Elena has set in progress but no news. In progress (25/2)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100037 (3/1)</br> A perfsonar ticket for Sheffield, whose perfsonar needs updating. No news for a while. On hold (3/2)

LANCASTER</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/13)</br> Lancaster's glexec ticket. Whilst there's been some progress in the glexec tarball (not as much as there should be, as tarball time keeps being redirected, particularly with EMI3), no movement on the ticket. On hold (31/1)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)</br> Lancaster suffering Poo Perfsonar Performance (I couldn't resist the childish alliteration). It doesn't seem to be an artificial carp (the rate has peeped over the 1Gb/s mark now and again. Looking for bottlenecks, but not had anytime to investigate. On hold (17/2)

EFDA-JET</br> https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)</br> LHCB jobs failing at JET due to openssl problems. No progress for a while, after the JET guys exhausted everything. On hold (11/2)

Monday 24th February 2014, 15.00 GMT</br>

36 Open UK tickets this week, but the majority are progressing nicely (only a third of them haven't had an update in the last week, and of these all of them are "On Hold").

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=101502 (24/2)</br> ILC have ticketed the UK to inform us of their move to using cvmfs for their software area. They've included extensive instructions (and updated their VO card). The best forum to ask questions of the VO seems to be this ticket. In progress (24/2)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/13)</br> NGI Argus ticket. As seen on TB-Support, good progress here but the ticket could do with some love. In progress (13/2)

https://ggus.eu/ws/ticket_info.php?ticket=101015 (5/2)</br> This CMS phedex problem looks like it can be bounced to Minnesota. I advise being proactive with the bouncing - either reassign it yourselves or solve it with a big "not a problem in our power to fix". In progress (24/2)

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=101398 (19/2)</br> LHCB want holes poked in the RAL firewall to allow direct xrootd access to the RALPP SE - more a heads up for everyone then a ticket nag. In progress (19/2)

EDINBURGH</br> https://ggus.eu/ws/ticket_info.php?ticket=100840 (29/1)</br> Daniela has given some tips on how to tackle this APEL nagios ticket. In progress (20/2)

PERFSONAR TICKETS:</br> A quick round up of these as there are a lot of them.

Lancaster: https://ggus.eu/ws/ticket_info.php?ticket=100566</br> RHUL: https://ggus.eu/ws/ticket_info.php?ticket=101135</br> ECDF: https://ggus.eu/ws/ticket_info.php?ticket=100569</br> RALPP: https://ggus.eu/ws/ticket_info.php?ticket=101136</br> Brunel: https://ggus.eu/ws/ticket_info.php?ticket=100568</br> UCL: https://ggus.eu/ws/ticket_info.php?ticket=101285</br> Sussex: https://ggus.eu/ws/ticket_info.php?ticket=101517</br> Durham: https://ggus.eu/ws/ticket_info.php?ticket=100968</br> Bristol: https://ggus.eu/ws/ticket_info.php?ticket=101516</br>

There's a lot of them, but none are looking very neglected (yet). The one with the biggest risk of neglect is actually the Lancaster ticket! Others are soldiering on or have firm reminder dates set for their upgrade.

Tickets from the UK:</br> I had my dreams of easily searching for tickets submitted by UKers smashed: https://ggus.eu/ws/ticket_info.php?ticket=101362 So it looks like it's back to my old method of searching for "Walker", "Bauer" or "Jones" :-D Monday 17th February 14.30 GMT</br> 35 Open UK tickets this week - the number is creeping up, I think largely due to the build up of perfsonar tickets. I plan to look at these in detail next week (or maybe bring them up in the Storage meeting if that's a more appropriate forum?).

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/2013)</br> The NGI Argus ticket. Ewan has helped out with some successful testing, there's a general call for others to get involved if they fancy it. In progress (13/2)

https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> Jobs failing on the RAL WMS, due to the gridsite/openssl/proxy size debacle. Chris successfully tested lcgwms06 after it was updated. Now lcgwms04 and 05 have been updated and Chris has once again been asked to work his testing magic (my apologies if this is already on your to do list Chris). Waiting for reply (11/2)

https://ggus.eu/ws/ticket_info.php?ticket=101052 (6/2)</br> Biomed having trouble with one of the RAL CEs. What really caught my eye here was that Biomed are using JSaga for their job submission- do we have any other user groups using this? (This also leads me to once again question what I find interesting!). No problems with how the ticket itself. In Progress (14/2)

https://ggus.eu/ws/ticket_info.php?ticket=101015 (5/2)</br> This CMS transfer problem (between Minnesota and RAL) ticket is looking a bit ropey. Last word on Friday was that the transfers were still failing. Of course, there are two sides to every transfer failure. In progress (14/2)

https://ggus.eu/ws/ticket_info.php?ticket=101079 (9/2)</br> I don't mean to pick on the Tier 1, but you keep getting thrown the interesting problems. Another "Idiosyncrasies of the ARC CE" ticket, here we see it's oddness with publishing different default SEs for different VOs. Again, naught actually wrong with the ticket. In progress (17/2)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=101135 (11/2)</br> I lied earlier, and I am bringing up one of the perfsonar tickets. Any luck with getting your perfsonar updated Govind? In progress (11/2)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10/2013)</br> The getting CMS to work at Glasgow epic (or would you prefer saga?). CMS have pointed out that the original problem is solved, so from their point of view the ticket can be closed when the Glasgow guys feels satisfied. The ticket is in "waiting for reply", but I'm not sure that anyone who you'd like to have input from is paying attention (the second reminder went out today). Waiting for reply (17/2)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=101177 (12/2)</br> Durham's SE is publishing biomed support when Durham no longer support them. Here's wishing you good luck with purging biomed from your system! In progress (17/2)

"Submitted from the UK"</br> I've been very lax about tracking tickets submitted by us NGI_UKers (partly as I never found a good way of doing it), but Steve's submission of the dteam voms server problem ticket (101177) whilst I was writing this up has prompted me to retackle that one. Watch this space! Monday 10th February 2014, 15.00 GMT</br> 32 tickets for the UK this week.

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=100849 (29/1)</br> This perfsonar ticket is is still just "assigned" state, don't make Duncan feel spurned, take a look at his ticket. Assigned (29/1)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/13)</br> NGI argus setup. argusngi.gridpp.rl.ac.uk is setup and in the GOCDB, but what next with the ticket? In progress (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> A ticket from Chris W concerning job failures due to 512-bit proxie problem. Catalin asked for the update to be tested, but is this testing covered in https://ggus.eu/ws/ticket_info.php?ticket=100343? Waiting for reply (6/2)

Talking of which, can:</br> https://ggus.eu/ws/ticket_info.php?ticket=100343</br> and</br> https://ggus.eu/ws/ticket_info.php?ticket=100887 (gridsite version on the webdav LFC)</br> be closed?

And that's it really. A scan through the the solved ticket pile doesn't show anything exciting. But on the second Monday of a month I tend to overcompensate for going over all the tickets the week before, so let me know if I missed ought. Monday 3rd February 2014, 14.30 GMT</br> Only 29 open tickets in the UK at the moment. To split it further, only 4 of these are "green", three are "yellow, the rest are "red". 7 are perfsonar related tickets, the only really big group of tickets we have.

RALPP</br> https://ggus.eu/ws/ticket_info.php?ticket=100480 (23/1)</br> Some obsolete entries were being published at RALPP, Chris thinks he has fixed it though (a problem on the cluster BDII), awaiting confirmation. Waiting for reply (31/1) Update-Solved

https://ggus.eu/ws/ticket_info.php?ticket=100849 (29/1)</br> Duncan has ticketed RALPP over their perfsonar latency box, he reckons a full log partition. Looks like this ticket hasn't been noticed yet though. Assigned (30/1)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=99642 (10/12)</br> Backup Voms server testing for GridPP and Southgrid VOs at Oxford. On hold (30/1)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=99910 (20/12/2013)</br> LHCB having problems with the environment at Bristol, tracked to ARC being an odd duck. The problem has been forwarded to the ARC devs. On hold (21/1)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10/2013)</br> Getting CMS working at Glasgow - the ticket. Gareth has updated a magic CMS xml file using one given to him by Daniela and notes that they're still failing CMS xrootd tests. Gareth asks if the tests are critical, and if they are he pleads for help. The lack of CMS credentials is really nobbling their efforts to getting this sorted, or even digging up docs. Waiting for reply (3/2) Update- Daniela provided an update containing what I can only assume is an invocation of dark forces, Gareth has risked his immortal soul and applied it.

EDINBURGH</br> I'll probably be better off coming back to these in a few weeks time!

https://ggus.eu/ws/ticket_info.php?ticket=100840 (29/1)</br> ECDF have an APEL-Pub nagios error going on. Looks like this has flown under the radar, probably due to both Andy and Wahid having more important things on their mind right now. Assigned (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=99179 (25/11/2013)</br> Glue2 obsolete entries. Plans to retire the CEs have been slowed down due to waiting on networking changes. Andy reported that he'll fix the publishing if their not in position to decommission soon. On hold (24/1)

https://ggus.eu/ws/ticket_info.php?ticket=99180 (25/11/2013)</br> Similar to above, but publishing default values. It's the same CEs at fault, so this ticket is in the same boat. On hold (4/12/2013)

https://ggus.eu/ws/ticket_info.php?ticket=99794 (16/12/2013)</br> ECDF's perfsonar boxen blocking access to their webpages. Was held up by Christmas, but no news since-probably won't be for a few weeks. On hold (16/12/2013)

https://ggus.eu/ws/ticket_info.php?ticket=100569 (28/1)</br> The perfsonar latency box has started refusing connections. On hold whist Andy's off. On hold (28/1)

https://ggus.eu/ws/ticket_info.php?ticket=95303 (1/7/2013)</br> glexec ticket. Sadly the same story as last time (or the last times).

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=99621 (10/12/2013)</br> Durham have a bad worker node, spotted by enmr.eu. Whilst the guys haven't had a chance to fix it, one could argue that an offlined problem is a solved problem, as it can't hurt the jobs anymore. On hold (28/1)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=100037 (3/1)</br> Sheffield's perfsonar box needed some site firewall holes poking for it. On the to do list is an upgrade and assimilation into the mesh due to only testing against 6 sites currently. On hold (27/1)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=100867 (30/1)</br> Teething problems for Manchester's new perfsonar boxes. Alessandra asks Duncan if it can be closed. In progress (3/2) Update- Solved, and wasn't a site problem to begin with.

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=100566 (27/1)</br> Lancaster isn't getting 10G performance out of its perfsonar boxen. My suspicion is that the NICs themselves are running slow, not the switches. Maybe I'm using the wrong drivers? In progress (3/2)

https://ggus.eu/ws/ticket_info.php?ticket=95299 (1/7/2013)</br> Lancaster's GLEXEC ticket, waiting on me getting a tarball one working. I'm currently trying out another tarball one on my test bed, but it's early days yet (it's more an exercise in documenting the errors at the mo). On hold (31/1)

https://ggus.eu/ws/ticket_info.php?ticket=100011 (31/12/2013)</br> Biomed stopped working for one of the Lancaster CEs. The ticket suffered from lack of priority (sorry biomed!). On hold (24/1)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=95298 (1/7/2013)</br> The UCL glexec ticket. SL6 and DPM upgrades are done, Ben is just getting things settled before he starts tackling this. On hold (27/1)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=94746 (10/6/2013)</br> QM having trouble scrubbing the biomed out of their SE's information system. Chris submitted https://ggus.eu/ws/ticket_info.php?ticket=100290 and has put a lot of hours into this. On hold (14/1)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=100568 (28/1)</br> Brunel's perfsonar have problems. Raul plans to upgrade, and has let know his distaste that an upgrade requires a reinstall. In progress (29/1)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9/2013)</br> LHCB job problems still haunting jet. I think this ticket should be in "Waiting for reply", but I also think that I know the answer to the question (that the error message they're seeing as a red herring). In progress, should be in some other status (29/1)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> Chis has spotted jobs failing to get from RAL WMS to Imperial. Looked to be SSL problems. On hold awaiting RAL upgrade to the next WMS release. On hold (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=100343 (16/1)</br> RAL WMS producing 512-bit proxies (occasionally). Waiting on the same release. Waiting for reply (?) (27/1)

https://ggus.eu/ws/ticket_info.php?ticket=100887 (31/1/2013)</br> Due to the same underlying issue as the above tickets , Chris asks for the gridsite package on the webdav LFC to be updated. In progress (31/1)

https://ggus.eu/ws/ticket_info.php?ticket=100507 (23/1)</br> CMS transfers failed between Caltech and RAL. The problem has eased itself, so the ticket only needs to be kept open if further investigation is warranted (as Brian pointed out). In progress (3/2)

https://ggus.eu/ws/ticket_info.php?ticket=98249 (21/10/2013)</br> CVMFS for SNO+. Almost there, creating the Sno+ tarballs to test with is taking longer then expected. On hold (29/1)

https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12/2013)</br> The new NGI Argus server (argusngi.gridpp.rl.ac.uk) has been set up in the gocdb and is online. In progress (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=97025 (3/9/2013)</br> Ye olde RAL myproxy server name confusion issue. No news on this for a while, the hope is having this dealt with soon. But then the last update was nearly a month ago, so soon isn't as soon as we'd like it to be! On hold (6/1)

That's all folks. I noticed a few longstanding tickets have been solved over the course of January, so thanks for that!

Monday 27th January 2014, 15.00 GMT</br> 33 Open UK Tickets this week.

Courtesy of John Kewley's Posse of Ticket Wranglers we have:

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=99642 (10/12/2012)</br> Southgrid Backup Voms server testing. I suspect other, squeakier wheels have been getting the Oxford grease (where the heck am I going with this analogy?). Unless you're going to get stuck into it right now probably best to On Hold until you're actually sat down actively poking it. In progress (8/1)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=100037 (3/1)</br> Problems with the Sheffield Perfsonar host. Looks like the Sheffield host might need an upgrade (or at least implementation of the mesh). Again, if it doesn't look like you'll get to this soon can you On Hold. In progress (13/1)

Spotted with my own eyes:

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=100527 (24/1)</br> An atlas ticket concerning the RHUL storage. Looks like it might have snuck in amongst the Monday morning e-mail pile. Assigned (24/1)

That's all really. We're down to 33 tickets (from 42 last week), as usual I'll be going over all of them next week, but feel free to bring any up that are particularly close to your heart in the meeting or online.

Please check your site tickets here:</br> http://tinyurl.com/cblj3ab

Monday 20th January 2014, 14.30 GMT</br> There are 42 Open UK tickets this week. Where did they all come from? Let's take a look.

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9/2013)</br> LHCB jobs failing at Jet. The Jet chaps have just fixed an SSL problem at their site, so would like to see if this has fixed the LHCB problems. Waiting fore reply (20/1) Update - things are still failing, reading the error perhaps JET have picked up some wierd rpms somewhere?

(This also possibly solves the Jet gLeXeC ticket https://ggus.eu/ws/ticket_info.php?ticket=95295 UPDATE-SOLVED, the Jet guys put in a fix to JAVA to solve the keysize problem and things work now )

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=100342 (16/1)</br> Atlas are seeing transfer failures to/from UCL's dpm. Looks like an authentication problem, Ben might need a hand. In progress (20/1)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=100333 (16/1)</br> Looks like this problem Tom and Chris spotted with one of the RAL WMSii has been solve, case can be closed. In progress (17/1) SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=100343 (16/1)</br> But the WMSses still bring us pain, here Chris documents that the RAL ones are still producing 512-bit proxies. Chris also helpfully links two other WMS tickets. In progress (17/1)

https://ggus.eu/ws/ticket_info.php?ticket=98122 (17/10/2013)</br> But Tom provides another win, this time with the cern@school cvmfs repo. He's managed to get it working, able to put data into it, so this ticket can probably be closed too. In progress (17/1) SOLVED

https://ggus.eu/ws/ticket_info.php?ticket=100114 (8/1)</br> But then the WMS try to spoil our buzz again with another ticket. Although I believe this is the forerunner to 100343 above. In progress (16/1)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=100188 (10/1)</br> Raul has provided Brian with the database dump from his SE (it should have landed in Brian's inbox), I think this ticket can be closed if the dump looks alright. In progress (16/1)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=99910 (20/12/2013)</br> LHCB problems at Bristol, due to ARC doing strange things to the environment. A few brave fixes have been tempted, but no joy. Waiting on feedback from the ARC developers - if that takes a while this ticket will need to be On Holded. In progress (14/1)

ECDF</br> https://ggus.eu/ws/ticket_info.php?ticket=99794 (16/12/2013)</br> Poking holes in the Edinburgh firewall for the perfsonar box. Any news from the IT overlords? I understand that there's a pending Edinburgh baby boom, so I'm not sure if anyone's still about? On hold (13/1)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10/2013)</br> The "getting CMS working at Glasgow" ticket. It's looking almost as neglected as my gym membership. On hold (16/12/2013)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=97066 (5/9/13)</br> Getting the Manchester perfsonar boxes back up and running. How goes it? On hold (7/1)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=98594 (4/11/2013)</br> The LHCB job uploading problem at Sheffield. It seems all parties have gotten stuck, so we need to decide where to go with this. On hold (8/1)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=99621 (10/12/13)</br> Just making sure this ticket, with a bad node needing offlining, isn't forgotten about. On hold (19/12)

Similar with the Durham GLEXEC ticket https://ggus.eu/ws/ticket_info.php?ticket=95302 - it was On Holded over Christmas, but Christmas was a while ago now. In fact, with Creme eggs out, it must be nearly Easter already... right?

EXTRA EXTRA</br> RALPP https://ggus.eu/ws/ticket_info.php?ticket=100401 (20/1) This nagios glexec alarm ticket which Chris quickly jumped on has been reopened on you guys. Just bringing it up as reopened tickets have a habit of sneaking under the radar. Reopened (21/1)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=100348 (17/10) Atlas are getting a little ansy for some news on this ticket. And also don't seem to understand the waiting for reply state is for... Waiting for reply (21/1)


Monday 6th January, 14.30 GMT</br> Happy New Year Everybody!

38 Open UK tickets this year.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=99854 (18/12/13)</br> The NGI ROD has a ticket open against it, Jeremy has asked for clarification but no word back yet. Waiting for reply (26/12/13)

SUSSEX</br> https://ggus.eu/ws/ticket_info.php?ticket=95165 (28/6/13)</br> Sussex's Perfsonar ticket. There's been a lot of progress thanks to new Sussex admin Matt (Hi Matt!). Duncan suggests leaving it a few days to collect data so we can see where we are with this. In progress (3/1)

https://ggus.eu/ws/ticket_info.php?ticket=99198 (26/11/13)</br> glexec ops nagios test failures at Sussex. The new Matt has gone great guns over other tickets at the site, although this problem still haunts them. If you can't see the solution maybe a mail to TB-SUPPORT is in order? In progress (31/12)

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=99642 (10/12/13)</br> Backup VOMS server testing ticket for Oxford. Testing was going well but I think something else came along! Needs some love. In progress (10/12/13)

BRISTOL</br> https://ggus.eu/ws/ticket_info.php?ticket=99796 (16/12/13)</br> A ticket about Bristol's perfsonar. Winnie is having the relevant holes poked into their firewalls, things are looking good (from the ticket) - actually not sure if it should be in "Waiting for Reply". In Progress (3/1)

https://ggus.eu/ws/ticket_info.php?ticket=99910 (20/12/13)</br> LHCB have spotted a CVMFS problem at Bristol. After a surprise power outage it looks like LHCB jobs aren't getting their SW_DIR set right, even though it looks like the infrastructure to set it up is in place. In progress (6/1)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=99639 (10/12/13)</br> The Glasgow VOMS Backup Server testing ticket. Some progress was made but Dave mentions that it would have to wait to the New Year before it can be finished off. On Hold (19/12/13)

https://ggus.eu/ws/ticket_info.php?ticket=100012 (31/12/13)</br> Biomed test jobs were failing at Glasgow - Dave thinks he snuffed out the problem and it looks like tests are being passed again. You might want to solve this one yourselves or at least Waiting for Reply it. In progress (6/1)

(As you can see over the holiday period GGUS tickets broke the 6-figure mark).

https://ggus.eu/ws/ticket_info.php?ticket=98253 (21/10/13)</br> A CMS ticket that evolved to "getting CMS working at Glasgow". Not much news for a while, last word was that Sam was looking at the CMS DPM redirector. On hold (3/12/13)

EDINBURGH</br> https://ggus.eu/ws/ticket_info.php?ticket=99794 (16/12/13)</br> ECDF's ticket regarding access to their Perfsonar Webpages. Andy submitted a request for the ports to be opened, but no progress was expected to nowish. On hold (16/12)

https://ggus.eu/ws/ticket_info.php?ticket=99180 (25/11/13)</br> Some of Edinburgh's CEs are publishing default values. This seems to be only affecting older CEs pointing at SL5 resources, as these will be decommissioned soon the strategy is to not bother fixing this issue. On hold (4/12)

https://ggus.eu/ws/ticket_info.php?ticket=99179 (25/11/13)</br> In a similar vein, some of the ECDF services are publishing obsolete GLUE2 entries. This appears to be the same problem as above, with the same solution. On hold (10/12)

https://ggus.eu/ws/ticket_info.php?ticket=95303 (1/7/13)</br> GleXEC ticket. No news as ECDF are a tarball site, although I see that Wahid assigned the ticket to Mark Mitchell. What did Mark do to deserve that? On hold (23/12)

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=99621 (10/12/13)</br> Durham had a bad WN eating enmr.eu jobs (as with Bristol, the problem seemed to be a bad environment). Ewan has flagged to be fixed after Christmas, the bad node is offline though so shouldn't be a bother. On hold (19/12/13)

https://ggus.eu/ws/ticket_info.php?ticket=95302 (1/7/13)</br> Durham's GlexEC ticket. Work paused for Chrimbo, but Ewan mentioned the lack of documentation on how to test this yourself. On hold (19/12)

SHEFFIELD</br> https://ggus.eu/ws/ticket_info.php?ticket=99955 (26/12/13)</br> Atlas jobs were failing with stag-in problems. Elena switched back to using rfio from xroot and suddenly the error rate dropped right off. Something for us to discuss in the storage/atlas meetings? In porgress (6/1)

https://ggus.eu/ws/ticket_info.php?ticket=98594 (4/11/13)</br> LHCB file uploading problems. Despite a lot of effort and retuning the NAT the problem persists. Any suggestions? In progress (16/12/13)

https://ggus.eu/ws/ticket_info.php?ticket=95301 (1/7/13)</br> glexec ticket. There was a request for a estimated deployment date from the GGUS ticket guys. On hold (29/10/13)

https://ggus.eu/ws/ticket_info.php?ticket=99793 (16/12/13)</br> Access to the Sheffield perfsonar web servers. At last word Elena was checking the iptables on her nodes. No news since. In progress (17/12)

https://ggus.eu/ws/ticket_info.php?ticket=100037 (3/1)</br> Perfsonar problem at Sheffield. In progress (5/1)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=100038 (3/1)</br> Manchester's perfsonar hosts have hit a spot of bother. In progress (6/1)

https://ggus.eu/ws/ticket_info.php?ticket=97066 (5/9/13)</br> A ticket about Manchester's perfsonar hosts, where at last word their nodes were to be reinstalled. Not sure how this relates to 100038. On hold (5/12/13)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=95299 (1/7/13)</br> Lancaster's GlexeC ticket. Ahem. On hold (16/12/13)

https://ggus.eu/ws/ticket_info.php?ticket=100011 (31/12/13)</br> Biomed tests aren't working on one of Lancaster's CE's. Being poked. In progress (1/6)

UCL</br> https://ggus.eu/ws/ticket_info.php?ticket=95298 (1/7/13)</br> Glexec ticket. On the to do list, after the DPM upgrade is done with. On hold (18/12)

https://ggus.eu/ws/ticket_info.php?ticket=98125 (17/10/13)</br> Atlas transfer failures. The DPM is upgraded, but there maybe some space issues. Paused for the holidays. On hold (20/12/13)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=94746 (10/6/13)</br> The Ghost of publishing past is haunting QM's SE, where biomed support is published where it shouldn't be. Chris will still get to it when he has the time. On hold (19/12/13)

BRUNEL</br> https://ggus.eu/ws/ticket_info.php?ticket=99996 (30/12/13)</br> Nagios APEL-Pub failures. Raul has run the publisher, but it didn't seem to work. EMI3 Apel woes? In progress (6/1)

EFDA-JET</br> https://ggus.eu/ws/ticket_info.php?ticket=95295 (1/7/13)</br> glexeC ticket. Jet are nearly there, just needing to iron out some problems. On hold (11/12/13)

https://ggus.eu/ws/ticket_info.php?ticket=100045 (3/1)</br> Nagios glexec-ops test failures. One of those bugs that need ironing out. In progress (6/1)

https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9/13)</br> LHCB job failures at EFDA-JET, with a odd authentication-like error. At last word the problem persisted. On hold (9/12)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=98249 (21/10/13)</br> CVMFS for Sno+. Waiting on SW tarballs from the VO. Waiting for reply (6/1)

(In other news T2K and HyperK have had their CVMFS tickets successfully closed).

https://ggus.eu/ws/ticket_info.php?ticket=99647 (10/12/13)</br> Sno+ lcg-cp timeouts at the Tier 1. There was a request for more information from the VO, just had it's second reminder last week. Waiting for reply (17/12/13)

https://ggus.eu/ws/ticket_info.php?ticket=99556 (6/12)</br> NGI Argus ticket. A server has been deployed for testing, work was paused for the holidays. In progress (30/1)

https://ggus.eu/ws/ticket_info.php?ticket=97025 (3/9)</br> The RAL Myproxy server's certificate problem, this ticket is serving as an open reminder of the issue. No recent progress, but hopefully it'll be solved this Month. On hold (6/1)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9/12)</br> "correlated packet-loss on perfsonar host". The last 2012 ticket. There was a plan to reinstall this on new hardware, but that was in October. On hold (18/10/13)

https://ggus.eu/ws/ticket_info.php?ticket=99768 (13/12/13)</br> Atlas source file errors. Thought to be a renaming problem, but have reoccurred. The ticket is in "waiting for reply" and I'm not sure it should be any more. Waiting for reply (29/12/13)

https://ggus.eu/ws/ticket_info.php?ticket=98122 (17/10/13)</br> cern@school's cvmfs-of-their-own ticket. Good progress on testing, Tom reports successfully uploading a tarball. Waiting for reply (6/1)