Difference between revisions of "Past Ticket Bulletins 2016"

From GridPP Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
'''Monday 19th July 2016, 14.00 BST'''<br />
 +
39 Open UK tickets this week!
 +
 +
'''SUSSEX'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=120735 120735] (11/4)<br />
 +
Availability ticket - things appear to be looking up at Sussex with their Storm up and running again - hope it will stay up. In Progress (12/7)<br />
 +
Good news for the NGI ticket about Sussex's poor figures: [https://ggus.eu/?mode=ticket_info&ticket_id=122614 122614]
 +
 +
'''LSST at GLASGOW and the TIER 1?'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=120351 120351] (Glasgow)<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=120350 120350] (Tier 1)<br />
 +
A pincer movement on you guys as Alessandra also asks in the tickets, any news? ''Update - thanks to Gareth for an honest appraisal of the situation at Glasgow.''
 +
 +
'''Other Availability Tickets - RHUL and BRISTOL''' (13/7/16)<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122854 122854] (RHUL)<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122853 122853] (Bristol)<br />
 +
These two tickets seemed to have snuck pass the sentinals at their respective sites - they probably just need On Holding, currently just Assigned. ''Update - both On Hold now, thanks!''
 +
 +
(also for RHUL is the Ops ticket [https://ggus.eu/?mode=ticket_info&ticket_id=122851 122851] - still just Assigned as well).
 +
 +
'''LHCB at DURHAM'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122662 122662] (7/7)<br />
 +
LHCB jobs are Durham are running into difficulties, Oliver asks if LHCB can take a look as the jobs are consistently hitting batch system limits and wasting CPU resources because of this. Waiting for reply (18/7)
 +
 +
'''MANCHESTER'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122379 122379] (28/6)<br />
 +
Robert tracked the reason behind the latency perfsonar issues seen - the owamp disk limits were set too low (what is presumably the default). Robert asks Duncan what they have at Imperial in terms of both configs and amount of space used (for what it's worth Lancaster has 10GB in the configs too, but only 21MB in /var/lib/owamp ...). Waiting for reply (18/7)
 +
 +
'''Decommissioning EFDA-JET'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122198 122198] (17/6)<br />
 +
Just an FYI, JET are in their final downtime. In Progress (12/7)
 +
 +
'''A spot of ticket buildup at BIRMINGHAM'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122771 122771] - Webdav & xroot ticket, Assigned.<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122416 122416] - Pilot role ticket, On Hold.<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=121125 121125] - Missing atlas dump ticket, On Hold.<br />
 +
Tickets seem to be ganging up at the site - let us know if you need a hand. The atlas dump ticket is looking quite crusty, but all are important.
 +
 +
'''TIER 1'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122827 122827](12/7)<br />
 +
Sno+ are having regrets after saying a while ago they'd be okay with all tape and no proper disk. Matt M has requested a disk area that isn't scratch. This is being looked at. In Progress (13/7)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122818 122818] (12/7)<br />
 +
Alessandra noticed that ATLAS Event Service jobs were failing due to the RAL Object Store was down. Alastair replied that the services was being configured, but there was no way to create a downtime for it and failover was not working as expected. Interesting stuff. In progress (12/7)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=121258 121258] (6/5)<br />
 +
Decommissioning one of the RAL WMSes. The service is in downtime and stopped, Are you waiting to delete it from the gocdb and close the ticket? In progress (28/6)
 +
 +
 +
 
'''Monday 11th July 2016, 15.30 BST'''<br />
 
'''Monday 11th July 2016, 15.30 BST'''<br />
 
36 Open UK tickets this week!
 
36 Open UK tickets this week!

Revision as of 14:51, 25 July 2016

Monday 19th July 2016, 14.00 BST
39 Open UK tickets this week!

SUSSEX
120735 (11/4)
Availability ticket - things appear to be looking up at Sussex with their Storm up and running again - hope it will stay up. In Progress (12/7)
Good news for the NGI ticket about Sussex's poor figures: 122614

LSST at GLASGOW and the TIER 1?
120351 (Glasgow)
120350 (Tier 1)
A pincer movement on you guys as Alessandra also asks in the tickets, any news? Update - thanks to Gareth for an honest appraisal of the situation at Glasgow.

Other Availability Tickets - RHUL and BRISTOL (13/7/16)
122854 (RHUL)
122853 (Bristol)
These two tickets seemed to have snuck pass the sentinals at their respective sites - they probably just need On Holding, currently just Assigned. Update - both On Hold now, thanks!

(also for RHUL is the Ops ticket 122851 - still just Assigned as well).

LHCB at DURHAM
122662 (7/7)
LHCB jobs are Durham are running into difficulties, Oliver asks if LHCB can take a look as the jobs are consistently hitting batch system limits and wasting CPU resources because of this. Waiting for reply (18/7)

MANCHESTER
122379 (28/6)
Robert tracked the reason behind the latency perfsonar issues seen - the owamp disk limits were set too low (what is presumably the default). Robert asks Duncan what they have at Imperial in terms of both configs and amount of space used (for what it's worth Lancaster has 10GB in the configs too, but only 21MB in /var/lib/owamp ...). Waiting for reply (18/7)

Decommissioning EFDA-JET
122198 (17/6)
Just an FYI, JET are in their final downtime. In Progress (12/7)

A spot of ticket buildup at BIRMINGHAM
122771 - Webdav & xroot ticket, Assigned.
122416 - Pilot role ticket, On Hold.
121125 - Missing atlas dump ticket, On Hold.
Tickets seem to be ganging up at the site - let us know if you need a hand. The atlas dump ticket is looking quite crusty, but all are important.

TIER 1
122827(12/7)
Sno+ are having regrets after saying a while ago they'd be okay with all tape and no proper disk. Matt M has requested a disk area that isn't scratch. This is being looked at. In Progress (13/7)

122818 (12/7)
Alessandra noticed that ATLAS Event Service jobs were failing due to the RAL Object Store was down. Alastair replied that the services was being configured, but there was no way to create a downtime for it and failover was not working as expected. Interesting stuff. In progress (12/7)

121258 (6/5)
Decommissioning one of the RAL WMSes. The service is in downtime and stopped, Are you waiting to delete it from the gocdb and close the ticket? In progress (28/6)


Monday 11th July 2016, 15.30 BST
36 Open UK tickets this week!

NGI
122614 (6/7)
The hammer is threatening to come down on Sussex (with reference to 120735) - I've let Jeremy M know offline about this- hopefully we can get things looking better this week. In progress (11/7)

ATLAS WEBDAV/XROOT TICKETS
122695 - Tier 1
122772 - SUSSEX
122771 - BIRMINGHAM
122770 - SHEFFIELD
Tickets from atlas trying to get the last few sites that don't seem to have xroot or webdav working or advertised (the Sheffield one looks like just a misconfiguration).

A polite nudge to Mark and Matt to check the Birmingham tickets in general, I think you have a few just assigned or need shimmying along...

ECDF
122653 (7/7)
Edinburgh having a CE that shouldn't be monitored being monitored has bitten them again (duplicate/expansion on 120004). Andy once again wishes that there was a way of unlinking the CE from the monitoring in gocdb. Waiting for reply (8/7)

GLASGOW
122498 (3/7)
It looks like this MICE ticket can be closed -it all looks OK. In progress (6/7)

Monday 4th July 2016, 13.00 BST
29 Open Tickets this month, arranged by site.

SUSSEX
121797 (26/5)
Sno+ dirac jobs failing at Sussex - looks to be a separate issue from the now closed pilot ticket (118289). In progress (13/6)

120735 (11/4)
A Low-availability ROD ticket. Hopefully this will "resolve itself" soon. On hold (6/6) Update - see Daniela's observation on TB-SUPPORT - looks like things a weird at Sussex.

RALPP
122463 (1/7) Atlas were seeing a "stalled" xrootd connection on the xroot door to RALPP's dcache, but it turned out that they were using the wrong url. The details are being changed in AGIS, and the submitter asks if this xroot door's hostname is planned on being kept as is. In progress (1/7)

OXFORD
121924 (2/6)
Duncan noticed a drop in perfsonar performance for Oxford- this is being brought up with the Oxford networking team. Any news from them? In progress (7/6)

BRISTOL
120455 (29/3)
CMS validation of the HTCondor CE at Bristol. Still waiting on the accounting getting sorted before things can kick off again properly. On hold (29/6)

BIRMINGHAM
122416 (29/6)
Daniela spotted that a number of VOs were missing pilot roles at Birmingham. On hold as Matt is away on holiday and Mark is unsure if he'll be able to get round to rolling out the changes. On hold (4/7)

121125 (28/4)
Missing atlas dumps, but Matt is on holiday so no news. On hold (4/7)

GLASGOW
120351 (22/3)
Enabling LSST at Glasgow. Gareth reports that things are picking up with their new Identity Management System almost in production, once this is satisfactorily sorted LSST will be enabled at the site. On hold (1/7)

122378 (28/6)
Duncan spotted that the Glasgow test results were all orange, like they had had too much fake tan applied to them. David and Co have decided to put them into scheduled downtime pendinga reinstall. On hold (28/6)

121929 (2/6)
Biomed still "accidentally" on the Glasgow SE. A date for purging biomed from the SE has been set for the 30th of July. In progress (24/6)

122498 (3/7)
A rare mice ticket - they had trouble getting their data. Sam reports that it was a config error on some of the mice-containing disk servers, which should be fixed now. Waiting for reply (4/7)

ECDF
120004 (7/3)
ROD ticket for the archer facing arc CE that will be perpetually failing nagios tests, it doesn't look like much can be done. On Hold (23/6)

SHEFFIELD
122517 (4/7)
Low availability ROD ticket. Elena is unsure of the reasons for the low metrics, and is having trouble digging up useful information. In progress (4/7)

MANCHESTER
122379 (28/6)
Duncan spotted that the Perfsonar latency results weren't looking right - after a bit of discussion of the right places to look (with the useful link) Robert restarted the services and things look better - perhaps the ticket can be closed? In progress (29/6)

LIVERPOOL
122414 (29/6)
Nagios failures after Liverpool's abrupt power "cut" last week - some services still seem unhappy. In progress (4/7)

122514 (4/7)
Another nagios ticket, for the arc CEs but likely the same underlying reason. Assigned (4/7)

RHUL
122417 (29/6)
A ticket to the CMS factory admins, asking for the entry for some RHUL resources into the system. There were some teething issues but they were fixed, and the factory overseers have asked if these endpoints are ready to be put into the production glidein factories? Assigned (should be something else) (1/7) Update - Govind has replied so looks like this ticket will be done soon.

QMUL
120204 (15/3)
LHCB trouble submitting to QM's dual stack CE, due to problems outside the site's control. Looking at the related ticket (120586) The ETA on a patch that should fix this behavior is the 8th of this month. On hold (25/4)

IMPERIALbr /> 122515 (4/7)
A very fresh ticket - CMS have complained that the permissions on a file folder are wrong. Daniela replies that this was on purpose - Imperial hasn't supported Heavy Ion data and asks why the change? Waiting for reply (4/7)

EFDA-JET
122198 (17/6)
The decommissioning of the EFDA-JET grid site ticket. Full switch off is scheduled for the 25th of August, downtime commences from the 12th of July. In progress (21/6)

121899 (1/6)
EFDA-JET ROD availability ticket. A bit of a moot point! On hold (28/6)

TIER 1
121258 (6/5)
Decommissioning of one of the RAL WMSes. Access stopped last week as announced. In Progress (28/6)

119841 (1/3)
HTTP support for lcgcadm04.gridpp.rl.ac.uk, currently being referred to the developers. No news for a while, any movement behind the scenes? On hold (26/4)

120350 (22/3)
Enabling LSST at RAL. At last check the remaining (large) hurdle was deploying the users to the worker nodes across the site. Any joy? In progress (6/5)

121687 (20/5)
Packet loss for the RAL perfsonar, investigation is awaiting a known network intervention which will replace a router. Do we know when this work will take place? On Hold (23/5)

122364 (27/6)
Getting cvmfs support for the solidexperiment.org VO. Catalin has setup /cvmfs/solidexperiment.egi.eu as the egi.eu namespace is the most suitable, Daniela gives the thumbs up for this and has set the ticket on hold for a fortnight pending other work. On Hold (29/6)

120810 (13/4)
The neat decommissioning of srm-biomed.gridpp.rl.ac.uk, reopened pending removal from the bdii. In progress (24/6)

117683 (18/11/15)
Getting CASTOR to publish GLUE2 information. No news for a while on this, could do with an update (even if it's a null update). On Hold (5/4)

NGI
119995 (7/3)
NGI uncertified site ticket, Jeremy is on it. In progress (28/6)

Monday 27th June 2016, 11.00 BST
25 Open UK Tickets this week - down a lot from last week.

VOMS admins doing us a Solid
122263 (Manchester)
122337 (Imperial)
122336 (Oxford)

FYI - setting up the VOMs servers for the solidexperiment.org, so it will soon be ready for us to support.

EFDA-JET
122198 (17/6)
Decommissioning EFDA-JET - the broadcasts have been sent out. Assigned (should be in progress) (21/6)

NGI
119995 (7/3)
Uncertified NGI sites for the chopping - this could do with at least an update. (17/5)

SUSSEX
118289 (10/12/15)
Pilots at Sussex. We managed to have a little look at this last week, but it was only a little one. There was a recent re-yaiming which might have changed the landscape somewhat, and some pilots seemed to be getting through. Waiting for reply (23/6)

(linked to the Sno+ version 121797)

BRISTOL
120455 (29/3)
CMS validation of be HTCondor CE - CMS have asked for an update. In progress (17/6)

LSST Tickets
120350 (Tier 1)
120351 (Glasgow)
Any news?

BIRMINGHAM
121125 (28/4)
Missing ATLAS dumps at Birmingham - no news for a while. Have you tried uploading them with rfcp? Has anyone got xrdcp to work for uploading their dumps automatically? In progress (1/6)

Monday 20th of June 2016, 13.00 BST
32 Open UK tickets this week- just doing the highlights due to HEPSYSMAN.

So long JET
122198 (17/6)
EFDA-JET are decommissioning, this is the ticket tracking that process. We should make sure that things keep on track. Assigned (17/6)

LANCASTER
122188 (16/6)
It might not be news to others, but Lancaster ran afoul of LHCB's newish 4GB job memory requirements, causing some job failures when they ran out of memory (we were only allocating 3GB per job). Should be okay now though with an increased memory allocation per job. Solved (20/6)

RHUL
121575 (16/5)
Happy news, this ROD availability ticket looks like it can be closed. In progress (17/6)

BRISTOL
122172 (15/6)
Bristol of hit on the "classic" problem of nagios tests timing out before the corresponding job is scheduled. Lukasz wonders if artificial job reservation is going to be need to stop this from happening. In progress (16/6)

QMUL
122193 (16/6)
Multiple DNS entries for the QM (dual stacked) perfsonar hosts are causing intermittent test failures, Duncan points to documentation that asks that perfsonar hosts only have one hostname.

Monday 13th June 2016, 16.00 BST
29 Open UK Tickets this week (down from 35 last week).

Not much exciting going on on the ticket front, and I have to send my apologies for today's meeting.

Here's the link to the UK tickets: http://tinyurl.com/nwgrnys

There are some cries for help on the ticket front, Kashif has already asked for help with the Oxford httpd ticket 122069, and a forewarning that Jeremy M will be seeking guidance with the Sussex issues at next week's HEPSYSMAN (118289 & 121797).


Monday 6th June 2016, 14.00 BST
35 Open UK Tickets this month.

NGI
121987
The NGI's very urgent reponse times for May weren't up to par - a dug into the reason why and updated the ticket. In progress (6/6) Update - solved once explanation given

119995 (7/3)
Uncertified NGS sites to clear up - Jeremy has been on it. In Progress (17/5)

SUSSEX
118289 (10/12/15)
Pilot ticket - Jeremy M thought he had got it, but Daniela's tests say otherwise (although the errors look like the CE playing up). In progress (26/5)

121797 (26/5)
Sno+ dirac jobs failing at Sussex - looks like the same problem as above. No word from the site - I'll poke Jeremy M offline. Assigned (26/5)

120735 (11/4)
Low availability ROD ticket. Hopefully Sussex will have a clear month. On Hold (6/6)

OXFORD
121641 (18/5)
Wrong capacities reported in REBUS - this ball has landed in Oxford's court, with Pete G looking at the SE publishing. Assigned (I set it In Progress) (3/6)

121924 (2/6)
An interesting ticket from Duncan, concerning a drop in perfsonar throughput performance at Oxford. Still just Assigned (2/6)

BRISTOL
120455 (29/3)
Validation of a new HTCondor CE at Bristol by CMS. At last check Bristol were testing the CERN accounting daemon, but that was a few weeks ago. Any news? In progress (9/5)

121989 (6/6)
Super-fresh ROD ticket (.glexec.CREAMCE-JobSubmit tests). Assigned (6/6) Update - solved 10 minutes after a wrote this.

BIRMINGHAM
121125 (28/4)
Missing ATLAS SE dumps. At last check Matt W was having troubles with xrdcp-ing the dumps into his DPM, Alessandra suggested that others succeeded using rfcp (with the caveat that rfcp might not be around much longer). In progress (1/6)

GLASGOW
120135 (11/3)
HTTP support ticket. Any news? An update would be nice, no matter how vacuous. On holding the ticket would be even better. In progress (7/4) Update - Solved, tests are green.

121929 (2/6)
Glasgow's SE "not working" for Biomed- which it isn't meant to - but biomed support was still being published. Gareth is sorting that out, and will close the ticket once the Biomed references are purged. All good. In progress (3/3)

120351 (22/3)
Enabling LSST at Glasgow. I'll repeat Alessandra's question in the ticket - any news? On hold (5/5)

EDINBURGH
121465 (11/5)
ROD Availability ticket, just waiting for time to pass. On hold (31/5)

121990 (6/6)
BDII issues caused a few ROD test failures - Marcus fixed things around lunchtime, hopefully they heal up soon (to answer Marcus's question, I believe BDII changes typically take 2-3 hours to fully propagate). In progress (6/6) Update - can likely be closed, tests are green again.

120004 (7/3)
ROD test failures for the ARCHER test CE. The last update has Andy asking if the ticket could be put on "long term hold"? That's if someone can't manually edit the gocdb to set this service "Monitored=N, Production=Y". On hold (24/5)

SHEFFIELD
121991 (6/6)
A fresh ROD ticket, srm tests were failing. Elena freed up some space and things should be good now. Waiting for reply (6/6) Update - tests are passing, another for the solved pile?

LIVERPOOL
121759 (25/5)
Another Availability ROD ticket. John identified the cause as likely due to the DPM publishing problems after a cert upgrade (that's got me too before). Just needs time to heal these wounds now. On hold (27/5) Update- actually the ticket isn't on hold, but it should be...hint...

RHUL
121575 (16/5)
Yet another availability ticket (May was not kind to the UK). Likely needs On Holding whilst the metrics "fix" themselves - if the underlying problems have passed. In progress (16/5)

QMUL
120352 (22/3)
LSSt support at QM. Dan reports today that LSST should be enabled on three CEs. Nice one. Waiting for testing. (6/6)

120204 (15/3)
LHCB problems, due to an issue submitting jobs to dual stack CEs from CERN. The referenced issue (120586) has had a priority bump and a few extra parties cc'd in, so hopefully there will be some movement on it. On Hold (25/4)

BRUNEL
121573 (16/5)
ROD BDII issue ticket - possibly due to multiple site BDIIs (although I was under the same impression as Daniela). Kashif has opened a related ticket (121760) which Raul commented on today to asj for confirmation the issues are related). On hold (27/5)

121813 (27/5)
Brunel failing CMS validation - likely due to cvmfs playing up on two nodes. The ticket has turned a conversation on CMS site settings, and seems to be chugging along fine. In progress (6/6)

EFDA-JET
121899 (1/6)
Low availability JET ticket. Assigned (1/6)

121837 (30/5)
JET SE not working for biomed. I thought that JET stopped supporting Biomed a while ago, I'll need to check my notes. Assigned (30/5)

100IT
121189 (2/5)
A ticket I don't understand! Waiting for reply (16/5)

TIER 1
119841 (1/3)
HTTP support at the Tier 1 - on hold awaiting dev support. On Hold (26/4)

121687 (20/5)
Another perfsonar performance ticket from Duncan. A router that could be the cause is due to be replaced, things will be looked into in more detail after. On hold (23/5)

121894 (1/6)
A request for the Tier 1's plans to deploy a "LHCOPN IPv6 Peering, incl. dualstack Perfsonar". The upcoming router replacement is a blocker for this. In progress (1/6)

120810 (13/4)
Biomed requiring a bit of extra reassurance during the decommissioning of their volume. In progress (20/6)

120350 (22/3)
Enabling LSST at RAL. Things were looking good, but it looks like progress stalled rolling out the VO to the workers (aka the hard bit). Any news? In progress (6/5)

121322 (10/5)
A Sno+ user having trouble accessing files at the Tier 1. Whilst the issue appears to be fixed for the example file, the user lists a few more fules which they have trouble downloaded a subset of. Reopened (3/6)

117683 (18/11)
Castor not publishing glue 2. Awaiting some background dev work. Any news? On hold (5/4)

DECOMMISSIONING TICKETS
120664- GenScratch Disk Pool at the Tier 1.
121258- WMSes & LB at the Tier 1 (I previously misadvertised this as being a Glasgow decommissioning ticket, thus revealing to everyone my secret- that I don't actually properly read every ticket.
All handed perfectly. Friday 27th May 2016

Matt's on leave w/c 30th May, he'll be back the week after for a full ticket review. Hopefully by then there will be fewer tickets for him to report on. That would make Matt a happy chappy!

In the mean time here's a link to all the UK tickets.
And he's the link to the Other VO Nagios.

Monday 23rd May 2016, 15.00 BST
37 Open UK Tickets this week.

Concentrating on tickets that look like they can be closed (if not now then soon):

TIER 1: 120954
This LHCB ticket to clean up DNS aliases looked to have the hard parts done.

TIER 1: 121698
CMS failures over the weekend, solved by increasing the max file limit by a factor of 10. Looks like this sorted the problem.

RALPP: 118628
Daniela reports that (after their voms change) LZ jobs submitted to RALPP okay - so maybe this LZ ticket can be wrapped up?

SUSSEX: 120714
This ROD ticket looks sorted for Sussex, Gareth has asked the site to set it to solved. Update - solved

RHUL: 121231
An LHCB ticket, problems were found and solved, pilots are flowing once again. Mark gives the thumbs up to solve the ticket.

GLASGOW: 120973
Ticket tracking the retirement the WMSii and LB. I suspect the Glasgow chaps know to (and are looking forward to) closing this ticket once you've completed the last few steps.

QMUL: 121574
It looks like the alarm triggering this ROD BDII ticket disappeared on its own, so feel free to close the ticket (as Gareth suggested).

LHCB VOFEED tickets
ECDF: 121360
BRUNEL: 121388
Both these VOFEED tickets have asked for feedback from lhcb on what way to proceed.

TIER 1 SNO+ TICKETS
120920
121322
Snoplus have two open tickets with the Tier 1 regarding file access - the first is regarding xrootd problems, the second accessing files from tape. Both tickets could do with an update, I believe both tickets have their root cause in Castor not playing ball.

Update - Birmingham
121125
Atlas dumps ticket for Birmingham - Matt reports he's trying to get the xroot to upload the dumps locally into the DPM. Did anyone have success with this?

Monday 16th May 2016, 15.00 BST
42 Open UK Tickets this week.

GOCDB/VOFEED mismatch tickets
There are 7 open tickets left from last week's campaign to clean up the VO tags featured in the gocdb. Only the Birmingham ticket is still in the "assigned" state, the rest are undergoing discussion or requesting feedback/clarification.
BIRMINGHAM 121450
RALPP 121464
LIVERPOOL 121394
BRUNEL 121388
BRISTOL 121386 Update - closed, thanks Winnie!
ECDF 121360
RHUL 121421

QUESTIONING ROD
121465(11/5)
This ECDF availability ticket is "on the mend", but Andy has asked how the numbers are calculated. Waiting for reply (16/5) (This goes in hand with Andy's question in ECDF's other ROD ticket 120004)

120714 (9/4)
I think this Sussex ROD ticket is solved, the link to the tests looks green (in a good way). I think it can be closed? In progress (28/4)

OXFORD 120019 (7/3)
Talking of tickets that probably can be closed, I think this CMS subscription change request issue is solved? Either way it could do with an update. In progress (29/4)

RHUL 121516 (12/5)
A biomed ticket, possibly the same networking problems affecting them that affected atlas jobs (121540). It looks like this ticket snuck past your sentries, and could do with acknowledgment. Assigned (12/5) Update- updated and in progress, hope the networking problems go away.

BIRMINGHAM
121125 (28/4)
Did you chaps have any luck getting your dumps working? Taking a peek myself I see that your dumps directories are still empty. Let us know if you need a hand. In progress (4/5)

Any there any other tickets or issues people want bringing up?

And finally, the Other VO Nagios...

Monday 9th May 2016, 13.00 BST
39 Open UK Tickets this month

So long and thanks for all the jobs - decommissioning tickets.
120973 (Glasgow, 2 WMSes and an LB).
121258 (Tier 1, just one WMS).
120664 (Tier 1, GenScratch disk pool).
Not much else to say, nothing to see here. Move along...

NGI
119995 (7/3)
Cleaning up old uncertified NGS sites. Any joy Jeremy? In Progress (18/4)

NEUGRID CVMFS STRATUM PROBLEMS
121179 (2/5)
The neugrid stratum at the Tier 1 isn't behaving - no site was notified with this ticket so it likely dodged people's notice. I sent it RAL's way- feel free to bounce elsewhere if it isn't a problem at the Tier 1. Assigned (9/5) Update - the submitter confirms things are fixed, it looks like the ticket can be closed.

SUSSEX
Ops tests woes:
121028 (25/4) -cream CE
120735 (11/4) -Availability
120714 (9/4) -CA distro.
Being handled as best Jeremy M can - it looks like the last two issues are on the mend. Not sure about the first one.

118289(10/12/15)
gridpp pilot role ticket. No news for a while, but hopefully a familiar face will sweep in and save the day soon. On Hold (25/1)

RALPP
120282 (18/3)
Atlas-centric HTTP support ticket. Chris is putting the site in downtime next week to upgrade the dcache hardware and version, and we'll see how this looks after. On hold (6/5)

118628 (5/1)
LZ pilot ticket. No news after the testing the test version of Arc didn't go so well, and so Chris decided to wait until they have a newer umd4 CE to try it out on, or at least until the fix makes it into the proper repos. The reminder date has passed, any news? On Hold (22/3)

OXFORD
120019 (7/3)
CMS federation subscription change for Oxford. Kashif has worked on this and it looks like it might be fixed. Any news? In progress (29/4)

121139 (22/4)
Enabling skatelescope.eu on the Oxford VOMS. Kashif kicked it but Robert's tests didn't work, so debugging is ongoing. In progress (6/5)

BRISTOL
121024 (25/4)
CMS transfer problems. Phedex was upgraded, but a few more problems with some dodgey datasets came up - Lukasz seems to have it all in hand though. In progress (6/5)

120455 (29/3)
A spot of self-ticketing, here Lukasz asked CMS to validate their new HTCondor CE. A lot of conversation in ticket (some regarding CMS multicore), the last entry has Lukasz looking at the cERN Condor accounting daemon. Assigned (could do with being changed to a different status) (9/5)

BIRMINGHAM
121125 (28/4)
The atlas storage dump is missing at Birmingham - Matt is looking for it (I had more trouble then I should have setting up this cron job at Lancaster - I forgot my 'nix-admining basics! The shame!). In progress (4/5)

120948 (20/4)
Ops availability ticket, on hold whilst things recover - naught to see here. On Hold (20/4)

GLASGOW
120135 (11/3)
Another atlas-centric http TF ticket. The ticket could do with an update/on holding. In progress (7/4)

120351 (22/3)
Enabling LSST at Glasgow, on hold awaiting the new identity management system[1]. Alessandra posted a helpful link here - how goes things? (5/5) Update - I noticed that 117706 (enabling pilots for pheno and friends) is done so hopefully this is just a roundtuit?

[1]Robin's started working on a CentOS7 argus sever build with ansible at Lancaster if that's relevant to your, or anyone else's, interests.

ECDF
121227 (4/5)
A crusty cream CE is causing ROD Ops test failures at ECDF - Andy and Marcus are deciding its fate. In progress (5/5) Update - the immediate issue was solved, and the ticket closed.

120004 (7/3)
The ARCHER facing test CE suffering ROD failures. Was a decision reached about whether or not to put the service in downtime or similar? I see the CE is in a short downtime at the moment. On Hold (25/4) Update - Andy is unsure what to do and has asked for some advice, or if perhaps a special case can be made for this CE in the monitoring/gocdb.

121285 (8/5)
Fleeting atlas transfer problems, caused by a network blip. The blip has passed, and Marcus asks if there are any more problems seen? Waiting for reply (9/5)

SHEFFIELD
121279 (7/5)
Atlas transfer failures - Elena noticed that the files don't actually exist at Sheffield and will declare them lost forthwith. In progress (8/5)

MANCHESTER

120998 (22/4)
skatelescope.eu VO creation ticket, nearly done. On Hold (4/5)

120430 (24/3)
Enabling Icecube VO at Manchester. It seems quite involved (gpu jobs sound quite exciting!), things look to be moving along nicely. In progress (5/5)

RHUL
121257 (6/5)
ROD ticket for multiple problems - a CE fell over and is being looked at (the CE problems might explain the BDII failures). In progress (6/5)

121231 (5/5)
LHCB pilots dying at RHUL. After finding a few problems at fixing them Govind wonders if problems persist. Waiting for reply (8/5)

QMUL
121245 (5/5)
Friday ROD issues - looks like multiple CEs were/are having a bad time of it. Assigned (5/5)

120352 (22/3)
Enabling LSST at QM. Alessandra posted the link to the information that Dan asked for. In Progress (5/5)

120204 (15/3)
The well-understood problem with lhcb jobs submitting to QM's dual-stack CEs. Waiting on 120586, where there has been no news for a month, although the last entry seemed positive. On Hold (25/4)

100IT (for 100% completeness)
121189 (2/5) - Being handled.
121271 (6/5) - Assigned
(interestingly this ticket asks for support for dteam as a child of 121262).

And Finally...

THE TIER 1
120810 (13/4)
Biomed asked that their castor storage pool that's being decommissioned (see 120664) be set to read-only prior to the decommissioning date. Gareth pointed out that this request is redundant, as the disk pool is set to be made read only as detailed in the decommissioning announcement. On Hold (27/4)

120350(22/3)
Enabling LSST at RAL. Andrew L reports good progress, still some work to go through. In progress (6/5)

https://ggus.eu/?mode=ticket_info&ticket_id=120920 (19/4)
Sno+ having xrootd problems at RAL. A lot of back and forth going on, the issue is being worked on. In progress (6/5)

117683 (18/11/15)
Castor not publishing glue2. This is being worked on slowly in the background, requires no small amount of dev work. On Hold (5/4)

119841 (1/3)
HTTP support ticket from the HTTP TF. On Hold whilst the developers are consulted. On Hold (26/4)

120954 (21/4)
SRM endpoint simplification for LHCB. At last check it looked good to remove the old alias, with a thumbs up from LHCB. Waiting fore reply (should be "In progress" I think) (3/5)

121147 (29/4)
CMS file reading failures at the Tier 1. Andrew L checked things and they looked okay, and asked for some clarification and extra information but no word back. Waiting for reply (29/4)


Tuesday 3rd May 2016, 10.00 BST
36 Open UK tickets this week.

The bank holiday through me off, but here's what a brief dredge of the tickets this morning dragged up:

NGI (TIER 1?)
121179
I think this ticket about the neugrid.egi.eu cvmfs is meant for the Tier 1 Stratum-1 admins (citing a problem with cvmfs-egi.gridpp.rl.ac.uk).

GET YOUR SKATELESCOPE.eu ON
120998
skatelescope.eu was the name settled on for this VO, IC and OXFORD have got child tickets to roll out the new VO to the backup VOMSESeses.


RALPP
121155
CMS noted that the RALPP PheDex agents decided to take the bank holiday off too. Assigned (29/4)

OXFORD
121175
Oxford got a ticket due to ATLAS using up all their space - as discussed many a time this is not a site problem - thanks to Elena for defending the site's honour.

LIVERPOOL
121092
As seen on TB-SUPPORT when Steve put a call out for advice, Liverpool were/are seeing multicore atlas jobs fail due to a lost heartbeat. Alessandra's digging revealed batch system memory restrictions as the likely culprit, but we can chat about it if it doesn't get brought up elsewhere.

QMUL
120352
Enabling LSST at QMUL - Dan has asked for some LSST details: "What's the software directory? Is it available via cvmfs? Typically how many accounts have you set up at other sites (10 / 50 100) ? No production role needed?". Waiting for reply (29/4)

similarly:
TIER 1
120350
The Tier 1 LSST ticket, this may contain the answers that Dan seeks - as Alessandra notes some VO information seems to have once again disappeared from the Ops Portal.

ECDF
120004
ROD ticket for the Archer-fronting CE, which doesn't really work but needs to look like it's in production for atlas to send tests. How long before this becomes a problem for the ROD Dashboard? Could ATLAS jobs be easily forced to a service in downtime?

GLASGOW
120973
WMS and L&B decomissioning ticket. The Ticket Pedant is saddened by the unchanged default status of this ticket...


Monday 25th of April 2016, 15.30 BST
31 Open UK Tickets.

A NEW CHALLENGER APPEARS
120998 (22/4)
Squire McNab has ticketed himself (which always feels like a weird thing to do) to set up the skatelescope.eu VO on the Manchester VOMS. No doubt many of us will be interested in enabling a SKA VO. Assigned (22/4)

A FEW FEWER WMSes
120973 (21/4)
Glasgow have announced the retirement of their WMSes and Logging and Bookkeeping server at the end of next month, with the Downtime starting in a fortnight (9/5). Assigned (Oh Hold or In Progress it?) (21/4)

The Tier 1 has a few tickets that peaked my interest:
120954 (21/4)
LHCB would like to amalgamate their endpoints at the Tier 1 - bringing the tape and the disk behind the same name. Brian rounded it out with a question- I think for LHCB. In progress (should be waiting for reply?) (25/4)

119841 (1/3)
This HTTP support ticket almost certainly looks like it should be On Hold, possibly awaiting some development work. In progress (22/3)

Talking of On Hold:
120204 (15/3)
This LHCB ticket for QMUL looks like it should be put On Hold, as it is awaiting an external fix that's outside the site's control (see ticket https://ggus.eu/?mode=ticket_info&ticket_id=120586). In progress (14/4)

And finally:
120019 (7/3)
A CMS ticket asking for a change of federation subscription for Oxford. I know Kashif and Pete are looking at it, but do you need a hand from someone who knows the arcane CMS ways? In progress (5/4).


Monday 18th April 2016, 15.30 BST
33 Open UK Tickets this week.

RALPP having a bad time?
120872 (cms)
120879 (lhcb)
I hope everything's not too (or at all) melty at RALPP. Both tickets still just assigned.

Update - there were bad times, caused by the condor collector filling up its filesystem, but things should be sorted now and both tickets are solved.

BIRMINGHAM
120860 (15/4)
Biomed are once again finding that they're running out of room at Birmingham. It seems like they either are very unsure of what data their users may or not be producing, or have (possibly unrealistic) views on what other user groups can and should be doing with their data. Assigned (15/4)

MANCHESTER
120706 (8/4)
This Biomed ticket looks like it took the Low Road whilst you were taking the High Road, missing each other along the way. Assigned (13/4) Update - In progress, Biomed have been purged from the Manchester information system and Alessandra has asked for the site to be removed from any static lists.

TIER 1
120664 (7/4)
The ticket tracking the retirement of one of the RAL disk volumes (this one supporting biomed, na62 and mice). All above board, but it could do with being set in progress or on hold. Assigned (7/4)

120810 (13/4)
I think related to the above ticket, Biomed have asked that write access be removed to their volume. In Progress (13/4)

120624 (5/4)
Atlas Consistency Checking Ticket - I don't think this should be in "waiting for reply" any more. Waiting for reply (13/4)

119841 (1/3)
HTTP Task force ticket. No news for a while, but it looked like the situation might be a complicated one to fix - perhaps the ticket needs on holding whilst its sorted out? In Progress (22/3)

Monday 4th April 2016, 14.00 BST
26 Open UK Tickets this month.

NGI
119995 (7/3)
Uncertified site ticket for the UK - Jeremy is on the case, and there appears to be no need to rush. In progress (4/4)

120588 (4/4)
A fresh ticket, saying we have achieved insufficient "Quality of Support performance" - we had an average of a 1.4 day response time for very urgent tickets during March.

I've looked into this using the ggus report viewer and I believe we're being accused of a crime we only technically committed (if I'm looking at things right). We only had 2 "very urgent" tickets in this period, and one of them the site forgot to put In Progress, so had an erroneous response time of two and a half days. When averaged with the single other very urgent ticket this gave us an average response time > 1. Poor statistics is a right blimmer. I've updated the ticket - which was solved whilst I wrote the report.

The take home from this - please remember to set your tickets In Progress! It does actually matter (kinda).

SUSSEX
118337 (14/12/15)
Sussex Storage down for Sno+ - I assume this is still the case? Jeremy M replied a while ago but no news since. On Hold (15/2)

117894 (23/11/15)
One of the last Atlas Consistency Checking tickets - in a similar state to the former. On Hold (25/1) Update - Solved by Alessandra, can make do without for Sussex

118289 (10/12/15)
gridpp pilots at Sussex- again no news. On Hold (25/1)

I was supposed to poke the Sussex tickets before Easter but local things came up - I will prod them after tomorrow's meeting if we don't get a chance to discuss them during.

RALPP
118628 (5/1)
LZ support at RALPP. Chris tried to roll out the LZ-friendly test version of ARC to a production server but hit a roadblock and had to rollback. Chris is waiting on the fix to go out into the proper repositories, and is interested to see how things fair on a test centos7/umd4 ArcCE he has brewing (no pun intended). On hold (22/3)

120282 (18/3) Atlas HTTP taskforce ticket. Chris has asked that the tests be re-aimed at another, less-loaded server. Waiting for reply (1/4)

OXFORD
120019 (7/3)
A CMS ticket asking the Oxford T3 to change its xrootd federation subscription. Ewan was the chap who first-responded to this ticket, quiet since - it needs some attention. In progress (7/3)

117892 (23/11/15)
The other holdout of the Atlas Storage Consistency Checking tickets, and again in a similar state. In progress (24/3)

120345 (22/3)
At atlas ticket asking Oxford to update their xroot monitoring settings. Kashif battled this issue with Ilija's help, and with luck it can be closed. In progress (31/3)

BIRMINGHAM
119957 (4/3)
A ROD availiability ticket after their SE DB crisis, just waiting to for the alarms to go green. On hold (31/3)

GLASGOW
117706 (19/11/15)
Pheno (and other?) pilots at Glasgow. Gareth reports that they should have their new identity management system up and running soon (it it arrived on time). On Hold (23/3)

118052 (30/11/15)
ATLAS HTTP Taskforce ticket. Reopened just before Easter after tests started failing with TLS issues. Reopened (24/3)

120351 (22/3)
The first on a few enable LSST tickets - On Hold until the new identity management system is up and running. On hold (23/3)

120135 (11/3)
I'm not entirely sure why you chaps got a second http TF ticket, but you have (for a slightly different issue). In progress (1/4)

EDINBURGH
120004 (7/3)
ROD ticket for the test ARC CE fronting ARCHER, where tests fail as expected. I remember years ago being among many who couldn't think of a good reason to keep the "Production=yes, Monitoring=no" option, so they got rid of it - but it would perfectly apply here. How long can the ROD keep this ticket on hold before the dashboard self-destructs? On hold (29/3)

SHEFFIELD
118764 (12/1)
Another HTTP TF ticket. Elena kicked the services a while ago, but no news since (and the tests are still not passing by the looks of things). In progress (24/2)

114460 (18/6/15)
gridpp pilots at Sheffield. Did you get round to having a look at this? In progress (29/2)

MANCHESTER
120430 (24/3)
Ticket tracking setting up Manchester for Icecube glideins (the coolest of VOs...). It opens with a request to the Manchester site admins to enable their user (looks like just the one pilot DN), but no reply (as the Mancunians might have missed that the ticket has turned on them). Assigned (24/3)

LANCASTER 120412 (24/3)
Atlas deletion errors at Lancaster - caused by a few files badly drained back in 2014. I'm trying to figure out a clever, database-y way of listing all the files on these long gone servers (the best I've got so far is `select * from Cns_file_replica where host like 'fal-pygrid-%';`, but of course the dpns mapping isn't that straightforward. Expect a cry for help soone! In progress (4/4)

RHUL
119509 (12/2)
Sno+ job directories being cleaned up prematurely. It looks like this problem could have been transient - Matt M submitted some test jobs and didn't see the problem, and is re-testing with some proper work. Hopefully those tests completed okay. In progress (22/3)

QMUL
120352 (22/3)
Request to enable LSST at QM. Dan has asked for a reminder after/during GRIDPP36. On hold (24/3)

120204 (15/3)
LHCB having issues with some of the QM CEs. The reasons for this are unclear - pilots stopped around the start of March and the problem persisted at last check. In progress (17/3)

THE TIER 1
117683 (18/11/15)
CASTOR not publishing GLUE2. It's being worked on in people's spare time - any recent news? If not, maybe progress is slow enough to warrant on-holding the ticket. In progress (17/2)

119841 (1/3)
HTTP TF ticket, this time for LHCB. Proxy functionality isn't working (although regular cert/key pair access is okay) - this functionality was never turned on and is being looked into. In progress (22/3)

120350 (22/3)
Request to enable LSST at the Tier 1. Daniela notes that the Tier 1 will likely hit the same problem as RALPP for LZ (118628), Andrew L concurs. Pool accounts have been requested, things chug along nicely. In progress (22/3)

Monday 21st March 2016, 15.15 GMT
29 Open UK Tickets this week.

After Ewan
Now that Ewan's living it up at his new job the Oxford tickets might need extra shepherding - let us know if you need help Kashif. The tickets are:

117892 (23/11/15)
Atlas consistency checking ticket. On Hold (16/3)

120019 (7/3)
CMS federation subscription change request. In progress (7/3)

120052 (8/3)
HTTP TF ticket. It appears to be looking hopeful though. In progress (14/3)

Whilst we're talking HTTP TF:
GLASGOW 120135 (11/3)
Looks like this ticket has snuck by, or maybe you chaps just never got roundtuit. Assigned (11/3)

SHEFFIELD
117886 (23/11/15) Atlas consistency check ticket - Elena's working on it, but the dump script fails as her DPM has run out of connections. Odd. In progress (21/3) - Update already - Elena ramped up the number of connections in my.cnf and things started working - just having trouble uploading the dumps now.

And I don't like to nag but the other two Sheffield tickets could do with an update:
118764 (http tf) and 114460 (pilot rollout)

QMUL
120204 (15/3)
A dearth of LHCB pilots at QM. Dan spotted that *something* broke at the start of March, and handily gave a list of suspects. Not sure which one is spoiling things though... In progress (17/3)

And that's all from me. The SUSSEX tickets will need chasing up again, I'll do that - plus the NGI ticket 119995 is a bit quiet. Finally, thanks to Alessandra for wrangling the Atlas Consistency Checking tickets.

Update - the RHUL Atlas Consistency Checking ticket looks on the verge of closure: 117881

Other VO Nagios looks clean. Nice one!

Monday 14th March 2015, 14.00 GMT

27 Open UK Tickets.

The Highlight(s):
The HTTP TF Tickets to DPM sites have mostly been reborn, seemingly changing tack from "http ain't working on your DPM" to "this ain't working all that well on your DPM - probably due to https".

The take home message from these tickets is:

"The DPM team strongly recommends disabling https on the disk servers. It is frequently a source of problems and has a significant performance penalty. Access is still authenticated and authorised on the head node which passes a token to the disk, so the setup is secure."

An example of one of these tickets (Manchester, by virtue of being the most recently updated): 120139

And um, that's it for interesting tickets AFAICS (over 50% of our tickets fall under atlas consistency checks, http TF tickets or rolling out pilot accounts). Let me know if I'm missing some excitement somewhere.

Looking at the other VO nagios... nope, that looks fine too (at time of writing). How peaceful...

Monday 7th March 2016, 14.30 GMT

28 Open UK Tickets this month.

NGI
119995 (7/3)
In some kind of clean up operation 5 old NGS sites that are uncertified have been identified for the "chopping block". Assigned (7/3)

ATLAS CONSISTENCY CHECKING SCRIPTS
SUSSEX 117894 On Hold (25/1)
OXFORD 117892 On Hold (12/1)
SHEFFIELD 117886 On Hold (29/1)
MANCHESTER 117885 On Hold (10/1)
RHUL 117881 On Hold (1/2)
QMUL 117880 Waiting for reply (25/2)

SUSSEX
119383 (5/2)
Low availability ticket - site recovering. On Hold (25/2)

118289 (10/12/15)
gridpp pilots, grounded after Matt RB left. Daniela has reiterated the need for this (as banning the site for the gridpp VO will ban it for snoplus too). On Hold (3/3)

118337 (14/12/15)
Sno+ having problems with the Sussex SE. The Sussex SE has been replaced, which will require some work with the Sno+ LFC (or aliasing magic). On Hold (15/2)

RALPP
118628 (5/1)
Getting LZ pilots working at RALPP. After trying out a patched version of ARC on a test CE there still appears to be a few problems with submission- no update for a few weeks though. In progress (15/2)

120006 (7/3)
A freshly squeezed ROD ticket. In progress (7/3) Update - dcache ws restarted just in case, but not sure what's going wrong. Nagios error messages aren't helpful.

BRISTOL
119930 (3/3)
A CMS user having trouble getting a file - it appears GFAL worked where xrdcp didn't. I suspect this ticket can be closed, the user seemed happy (and very polite!). Assigned (can be closed) (4/3) Update - solved

BIRMINGHAM
118155 (4/12/15)
Biomed problems with the Birmingham SE, ending with them greenlighting the removal of all their dark data (which I believe is all the biomed data still left on the SE). Matt's started the purge. In progress (7/7)

GLASGOW
118052 (30/11/15)
HTTP TF ticket - things seem to be intermittently working, Georgios spotted some interesting issues - but at least right now the SE looks all green. In progress (16/2)

117706 (19/11/15)
A pilot ticket, this one pheno-centric. Waiting on some infrastructure work at Glasgow. On hold (15/1)

ECDF
120004 (7/3)
A ROD ticket to the ARCHER facing ARC CE. Andy knows this will be a problem child, and has asked if there's a way to pull it from the ROD monitoring in a way that will still allow it to look in-production to ATLAS? Waiting for reply (7/3)

SHEFFIELD
118764 (12/1)
HTTP TF ticket. Things look a little odd on the probe page, but there's a fair amount of green. Any news? In progress (25/1)

114460 (18/6/15)
Pilot ticket - Elena rolled out the pilots but things didn't seem to work as intended. Any luck with this last week? In progress (29/2)

LIVERPOOL
119983 (4/3)
Some hardware (RAID) faults on a few pool nodes having been causing problems for some atlas users, but the Liver-lads are fighting the good fight. In progress (7/3) Update - solved. But I personally would like to hear about what hardware was failing in the Storage meeting.

RHUL
119795 (28/2)
Atlas transfer error ticket - fallout from the files lost during RHUL's draining troubles. Being declared lost. In progress (28/2) Update - spawned a ticket to track the cleanup: 120009

119509 (12/2)
Sno+ jobs are occasionally failing at RHUL with what looks to be premature sandbox cleanup problems. Govind is back in the saddle, and asked that some jobs be sent his way for testing. In progress (3/3)

QMUL
119013 (21/1)
CMS enabling QM and Glasgow as T3s - although the buck seems to have stopped at QM. After a lot of work it looks like we're waiting on the production team to greenlight the two sites. We might want to chase them up sooner rather then later. Waiting for reply (29/2).

IMPERIAL
119617 (19/2)
The CMS multicore adventure at Imperial. The jobs have run, so that looks good - CMS have asked if there is any form of reservation at the site, to which Simon replied with a resonating "kind of". Waiting for reply (7/3)

100IT
116358 (22/9/15)
Ongoing problems with missing images - work is still continuing this, but I won't go into it. In progress (2/3)

TIER 1
116864 (12/10/15)
CMS AAA test problems. CMS report that things seem to look better this week (EU redirector open and read tests are OK), and wonder if anything has changed? Has it? In progress (23/2) Update - Andrew L reports nothing changed. Maybe it was the nice Grid Pixies? We don't see them very often!

117683 (18/11/15)
CASTOR not publishing GLUE2. Jens reports that there's not been slow progress due to lack of time and ongoing CASTOR upgrade work, but slow progress is better then no progress! In progress (17/2)

Monday 29th February 2016, 15.00 GMT
Link to the 31 Open UK Tickets

A light review this week, some notes:
Still nothing from atlas on the Storage Consistency Check tickets-the ball is firmly in atlas' court.

Sheffield has two tickets that need some love:
118764 (http support)
114460 (pilot rollout)

Plus this Birmingham Biomed ticket has been left hanging (after Biomed gave the go ahead for purging their dark data at the site):118155.
(although I appreciate that Matt has had bigger fish to fry recently! I don't envy having to restore your DPM DB).

Helios is expiring: The Helios VO has hit a spot of bother and asked the Manchester VOMS admins to do...something. Robert has asked for clarification: 119363

And that's all I'll go into.

Looking at the other VO nagios

I see some persistent failures for pheno and t2k with the Imperial SE - a getTURLS failures (failing on the http protocol). I saw something like this at Lancaster but for the life of me can't remember what we fixed. Still I don't think this is a functional functional test!

Monday 22nd February 2016, 15.30 GMT
37 Open UK Tickets this week.

NGI
118930 (18/1)
This information system ticket really needs some attention. Assigned (19/1)

CMS Multicore
Brunel: 119618
Imperial: 119617
RALPP: 119616
CMS are to be rolling multicore pilots soonish and requested some information to set up their test queues with. Brunel might have missed the ticket, the other two are chugging along nicely. Update - Brunel's updated their ticket, so all's good.

Whilst we're talking CMS
119013 (21/1)
This ticket (wrongly assigned to just QMUL at the moment) seems to have become an odd catchall for enabling Glasgow and QM as Tier 3s. The CMS guys seem to think jobs should be flowing/trickling now, so maybe this can be closed? Assigned (18/2)

RHUL
119509
Govind is away and when the admin isn't looking things start breaking - in the case of this ticket Sno+ have disabled submission to RHUL so the ticket should be On Holded (I didn't want to On Hold the ticket myself, as that's a recipe for the ticket getting forgotten about). Or perhaps someone has a suggestion to tackle the problem? Assigned (12/2)

100IT
119534 (15/2)
ROD ticket for 100IT, where they're accused of failing a test that they shouldn't be failing. David opened a ticket about this (https://ggus.eu/index.php?mode=ticket_info&ticket_id=119513) but not received any attention at all - was it submitted to the right group? In progress (22/2)

GLASGOW
118052 (30/11)
HTTP support on the Glasgow SE. You seem to have been "upgraded" to "failing intermittently" (a possible title for my autobiography). Did you change anything to upgrade your status? In progress (16/2)

TIER 1
119389 (5/2)
This LHCB data transfer ticket to the Tier 1 has been waiting for a reply for a while now. Any news from lhcb? Waiting for reply (15/2)

Those 8 Atlas Storage Consistency Check Tickets
A chat about this at the Thursday atlas UK cloud meeting revealed that the chap handling these has gone to Argentina. It was unclear whether this was business, pleasure or as a GGUS fugitive escaping the grumpyness of dozens of site admins.

Updates:

Unsolved but not Unforgotten, the tarball glexec tickets
ECDF: 95303
Lancaster: 95299

Can be solved
Brunel: 119682 This ROD ticket looks like it's sorted now. Good stuff!


Monday 15th February 2016, 13.30 GMT

37 Open UK Tickets.
Link to them all: http://tinyurl.com/nwgrnys

A few highlights:

BRUNEL
118740 (10/1)
Atlas MCORE problems at Brunel. Raul has experimented with restricting MC jobs to nodes where the Condor Memory Checking is disabled, with promising results. Waiting for reply (13/2)

QMUL
119013 (21/1)
Enabling CMS T3 - this ticket has been reopened for QM. Dan has asked for some clarification and information with respect to xroot settings for CMS. The status could do with a tweak... Reopened (12/2)

RALPP
118628 (5/1)
The deployment of LZ pilots hitting an arc bug. Chris has managed to get ahold of and deploy the updated packages on his test CE (impressive turnaround!), and wonders if it works now. Waiting for reply (11/2)

And I think that's it - still a lot of atlas consistency checking tickets that I will mention in the Thursday atlas meeting - although I think Alastair and Brian are aware of them.

Other VO Nagios
I haven't looked at this for a while, the Imperial SE seems to have been seeing problems for pheno and t2k.org for nearly a fortnight.

Monday 8th February 2016, 13.30 GMT
44 43 Open UK Tickets this month. Going over all of them, in kinda-alphabetical order.

NGI
118930 (18/1)
That NGI information ticket, linked to the "wrong" (according to some) information being published by the UK arc CEs. This has haunted us for a while, the consensus was the ticket is a load of B-word and not really worth worrying over - but it does warrant a response (from someone over that Steve J).. Assigned (19/1)

SUSSEX
With Matt RB off to pastures green Sussex is in limbo - I'll contact Jeremy M concerning this last week's fresh tickets.

117894 (23/11)
Atlas Consistency Checking. On hold (25/1)

118289 (10/12)
Gridpp Pilots. On hold (25/1)

118337 (14/12)
The Sussex SE was not working for Sno+ - the most serious of these older issues. On hold (25/1)

119383 (5/2)
ROD Availability ticket. Assigned (5/2)

119384 (5/2)
ROD CA distribution ticket. Maybe the two ROD tickets are correlated (i.e. if we fix this one the previous one will soothe itself?) Assigned (5/2)

RALPP
118945 (19/1)
Poor CMS SAM results for RALPP due to digi-reco work pummeling the RALPP storage - Chris has asked for the digi-reco workload to stop at RALPP, then asked for clarification as to why the site was still in unknown state. Waiting for reply (25/1) Solved - it was them, not RALPP - a restart of the SAM services looks to have cleared the issue,

118628 (5/1)
LZ Pilot deployment at RALPP. Chris has submitted a bug report to nordugrid to fix the issue (http://bugzilla.nordugrid.org/show_bug.cgi?id=3529), which was fixed and should be available in the next release. On Hold (26/1) Update - Chris is trying to get hold of a pre-release to test things.

OXFORD
119197 (29/1)
CMS has asked to change some CRAB site configs at T3s - Daniela has ashed Chris B if he's the one looking after this for Oxford. Assigned (3/2)

117892 (23/11)
Atlas consistency checks. Ewan has firmly and clearly put this on the backburner. On hold (12/1)

BIRMINGHAM
118155 (4/12)
Biomed having a clear up of their stuff on the Brummie SE. Franck has given the nod for deleting the dark data left in the DPM after their cleanup efforts. It's on their heads now! In progress (2/2)

117890 (23/11)
Another Atlas Storage Consistency Checking ticket. Any chance to have a look at this again? On hold (15/12)

GLASGOW
117706 (19/11)
Another pilot ticket, this time for pheno. Glasgow were going to roll this into their overhaul of their identity management gubbins, but the Universe messed with their plans. How goes things? On hold (15/1)

118052 (30/11)
HTTP support on the Glasgow SE. I suspect progress here took a similar shoeing to the identity management plan - but the ticket could do with an update (and maybe on holding). In Progress (4/1)

ECDF
118787 (12/1)
Another HTTP ticket. Let us know if you need a hand Marcus and Andy. Or if you're too busy to make this a priority consider on-holding it. In progress (12/1)

95303 (1/7)
Tarball glexec ticket. On hold for a very long time.

An update on this - I managed to put in some good hours on trying to build a relocatable glexec last week, successfully building from source glexec and the lcas/lcmaps stack. *But* I still have rpath problems - short of attacking every lib file with patchelf I'm not sure how to proceed, and the process is such a mess that I'm not sure if I'll ever manage to make it into a proper recipe (much like my cocoa-butter shortbread).

SHEFFIELD
119374 (5/2)
A fresh ticket from Biomed, about incorrect/no dynamic information being published at Sheffield. In progress (5/2) Update - see Steve B's post to TB-SUPPORT for clues, Elena is retackling these problems today.

118789 (12/1)
ROD Information system ticket, almost certainly caused by the same underlying issue. Is the bdii service on your CEs silently dying or failing to update?

114460 (18/6)
Gridpp Pilots. Changes were implemented but at last check things weren't working right. How goes it now? In progress (20/1)

117886 (23/11)
Atlas Storage Consistency Check ticket - any luck with this? On hold (29/1)

118764 (12/1)
HTTP support ticket for the Sheffield SE. Have you had a chance to have a look at this? In progress (25/1)

The Storage list can lend a hand fixing either of these issues (which goes for everyone of course).

MANCHESTER
118679 (7/1)
HTTP support (atlas edition). Hit a problem due to there being no outside-a-space-token space at Manchester. On Hold (12/1)

118674 (7/1)
HTTP Support (lhcb edition). As above. On Hold (12/1)

117885 (23/11)
Atlas Storage Consistency Checks - hit the same problem as the previous 2 tickets. On hold (10/1)

118603 (4/1)
A VOMS ticket rather then a site ticket, removal of the nsccs.ac.uk VO. The VO has been removed from the other UK voms servers. In progress (5/2) Update-solved

LANCASTER
95299 (1/7)
Lancaster's glexec tarball ticket. See the entry above - although I really need to update the ticket properly! Practice what you preach, Matt! On hold.

RHUL
119380 (5/2)
ROD Low availability ticket - the site is in the green now, so it's the usual 30-day wait. On hold (8/2)

117881 (23/11)
Atlas SCC ticket. On hold until March. On hold (1/2)

QMUL
117723 (19/11)
Pilots at QM. Dan's been working on this, and asked Daniela for a picture of what should be enabled[1] - Any joy? In progress (27/1)

[1] http://www.hep.ph.ic.ac.uk/~dbauer/dirac/site_pilot_status.html

117880 (23/11)
Atlas SCC ticket (wish I had started using that acronym sooner). Just waiting for the nod from atlas that all is well. Dan included the script he uses that may be useful for other STORM sites. Waiting for reply (4/2)

118985 (21/1)
QM has banished biomed from their queues until QM have a cgroupy solution to the ill-behaved biomed user jobs. Biomed have asked that the ban be reconsidered and problem users by dealt with by the VO. QM are perfectly right to say no to this, but it'll be nice to not leave them hanging. On hold (1/2)

119348 (4/2)
LHCB have noticed cvmfs issues on some nodes, which Dan couldn't replicate. Dan ponders that perhaps this is caused by ephemeral memory issues on the nodes, noting more swap being used recently. Waiting for reply (4/2)

119409 (8/2)
Fresh ROD emi glexec ticket - things exploded at the weekend but the QM admins are fighting the good fight. In progress (8/2)

IMPERIAL
119294 - but this got solved by the times I got to it (it concerned a java update breaking md5).

BRUNEL
117878 (23/11)
Atlas SCC - Raul provided an example and is waiting on atlas to give a yay or nay before deploying. Waiting for reply (18/1)

118740 (10/1)
Atlas MCORE problems at Brunel, looks to be caused by some extreme Condor oddness, Raul reconfigured Condor to give a better view. Any joy? In progress (25/1)

100IT
119002 (Reopened)
116358 (In Progress)
Not going into detail with these as I'm not sure what the crack is with 100IT.

AND FINALLY...

THE TIER 1
118809 (12/1)
The Tier 1 provided feedback on configuring memory limits for batch jobs, the ticket left open for follow up. On hold (13/1)

116864 (12/10)
CMS AAA tests failing. Andrew L reports that the CASTOR headnode has received what sounds like a big fix which will hopefully improve things. In progress (29/1)

119389 (5/2)
LHCB data transfer problem to RAL. Being looked at. In progress (5/2)

117683 (18/11)
Another publishing ticket. How we love those! This one about CASTOR not publishing GLUE 2. Code was written by Jens and Rob but not integrated, something that works might be a long way off. That was a month ago, any news since? In progress (5/2)

109358 (15/10) or (5/2)
This ticket is weird - it started in a "waiting for reply" state and was apparently issued in 2014! I can't find a ticket with this number in my records though. Sno+ are unable to use the RAL WMS - it's being looked at. In progress (5/2)


Monday 1st February 2016, 14.30 GMT
50 Open UK Tickets this week, no Ops meeting scheduled so postponing a full review.

org.bdii.GLUE2-Validate tickets
We have 8 sites with these tickets (7 as Bristol have slain theirs), these are being discussed on TB-SUPPORT. A lot of these are still just assigned though - even if the issue is not really our fault we still need to handle the ticket proper. Rising above it all and all that.

If someone has submitted or knows of a counter-ticket for this issue please let me know.

NGI
Talking about a pain in the Information System, the UK still has this ticket to close (which has a similar root problem): 118930

CMS Siteconf problems.
GLASGOW 119196
EDINBURGH 119195
OXFORD 119197

CMS have spotted a number of misconfigured T3s across the globe (on a Friday afternoon)- the fix seems to be straightforward enough and Glasgow look like they're done already. Proper job!

ATLAS CONSISTENCY CHECKS
We still have 8 tickets open on this issue, although a couple are waiting for feedback from atlas. I'll bring this up in the Thursday UK atlas meeting to see if we can't shimmy along the tickets waiting for atlas feedback.

PILOTS
117723
Whilst investigating pilot issues at QM Daniela reminds us of this page that tells us what Dirac things should be going on at your site. Might be handy to preempt problems:
http://www.hep.ph.ic.ac.uk/~dbauer/dirac/site_pilot_status.html

118628
Whilst rolling out similar changes for LZ at RALPP Chris stumbled upon a problem, for which he submitted a bug report to nordugrid: http://bugzilla.nordugrid.org/show_bug.cgi?id=3529

AND FINALLY

QMUL
118985 (21/1)
Biomed have got back to Dan suggesting that rather then ban them altogether until he has a cgroup-corral to put their jobs in if he would be willing and able to supply a list of the problem users. Of course this requires that there be any non-problem users in the VO... On hold (1/2)

Monday 25th January 2016, 14.30 GMT

"OTHER VO" NAGIOS
Looks like hepgrid2.ph.liv.ac.uk at Liverpool is playing up for all VOs, and the Sheffield SE is misbehaving for the gripp VO. Other then that it looks clean.

43 Open UK Tickets this week.

That ticket to the NGI...
118930 (18/1)
Steve J put in a comprehensive reply about what Liverpool do to get their publishing kinda right. The view on this ticket from last week was to close it with a <carefully|harshly> worded statement about why this is a bit of a pointless request. Who was formulating the reply? If it was me I dropped that ball! Assigned (19/1)

Pilots Problems.
BRUNEL: 117710 Pheno. On Hold (19/11/15)
QMUL: 117723 Pheno - hopefully sorted. Waiting for reply (25/1)
SHEFFIELD: 114460 gridpp et al. In Progress (20/1)
RALPP: 118628 LZ (and maybe LSST?). In progress (14/1)

We have a few pilot rollout tickets, the last two being worked on but proving problematic.

RHUL
119027 (22/1)
As seen on the gridpp-storage list, Sno+ have asked RHUL (and will no doubt as others) for storage space (~20TB). In progress (22/1)

(for the interest of others the Govind's other thread on gridpp-storage was likely triggered by https://ggus.eu/?mode=ticket_info&ticket_id=118553)

QMUL
118985 (21/1)
QM have banished biomed from their cluster until they have a batch system that can put Biomed jobs in a c-group cage (looking at slurm). On Hold (21/1)

BIRMINGHAM
118155 (4/12)
Talking of Biomed, they've asked if they've successfully cleaned up all their files on the Birmingham SE - a cheeky uberftp onto your SE suggests the biomed directory is still full of cra.. I mean, files. In Progress (20/1)

HTTP TF Tickets
118787 (ECDF)
118764 (SHEFFIELD)
Feel free to poke the gridpp storage group for help with these. (I left out the 2 Manchester tickets as their immediate showstopper isn't their configs- but they can ask for help too!).

ATLAS CONSISTENCY CHECKS
Manchester, Oxford, Birmingham, Sussex, RHUL, Sheffield, Brunel and QMUL still open - a mix of chugging along nicely and being very much "On Hold".

Monday 18th January 2016, 14.00 GMT
49(!!) Open UK Tickets this week

NGI
118930 (18/1)
The NGI received a ticket concerning incorrect or missing glue information for the Tier 1, Brunel, Imperial, Liverpool, Durham, Glasgow, Bristol, Oxford and RALPP. The variables in question are GlueSubClusterPhysicalCPUs, GlueSubClusterLogicalCPUs and GlueHostProcessorOtherDescription. There are some extra instructions in the ticket - it would be nice if we didn't have to create child tickets (hint hint...).

ATLAS CONSISTENCY CHECKS (10 tickets)
Progress, or at least non-exciting but reassuring updates, on these. Birmingham and Glasgow tickets could do with an update (even if it's a "nothing to see here").

The QMUL ticket had an update providing feedback that might be useful to others too:
https://ggus.eu/?mode=ticket_info&ticket_id=117880

HTTP TF (5 tickets)
ECDF, Manchester, Sheffield and Glasgow are on the HTTP TF list - although no tickets are stale at the moment.

TIER 1 RECOMMENDATIONS
118809 (12/1) An interesting ticket asking T0 and T1s to fill in a questionnaire on configuring batch job memory limits - the Tier 1 have did their bit and the ticket is On Holded for feedback.

GLASGOW
118732 (9/1)
This ticket has got confusing - atlas want a dump for files "lost" at Glasgow that by the looks of it actually never made it to the site in the first place... Waiting for reply (15/1)

TIER 1 DUPLICATES
Are these three CMS are the same (or similar or related) issues -or am I just getting my wires crossed?
118494 (23/12/15)
116864 (12/10/15)
118722 (8/1)

CAN BE CLOSED (I THINK)
IC - 118162 (lfc ticket)
QM - 118839 (atlas job mcore jobs failures - doesn't look like the problem persists).

NEARLY THERE:
Lancaster - 118637 (squid misconfiguration hammering statum-0)
Birmingham - 118155 (biomed SE use - biomed now think they deleted all data at Birmingham).

Monday 11th January 2016, 14.30 GMT
48(!) Open UK Tickets this week

  • VOMS TWEAK

118603: nsccs.ac.uk has been requested to be removed from the gridpp voms servers. Just "Assigned" to the UK as a whole at the moment.

  • THE HTTP TASK FORCE STRIKES

Lancaster, RHUL and Manchester all had http TF tickets alongside Glasgow. Your site might be next! It'll be worth checking the monitoring pages and reviewing the documentation if you are: atlas: http://cern.ch/go/h8Rr
lhcb: http://cern.ch/go/Bk8J
https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFSAMProbe

  • TRANSFER ODDITIES

118494: The Tier-1 have a CMS ticket where xrootd is expecting a file which phedex and DAS don't think is at RAL. Is this even a site problem?

118728: In a similar vein, QMUL have an atlas ticket where a single file is refusing to be transfered - Dan has noticed a number of write attempts followed by immediate deletion. Checksumming causing a problem?

  • LOW HANGING FRUIT- tickets that can probably be closed, or are close to it.

IMPERIAL 118162
A ticket for the Imperial LFC, which appeared to be working (for Janusz at least).

RALPP 117740
Atlas datadisk cleanup ticket. Elena confirmed that the step09 directory can go for the chop. Not sure if Brian has had a chance at looking at the users directory contents yet.

BRISTOL 118311
I suspect that this CMS SAM ticket can be closed as the CEs were all green.

  • ATLAS CONSISTENCY CHECKS

As requested at the Thursday atlas meeting here's the outstanding consistency check tickets.

IMPERIAL: 117879
Not much news, (understandably) low priority for the site.

SUSSEX: 117894
It doesn't look like Matt got round to this before he left.

SHEFFIELD: 117886
Set in progress but no news since.

OXFORD: 117892
A similar case here - I assume it's on Ewan's to-do list before he heads off to pasture's green.

BIRMINGHAM: 117890
Matt was going to look at this again in the New Year. Any joy?

RHUL: 117881
Govind was going to try to get to this before Christmas. Any luck?

GLASGOW: 117889
Back in 2015 the dumps were run and Sam asked for some clarification. Considering Glasgow's current state any dump made using these tools might be full of lies, but I know that you chaps are working on this problem.

BRUNEL 117878
Raul asked some questions in his ticket, for which atlas only replied last week.

QMUL: 117880
Dan has created dumps and has asked for the all clear before he sets up the monthly cron.

TIER 1: 117846
Dumps have been created, but gfal and castor issues have slowed down the checking process (gfal-cat doesn't seem to work with castor).

MANCHESTER: 117885
This ticket was recently On-Holded, as currently Manchester has 0 free space outside of tokens whilst a few disk servers are down.

Monday 4th January 2015, 14.30 GMT
HAPPY NEW YEAR EVERYONE!

38 Open UK Tickets this year.

All-the-UK-tickets URL: http://tinyurl.com/nwgrnys

As Jeremy spotted, with Matt RB off to pastures new the Sussex tickets are looking a bit neglected, especially as one was reopened after his departure:
118337
118289

Finally in this Glasgow ticket the submitter gave two new links for the http taskforce monitoring: 118052

The links to the http tf monitoring pages are:
atlas: http://cern.ch/go/h8Rr
lhcb: http://cern.ch/go/Bk8J