Difference between revisions of "Past Ticket Bulletins 2017"

From GridPP Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
'''Monday 13th February 2017, 16.00 GMT'''<br />
 +
25 Open UK Tickets this Week
 +
 +
'''ATLAS want your INPUT'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126184 126184] (26/1)<br />
 +
Atlas request for input on sites monitoring. In last week's cloud meeting Alastair asked if anyone had any input for this. If you do feel free to add to the google doc linked in the ticket or email your points to the cloud support mailing lists. In progress (7/2)
 +
 +
'''TOKEN AFFECTION'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126554 126554] (10/2)<br />
 +
Sno+ jobs failed at Liverpool, and once again John B had to educate a user group that space tokens are a thing (thanks John!). Would everyone who supports Sno+ be willing to roll out a space token for them? We don't know at this stage how much space would be needed, at this point it mainly seems for job stage back. In progress (13/2)
 +
 +
'''UNRELIABLE AVAILABILITY'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126349 126349] - '''ECDF'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=125743 125743]- '''RALPP'''
 +
 +
Both of these availability tickets are confusing the sites and myself (although the latter is still quite easy to do). ECDF are getting negative results again (and a lot of unknowns) and RALPP seem to be not updating results very often at all, suffering a several day lag by the looks of it.
 +
 +
 
'''Monday 6th February 2017, 14.30 GMT'''<br />
 
'''Monday 6th February 2017, 14.30 GMT'''<br />
 
<strike>21</strike> 23 Open UK tickets this month
 
<strike>21</strike> 23 Open UK tickets this month

Revision as of 12:08, 20 February 2017

Monday 13th February 2017, 16.00 GMT
25 Open UK Tickets this Week

ATLAS want your INPUT
126184 (26/1)
Atlas request for input on sites monitoring. In last week's cloud meeting Alastair asked if anyone had any input for this. If you do feel free to add to the google doc linked in the ticket or email your points to the cloud support mailing lists. In progress (7/2)

TOKEN AFFECTION
126554 (10/2)
Sno+ jobs failed at Liverpool, and once again John B had to educate a user group that space tokens are a thing (thanks John!). Would everyone who supports Sno+ be willing to roll out a space token for them? We don't know at this stage how much space would be needed, at this point it mainly seems for job stage back. In progress (13/2)

UNRELIABLE AVAILABILITY
126349 - ECDF
125743- RALPP

Both of these availability tickets are confusing the sites and myself (although the latter is still quite easy to do). ECDF are getting negative results again (and a lot of unknowns) and RALPP seem to be not updating results very often at all, suffering a several day lag by the looks of it.


Monday 6th February 2017, 14.30 GMT
21 23 Open UK tickets this month

FRESH IN THIS MORNING - BRISTOL
https://ggus.eu/?mode=ticket_info&ticket_id=126454 (7/6) As seen on TB-SUPPORT, CMS are having test failures at Bristol and Winnie is left without a CMS site support at the moment. I see some replies already on the list, I'll leave this slot here for hopefully helpful discussion. On Hold (7/6)

SUSSEX
125503 (9/12/16)
Sno+ file download failure ticket, due to the wrong SE name in the LFC for the files. Jeremy M reports that he is looking into created a DNS alias and asking the CA sage (aka Jens) to shape the necessary certificate. In progress (30/1)

122772 (11/7/16)
Webdav/xroot deployment ticket from atlas. Jeremy M reports the appointment of their new admin, which is great stuff. This is one of the first things on his todo list. I'll repeat the usual "we're here to help" message. No point suffering in silence! On hold (26/1)

Fresh in last night - 126438 - atlas seeing srmPut failures, but the error is 'file already exists'. A problem with rucio?

RALPP
125743 (27/12/16)
An availability ticket. A few blips on the nagios page, but I don't think there's anything to see here really. On Hold (29/1)

125815 (5/1)
Atlas ticket regarding space not being released after deletion. Chris has beaten his dcache into shape, and asked for the deletions to be re-attempted. Waiting for reply (30/1)

OXFORD
126371 (4/2)
Atlas transfer failures. Kashif spotted that the dpm-gsiftp daemon and failed, and got it back up. I suspect this ticket it can be closed if the daemon is stable? In progress (4/2)

121924 (2/6/16)
Perfsonar rate ticket? Any news? If not, is there likely to be any? On Hold (5/12/16)

125822 (5/1)
The Oxford edition of the "Space not released after deletion" issue. Kashif too has been tinkering his SE, tweaking and (re-)starting httpd daemons and asks for a fresh list of files to check. Waiting for reply (27/1)

BIRMINGHAM
126131 (24/1)
Availability ticket. The numbers are on the mend so the ticket is On Hold (30/1)

GLASGOW
125867 (9/1)
LHCB seeing cvmfs-related job failures on WNs at Glasgow. Gareth has updated cvmfs across the Glasgow nodes and asks if the issue has calmed down. Waiting for reply (31/1)

124052 (25/9/16)
Another LHCB ticket, about the arc publishing incorrect job numbers. Gareth provided an update regarding the Glasgow plans, rolling fixing this into the Centos7 migration. Thanks Gareth! On Hold (31/1)

EDINBURGH
126349 (3/2)
Another availability ticket, although today's numbers look to be okay so hopefully the cause of the troubles has passed. Looks like this ticket hasn't been noticed yet though. Assigned (3/2) Andy noted that the argo numbers seem nonsensical with negative availability for a few days! But things are on the mend now. Looks like a simple case of On Holding the ticket for the next 26 days.

LIVERPOOL
124819 (3/11/16)
The last AFS ticket, John B reports that the university has stopped firewalling UDP port 7001 and asks if things are better now. Waiting for reply (3/2)

126167 (25/1)
Decommissioning ticket for the last CREAM CE at Liverpool (which will also see the end of torque at the site). Downtime for the service will be on the 14th (Happy Valentine's Day?) and the service will be switched off properly come the 28th. In progress (30/1)

QMUL
125627 (19/12/16)
Atlas transfers failing to the QM test SE. Dan increased the space to 10TB to sooth the last batch of failures, just waiting to here if that worked. Waiting for reply (26/1)

126261 (30/1)
Biomed nagios tests not working for ce4 at QM. The problem persists. In progress (2/2)

126312 (1/2)
Atlas spotted QM's squid had fallen over. Dan has noticed problems since upgrading to v3 of frontier-squid, although the issues could also be related to IPv6 on the hosts (of the two squids at QM the one that fell over was also the one that has an IPv6 address in DNS). Keeping the ticket open to see if things stay up. In progress (1/2)

TIER 1
126296 (1/2)
CMS SAM tests failing against srm-cms-disk.gridpp.rl.ac.uk. All transfers "by hand" pass without trouble, and Gareth points out that this service is not in production in the GOCDB, so tests shouldn't even be running against it! Waiting for reply (6/2) Update - CMS got back that this is the endpoint specified in PhEDeX so this is why it was tested. If this is wrong it will need to be changed.

126376 (5/2)
Another batch of CMS SAM test failures. This includes the srm-cms-disk issue again. John K restarted the CMS xroot directors to try to clear the CE test errors that were being seen - things were looking up. In progress (6/2)

126184 (26/1)
Request from atlas for input on the new site monitoring schemes, linked in the ticket. The appropriate people were being chased. In progress (26/1)

124876 (7/11/16)
echo instance at RAL failing nagios tests due to the tests not using the right path. The ticket addressing this (125026) has had no progress since just before Christmas and so could do with a shake up. On Hold (1/1)

117683 (18/11/15)
Glue 2 publishing for Castor ticket. Did Jens and Rob have any luck tackling this in the pre-Christmas get together? On Hold (7/12/16)

Monday 30th January 2017, 15.15 GMT
24 Open UK Tickets this week

QMUL
126156 (25/1)
A quite interesting ticket from John Gordon regarding QM having >100% efficiency. Within the ticket Dan debugs his homegrown slurm accounting scripts. Possibly of interest to others - some good stuff in this ticket. In progress (26/1)

A few other tickets at QM could do with a poke though:
126012 (17/1)
Nagios BDII ticket, problem keeps cropping up.

126234 (28/1)
LHCB pilots failing and jobs not returning output, the ticket likely has snuck by you. Assigned (28/1)

RALPP
126240 (29/1)
Whilst this CMS SAM test failure ticket filled me with righteous indignation with its brevity and lack of reference links, it still could do with acknowledging. Assigned (29/1)

(In fairness given my current coffee consumption it doesn't take much to send me off on one.)

GLASGOW
125867 (9/1)
This LHCB cvmfs ticket threatens to go stale - any word on extra failures (or lack thereof)? In progress (16/1)

Talking of Glasgo tickets looking a bit stale: ticket 124052 (arc publishing ticket last updated in September).

TIER 1
126184 (26/1)
Possibly not intended of general consumption, this is an atlas request for feedback concerning the atlas site monitors. In progress (26/1)

Monday 23rd January 2016, 15.30 GMT
21 Open UK Tickets this week

RALPP
126053 (19/1)
This one piqued my interest - CMS users in Florida are having trouble getting at files, seemingly due to their MTU settings - with their default of 9000 things timeout, with 1500 things work. Bristol transfers okay. Chris is investigating. In progress (20/1) Update- solved, the problem mysteriously fixed itself.

(also at RALPP is Biomed ticket 126065, which may have not been noticed yet). Update - in progress

OXFORD
125822 (5/1)
Oxford deletions not working. An observational question - is http working as expected on the Oxford nodes? I ask because when poking my nose around pointing my browser at the Oxford SE got me nothing. The file in question I could access using my dteam credentials (and xroot), so it still exists on disk. In progress (23/1)

121924 (2/6/16)
Perfsonar ticket - a polite reminder if you (or anyone else) would like help debugging perfsonar transfer problems with some independent "standard" iperf tests I'm happy to try to help out with them. On hold (5/12)


BIRMINGHAM
Good luck to Mark with his DPM headnode this week! Let us know if you need a hand.

AFS TICKETS (LIVERPOOL and GLASGOW, but mainly Glasgow)
Can you please throw in a soothing update to your AFS tickets when you have a few spare minutes:
124821 - GLASGOW
124819 - LIVERPOOL

SUSSEX SNOPLUS FILES
125503 (9/12/16)
And finally, no news is not good news on this Sno+ ticket for Sussex. It threatens to turn into a game of pass the buck, as the options available to the VO put the responsibility in three very different places. In progress (23/1) Update- Jeremy will look at the dns alias solution, which requires some certificate magic to be done.

TIER 1
124876
The ticket Daniela mentioned, regarding nagios tests for the echo instance. To quote Daniela "The requirement that machines in production should pass basic tests is really not that onerous."

Monday 16th January 2016, 15.00 GMT
21 Open Tickets this week

Bounced back to Bristol
125558 (13/12/16)
This ticket from Lukasz to CMS, concerning decommissioning a queue in the glidein factories, has been reassigned back to Bristol. Assigned (12/1) Update - solved by the site, the initial query sorted.

ANYONE SEEN SOMETHING LIKE THIS BEFORE?

DURHAM
125845 (6/1)
Durham are having intermittent, hard to explain nagios test failures on their arc CE - seeing a few failures a day. Fishing on the site's behalf, has anyone any suggestions about where to look? In progress (13/1) Update - Thanks to Kashif for his input.

GLASGOW
125867 (9/1)
Another piece of unasked for meddling by myself, Glasgow are seeing some greedy behaviour from cvmfs on some nodes running lhcb jobs - has anyone seen something similar? In progress (16/1)

AND FINALLY...

SUSSEX
125503 (9/12/16)
As seen on TB-SUPPORT, I stuck my 2-yen's worth in to this Sno+ ticket and got a little out of my depth. Either Sussex will need to alias their new SE to the old one or there will need to be some heavy LFC operations for Sno+ (either by them or the LFC admins). Thanks to Simon, Catalin and Henry for their input. In progress (16/1)

Monday 9th January 2017, 14.30 GMT
HAPPY NEW YEAR!

22 Open UK Tickets this year.

SUSSEX
124614 (24/10/16)
A availability/reliability ticket. The New Year is looking greener on the argo pages for Sussex, so hopefully there will be plain sailing until the alarm clears. On Hold (6/1)

125503 (9/12/16)
Snoplus file download failures. Doing a spot of investigation myself it looks like the Sno+ guys didn't convert their lfns when Sussex did an SE migration last year, I've informed them thusly. Waiting for reply (9/1)

122772 (11/7/16)
Webdav and xroot frontend ticket. Hopefully the new admin at Sussex will start wrangling this soon. On Hold (21/11/16)

RALPP
125815 (5/1)
A CMS ticket regarding space not being released after deletion. It is likely a dcache problem, but a similar issue was seen at Oxford for atlas (125822). Chris has asked for some problem surls. In progress (5/1)

125743 (27/12/16)
Another availability ticket - I had to dig deep into argo to convince myself tests were running but things are looking okay. On hold (6/1)

OXFORD
125822 (5/1)
Atlas deletion problems at Oxford - probably unrelated to the RALPP issue. There's mention of a similar issue seen at Liverpool, but no specifics- Kashif has asked for more information and supplied a dark data dump. In progress (9/1)

121924 (2/6/16)
Perfsonar throughput drop ticket. Suspected to be a problem with just the perfsonar tests, it likely warrants a spot of further investigation - perhaps someone with a "regular" iperf endpoint could help? On hold (5/12/16)

BIRMINGHAM
122771 (11/7/16)
xroot/webdav ticket from atlas. Mark finished off 2016 with some good progress - looks like permission issues to my eyes. On Hold (22/12/16)

GLASGOW
125867 (9/1)
lhcb seeing cvmfs problems on some Glasgow nodes. Gareth has his prodding stick out and removed the nodes from production just to be safe. In progress (9/1)

124821 (3/11)
AFS ticket. Not very exciting. On hold (16/11/16)

124052 (25/9)
LHCB arc job number publishing ticket. I believe tackling this is on the to-do list. On hold (26/9/16)

DURHAM
125845 (6/1)
ROD arc ce test ticket - I think this snuck by the Durham admins, understandable on the first Friday of the year. Assigned (6/1)

SHEFFIELD
125853 (6/1)
Apel publishing ROD ticket. Elena has fixed things, but it will take some time to trickle through. This ticket will want on holding until then I reckon. Waiting for reply (9/1) Update - solved, tests all green now.

MANCHESTER
125664 (20/12)
This is a ticket to Andrew with his VAC dev hat on, asking for a way to keep VAC and dirac versions in sync. Some good discussion going on. In progress (6/1)

LIVERPOOL
124819 (3/11/16)
Another AFS ticket - John provided an update before on holding it. On hold (16/12/16)

RHUL
125855 (6/1)
Biomed have asked if they're being purposely excluded from accessing ce3. I'm not sure if Raul is back yet, the ticket could do with some fielding. Assigned (6/1/17) Update - solved, biomed enabled on the queues.

QMUL
125627 (19/12/16)
Atlas noticing problems on a test SE at QM, which Dan was trying out a UMD4 install on. On hold (19/12)

TIER 1
125856 (6/1)
LHCB file access ticket, this has been investigated and the Tier 1 team have come back with a few questions. Waiting for reply (9/1)

125157 (24/11/16)
Creation of extras-fp7.eu cvmfs repo - chugging along nicely in spite of the holidays, with most stratum-1 replications in place. In progress (3/1)

124876 (7/11/16)
Ticket following getting nagios tests working for the RAL echo instance. Alastair provided a summary to the issue to start the new year off with with a reference to ticket 125026. On Hold (1/1)

125480 (9/12/16)
Physical/logical core publishing mismatch. After some discussion the ticket was held for the holidays. On Hold (21/12/16)

117683 (18/11/15)
Glue 2 publishing for Castor - Jens and Rob hopefully had a chance to have a bit of a bash at this before Christmas. Hope that went well! On hold (7/12/16)