Difference between revisions of "Past Ticket Bulletins 2017"

From GridPP Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
'''Monday 3rd April 2017, 14.30 BST'''<br />
 +
30 UK tickets this month
 +
 +
'''SUSSEX'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=122772 122772] (11/7/16)<br />
 +
Atlas webdav/xroot ticket. Any luck, or would you like a hand at GridPP this week? On hold (26/1)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=125503 125503] (9/12/16)<br />
 +
Sno+ ticket about file access problems due to a wrong SE name in the LFC. Any word on this too? I think a plan was put in place. In progress (30/1)
 +
 +
'''RALPP'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126902 126902] (2/3)<br />
 +
CMS ticket, I got a bit lost trying to follow it but a moot point as CMS indicate it can be closed. In progress (3/4)
 +
 +
'''BRISTOL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126864 126864] (28/2)<br />
 +
Request to enable LZ, Daniela has provided the requested information. In progress (31/3) ''Update - solved''
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126865 126865] (28/2)<br />
 +
A CMS ticket from Daniela, concerning ipv6 transfer failures to/from Bristol. Things were looking better, although there is an outstanding question that Winnie highlighted about the CERN setup that perhaps Duncan or someone could answer? In progress (31/3)
 +
 +
'''BIRMINGHAM'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127319 127319] (27/3)<br />
 +
A low-availability ticket. Whilst these are boring it needs to be tended (i.e. put In Progress or On Hold). Assigned (27/3) ''In progress - Mark cites a misbehaving DHCP server causing hassle.''
 +
 +
'''GLASGOW'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=124052 124052] (25/9)<br />
 +
LHCB ticket concerning incorrect job publishing, to be fixed in the next generation of ARC CEs deployed at Glasgow. Sadly the time has come for another update, even if it's a totally dry one. On Hold (31/1)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127160 127160] (16/3)<br />
 +
An availability ticket. Nothing more to say then that. On hold (16/3)
 +
 +
'''SHEFFIELD'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127210 127210] (19/3)<br />
 +
Atlas transfer timeout failures. After coming out of downtime failures persist. Perhaps a similar problem to what we saw at Lancaster last week? As per the post to the storage list those issues were apparently soothed by increasing the DPM threads. In progress (3/4)
 +
 +
'''MANCHESTER'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127464 127464] (3/4)<br />
 +
A very fresh atlas deletion error ticket. In progress (3/4)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127384 127384] (29/3)<br />
 +
LSST authorisation failure ticket. Alessandra has tracked down hopefully all the config errors that crept in during the move from svn to git. Hopefully this is nearly sorted. In progress (31/3)
 +
 +
'''LIVERPOOL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=124819 124819] (3/11/16)<br />
 +
AFS ticket. After the firewall ports were opened the submitter provided some feedback, but no news back from the site. Perhaps just put this ticket out of its misery (like what will soonish happen for AFS itself)? In progress (13/2)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127353 127353] (28/3)<br />
 +
Steve bravely rolled out a small Centos7 test cluster and Sno+ job accidentally landed on it - they kept it that way to test things out but sadly it looks like their tests failed and have asked for their jobs to not land on the test cluster anymore. In progress (2/4)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126956 126956] (6/3)<br />
 +
Availability ticket due to the annoying ARC monitoring issues. On hold (27/3)
 +
 +
'''QMUL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127352 127352] (28/3)<br />
 +
Icecube jobs failing on a QM GPU node - the likely cause has been spotted (old AMD libs sitting on the system with a new nvidia card in it) but it might be a little while till this is fixed. Dan has proposed using this as an opportunity to roll out a Centos7 test node which Icecube were okay with. In progress (31/3)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127144 127144] (15/3)<br />
 +
LHCB saw problems with ce04, which Dan reckons were caused by load and has asked if there are still problems. Waiting for reply (31/3)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126261 126261] (30/1)<br />
 +
A biomed ticket for ce04, although they rechecked if this was still a problem during the aforementioned load problems. There seems to be other errors too though- maybe related to the biomed infrastructure? In progress (31/3)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126650 126650] (15/2)<br />
 +
cern@school errors due to a misconfig in the VO usernames (slurm only does lowercase usernames!). Dan has rolled out the new users and Daniela has rolled out some tests jobs. In progress (31/3)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127445 127445] (1/4)<br />
 +
Another biomed submission error ticket, I'm not sure if this is a duplicate of 126261. It looks like a similar error (on ce5 this time though). Assigned (1/4)
 +
 +
'''BRUNEL'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127117 127117] (13/3)<br />
 +
A request from CMS to upgrade the spacemon client. Raul was on it. Any luck with this? Although I've just remembered that Raul is in a different hemisphere so that question might fall on a deaf inbox. In progress (14/3)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127126 127126] (14/3)<br />
 +
Availability ticket, again by the looks of it due to the ARC monitoring playing up. On hold (27/3)
 +
 +
'''TIER 1'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127251 127251] (21/3)<br />
 +
A ticket from an atlas user concerning transfers into castor have trouble and some errors the user is seeing. John has requested more information as the files themselves seem present and correct, but someone who has some idea as to what the error messages listed by the submitter mean would be handy. Waiting for reply (27/3) ''Update - closed as likely a problem with the user's code.''
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127449 127449] (2/4)<br />
 +
One of the RAL ARCs wasn't working well for LHCB - but the problems appear to have passed and the ticket can be closed now. In progress (3/4)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126905 126905] (2/3)<br />
 +
CVMFS commissioning for the SOLID experiment. With effort from Daniela and Catalin things all look to be working for solid now with /cvmfs/solidexperiment.egi.eu exported nicely and uploadable to by the VO. Looks like another ticket can be closed. Waiting for reply (29/3) ''Update - before it gets closed there has been a request for some extra information from Catalin.''
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127388 127388] (29/3)<br />
 +
LHCB troubles accessing some files at RAL. Have these issues passed with the other castor problems from the weekend? In progress (3/4)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=127240 127240] (21/3)<br />
 +
CMS request to run staging tests in prep for Run 2. There was a request from CMS for access to some monitoring plots, I assume for the transfer rates between buffers, but it wasn't very clear. In progress (27/3)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126184 126184] (26/1)<br />
 +
Atlas request for site monitoring input. Alessandra went over this in last week's atlas uk meeting. It's not too late to have your say in the google docs. In progress (7/2)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=124876 124876] (7/11)<br />
 +
ROD ticket concerning tests to the RAL echo instance. Alastair's counter ticket (ticket [https://www.ggus.org/index.php?mode=ticket_info&ticket_id=125026 125026]) hasn't had an update since last year - I think it needs a kick. On Hold (1/1)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=117683 117683] (18/11/15)<br />
 +
Castor Glue 2 publishing. Rob reported some good progress. On Hold (2/3)
 +
 +
'''NGI'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=126808 126808] (24/2)<br />
 +
WMS usage ticket - mainly involving Imperial and the Tier 1. There was some worry from Daniela regarding the closure of old WMS tickets due to it being "no longer supported", but there were reassurances that security bugs would be fixed. Are you feeling reassured? In progress (20/3)
 +
 
'''Monday 27th March 2017, 15.15 BST'''<br />
 
'''Monday 27th March 2017, 15.15 BST'''<br />
 
26 Open UK Tickets this week.
 
26 Open UK Tickets this week.

Revision as of 12:07, 10 April 2017

Monday 3rd April 2017, 14.30 BST
30 UK tickets this month

SUSSEX
122772 (11/7/16)
Atlas webdav/xroot ticket. Any luck, or would you like a hand at GridPP this week? On hold (26/1)

125503 (9/12/16)
Sno+ ticket about file access problems due to a wrong SE name in the LFC. Any word on this too? I think a plan was put in place. In progress (30/1)

RALPP
126902 (2/3)
CMS ticket, I got a bit lost trying to follow it but a moot point as CMS indicate it can be closed. In progress (3/4)

BRISTOL
126864 (28/2)
Request to enable LZ, Daniela has provided the requested information. In progress (31/3) Update - solved

126865 (28/2)
A CMS ticket from Daniela, concerning ipv6 transfer failures to/from Bristol. Things were looking better, although there is an outstanding question that Winnie highlighted about the CERN setup that perhaps Duncan or someone could answer? In progress (31/3)

BIRMINGHAM
127319 (27/3)
A low-availability ticket. Whilst these are boring it needs to be tended (i.e. put In Progress or On Hold). Assigned (27/3) In progress - Mark cites a misbehaving DHCP server causing hassle.

GLASGOW
124052 (25/9)
LHCB ticket concerning incorrect job publishing, to be fixed in the next generation of ARC CEs deployed at Glasgow. Sadly the time has come for another update, even if it's a totally dry one. On Hold (31/1)

127160 (16/3)
An availability ticket. Nothing more to say then that. On hold (16/3)

SHEFFIELD
127210 (19/3)
Atlas transfer timeout failures. After coming out of downtime failures persist. Perhaps a similar problem to what we saw at Lancaster last week? As per the post to the storage list those issues were apparently soothed by increasing the DPM threads. In progress (3/4)

MANCHESTER
127464 (3/4)
A very fresh atlas deletion error ticket. In progress (3/4)

127384 (29/3)
LSST authorisation failure ticket. Alessandra has tracked down hopefully all the config errors that crept in during the move from svn to git. Hopefully this is nearly sorted. In progress (31/3)

LIVERPOOL
124819 (3/11/16)
AFS ticket. After the firewall ports were opened the submitter provided some feedback, but no news back from the site. Perhaps just put this ticket out of its misery (like what will soonish happen for AFS itself)? In progress (13/2)

127353 (28/3)
Steve bravely rolled out a small Centos7 test cluster and Sno+ job accidentally landed on it - they kept it that way to test things out but sadly it looks like their tests failed and have asked for their jobs to not land on the test cluster anymore. In progress (2/4)

126956 (6/3)
Availability ticket due to the annoying ARC monitoring issues. On hold (27/3)

QMUL
127352 (28/3)
Icecube jobs failing on a QM GPU node - the likely cause has been spotted (old AMD libs sitting on the system with a new nvidia card in it) but it might be a little while till this is fixed. Dan has proposed using this as an opportunity to roll out a Centos7 test node which Icecube were okay with. In progress (31/3)

127144 (15/3)
LHCB saw problems with ce04, which Dan reckons were caused by load and has asked if there are still problems. Waiting for reply (31/3)

126261 (30/1)
A biomed ticket for ce04, although they rechecked if this was still a problem during the aforementioned load problems. There seems to be other errors too though- maybe related to the biomed infrastructure? In progress (31/3)

126650 (15/2)
cern@school errors due to a misconfig in the VO usernames (slurm only does lowercase usernames!). Dan has rolled out the new users and Daniela has rolled out some tests jobs. In progress (31/3)

127445 (1/4)
Another biomed submission error ticket, I'm not sure if this is a duplicate of 126261. It looks like a similar error (on ce5 this time though). Assigned (1/4)

BRUNEL
127117 (13/3)
A request from CMS to upgrade the spacemon client. Raul was on it. Any luck with this? Although I've just remembered that Raul is in a different hemisphere so that question might fall on a deaf inbox. In progress (14/3)

127126 (14/3)
Availability ticket, again by the looks of it due to the ARC monitoring playing up. On hold (27/3)

TIER 1
127251 (21/3)
A ticket from an atlas user concerning transfers into castor have trouble and some errors the user is seeing. John has requested more information as the files themselves seem present and correct, but someone who has some idea as to what the error messages listed by the submitter mean would be handy. Waiting for reply (27/3) Update - closed as likely a problem with the user's code.

127449 (2/4)
One of the RAL ARCs wasn't working well for LHCB - but the problems appear to have passed and the ticket can be closed now. In progress (3/4)

126905 (2/3)
CVMFS commissioning for the SOLID experiment. With effort from Daniela and Catalin things all look to be working for solid now with /cvmfs/solidexperiment.egi.eu exported nicely and uploadable to by the VO. Looks like another ticket can be closed. Waiting for reply (29/3) Update - before it gets closed there has been a request for some extra information from Catalin.

127388 (29/3)
LHCB troubles accessing some files at RAL. Have these issues passed with the other castor problems from the weekend? In progress (3/4)

127240 (21/3)
CMS request to run staging tests in prep for Run 2. There was a request from CMS for access to some monitoring plots, I assume for the transfer rates between buffers, but it wasn't very clear. In progress (27/3)

126184 (26/1)
Atlas request for site monitoring input. Alessandra went over this in last week's atlas uk meeting. It's not too late to have your say in the google docs. In progress (7/2)

124876 (7/11)
ROD ticket concerning tests to the RAL echo instance. Alastair's counter ticket (ticket 125026) hasn't had an update since last year - I think it needs a kick. On Hold (1/1)

117683 (18/11/15)
Castor Glue 2 publishing. Rob reported some good progress. On Hold (2/3)

NGI
126808 (24/2)
WMS usage ticket - mainly involving Imperial and the Tier 1. There was some worry from Daniela regarding the closure of old WMS tickets due to it being "no longer supported", but there were reassurances that security bugs would be fixed. Are you feeling reassured? In progress (20/3)

Monday 27th March 2017, 15.15 BST
26 Open UK Tickets this week.

STALE INFORMATION
Nearly a third of the UK tickets are not On Hold but not received an update in over 10 days. Before I go over all the tickets next week please could everyone check their older tickets. In light of that I'll have a delicate review of the tickets this week.

ATLAS MONITORING
126184 (26/1)
Has everyone who had input on the atlas site monitoring survey said their piece? If yes then this ticket has done its job. In progress (22/3)

BAG OF ON-HOLDING
126956 (6/3)
I cheekily set this Liverpool availability ticket on hold as per the S.O.P. On Hold (27/3)

I did similar to this just as unjust Brunel ticket: 127126

CAN BE CLOSED
126976 (6/3)
This Sno+ ticket looks to be solved (it was fixed by adjusting the acls and directory permissions after implementing the new spacetoken) - so the ticket can probably be closed. In Progress (25/3)

Monday 20th March 2017, 14.30 GMT
25 Open UK Tickets this week.

But first, a ticket from the UK
127224 (20/3) Thanks to Daniela for filing a ticket about the ARC monitoring weirdness seen across sites the last few months. Let's hope the monitoring team can get to the bottom of this. Update - some ticket confusion mixing this up with 126724 from Steve himself.

NGI
126808 (24/2) The "WMS usage" ticket - Daniela notes that after some disheartening closure of the WMS tickets due to lack of support withdrawal of the service might need to happens sooner rather then later. There is an attempt to reassure us that security bugs would be fixed in spite of the lack of dev effort. In Progress (20/3)

Not ECDF
127223 (20/3) Daniela's quest to sort the dashboard continues with this request to get someone to look at anomalous alarms at ECDF. Sadly the TPM routed it back to the UK. I've tried to bounce it back. Assigned (20/3) In fairness this tickets was re-routed and completed before I finished writing this up.

TIER 1 IPv6
127185 (17/3) Of interest - a ticket from WLCG requesting that the Tier 1 completes a survey about it's IPv6 readiness and plans. In Progress (17/3)

OXFORD
126928 (3/3)
Are the transfer failures that prompted this atlas ticket still plaguing the site after Kashif's fix last week? Looking at the DDM plots myself I don't think they are, so it looks like this ticket can be closed. In progress (15/3)

And finally, SUSSEX and BRISTOL have a few tickets that could do with an update Although the two Bristol tickets are only a few weeks out of date and I have not followed through with providing support and encouragement to Sussex, so my bad there.

Thursday 9th March 2017
Matt's on leave so no ticket update from him, but console yourselves with a link to all the UK tickets!

Monday 6th March 2017, 15.00 GMT
21 Open UK Tickets this month

TIER 1 and IMPERIAL
126808 (24/2)
WMS "Usage Survey" ticket. Both WMS sites have replied and it has been noted that the two big WMS users in the UK (mice and t2k.org) can be encouraged to use dirac. In progress (28/1)

SUSSEX
122772 (11/7/16)
Atlas webdav/xroot ticket. It's a bit of a baptism of fire for the new admin, and any luck with this? The storage group are always happy to help. On Hold (26/1)

125503 (9/12)
Sno+ file access problems due to what appears to the SE headnode moving. Any movement here? I think a plan was made at least. In progress (30/1)

RALPP
126902 (2/3)
CMS ticketing RALPP essentially because their multicore jobs aren't getting slots. Chris notes this problem may be compounded by these CMS jobs being particularly RAM-hungry in their resource requirements. It doesn't look like this is a site problem really (as is noted in the ticket). In progress (2/3) Update - CMS have confirmed that this ticket can be closed.

OXFORD
126928 (3/3)
Atlas transfer failures ticket - Kashif spotted that the gridftp service wasn't listening and restarted it - I suspect this ticket can be closed but is it me or has the gridftp service been a bit flakey recently (particularly for Oxford?). In progress (3/3)

BRISTOL
126864 (28/2)
A ticket to track LZ deployment at Bristol. Ticking along. In progress (1/2)

126865 (28/2)
Investigating the cause of Ipv6 CMS Phedex transfer failures at Bristol. I think things are looking good after IPv6-ing the GridFTP server, the ticket was discussing how best to debug transfers. In progress (2/3)

GLASGOW
124052 (25/9/16)
LHCB ticket concerning incorrect CPU publishing from the Glasgow ARCs - Gareth was hoping to have things sorted when they rolled out the next generation of Centos7 ARC CEs. Any joy? On hold (31/1)

ECDF
126349 (3/2)
Availability ticket with some very odd figures - luckily the argo team are looking into what's going on. Waiting for reply (1/3)

126957 (6/3)
Nagios SRM-put test failure ticket - but checking the link all seems okay. Waiting for reply (6/3)

LIVERPOOL
126956 (6/3)
A fresh availability ticket, ripe for On Holding for 30 days. Assigned (6/3) Steve notes that this the bad figures are another example of the ARC testing problems.

126936 (3/3)
A confused atlas deletion ticket, as they seem to have Liverpool and Lancaster confused. I suspect that the ticket can be closed as things seem okay at Liverpool. I took some steps to sooth Lancaster in case deletion errors persist and the DDM plots looked okay. In progress (6/3) Update - closed after no issues at either site.

124819 (3/11/16)
AFS ticket - in reply to the University opening the requested port it's noted that some hosts are still having problems (firewall on the machines themselves?) and others look to be behind a NAT. Waiting for reply (should be In Progress) (13/2)

QMUL
126650 (15/2)
cern@school pilots failing at QM (submission command failed type errors). The ticket could do with an update (even a null one). In progress (15/2) Dan found the problem, the c@s user accounts had capitals in them but slurm doesn't like that. He's recreating the account.

126261 (30/1)
QM CEs not working for biomed. Duncan spots that ce04 and 05 might not be working for CMS either. In progress (3/3) Update - Dan sees biomed jobs running on their cluster. Maybe this is no longer an issue?

126838 (27/2)
Atlas "space reporting issue" ticket. Brian is aiding the investigation, spotting a large amount of possible dark data creation via botched deletions during the switch to webdav. The discussion moves onto problems inherent using webdav for deletions on storm. This has been talked about in the atlas uk meeting (although I'm afraid I phased out during the conversation). In progress (6/3)

TIER 1
126184 (26/1)
Atlas site monitoring survey ticket. Possibly closing soon? Has feedback been provided? In progress (7/2)

126889 (1/3)
Atlas deletion error ticket for the Tier 1. Again it looks like the problem has gone away, although Tomas kindly provided the error message that was being seen for investigation. In Progress (6/3) Update - Brian closed the ticket with a good explanation of what went on.

126905 (2/3)
Finishing up with deployment of the cvmfs support for solidexperiment.org - focusing in part on accepted upload proxies. This looks to have been done, and Dan the new solidexperiment chap has success in uploading software. Daniela has updated the ticket with a few new questions. In progress (6/3)

117683 (18/11/15)
Castor Glue2 publishing ticket. Rob reports that much of the code has been written. On hold (2/3)

124876 (7/11/16)
The echo instance not working for nagios tests due to the wrong path being used. No movement on the child ticket 125026 - it looks like some chasing is needed to be done by someone. On hold (1/1)

Monday 27th February 2017, 16.00 GMT
20 Open UK Tickets this week - just doing the highlights ahead of a full review next week.

NGI (well the Tier 1 and Imperial)
126808 (24/2)
With reference to the last OMB meeting, this is a request for a gathering of statistics for the remaining UK WMS - with an eye to using this statistics to plan WMS decommissioning. Assigned (24/2)

ECDF asks WTF? (where the last F obviously stands for Flip)
126349 (3/2)
A low availability ticket that has left Andy scratching his head a bit and asking for clarification on what is going on - before at his site and across the UK. The picture is muddled by yet another example of the tests not running regularly on ARC CEs. Waiting for reply (24/2)

Monday 20th February 2016, 16.15 GMT
20 Open UK Tickets this week.

Link to all the UK Tickets.
Whilst the number of tickets for the UK is low, a good few of the them are looking a bit neglected.

Atlas Pilots at RALPP
126632 (14/2) This is likely not a site problem, but I directed this atlas ticket RALPP's way to see if Chris can shed a little light onto what's up. Assigned (15/2)

Sno Space at Liverpool
126554 (10/2) After discussion last week Liverpool have rolled a Sno+ spacetoken. John is waiting on news from David to see how things work out, and if it all goes well the rest of us that support Sno+ will be asked to follow suit. In progress (20/2)

Atlas Monitoring Survey
126184 Yet another reminder that atlas are collecting feedback concerning site monitoring - feel free to add to the google doc yourself or forward your thoughts to atlas uk cloud support. In progress (7/2)

Monday 13th February 2017, 16.00 GMT
25 Open UK Tickets this Week

ATLAS want your INPUT
126184 (26/1)
Atlas request for input on sites monitoring. In last week's cloud meeting Alastair asked if anyone had any input for this. If you do feel free to add to the google doc linked in the ticket or email your points to the cloud support mailing lists. In progress (7/2)

TOKEN AFFECTION
126554 (10/2)
Sno+ jobs failed at Liverpool, and once again John B had to educate a user group that space tokens are a thing (thanks John!). Would everyone who supports Sno+ be willing to roll out a space token for them? We don't know at this stage how much space would be needed, at this point it mainly seems for job stage back. In progress (13/2)

UNRELIABLE AVAILABILITY
126349 - ECDF
125743- RALPP

Both of these availability tickets are confusing the sites and myself (although the latter is still quite easy to do). ECDF are getting negative results again (and a lot of unknowns) and RALPP seem to be not updating results very often at all, suffering a several day lag by the looks of it.


Monday 6th February 2017, 14.30 GMT
21 23 Open UK tickets this month

FRESH IN THIS MORNING - BRISTOL
https://ggus.eu/?mode=ticket_info&ticket_id=126454 (7/6) As seen on TB-SUPPORT, CMS are having test failures at Bristol and Winnie is left without a CMS site support at the moment. I see some replies already on the list, I'll leave this slot here for hopefully helpful discussion. On Hold (7/6)

SUSSEX
125503 (9/12/16)
Sno+ file download failure ticket, due to the wrong SE name in the LFC for the files. Jeremy M reports that he is looking into created a DNS alias and asking the CA sage (aka Jens) to shape the necessary certificate. In progress (30/1)

122772 (11/7/16)
Webdav/xroot deployment ticket from atlas. Jeremy M reports the appointment of their new admin, which is great stuff. This is one of the first things on his todo list. I'll repeat the usual "we're here to help" message. No point suffering in silence! On hold (26/1)

Fresh in last night - 126438 - atlas seeing srmPut failures, but the error is 'file already exists'. A problem with rucio?

RALPP
125743 (27/12/16)
An availability ticket. A few blips on the nagios page, but I don't think there's anything to see here really. On Hold (29/1)

125815 (5/1)
Atlas ticket regarding space not being released after deletion. Chris has beaten his dcache into shape, and asked for the deletions to be re-attempted. Waiting for reply (30/1)

OXFORD
126371 (4/2)
Atlas transfer failures. Kashif spotted that the dpm-gsiftp daemon and failed, and got it back up. I suspect this ticket it can be closed if the daemon is stable? In progress (4/2)

121924 (2/6/16)
Perfsonar rate ticket? Any news? If not, is there likely to be any? On Hold (5/12/16)

125822 (5/1)
The Oxford edition of the "Space not released after deletion" issue. Kashif too has been tinkering his SE, tweaking and (re-)starting httpd daemons and asks for a fresh list of files to check. Waiting for reply (27/1)

BIRMINGHAM
126131 (24/1)
Availability ticket. The numbers are on the mend so the ticket is On Hold (30/1)

GLASGOW
125867 (9/1)
LHCB seeing cvmfs-related job failures on WNs at Glasgow. Gareth has updated cvmfs across the Glasgow nodes and asks if the issue has calmed down. Waiting for reply (31/1)

124052 (25/9/16)
Another LHCB ticket, about the arc publishing incorrect job numbers. Gareth provided an update regarding the Glasgow plans, rolling fixing this into the Centos7 migration. Thanks Gareth! On Hold (31/1)

EDINBURGH
126349 (3/2)
Another availability ticket, although today's numbers look to be okay so hopefully the cause of the troubles has passed. Looks like this ticket hasn't been noticed yet though. Assigned (3/2) Andy noted that the argo numbers seem nonsensical with negative availability for a few days! But things are on the mend now. Looks like a simple case of On Holding the ticket for the next 26 days.

LIVERPOOL
124819 (3/11/16)
The last AFS ticket, John B reports that the university has stopped firewalling UDP port 7001 and asks if things are better now. Waiting for reply (3/2)

126167 (25/1)
Decommissioning ticket for the last CREAM CE at Liverpool (which will also see the end of torque at the site). Downtime for the service will be on the 14th (Happy Valentine's Day?) and the service will be switched off properly come the 28th. In progress (30/1)

QMUL
125627 (19/12/16)
Atlas transfers failing to the QM test SE. Dan increased the space to 10TB to sooth the last batch of failures, just waiting to here if that worked. Waiting for reply (26/1)

126261 (30/1)
Biomed nagios tests not working for ce4 at QM. The problem persists. In progress (2/2)

126312 (1/2)
Atlas spotted QM's squid had fallen over. Dan has noticed problems since upgrading to v3 of frontier-squid, although the issues could also be related to IPv6 on the hosts (of the two squids at QM the one that fell over was also the one that has an IPv6 address in DNS). Keeping the ticket open to see if things stay up. In progress (1/2)

TIER 1
126296 (1/2)
CMS SAM tests failing against srm-cms-disk.gridpp.rl.ac.uk. All transfers "by hand" pass without trouble, and Gareth points out that this service is not in production in the GOCDB, so tests shouldn't even be running against it! Waiting for reply (6/2) Update - CMS got back that this is the endpoint specified in PhEDeX so this is why it was tested. If this is wrong it will need to be changed.

126376 (5/2)
Another batch of CMS SAM test failures. This includes the srm-cms-disk issue again. John K restarted the CMS xroot directors to try to clear the CE test errors that were being seen - things were looking up. In progress (6/2)

126184 (26/1)
Request from atlas for input on the new site monitoring schemes, linked in the ticket. The appropriate people were being chased. In progress (26/1)

124876 (7/11/16)
echo instance at RAL failing nagios tests due to the tests not using the right path. The ticket addressing this (125026) has had no progress since just before Christmas and so could do with a shake up. On Hold (1/1)

117683 (18/11/15)
Glue 2 publishing for Castor ticket. Did Jens and Rob have any luck tackling this in the pre-Christmas get together? On Hold (7/12/16)

Monday 30th January 2017, 15.15 GMT
24 Open UK Tickets this week

QMUL
126156 (25/1)
A quite interesting ticket from John Gordon regarding QM having >100% efficiency. Within the ticket Dan debugs his homegrown slurm accounting scripts. Possibly of interest to others - some good stuff in this ticket. In progress (26/1)

A few other tickets at QM could do with a poke though:
126012 (17/1)
Nagios BDII ticket, problem keeps cropping up.

126234 (28/1)
LHCB pilots failing and jobs not returning output, the ticket likely has snuck by you. Assigned (28/1)

RALPP
126240 (29/1)
Whilst this CMS SAM test failure ticket filled me with righteous indignation with its brevity and lack of reference links, it still could do with acknowledging. Assigned (29/1)

(In fairness given my current coffee consumption it doesn't take much to send me off on one.)

GLASGOW
125867 (9/1)
This LHCB cvmfs ticket threatens to go stale - any word on extra failures (or lack thereof)? In progress (16/1)

Talking of Glasgo tickets looking a bit stale: ticket 124052 (arc publishing ticket last updated in September).

TIER 1
126184 (26/1)
Possibly not intended of general consumption, this is an atlas request for feedback concerning the atlas site monitors. In progress (26/1)

Monday 23rd January 2016, 15.30 GMT
21 Open UK Tickets this week

RALPP
126053 (19/1)
This one piqued my interest - CMS users in Florida are having trouble getting at files, seemingly due to their MTU settings - with their default of 9000 things timeout, with 1500 things work. Bristol transfers okay. Chris is investigating. In progress (20/1) Update- solved, the problem mysteriously fixed itself.

(also at RALPP is Biomed ticket 126065, which may have not been noticed yet). Update - in progress

OXFORD
125822 (5/1)
Oxford deletions not working. An observational question - is http working as expected on the Oxford nodes? I ask because when poking my nose around pointing my browser at the Oxford SE got me nothing. The file in question I could access using my dteam credentials (and xroot), so it still exists on disk. In progress (23/1)

121924 (2/6/16)
Perfsonar ticket - a polite reminder if you (or anyone else) would like help debugging perfsonar transfer problems with some independent "standard" iperf tests I'm happy to try to help out with them. On hold (5/12)


BIRMINGHAM
Good luck to Mark with his DPM headnode this week! Let us know if you need a hand.

AFS TICKETS (LIVERPOOL and GLASGOW, but mainly Glasgow)
Can you please throw in a soothing update to your AFS tickets when you have a few spare minutes:
124821 - GLASGOW
124819 - LIVERPOOL

SUSSEX SNOPLUS FILES
125503 (9/12/16)
And finally, no news is not good news on this Sno+ ticket for Sussex. It threatens to turn into a game of pass the buck, as the options available to the VO put the responsibility in three very different places. In progress (23/1) Update- Jeremy will look at the dns alias solution, which requires some certificate magic to be done.

TIER 1
124876
The ticket Daniela mentioned, regarding nagios tests for the echo instance. To quote Daniela "The requirement that machines in production should pass basic tests is really not that onerous."

Monday 16th January 2016, 15.00 GMT
21 Open Tickets this week

Bounced back to Bristol
125558 (13/12/16)
This ticket from Lukasz to CMS, concerning decommissioning a queue in the glidein factories, has been reassigned back to Bristol. Assigned (12/1) Update - solved by the site, the initial query sorted.

ANYONE SEEN SOMETHING LIKE THIS BEFORE?

DURHAM
125845 (6/1)
Durham are having intermittent, hard to explain nagios test failures on their arc CE - seeing a few failures a day. Fishing on the site's behalf, has anyone any suggestions about where to look? In progress (13/1) Update - Thanks to Kashif for his input.

GLASGOW
125867 (9/1)
Another piece of unasked for meddling by myself, Glasgow are seeing some greedy behaviour from cvmfs on some nodes running lhcb jobs - has anyone seen something similar? In progress (16/1)

AND FINALLY...

SUSSEX
125503 (9/12/16)
As seen on TB-SUPPORT, I stuck my 2-yen's worth in to this Sno+ ticket and got a little out of my depth. Either Sussex will need to alias their new SE to the old one or there will need to be some heavy LFC operations for Sno+ (either by them or the LFC admins). Thanks to Simon, Catalin and Henry for their input. In progress (16/1)

Monday 9th January 2017, 14.30 GMT
HAPPY NEW YEAR!

22 Open UK Tickets this year.

SUSSEX
124614 (24/10/16)
A availability/reliability ticket. The New Year is looking greener on the argo pages for Sussex, so hopefully there will be plain sailing until the alarm clears. On Hold (6/1)

125503 (9/12/16)
Snoplus file download failures. Doing a spot of investigation myself it looks like the Sno+ guys didn't convert their lfns when Sussex did an SE migration last year, I've informed them thusly. Waiting for reply (9/1)

122772 (11/7/16)
Webdav and xroot frontend ticket. Hopefully the new admin at Sussex will start wrangling this soon. On Hold (21/11/16)

RALPP
125815 (5/1)
A CMS ticket regarding space not being released after deletion. It is likely a dcache problem, but a similar issue was seen at Oxford for atlas (125822). Chris has asked for some problem surls. In progress (5/1)

125743 (27/12/16)
Another availability ticket - I had to dig deep into argo to convince myself tests were running but things are looking okay. On hold (6/1)

OXFORD
125822 (5/1)
Atlas deletion problems at Oxford - probably unrelated to the RALPP issue. There's mention of a similar issue seen at Liverpool, but no specifics- Kashif has asked for more information and supplied a dark data dump. In progress (9/1)

121924 (2/6/16)
Perfsonar throughput drop ticket. Suspected to be a problem with just the perfsonar tests, it likely warrants a spot of further investigation - perhaps someone with a "regular" iperf endpoint could help? On hold (5/12/16)

BIRMINGHAM
122771 (11/7/16)
xroot/webdav ticket from atlas. Mark finished off 2016 with some good progress - looks like permission issues to my eyes. On Hold (22/12/16)

GLASGOW
125867 (9/1)
lhcb seeing cvmfs problems on some Glasgow nodes. Gareth has his prodding stick out and removed the nodes from production just to be safe. In progress (9/1)

124821 (3/11)
AFS ticket. Not very exciting. On hold (16/11/16)

124052 (25/9)
LHCB arc job number publishing ticket. I believe tackling this is on the to-do list. On hold (26/9/16)

DURHAM
125845 (6/1)
ROD arc ce test ticket - I think this snuck by the Durham admins, understandable on the first Friday of the year. Assigned (6/1)

SHEFFIELD
125853 (6/1)
Apel publishing ROD ticket. Elena has fixed things, but it will take some time to trickle through. This ticket will want on holding until then I reckon. Waiting for reply (9/1) Update - solved, tests all green now.

MANCHESTER
125664 (20/12)
This is a ticket to Andrew with his VAC dev hat on, asking for a way to keep VAC and dirac versions in sync. Some good discussion going on. In progress (6/1)

LIVERPOOL
124819 (3/11/16)
Another AFS ticket - John provided an update before on holding it. On hold (16/12/16)

RHUL
125855 (6/1)
Biomed have asked if they're being purposely excluded from accessing ce3. I'm not sure if Raul is back yet, the ticket could do with some fielding. Assigned (6/1/17) Update - solved, biomed enabled on the queues.

QMUL
125627 (19/12/16)
Atlas noticing problems on a test SE at QM, which Dan was trying out a UMD4 install on. On hold (19/12)

TIER 1
125856 (6/1)
LHCB file access ticket, this has been investigated and the Tier 1 team have come back with a few questions. Waiting for reply (9/1)

125157 (24/11/16)
Creation of extras-fp7.eu cvmfs repo - chugging along nicely in spite of the holidays, with most stratum-1 replications in place. In progress (3/1)

124876 (7/11/16)
Ticket following getting nagios tests working for the RAL echo instance. Alastair provided a summary to the issue to start the new year off with with a reference to ticket 125026. On Hold (1/1)

125480 (9/12/16)
Physical/logical core publishing mismatch. After some discussion the ticket was held for the holidays. On Hold (21/12/16)

117683 (18/11/15)
Glue 2 publishing for Castor - Jens and Rob hopefully had a chance to have a bit of a bash at this before Christmas. Hope that went well! On hold (7/12/16)