Difference between revisions of "Past Ticket Bulletins 2018"

From GridPP Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
'''Monday 12th March 2018, 14.30 GMT'''<br />
 +
42 Open UK Tickets this week.
 +
 +
'''SUSSEX'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=133325 133325] (6/2)<br />
 +
This Availability ticket looks like it can be closed, with the alarms having gone green. In progress (8/3)
 +
 +
'''DURHAM'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=133338 133338] (7/2)<br />
 +
Is this subject of Atlas ticket still causing problems? Lots of things were done at the last update - did they fix the issue? In progress (21/2)
 +
 +
'''TIER 1'''<br />
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=133719 133719] (27/2)<br />
 +
This ECHO ticket hasn't had an update since its acknowledgment, any news? In progress (27/2)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=133717 133717] (27/2)<br />
 +
Possibly related, this CMS FTS ticket hasn't had an update this month either. In progress (27/2)
 +
 +
Both of these issues look like they're related to this atlas ticket, which has been getting updates: [https://ggus.eu/?mode=ticket_info&ticket_id=133752 133752]
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=133619 133619] (21/2)<br />
 +
I have a feeling that this CMS unmerged file ticket can be closed, but I could be misreading the last updates. It's definitely work checking to see if it is solved. In progress (12/3)
 +
 +
[https://ggus.eu/?mode=ticket_info&ticket_id=133764 133764] (1/3)<br />
 +
Finally, this Sno+ BDII ticket can be closed, the problem appears to have been at the source. In progress (8/3)
 +
 +
 
'''Monday 5th March 2018, 14.30 GMT'''<br />
 
'''Monday 5th March 2018, 14.30 GMT'''<br />
 
44 Open Tickets this month.
 
44 Open Tickets this month.

Revision as of 14:25, 19 March 2018

Monday 12th March 2018, 14.30 GMT
42 Open UK Tickets this week.

SUSSEX
133325 (6/2)
This Availability ticket looks like it can be closed, with the alarms having gone green. In progress (8/3)

DURHAM
133338 (7/2)
Is this subject of Atlas ticket still causing problems? Lots of things were done at the last update - did they fix the issue? In progress (21/2)

TIER 1
133719 (27/2)
This ECHO ticket hasn't had an update since its acknowledgment, any news? In progress (27/2)

133717 (27/2)
Possibly related, this CMS FTS ticket hasn't had an update this month either. In progress (27/2)

Both of these issues look like they're related to this atlas ticket, which has been getting updates: 133752

133619 (21/2)
I have a feeling that this CMS unmerged file ticket can be closed, but I could be misreading the last updates. It's definitely work checking to see if it is solved. In progress (12/3)

133764 (1/3)
Finally, this Sno+ BDII ticket can be closed, the problem appears to have been at the source. In progress (8/3)


Monday 5th March 2018, 14.30 GMT
44 Open Tickets this month.

IPv6 Deployment Tickets
Sussex: 131617
Possibly on hold until mid-2018.
RALPP: 131616
Chris had an encouraging update back in January, but hit some snags with a new Perfsonar install. Any joy?
OXFORD: 131615
No update since stating you had dual-stacked Perfsonar boxes back in November. Anything to add? Thanks for the update.
CAMBRIDGE: 131614
No progress expected until the Summer of this year. Is this still the case?
BRISTOL: 131613
Last update hoped progress could happen by February, any news? No recent news
BIRMINGHAM: 131612
Some progress on the v6 infrastructure news, hopefully the bugs Mark described a few weeks back can be ironed out.
GLASGOW: 131611
Gareth provided a recent, if not totally positive, update.
ECDF: 131610
There were some interesting times last week when taking the first steps in dual-stacking the ECDF DPM broke things. Keeping to dual-stacking their test DPM for now.
DURHAM: 131609
Last update at the end of January had no positive movement from central IT on v6 deployment.
SHEFFIELD: 131608
This ticket really could do with an update - even an unexciting one.
MANCHESTER: 131607
IIRC I think reverse lookup works only for the Perfsonar boxes - the ticket could do with an update about this.
LIVERPOOL: 131606
Another ticket that could do with an update, even if it's a boring one. John provided a brief update.
UCL: 131604
No news from central IT at last check back in January.
RHUL: 131603
Perfsonar dual-stacked, but DNS lookup not supported yet.

Common or Garden Tickets

SUSSEX
122772 (11/7/16)
Webdav/Xroot ticket. Some good looking progress on getting this to work, although at last check Leo hit some more problems. In progress (7/2)

133325 (6/2)
Availability ticket. Hopefully given another week of smooth running this can be closed. In progress (12/2)

RALPP
133819 (4/3)
LHCB asked RALPP to provide details of nodes without any SSE4.2 support. As Chris instructed the ticket was reopened by LHCB to request lhcb jobs do not land on these nodes. Reopened (4/3) Update - solved, the nodes are being decommissioned very soon.

OXFORD
133809 (3/3)
Availability ticket, caused by the AC troubles. On hold (5/3)

BRISTOL
133762 (1/3)
CMS Transfer problems, on hold until Friday. On Hold (5/3)

133806 (2/3)
CMS asked sites to deploy Singularity by March 2018, this ticket is the follow up. On hold (5/3)

BIRMINGHAM
129930 (4/8/17)
Atlas http SAM tests failing. Any luck with the puppet scripts Kashif shared with you? On hold (13/2)

GLASGOW
133667 (23/2)
LHCB data access problems at Glasgow. The ticket tailed off a bit, Andrew McNab has offered to help compare Glasgow and Manchester settings. In progress (5/3) Update - everything looks good now after Sam updated xroot across the Glasgow storage. Maarten noted in the xroot changelog the likely fix. I should imagine this ticket can be closed now.

DURHAM
133338 (7/2)
Atlas jobs failing at Durham, with the problems likely to be related to the Arc Control Tower handling of pilots. Adam rolled out some changes, have these fixed things? In progress (21/2)

SHEFFIELD
133019 (24/1)
Availability ticket. Ticking along. On hold (1/3)

133810 (3/3)
Sno+ jobs failing due to cvmfs errors on a node, which Elena has offline. I suspect that that's this ticket done with. In progress (4/3)

133770 (2/3)
LHCB jobs failing due to problems on some WNs, Elena has been fixing them, hopefully it's all sorted now. In progress (3/3)

MANCHESTER
133716 (27/2)
Atlas deletion errors - it looks like this ticket has been missed. Assigned (27/2)

QMUL
133402 (9/2)
A good portion of Sno+ jobs failing at QM, due to stage in/out errors. This is likely caused by the reduced network bandwidth being hogged by atlas. Hopefully this will be fixed soon (by restoring the 20GB/s site connection). In progress (22/2)

132713 (4/1)
hyperk.org support ticket. Any news? In progress (6/2)

132929 (18/1)
CMS having problems due APEL's problem parsing slurm logs (or something like that). APEL support have been called in, but no news yet. In progress (29/1)

IMPERIAL
133683 (24/2)
Atlas seeing a high job failure at Imperial, due to problems with their AGIS configs that they have no control over. Elena proposes closing the ticket and moving the conversation to JIRA. In progress (5/3) Update - atlas are waiting on seeing some running jobs before closing the ticket

133818 (4/3)
Another LHCB asking how many nodes do not have sse4.2 support. Simon reports there are no plans to decommission these nodes yet. Waiting for reply (5/3)

133723 (27/2)
This is a ticket for the Cloud site, Sno+ saw problems. Simon was investigating, and has offlined the cloud site in Dirac to prevent further failures. In progress (27/2) Update - Simon hasn't managed to reproduce any errors, and has suggested closing the ticket for now, reopening if needed.

132688 (3/1)
Another not really an Imperial ticket, I think this lost Pheno file ticket can be closed soon. In progress (29/1) Update - ticket closed

TIER 1
133719 (27/2)
Atlas spotted tranfers failing into Echo. It was being investigated, any news? In progress (27/2)

133752 (1/3)
Atlas noticed the FTS was broken. Was investigating Alastair noted that it appears to be an IPv6 issue. In progress (1/3)

133717 (27/2)
Likely related, a similar sounding CMS ticket. Any news? In progress (27/2)

133619 (21/2)
Missing unmerged CMS files at RAL. Chris has been helping a lot, but has asked CMS to double check his working. Waiting for reply (5/3)

133764 (1/3)
Sno+ ticket about the RAL BDII not having SFU information. It looks like the bdii information has recently changed (for the worse). Any news? In progress (2/3) Update - Karin has updated the ticket saying that things have got a lot worse for Sno+, upping the ticket's priority.

132589 (21/12/17)
LHCB killed pilots ticket. Some more investigations into this show that the problem is getting worse. Any luck with your investigation? In progress (23/2)

132708 (4/1)
WMS decommissioning ticket. Nothing to do here until next month I don't think. In progress (18/1)

127597 (7/4/17)
CMS network performance ticket. No news since Chris' comprehensive update in January. On hold (29/1)

124876 (7/11/16)
ECHO gridftp ROD tests not working, due to problems with the tests. No news on the counter ticket, still. On hold (13/11/17)

117683 (18/11/15)
GLUE2 publishing for Castor. A quick update in January reports a prototype version is being tested. On hold (3/1)

Monday 26th February 2018, 14.30 GMT
37 Open UK Tickets this week.

It's still seemingly like a stagnant time on the ticket front. A few tickets that need a poke include this RALPP ticket: 133390, which has been in waiting for reply for a few weeks, and this QMUL ticket: 132929, waiting for some input (or acknowledgement) from the APEL devs.

Glasgow have a few tickets related to some issues with xrootd playing up in various ways at their site (causing errors for lhcb in 133667 and a return of the classic xroot overload problems in 133690). The tickets are being handled with the usual Glasgow panache, but I thought I'd give an opportunity to talk about them.

For the first time in a while (that I can remember at least) a ticket has been (re-)assigned to atlas-adc-cloud-UK - the IC ticket 133683. The root causes of the problems are likely the move to using QM as IC's DATADISK. It could be interesting to watch (hopefully it won't be though!).

Related to the previous tickets, for the Sussex xroot ticket 122772 it is worth atlas re-engaging with this. Plus perhaps the errors seen could be related to xroot playing up rather then a misconfig?


Monday 19th February 2018, 15.30 GMT
35 Open UK Tickets this week.

IPv6 Tickets.
A quick skim over these - does anyone have anything they want to add?

Bristol
133508 (14/2) CMS sites have been asked to set up Rucio test areas - this one hasn't been spotted yet. The Brunel equivalent (133506 contains possibly useful information. Assigned (14/2)

Tier 1
133421 (12/2) This Sno+ transfer ticket looks like it can be closed, the VO reports that things are fixed. In progress (14/2)

QMUL
132713 (4/1) One of the last hyperk support tickets, Daniela had a suggestion but no news on the ticket since. In progress (6/2)

DURHAM
133338 (7/2) This atlas jobs failure ticket has been reopened, with atlas still seeing issues but not sure about the cause (the jobs complain with "cat: output.list: No such file or directory"). Reopened tickets can often sneak by us so I thought I'd bring this one up. Reopened (17/2)

Monday 12th February 2018, 17.00 GMT
46 Open UK Tickets this week.

Link to all the UK Tickets.

It doesn't feel like a very exciting week for tickets - although it's worth noting that Sno+ seem to be having a ticket drive, cleaning up problems that they're seeing.

There's a RHUL ticket (133409) that needs acknowledging, and there's a few tickets from CMS regarding that data transfers that just seem confusing to me (133390 and 133389 at RALPP, 133344 at Imperial) - although sites aren't to blame for this confusion!

Completely anecdotally (citing 133424), is it me or does CVMFS feel less robust recently? It of course could just be me.

Finally I'll take this opportunity to do my bi-annual reminder to sites to please check the status of their tickets - when you start working on it please make sure to set them 'In Progress', when you ask a question please mark the ticket 'Waiting for reply' and when you're not going to make any progress for a while please set the tickets 'On Hold'. Finally finally, it's not really worth leaving tickets for too long before closing them - a day or two is usually more then enough.

Monday 5th February 2018, 15.30 GMT
38 Open UK Tickets this month

IPv6 Tickets
Sussex: 131617 On Hold (15/11/17)
RALPP: 131616 Chris put in a nice update a fortnight ago, citing some perfsonar problems. In progress (31/1)
Oxford: 131615 No recent news on the ticket but I think there's v6 progress at Oxford? On hold (7/11/17)
Cambridge: 131614 On hold (15/11/17)
Bristol: 131613 Early February was the estimated time to get the perfsonar boxes dual stacked, how's that looking? On hold (7/11)
Birmingham: 131612 Duncan poked the ticket last month. On hold (11/11/17)
Glasgow: 131611 I think any further news awaits you chaps moving into your new digs (once they're built). On hold (6/11)
ECDF: 131610 Planning is underway, Raul has kindly offered to help. In progress (5/2)
Durham: 131609 The v6 reverse DNS at Durham is still not working, Adam has provided an update on this. In progress (31/1)
Sheffield: 131608 Is there anyway we can help encourage the University to enable v6 for you? On hold (6/11/17)
Manchester: 131607 Duncan reckons you now have v6 reverse DNS lookup, so that's good news. On hold (1/2)
Liverpool: 131606 As further progress here is reliant on some upstream routers getting upgraded maybe this ticket should be put on hold? In progress (14/11/17)
Lancaster: 131605 Lancaster is just waiting on some testing from a v6 only endpoint. I'm working on setting up a v6 only UI to see if that helps. In progress (5/2)
UCL: 131604 Waiting on central IT to get back. On hold (15/1)
RHUL: 131603 RHUL's perfsonar boxen are now dualstacked - nice. On hold (31/1)

Regular Tickets:

SUSSEX
122772 (11/7/16)
Atlas xroot/webdav ticket. At last word just before Christmas Leo was waiting on some ports being opened up in the external firewall. Any joy? In progress (19/12/17)

RALPP
133250 (5/2/1042)
A ROD ticket - the date looks a bit suspect (I don't think GGUS has been around for that long). The test (ch.cern.WebDAV) and the server failing it (mover.pp.rl.ac.uk) all sound a bit weird too. Assigned (2/2/2018)

133274 (5/2)
CMS xroot failures. Things were fixed by a trusty restart script, but Chris has asked about the state of the AAA network. Waiting for reply (5/2)

OXFORD
133215 (31/1)
Atlas deletion errors on the newly reinstalled Oxford SE. After consulting on the dpm list Kashif tweaked his mysql settings and is in the "wait and see" phase. In progress (5/2)

BRISTOL
133220 (1/2)
CMS hammercloud jobs hitting their wall clock limit - for reason for which is proving a bit of a mystery. Luke has looked into this very closely so far, but it might be some weird emergent property. In progress (2/2)

BIRMINGHAM
132569 (19/12/17)
Dirac pilots not being able to be submitted to Birmingham. I think the problem is well understood, have the effected VOs been removed from the bdii? Assigned (22/1)

129930 (4/8/17)
Atlas http tests failing at Birmingham. Perhaps Kashif might have some insight into this after his recent DPM adventure? Although maybe this ticket will become moot. On hold (16/11/17)

GLASGOW
133115 (29/1)
Checking if the new lchb conddb cvmfs mount is mounted. For some odd reason some of Glasgow CEs are failing/not running the tests. Despite all the tests running across the same WNs. In progress (5/2) Update- LHCB seem to think this is a problem with the tests, and so the ticket can be closed.

ECDF
133222 (5/2/3164)
A ROD ticket from the distant future! The tests look okay now, so I suspect this ticket can be closed. Waiting for reply (5/2/2018)

SHEFFIELD
133019 (24/1)
Low availability ticket, all good. On hold (30/1)

133260 (3/2)
Atlas transfers failing. Any luck debugging this? In progress (3/2)

MANCHESTER
131526 (1/11/17)
Storage accounting deployment. Were there some roadblocks for this? On hold (12/1)

LIVERPOOL
133114 (29/1)
New LHCB mountpoint ticket. It looks like this ticket was missed. Assigned (29/1)

RHUL
132715 (4/1)
Supporting hyperk.org. Any word on this? In progress (22/1)

QMUL
132713 (4/1)
Support for hyperk.org. Sadly despite some fixing errors persist. In progress (5/2)

132929 (18/1)
CMS APEL problem for QM jobs. Due to a problem with SLURM, Dan originally "unsolved" this ticket. Reopened with some useful tips, but the apel team has been involved to check on this, which was the right call. In progress (29/1)

BRUNEL
132876 (16/1)
CMS seeing reading issues at Brunel. After some expert debugging from Raul I think we're waiting on the CERN ticket 133010. In progress (5/2)

IMPERIAL (kinda)
132688 (3/1)
A lost pheno files ticket that bounced back to IC. Just waiting for word back from users (which may take a while). In progress (25/1)

TIER 1
132589 (21/12/17)
Killed LHCB pilots at the Tier 1. There's a proposal to mark the ticket "unsolved", but Vladimir seems reluctant to do this. In progress (31/1)

117683 (18/11/15)
The old Glue 2 publishing for Castor ticket. Last news is that a prototype version is in testing. On hold (3/1)

127597 (4/7/17)
CMS ticket checking xroot and network performance. Chris provided a good news update - new firewall hardware is on its way. However this might not fix things, Chris warns more work might be needed. On hold (29/1)

124876 (7/11/16)
Echo failing gridftp nagios tests - due to the tests being broken. Absolutely no movement on the linked ticket to fix the tests (125026). On hold (13/11/17)

132708 (4/1)
The ticket tracking the decommissioning for the RAL WMSseses. It's going well. In progress (18/1)

Monday 29th January 2018, 15.30 GMT
43 Open UK Tickets this week.

New LHCB mountpoint tickets
LHCB have ticketed a bunch of sites to make sure that they have "/cvmfs/lhcb-condb.cern.ch" accessible on their WNs. It's a simple case of check and close, LHCB will do the verification their end afterwards.

BIRMINGHAM
132569 (19/12/17)
I'm not sure if some solid actions were planned out that week for this ticket, but it could do with an update. I think the decision was simply to remove the dirac supported VOs from the CREAM CE bdii? Assigned (should be a different status) (22/1)

BRUNEL
132876 (16/1)
I'm not sure what's going on in this CMS xroot ticket, but I'm wondering if the original issue either still exists or was not a Brunel problem after all. This ticket either can be closed, or perhaps put on hold whilst the related CERN ticket is sorted. In progress (23/1)

ECDF
132446 (11/12/17)
It looks like this ticket tracking dirac jobs having batch system problems can be closed after so tweaking in the argus servers. In progress (26/1)

Also I think the corresponding hyperk support ticket 132716 can be closed too.

RHUL
132715 (4/1)
It might well be that you're still in the middle of network maintenance, but a polite reminder of this hyperk support ticket. In progress (22/1)

TIER 1
132712 (4/1)
Still on the hyperk support ticket, this ticket was just waiting on the hyperk configs to get into quattor. Has that happened yet? In progress (23/1) Update - solved

132589 (21/12/17)
Raja has updated the ticket to sadly report that they are still seeing LHCB job deaths at RAL. In progress (29/1) A further update this morning from Vladimir asks to check on a bunch of jobs' statuses.

132708 (4/1)
Just for information, this is the ticket tracking the decommissioning of the RAL WMSses. In progress (18/1)

Monday 22nd January 2018, 15.00 GMT
54 Open UK Tickets this year.

Start with the good news - these tickets look like they can be closed:

BRISTOL
132880 (16/1)
It looks like transfers are working after the firewall fix. In progress (19/1) Solved, but CMS have hit Bristol with another xroot ticket: 132990

QMUL
132615 (26/12/17)
After changing the working directory LHCB jobs don't seem to be running out of space anymore, so the ticket can be closed. In progress (20/1)

TIER 1
132712 (4/1)
There seems to be positive news getting hyperK jobs working at the Tier 1, so maybe this ticket is sorted? In progress (22/1)

RALPP
132830 (12/1)
This complex CMS xroot ticket looks likely to be solved (in fact Chris might be closing the ticket as I type). In progress (19/1) Solved

Now onto the bad:

RHUL
132715 (4/1)
This ticket from Daniela about supporting the hyperK VO seems to have gone un-noticed. Can you please notice it? Assigned (4/1)

RALPP
132851 (15/1)
This CMS xroot ticket might be related to the one above, hence why it's not been tended to (indeed it might be able to be closed too). There's a request for some verbose output of an xrdcp from different CMS peeps, so the conversation is out of the site's hands for now. In progress (17/1)

QMUL
132713 (4/1)
Fixing hyperk jobs at QM on a couple of CEs. Dan had a kick of things a while back, how did that work out? In progress (4/1)

BIRMINGHAM
132569 (19/12)
Daniela spotted Dirac problems at Birmingham. Ultimately this is fallout from the Birmingham move to VAC, Daniela has suggested that Mark remove the VOs from the BDII to stop dirac sending jobs to an almost dead CE. Assigned (should be something else) (22/1)

MANCHESTER
132121 (28/11/17)
Any news or progress with this ticket to the VOMS service? There's been no updates with words in them from any site admins. In progress (1/12/17)

TIER 1
132589 (21/12/17)
LHCB pilots are still failing at the Tier 1 at Raja's last post, this ticket could do with an update from the Tier 1's side. In progress (10/1)

And the Ugly are a few tickets that need updates from the VOs:

MANCHESTER
132468 (14/12/17)
Alessandra updated this atlas transfer ticket with news that she has informed atlas of many lost files that were causing the errors. No news from anyone since. Perhaps someone from cloud support could update things? In progress (4/1)

IMPERIAL
132688 (3/1)
Daniela tried to poke Pheno over some lost files, but has had nothing but silence from them. Must have not been important files. Assigned (19/1)

132692 (3/1)
This LHCB ticket is in the same state as the Pheno one- waiting for someone from the VO to acknowledge the lost files. Assigned (3/1)

132683 (3/1)
The atlas equivalent of the previous two, Brian jumped on it when poked through another channel - so maybe these lines of communication aren't getting to where they should? In progress (22/1)

Extra extra...

Raul pointed out on tb-support this Brunel ticket 132876, which points to an IPv6 config issue and has been thrown back towards the T0 to fix things (132993).