Monday 30th June 2014, 14.30 BST
Full Review this week, a little earlier then usual.
28 Open UK Tickets
SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105937 (2/6)
Low availability ticket, due to EMI3 upgrade woes. Most issues have been solved, but Apel publishing problems have been rolled into the ticket. Matt RB seems digging his way out in the right direction though. In progress (30/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105618 (21/5)
Sno+ CVMFS unavailable at Sussex. On Hold whilst the other issues are dealt with. On Hold (23/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106492 (25/6)
A request from atlas to resize Space Tokens. Matt also asked if atlashostdisk and atlasgroupdisk could be deleted - Brian gave the nod yes. Probably all done with here? In Progress (27/6)
BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106438 (23/6)
CMS having some trouble running jobs at Bristol (especially having lots of "held" jobs- but reading the ticket this means held on the cms queue, not in the local batch system). Winnie notes that for at least one of their queues they have over a hundred waiting cms jobs on a 72 slot shared queue. But it looks like the problem may have evapourated. At last word the cms submitter said he'd close the ticket if things stayed clear - but this was last Thursday. In Progress (26/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (1/6)
A different CMS ticket, about pilot jobs losing connection to their submission hosts. After another round of nomenclature confusion, it was found that the problem seems to be between Bristol and hosts cmssrv119.fnal.gov and vocms97.cern.ch. Lukasz suggests using perfsonar to investigate. Also the dates on this ticket are well off (creation date 1/6, but first update 18/6) In progress (27/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106554 (1/6)
Again the dates on this ticket are very off (creation date was the 1/6, but the first update is the 29/6)- so the issue may have disappeared. This is another cms ticket about a heavy transfer backlog between Bristol and FNAL - if it's still a problem possibly linked to the above issue. Waiting on Lukasz to get back. In progress (30/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106058 (9/6)
CMS xrootd problems at Bristol. Also waiting on Lukasz's return (which I think has happened). On Hold (16/6)
EDINBURGH
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/2013)
glexec ticket. No news, the early review meant I couldn't sooth my shame on this matter. On Hold (27/1)
MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105922 (2/6)
Manchester publishing to EMI2 APEL. It's being worked on, but one piece is missing - on hold until this detail is sorted. On Hold (25/6)
LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106406 (23/6)
LHCB having trouble on Lancaster's older cluster. First issue was cvmfs timeouts - linked to older WNs being overloaded. Second issue is cream CE losing track of jobs in the batch system. Being worked on, but like a case of old age-tuning can only fix so much. In progress (26/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/2013)
glexec ticket. As with ECDF. On Hold (4/4)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)
Persistant Poor Perfsonar Performance Problems Plaguing Plymouth-born Postdoc... nope, that's as many Ps as I can get (and I'm not sure I still count as a Postdoc). A reinstall of the box hasn't helped. If anyone has a normal 10G iperf endpoint I could test against that would be great. Other then that waiting on some networking rejigging at Lancaster to shake things up and give the network engineers another chance to go over things. On Hold (23/6)
UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106425 (23/6)
UCL failing ops tests that are using their SE. Ben noticed a problem with one of their pools, but fixing it didn't seem to solve the problem. Gareth has asked for an update pending being forced to escalate. In progress (30/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/2013)
UCL's glexec ticket. Last word was this would be the first job of a newer staff member, who was due to start within a few months (so about nowish?). On Hold (16/4)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101285 (16/2)
UCL's perfsonar not working after suffering a hardware failure. Bits have been replaced and the machine was due a reinstall a while ago. On Hold (28/4)
RHUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106437 (23/6)
Atlas have inaccessible file(s) at RHUL due to a pool node in distress. Govind hopes to install a new motherboard tomorrow and will update after. Good luck with the repair! In progress (30/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105943 (2/6)
Biomed asking for gsiftp access on the RHUL headnode so that they can read the namespace with gsiftp. Govind tried to enable this but biomed report that it didn't work. Not much word since - but I expect Govind's been busy. In progress (23/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105923 (2/6)
RHUL still publishing to EMI2 APEL too. On Govind's to do list, but low priority. No word for a while. On Hold (17/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106495 (25/6)
Inconsistent storage capacity publishing at RHUL. Govind reckons (quite rightly) that this is due to having a pool node out of commission and will look at it once that's fixed. In Progress (26/5)
QMUL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105771 (27/5)
Biomed having problems accessing files via https at QM. Chris explains that they've had to switch off https access and are waiting for 105361 to be fixed and storm to be updated. On Hold (12/6)
IMPERIAL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106369 (20/6)
Biomed ticket, similar to 105943 for RHUL, but with some added history (106369). Biomed are being a little insistent, and asked a question that I don't fully understand about path publishing. In Progress (30/6)
IMPERIAL CLOUD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106347 (19/6)
The new cloud site needed to tune things as VMs weren't using proxies but hitting the cern statum 0 directly. Adam is working on how to get around this - Ewan has mentioned that Oxford have shoal running and have seen accesses from the Imperial Cloud machines - so the problem may have a no work required workaround (the best kind!). In Progress (29/6)
EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/2013)
LHCB jobs having openssl like problems at Jet. No progress on this for a while but none was expected - the problem survived the move to EMI3, and the jet admins are stuck. On Hold (12/5)
TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
Vidyo router firewall ticket. I suspect this ticket can be closed, as other issues are being followed up elsewhere- or it at least needs an update/being ste on hold. In Progress (10/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
Inconsistent BDII and SRM storage numbers for lhcb. This has been worked on, and seems almost fixed. There's some debate over the tape figures, Brian points out that the 'online' values are correct. In progress (30/6)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324 (18/6)
CMS pilots losing connection to their submission hosts at RAL. It looks like this has been going on silently for a while, the RAL team are taking it up with their networking chaps to see if it's a firewall issue.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106480 (25/6)
The information publishing police have pointed out that the RAL Castor isn't publishing a sane version. Brian suspects an rogue ":" causing the problems.
|