Monday 18th May 2015, 14.30 BST
Full review this week.
Other VO Nagios
At time of writing I see problems with test jobs at Brunel for pheno and Liverpool for a number of VOs (see Sno+ ticket for probable cause and fix at Liverpool).
22 Open UK Tickets this week. Going site-by-site:
APEL/NGI
113473 (4/5)
Missing accounting date for April for some sites. Raul is discussing things for Brunel in the ticket, although they have republished. I think it's only ECDF left to republish their April data. In progress (16/5)
OXFORD
113482 (26/4)
Loss of accounting data for Oxford needing a APEL republish. The Oxford guys republished, but there is some confusion with the resulting numbers. Discussion is ongoing, John G is currently looking at the records. In progress (14/5)
113650 (11/5)
CMS glideins failing at Oxford. The original problem was with a config tweak being left out of the cvmfs setup, but the ticket has been reopened citing problems persisting on the ARC CE (the CREAM appears to be fixed). Reopened (16/5)
GLASGOW
113095 (17/4)
ROD ticket about batch system BDII failures, left open to avoid unnecessary ticket filing. Gareth noted that the full migration to ARC and HTCondor, which should see the end of these issues, will hopefully be completed by the end of June. On Hold (12/5)
ECDF
95303 (31/7/13)
Somehow left this one out of the e-mail update. Edinburgh's glexec ticket, dependent on the tarball. I put in my tuppence worth today with my tarball hat on. On hold (18/5)
SHEFFIELD
113769 (18/5)
LHCB see a cvmfs problem at Sheffield. Elena has probably fixed the problem(restarted the sssd), just waiting to see if it all pans out. In progress (18/5)
MANCHESTER
113744 (15/5)
For the VOMS rather then the site, Jens' request for the creation of the dIrac VO, vo.dirac.ac.uk. In progress (18/5)
113692 (13/5)
A request from pheno to add support to for their new cvmfs area at Manchester, and as I understand it, to support them in a new "form" (pheno.egi.eu). In progress (13/5)
LIVERPOOL
113742 (15/5)
Sno+ noticed their nagios failures at Liverpool. Rob reckons this was a problem with the DPM BDII service certificate not being updated (that's bitten me too), and fixed things this morning. Let's see how that goes. In progress (18/5)
LANCASTER
95299 (1/7/13!)
Lancaster's vintage glexec ticket. An update on this - after have a roundtuit session last week I was building glexec for different paths. It still needs some testing to make sure it works properly. There however definitely won't be a one-size-fits-all tarball solution. On hold (15/5)
100566 (27/1/14)
Only the crustiest old tickets for us at Lancaster! Poor perfsonar performance. Sadly didn't get roundtuit on this one - we're pushing getting these nodes dual stacked as Ewan had pointed out that it would be interesting to see if IPv6 tests also saw this issue. On hild (18/5)
UCL
113721 (14/5)
The only UCL ticket, this is a egi "low availability" ticket. However Daniela notes that the plots are on the rise, so things are looking alright. Probably want to "On Hold" it but otherwise not much to be done. In progress (14/5)
IMPERIAL
113743 (15/5)
A ticket from Durham concerning the Dirac instance at Imperial's settings for their site. Daniela hopes to get it fixed soon. In progress (15/5)
100IT
112948 (10/4)
CA certificate update at 100IT leading to a discussion of other authentication based failures. David has asked for voms information after posting his configs. In progress (13/5)
TIER 1
113035 (14/4)
Ticket tracking the decommissioning of the Tier 1 CREAM CEs. I think things are just about done now, this ticket can soon be closed. In progress (11/5)
109694 (28/10/14)
Sno+ gfal-copy ticket. Brian reports that the Tier 1 is upgrading gfal2 on their WNs, and notes that there's a lot of active debugging work going on in the area. As he eloquently puts it "situation is quite fluid". In progress (13/5)
108944 (1/10/14)
CMS AAA tests failing at the Tier 1. There's been a lot of work on this, deploying then trying to get the new xrootd director configured. New problems have cropped up, and are under investigation. In progress (11/5)
112721 (28/3)
Atlas transfer failures ("failed to get source file size"). Tracked to a odd double transfer error, possibly introduced in one of the recent "upgrades". Brian has been declaring these files as bad, and a workaround or solution is being thought about. In progress (14/5)
113705 (13/5)
Atlas transfer failures from RAL tape. Checksum failures, which Brian tracked to being due to not being of a type Castor supports. Brian has asked if this can be changed at the CERN FTS or in rucio. Waiting for reply (14/5)
113748 (16/5)
Another atlas transfer ticket, but as the error indicates no space left at the Brunel space token being transferred to Elena has noted that this isn't a site problem, telling the submitter to put in a JIRA ticket instead. Waiting for reply, but probably can be just closed (16/5)
112866 (7/4)
Lots of cms job failures at RAL. This has been traced to some super-hot files, mitigation is being looked into. A candidate for perhaps On Holding, depends on the time frame of a work around. In progress (13/5)
113320 (27/4)
CMS data transfer issues. I'm not actually too sure what's going on. There are files that need invalidating, which seems to be the root of the evil befalling transfers. The issue is being actively worked on though. In progress (18/5)
|