Monday 26th January 2015, 14.15 GMT
Back after being forgotten about by me:
Other VO Nagios Status:
At the time of writing I see:
Imperial: gridpp VO job submission errors (but only 34 minutes old so probably naught to worry about).
Brunel: gridpp VO jobs aborted (one of these is 94 days old, so might be something to worry about).
Lancaster: pheno failures (I can't see what's wrong, but this CE only has 10 days left to live).
Sussex: snoplus failures (but I think Sussex is in downtime).
RALPP: A number of failures across a number of CEs, all a few hours old. An SE problem?
Sheffield: gridpp VO job submission failure, but only 6 hours old.
And of course the srm-$VONAME failures at the Tier 1, which are caused by incompatibility between the tests and Castor AIUI. Things are generally looking good.
22 Open UK Tickets this week.
The NGI has been asked to upgrade the cloud accounting probe, and then notify our (only at the moment) cloud site to republish their accounting. Not entirely sure what this entails or who this falls on, I assigned it to NGI-OPERATIONS (and also noticed that 100IT isn't on the "notify site" list - odd). Assigned (22/1)
CMS AAA test failures. Andrew Lahiff reported last week that the Tier 1 is building a replacement xrootd box which is currently being prepared. If that will take a while can the ticket be put on hold? In progress (19/1)
An atlas ticket, asking for httpd access to at QMUL. The QM chaps were waiting on a production ready Storm that could handle this, and are preparing to test one out. This is another ticket that looks like it might need to be put On Hold (will leave that up to you chaps - there's a big difference between "slow and steady" progress and "no progress for a while"). In progress (21/1)
A dteam ticket - concerning http access to RHUL's SE. Although the initial observation about the SE certificate being expired was incorrect (the expiry date was reported as 5/1/15, which to be fair I would read as the 5th of January and not the 1st of May!) there still is some underlying problem here with intermittent test failures. Also this ticket raises the question of under what context are these tests being conducted? Anyone know, or shall we ask the submitter? In progress (26/1)
Biomed are having job problems, looking to be caused by using crusty old WMSes to communicate with these site's shiny up-to-date CEs. According to ticket 110635 a cream side fix should be out by the end of January (CREAM 1.16.5), although Alessandra suggests that Biomed should try to use newer, working WMSes - or Dirac instead!