Past Ticket Bulletins 2015
Monday 11th May 2015, 14.10 BST
22 Open UK Tickets this week.
TIER 1
There are a few tickets at the Tier 1 that are set "In Progress" but haven't received an update yet this month:
108944 (CMS AAA Tests, 30/4)
112721 (Atlas Transfer problems, 16/4)
109694 (SNO+ gfal copy trouble, 15/4)
112866 (CMS job failures, 7/4)
112819 (SNO+ arcsync troubles, 20/4)
Other Tier 1 Tickets (sorry to be picking on you guys!)
111699 (10/2)
Atlas glexec hammercloud test jobs at the Tier 1. It appears to be working, but a batch of test jobs failed because they couldn't find the "mkgltempdir" utility on some nodes ("slot1_5@lcg1742.gridpp.rl.ac.uk" and "slot1_4@lcg1739.gridpp.rl.ac.uk"). In progress (4/5)
113320 (27/4)
Maybe repeating what Daniela is going to say in the CMS update - trouble with CMS data transfers within RAL. It's under investigation, but it looks like the files in question will need to be invalidated - even if it's just to paint a clearer picture. In progress (10/5)
APEL REPUBLISHING
113473
At last update Brunel, Liverpool, Edinburgh, Birmingham and Oxford need to republish still. Oxford have their own ticket about it due to complications (113482).
UCL Tickets - Ben is starting to move to close these, some are going to be "unsolved".
GLASGOW
113095 (17/4)
Andrew asks if the timeframe for the move to Condor be added to this ticket, for the ROD team's information. On Hold (7/4)
100IT
112948 (10/4)
No news on this 100IT ticket for a while. In progress (27/4)
Friday 1st May
The Bank Holiday weekend might muck up plans for a Ticket review this week. Just in case, some links!
Hope you all have a nice weekend!
A quick check of the Other VO Nagios page.
26 Open UK tickets this week.
ITWO Decommissioning
Three of the tickets are to the VOMS sites (Manchester, Oxford, IC), concerning the decommissioning of the ITWO VO. Just an FYI to y'all.
ECDF
113293 (26/4)
There was an APEL problem last month where a lot of sites needed to republish their data for the month. I think Edinburgh are the only UK site that suffered this problem, but another FYI ticket. Assigned (26/4) And solved
113181(21/4)
Atlas production jobs not running at ECDF. Andy noticed that analysis jobs were running fine, and believes that this might be a problem scheduling pilots in time. Perhaps a multicore issue if this is only effecting (affecting?) production jobs. In progress (22/4) Update - solved
ATLAS GLEXEC HAMMERCLOUD PROBLEMS
111699 (TIER 1)
111703 (RALPP)
It was discovered that there was a problem in the test code, so the ball is very much in atlas' court for this one. The problem has been fixed and the tests are being rebuilt and resubmitted.
TIER 1
112721 (28/3)
ATLAS FTS failures too RAL. A rucio issue causing double-transfers has been discovered (here), which would explain the behaviour seen. No news since this revelation. In progress (16/4)
There are a number of other Tier 1 tickets that could do with either an update or On Holding
Monday 20th April 2015, 14.30 BST
24 Open UK tickets this week, only a light review. Update - down to 20 open tickets as of this morning
NGI/TIER 1
113150 (20/4)
Fresh in - the NGI has been ticketed to change the regional VO from emi.argus to ngi.argus in the gocdb. Seems a bit pedantic, but hey! I assigned it to the NGI ops, and notified RAL as keepers of the regional argus. Assigned (20/4) Update - solved
TIER 1
113035 (14/4)
Just for people's interest, the ticket tracking the decommissioning of the last of the RAL CREAM CEs. In progress (14/4)
112819 (2/4)
A SNO+ ticket I must of somehow missed last week, concerning SNO+'s manual renewing of proxies on ARC machines. Matt M has noticed that ArcSync occasionally hangs rather then timeouts smoothly (although he later notes that he doesn't see the initial problems working from a different network). I'm thinking that this should be redirected at the arc devs, but I don't think they have a GGUS support group (I could be wrong, I'm well behind on the ARC curve). In progress (7/4)
EDINBURGH
113110 (17/4)
Looks like this atlas low transfer efficiency ticket can be closed. Waiting for reply (20/4) Update - solved
GLASGOW
113095 (17/4)
ROD ticket for some BDII misreporting at Glasgow. The botheration seems to be ephemeral in nature, the blunders passing with the abating of their batch system's burden. This ticket can probably be solved. In progress (17/4)
Monday 13th April 2015, 14.00 BST
24 Open tickets this week - going over all of them this week, site by site.
Fresh in this morning - 113010 and 113011 - Sno+ tickets concerning the RAL and Glasgow WMSes not updating job statuses.
RALPP
111703(11/2)
Atlas glexec hammercloud tests failing. There's been a lot of waiting on atlas to build new HC jobs. The most recent exchange (delayed due to Easter), was asking about SELinux - but no news since the first. In progress (1/4)
BIRMINGHAM
112875(7/4)
Low availability ROD ticket. Availability is crawling back up, just need it to go green. On hold (13/4)
GLASGOW
112967(10/4)
Another ROD ticket for bdii errors at Glasgow. Gareth has been doing everything right investigating this. Kashif recommended ticketed the midmon unit, but Gareth has spotted that the errors correspond to high load on their ARC CE - so it might be a site problem after all - Gareth asks for clarification. Waiting for reply (13/4)
EDINBURGH
95303 (1/7/13)
Tarball glexec ticket. No news (sorry). End of April I believe was the "deadline" I set for having this made. On Hold (9/3)
LANCASTER
100566 (27/1/14)
Lancaster's poor perfsonar performance. I'm not believing quite what I was seeing with the tests I performed so I'm aiming to rerun them. On hold (13/4)
95299 (1/7/13)
Lancaster's tarball glexec ticket. Same as ECDF. On hold (9/3)
BRUNEL
112966 (13/3)
A ROD cream job submit ticket, freshly assigned this afternoon. It's a bit mean of me to bring notice to it. Assigned (13/4) And POW, Raul closed this after kicking torque into shape - solved
100IT
112948 (10/4)
100IT needed to upgrade to the latest CA release. They've done this, but there are still authentication problems. In progress (13/4)
108356 (10/9/14)
Deploying vmcatcher at 100IT. After David's questions falling on deaf ears for a while it has been advised that the ticket be closed as this issue will be dealt with elsewhere. Whether or not it is to be "solved" or "unsolved" is open to debate! In progress (can possibly be closed) (13/4)
TIER 1
108944 (1/10/14)
CMS AAA tests failing at RAL. After a lot of work and new xrootd redirectors problems persist. It's looking to be a problem that needs the CASTOR and/or xrootd devs to look at. In progress (30/3)
112713 (27/3)
CMS asking to clean up the "unmerged area". Andrew conjured up a list of files and asked if they could be deleted - CMS responded with a "yes please then close the ticket". Has the deed been done? In progress (31/3)
109694 (28/10/14)
The Sno+ gfal copy ticket. Matt M still sees gfal-copy hang for files at RAL when he uses the GUID (SURL works). A Castor oddity perhaps? Matt asks a question about what problems like this (coupled with the move away from lcg tools) will mean for VOs that rely on the LFC. In progress (31/3)
112977 (10/3)
CMS high job failure rate at RAL. Related to 112896 (below) - the jobs all want that file! In progress (13/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=112896 (9/4)
CMS Dataset access problems - caused by over a million access attempts on a single file over a 18 hour period. Andrew L comments that CMS needs to have a think about how they access pileup datasets. In progress (9/4)
111699 (10/2)
Tier 1 counterpart to 111703. A new HC stress test was submitted near the end of March, but no news on how it did. In progress (23/3)
112866 (7/4)
A different "lots of CMS job failures" ticket. Again a "hot file" seems to be the root cause. In progress (7/4)
112721 (28/3)
An atlas file access ticket, seemingly caused by some odd FTS behaviour. No answers to Shaun's question about this odd occurrence or much noise at all till today. Waiting for reply (13/4)
UCL
UCL has 6 tickets - 4 just "assigned". I'll just list them in the interests of brevity.
112371 (ROD low availiability, On Hold)
112841 (atlas 0% transfer efficiency, assigned)
112873 (ROD srm put failures, assigned)
95298 (glexec ticket, on hold)
112722 (atlas checksum timeouts, in progress)
112966 (ROD job submit failures, assigned)
Tuesday 7th April
- 20 open tickets. Link to GGUS.
Monday 23rd March 2015, 15.30 GMT
19 Open tickets this week.
BIRMINGHAM
https://ggus.eu/?mode=ticket_info&ticket_id=112550 (23/3)
A ticket fresh off the ROD dashboard - the Birmingham CREAMs aren't being matched ("BrokerHelper: no compatible resources"). Matt W has double checked their setup and can't spot anything wrong - they've been running "normal" atlas/lhcb etc jobs fine over the last few weeks. Any advice appreciated. In progress (23/3)
TIER 1
https://ggus.eu/?mode=ticket_info&ticket_id=112495 (21/3)
MICE problems running jobs at RAL, which Andrew L discovered coincided with WMS problems that he fixed. Probably should be in "Waiting for Reply/Seeing if the problem's evaporated". In progress (23/3)
https://ggus.eu/?mode=ticket_info&ticket_id=112350 (14/3)
The cause of this Sno+ ticket, about a recent user not being able to access files due to not being in the gridmap, has been discovered. As Robert F sagely pointed out the latest version of the mkgridmap rpm is required to talk to the voms server. Just waiting on the time for it to updated at now. In progress (17/3)
100IT
https://ggus.eu/?mode=ticket_info&ticket_id=108356 (10/9/14)
This 100IT ticket is still waiting for a reply (since mid-January). The question needs to be answered by someone familiar with the technologies and terminologies? Is anyone up on vmcatcher? Anyone know what other channel to pass David's query onto? Waiting for reply (15/1)
Monday 16th March 2015, 15.30 GMT
16 Open UK tickets this week. Half Red, Half Green.
RALPP and TIER 1 glexec HC tickets
111703 (RALPP)
111699 (TIER 1)
No news on these tickets since it was expected that a new stable HC job would be released last Tuesday. All very quiet.
OXFORD
112011(25/2)
Similar for this CMS glexec ticket - no news after Kashif asked for some more information way back. Waiting for reply (27/2)
TIER 1
109694(28/10/14)
The Sno+ gfal copying ticket. A lot of people are working on this, and attempts to recreate the problems seem to occasionally be devolving into "did we get this complicated command right?". At some point it might be necessary to get the gfal devs involved (are there gfal devs?). Waiting for reply (11/3)
Also there's the poor 100IT ticket, still waiting for a reply. The JET ticket also needs wrapping up, I put a reminder in to the end of it.
Monday 9th March 2015, 15.00 GMT
From last week's crusty ticket round up:
Tier 1
109694- 28/10/14
Matt M has managed to reclaim his tickets after a certificate change orphaned his old ones. Progress has resumed. Duncan has asked Matt to retry his failing tests with a simple copy to local disk example.
108944- 1/10/14
CMS AAA access at RAL. Andrew posed a question to the CMS xroot experts last week - if you have their details it might be a good idea to involve them in the ticket.
100IT
108356- 10/9/14
This VMCatcher ticket is still stuck waiting for a reply. Deafening silence for our 100IT colleagues.
EFDA-JET
97485 - 21/9/13
The Jet LHCB ticket is in the state of being wrapped up. LHCB have been removed from the local configs, and the site has been removed from LHCB's. I believe that this ticket can be terminated.
QMUL
110353 - 25/11/14
Dan's managed to get webdav working on one SE, but not t'other. Very strange, but Dan is investigating (see also 111942).
No movement on the three glexec tickets (none expected on the two tarball ones in the last week though), the Lancaster perfsonar ticket is still waiting on another batch of local tweaks (and I still need to make sense of what I'm seeing). Matt RB closed Sussex's perfsonar ticket though - nice one.
The "Normal" tickets:
Atlas gLexec Hammercloud failures (RALPP and Tier 1)
111699 (Tier 1)
111703 (RALPP)
These tests were waiting on a new stable job release being made - this has been delayed (hopefully out tomorrow).
TIER 1
111856(19/2)
This LHCB ticket about stalled jobs looks like it can be closed (LHCB no longer see a problem). Update - set to solved, the jobs were being killed for using too much memory.
OXFORD
112011(25/2)
A CMS user saw glexec failures on some nodes - Kashif asked for some more information but there has been no reply. I'd consider giving the user till the end of the week then closing the ticket if there's still no word. Waiting for reply (27/2)
Tuesday 3rd March
- A link to GGUS open tickets for checking status directly.
Concentrating on pre-2015 tickets this week in an attempt to Spring Clean the UK ggus presence. I will review these again next week - can people please take a look at these tickets if they're owned by them (or if they think they can help!).
SUSSEX - 26/11/14
110389
This is a perfsonar ticket - the initial request (reinstalling the perfsonar node) has been done a while ago but things weren't quite right. Matt RB did some soothing to this last week and asks if he's missed anything - I put it to waiting for reply this morning. Waiting for reply (24/2)
QMUL - 25/11/14
110353
Atlas wanting https access on QM's SE. Dan's been working on this nicely, carefully testing each stage of his rollout. The end is in sight here. In progress (17/2)
TIER 1 - 28/10/14
109694
This is a SNO+ ticket about getting gfal tools working for the Tier 1 - with the new version out Brian has tested it correctly (and I saw a related thread on lcg-rollout) - but no word from Sno+. Who is wrangling the other VOs in these post-Walker times? Waiting for reply (24/2)
TIER 1 - 1/10/14
108944
A CMS access about AAA tests at the Tier 1. This ticket is being actively worked on, with a new xrootd redirector at RAL and problems with the EU redirectors mucking things up. No problems that I can see. Waiting for reply (2/3)
100IT - 10/9/14
108356
Getting VMcatcher stuff to work at 100IT. This ticket seems to keep stalling due to lack of documentation or replies from the submitters. Waiting for reply (19/1)
LANCASTER - 27/1/14
100566
Lancaster's poor perfsonar performance. Being poked and prodded on and off over the last year, but the problems remain a mystery - Ewan's lending of a iperf endpoint has helped out greatly though, waiting on yet another network tweak. On Hold (23/2)
EFDA-JET - 21/9/13
97485
LHCB job failures at EFDA-JET. The causes of this remain a mystery, and is the first ticket on my "to be set to unsolved" list.
UCL - 1/7/13
95298
UCL's glexec ticket. Ben's been working hard at this recently, but keeps hitting show stoppers - the latest being a performance problem possibly due to the VM he's running argus on. On hold (19/2)
ECDF and LANCASTER - 1/7/13
95303 (ECDF)
95299 (Lancaster)
glexec for the tarball. This is *still* waiting on the tarball glexec, which is again waiting on me, which is waiting on me magicking some extra tarball development time. Will be reviewed by the end of March.
Monday 23rd February 2015, 15.00 GMT
Only 15 Open UK tickets this week. Feel free to bring up any ticket-based issues of your own on this quiet week.
RALPP
111703(11/2)
This CMS glexec hammercloud ticket is looking a little quiet - no update for a while. If it's continuing offline or waiting on input could it at least be put On Hold? In progress (11/2)
(The Tier-1 version of this ticket, 111699, seems to be chugging along fine - there might be useful snippets in there).
100IT
111333(22/1)
The Cloud accounting probe ticket was reopened, asking if 100IT ticketed apel support (I assume contacting them via other means would work too) otherwise the new cloud accounting won't be properly republished. Reopened (20/2)
IMPERIAL (but not really their issue)
111872(20/2)
Tom opened a ticket after another cern@school user had troubles using the IC SE - there has been some problems with the newer versions of the dirac UI. Sometimes it's better to go Vintage! Although after trying Simon and Daniela haven't been able to reproduce the failure - perhaps something's up with the user's UI? Waiting for reply (23/2)
SUSSEX
110389(26/11/14)
Sussex's Perfsonar ticket. I know Matt RB has put the ticket On Hold and is very busy - is there any news/anything we can do to help? On Hold (21/1
TIER 1
109694(28/10/14)
Sno+ gfal copy problems at RAL. Brian informs us that the latest version of the gfal tools works for him and has asked if they work for Matt M. and Co. Did you get these packages out of epel/epel-testing or somewhere else Brian? Waiting for reply (18/2)
Monday 16th February 2015, 14.30 GMT
Only 19 open UK tickets today.
TIER 1
111120(12/1)
An atlas ticket concerning transfer failures between RAL and BNL. Brian mentioned last week that the lack of recent failures is due to atlas not attempting to transfer any older data recently. Perhaps this could do with being put into the ticket (and the ticket being put On Hold, or prodded some more)? Waiting for reply (29/1)
Also
111800 (17/2)
ARC CE issues at RAL detected.
RALPP
111703(11/2)
Atlas running glexec hammerclouds - having trouble at RALPP (and RAL, see 111699). The glexec experts have gotten involved on this one, and asked to take a peek at a proxy - not sure about anyone else, but I'd feel a tad uncomfortable sharing proxies, even with known and trusted experts as in this case. Am I being overly paranoid? Either way the ticket has gone a bit quiet. In progress (11/2)
100IT
108356 & 111333
Both the 100IT tickets are in Waiting for reply - the oldest one for quite a while - David asked a question a while back and no answer. The newest one asks if the 100IT logs made it to apel safely - I think what David has to do is submit a ticket with this question to the apel support team - have I got the right end of the stick?
Biomed Tickets at Manchester and Imperial
111356 & 111357
FYI There's a note at the bottom of both of these tickets that the version of CREAM that should fix this has been delayed until the end of February(ish).
Update - I read those updates wrong - the cream update has been released and these tickets have been (perhaps erroneously) closed.
Monday 9th February 2015, 15.00 GMT
Other VO Nagios Results
At the time of writing the only site showing red that aren't suffering an understood problem was RALPP with org.nordugrid.ARC-CE-submit and SRM-submit test failures for gridpp, pheno, t2k and southgrid for both its CEs and its SE. The failures are between 1 and 12 hours old, so it doesn't seem to be a persistent failure, but it seems to be quite consistent. They all seem to be failing with "Job submission failed... arcsub exited with code 256: ...ERROR: Failed to connect to XXXX(IPv4):443 .... Job submission failed, no more possible targets". Anyone seen something like this before?
Only 20 Open UK tickets this week.
Biomed tickets:
111356 (Manchester)
111357 (Imperial)
Biomed have linked both these tickets as children of 110636, being worked on by the cream blah team. AFAIKS no sign of Cream 1.16.5 just yet.
TIER 1
111347 (22/1)
CMS consistency checks for January 2015. It looks like everything that was asked of RAL has been done by RAL, so hopefully this can be successfully closed. In progress (3/2)
111120 (12/1)
Another ticket, this time concerning a period of Atlas transfer failures between RAL and BNL, that looks like it can be closed as the failures seem to have stopped (and might well have been at the BNL end). Waiting for reply (22/1)
108944 (1/10/14)
CMS AAA test failures at RAL. Federica can't connect to the new xrootd service according to the error messages. No news for a while. In progress (29/1)
100IT
108356
111333
Both of these 100IT tickets are looking a bit crusty - the first is waiting for advice, the second was just put "In progress".
QMUL
110353 (25/11/14)
Dan has set up se02.esc.qmul.ac.uk to test out the latest https-accessible version of storm for dteam and atlas. As a cherry on top this node is also IPv6 enabled. I'm not sure if Dan wants others in the UK to "give it a go"? In progress (6/2)
LANCASTER
100566 (27/1/14)
(Blatantly scounging for advice) Trying to figure out why Lancaster's perfsonar is under-performing. Ewan kindly gave us access to a iperf endpoint and it's been very useful in characterising some of the weirdness - although I'm still confused. Ewan also gave us a bunch of suggestions for testing that have been useful - next stop, window sizes. If anyone else wants to throw advice to me all wisdom donations are thankfully accepted. My advice for others in be careful trying to connect to the default iperf port on a working DPM pool node.... In Progress (9/2)
Monday 2nd February 2015, 14.00 GMT
22 Open UK tickets this month.
SUSSEX
110389 (26/11/14)
A perfsonar ticket for Sussex. Their perfsonar has been reinstalled, but needs soothing. Matt has informed us that this might have to wait a few weeks due to other issues. On Hold (21/1)
RALPP
110536 (2/12/14)
MICE job failures at RALPP - it looked like they were dying due to running of of memory. The queues have been tweaked to give MICE more, but no word from the MICE if this has solved the problem. Waiting for reply (12/1)
BRISTOL
110365 (25/11/14)
Another perfsonar ticket. Again the node is reinstalled, just not quite working right. Winnie is waiting for news from the other sites in a similar boat. In progress (maybe On Hold it?) (20/1)
EDINBURGH
111118 (12/1)
ECDF "low availability" ticket - just waiting for the silly alarm to clear. Daniela submitted a ticket about this foolish alarm a while ago - 107689. On Hold (19/1)
95303 (1/7/13)
glexec tarball ticket. With my tarball hat on - still no positive news on this front - it's beginning to look like this can't be done but we're having one last go. Sorry! On Hold (19/12)
MANCHESTER
110225 (18/11/14)
Change of VO Manager for helios-vo.eu. It looks like this ticket is being held up at the user end a lot. I'm not sure there's anything we can do as it involves outside CAs. On Hold (20/1)
111356 (23/1)
One of Manchester's CEs not working for biomed, due to problems with the new CREAM/old WMS communication. Alessandra gave biomed some sagely advice, but I suspect this ticket will need to be prodded soon to get a reponse from biomed (who I agree should use a newer WMS and close it). On Hold (26/1)
LANCASTER
111547 (2/2)
I'm reporting on a ticket that I submitted to myself today. I'm not sure what that says about the world. Anyway - a ticket to track the decommissioning of one of Lancaster's CEs, as we try to do it all proper like. On Hold (2/2)
100566 (21/1/14)
Lancaster's perfsonar ticket, which I sadly let reach its first birthday. I've been prodding this offline, does anyone have the address for a regular, open iperf endpoint I could borrow? On Hold (9/1)
95299 (1/7/13)
Lancaster's tarball glexec ticket, as the ECDF one. On hold (26/1)
UCL
95299 (1/7/13)
UCL's glexec ticket. They've been having trouble getting it to behave, and at last check Ben was off ill - probably due to dealing with glexec :-) On Hold (20/1)
QMUL
110353 (25/11/14)
Atlas asking for QM's storage to be made available via https. Waiting on a production ready STORM that can provide this - Dan is trying it out on his testbed se02.esc.qmul.ac.uk, which still needs tweaking. In progress (28/1)
IMPERIAL
111357(23/1)
One of the IC CEs not working for biomed. Similar to the Manchester ticket, Daniela points to ticket 110635 and is waiting on an EMI release to fix it (due out imminently AIUI). On Hold (28/1)
EFDA-JET
97485 (21/9/13)
Jet's LCHB job failure tickets. I'm afraid I haven't been able to chase this up (partly due to only ever remembering on the first Monday of the month) - there's been no news for a while. On Hold (1/10/14)
100IT
111333 (22/1)
A ticket to 100IT and the NGI to get the cloud accounting probe upgraded. I notified 100IT, but forgot to reassign the ticket - thanks to Jeremy for doing it. Assigned (2/2)
108356(10/9/14)
Getting VMcatcher working at 100IT. David from 100IT has asked for some answers on which "glancepush" to use, but no reply for a while. Waiting for reply (19/1)
TIER 1
111477(29/1)
CMS would like to run some staging tests to warm up for Run2. The Tier 1 warned CMS of today's outage and they're happy to proceed tomorrow (the 3rd) - I think they'd like a response. In progress (30/1)
107935(27/8/14)
A ticket regarding inconsistent BDII and SRM storage numbers. Waiting on a fix from the developers regarding read-only disk accounting (I think), Brian is still on the case. Stephen B let us know that Maria the ticket submitter is on maternity leave, and asks in her stead if the numbers are expected to align now. On hold (28/1)
111120(12/1)
An atlas ticket about a large number of data transfer errors seen between RAL and BNL. Brian reckoned that this was due to shallow checksums on the old data being transferred, but had trouble looking at the BNL FTS. Regardless, the ADCoS shifter hadn't seen any errors for a week and suggests the ticket can be closed. Waiting for reply (29/1)
108944(1/10/14)
CMS AAA test problems at RAL. After setting up a new xrootd box the test failures have changed in nature, but sadly they're still failures. In progress (29/1)
111347(22/1)
CMS Consistency Check for RAL, January 2015 edition. Filelists were generated, orphan files were identified, then purged. Just need to know what CMS want to do next. Waiting for reply (26/1)
109694(28/10/14)
Sno+ ticket concerning gfal tool problems, waiting on the new release to come out (middle of this month I believe). If you don't want to wait that long then I believe the 2.8 gfal2 tools can be found in the fts3 repo at last check. On hold (20/1)
Monday 26th January 2015, 14.15 GMT
Back after being forgotten about by me:
Other VO Nagios Status:
At the time of writing I see:
Imperial: gridpp VO job submission errors (but only 34 minutes old so probably naught to worry about).
Brunel: gridpp VO jobs aborted (one of these is 94 days old, so might be something to worry about).
Lancaster: pheno failures (I can't see what's wrong, but this CE only has 10 days left to live).
Sussex: snoplus failures (but I think Sussex is in downtime).
RALPP: A number of failures across a number of CEs, all a few hours old. An SE problem?
Sheffield: gridpp VO job submission failure, but only 6 hours old.
And of course the srm-$VONAME failures at the Tier 1, which are caused by incompatibility between the tests and Castor AIUI. Things are generally looking good.
22 Open UK Tickets this week.
NGI/100IT
111333(22/1)
The NGI has been asked to upgrade the cloud accounting probe, and then notify our (only at the moment) cloud site to republish their accounting. Not entirely sure what this entails or who this falls on, I assigned it to NGI-OPERATIONS (and also noticed that 100IT isn't on the "notify site" list - odd). Assigned (22/1)
TIER 1
108944(1/10/14)
CMS AAA test failures. Andrew Lahiff reported last week that the Tier 1 is building a replacement xrootd box which is currently being prepared. If that will take a while can the ticket be put on hold? In progress (19/1)
QMUL
110353(25/11/14)
An atlas ticket, asking for httpd access to at QMUL. The QM chaps were waiting on a production ready Storm that could handle this, and are preparing to test one out. This is another ticket that looks like it might need to be put On Hold (will leave that up to you chaps - there's a big difference between "slow and steady" progress and "no progress for a while"). In progress (21/1)
RHUL
111355(23/1)
A dteam ticket - concerning http access to RHUL's SE. Although the initial observation about the SE certificate being expired was incorrect (the expiry date was reported as 5/1/15, which to be fair I would read as the 5th of January and not the 1st of May!) there still is some underlying problem here with intermittent test failures. Also this ticket raises the question of under what context are these tests being conducted? Anyone know, or shall we ask the submitter? In progress (26/1)
BIOMED PROBLEMS:
Manchester: 111356(23/1)
Imperial: 111357(23/1)
Biomed are having job problems, looking to be caused by using crusty old WMSes to communicate with these site's shiny up-to-date CEs. According to ticket 110635 a cream side fix should be out by the end of January (CREAM 1.16.5), although Alessandra suggests that Biomed should try to use newer, working WMSes - or Dirac instead!
Monday 19th January 2015, 14.30 GMT
23 Open UK Tickets this week.
TIER 1
108944(1/10/14)
CMS seeing AAA test failures at RAL. The tests have been restarted recently and now seem to be having some suspicious looking authentication failures. In Progress (13/1)
SHEFFIELD
111162(14/1)
Atlas complaining about httpd doors not working on Sheffield's SE. After schooling the submitter in how to submit more useful information Elena is working on it. I bring this up as in the last few days I've had quite a few of my pool nodes have their httpd daemons crash on them (they're up to date, but still SL5), which may or may not be related. In Progress (19/1)
ECDF
111118(12/1)
ECDF "low availability" ticket after a few days of argus trouble, which Wahid fixed. Now the ticket will languish for a few weeks as the alarm clears. Daniela has reminded us of her ticket against these fairly silly alarms: https://ggus.eu/?mode=ticket_info&ticket_id=107689 . In the mean time this ticket could do with being put On Hold whilst the alarm clears. In progress (19/1)
100IT
108356(10/9/2014)
Setting up VMCatcher at 100IT. After some troubles things seem to have be looking up, although there are still some questions that the 100IT chaps have for the configurations and what they should be using that aren't getting answers. I set the ticket to "Waiting for Reply" hoping that this will help get those in the know's attention. Waiting for reply (15/1)
Perfsonar Tickets
110389(Sussex)
110382(TIER 1)
108273(Durham)
100566(Lancaster)
110365(Bristol)
Everyone seems to have updated their perfsonar hosts, so we're all good on that front, but a number of sites are either having trouble with their reinstalled hosts, or are having problems that they had pre-reinstall still haunt them. I'm afraid I have no suggestions of what to do about the growing number of these tickets though!