Past Ticket Bulletins 2015

From GridPP Wiki
Jump to: navigation, search

Tuesday 3rd March

Concentrating on pre-2015 tickets this week in an attempt to Spring Clean the UK ggus presence. I will review these again next week - can people please take a look at these tickets if they're owned by them (or if they think they can help!).

SUSSEX - 26/11/14
110389
This is a perfsonar ticket - the initial request (reinstalling the perfsonar node) has been done a while ago but things weren't quite right. Matt RB did some soothing to this last week and asks if he's missed anything - I put it to waiting for reply this morning. Waiting for reply (24/2)

QMUL - 25/11/14
110353
Atlas wanting https access on QM's SE. Dan's been working on this nicely, carefully testing each stage of his rollout. The end is in sight here. In progress (17/2)

TIER 1 - 28/10/14
109694
This is a SNO+ ticket about getting gfal tools working for the Tier 1 - with the new version out Brian has tested it correctly (and I saw a related thread on lcg-rollout) - but no word from Sno+. Who is wrangling the other VOs in these post-Walker times? Waiting for reply (24/2)

TIER 1 - 1/10/14
108944
A CMS access about AAA tests at the Tier 1. This ticket is being actively worked on, with a new xrootd redirector at RAL and problems with the EU redirectors mucking things up. No problems that I can see. Waiting for reply (2/3)

100IT - 10/9/14
108356
Getting VMcatcher stuff to work at 100IT. This ticket seems to keep stalling due to lack of documentation or replies from the submitters. Waiting for reply (19/1)

LANCASTER - 27/1/14
100566
Lancaster's poor perfsonar performance. Being poked and prodded on and off over the last year, but the problems remain a mystery - Ewan's lending of a iperf endpoint has helped out greatly though, waiting on yet another network tweak. On Hold (23/2)

EFDA-JET - 21/9/13
97485
LHCB job failures at EFDA-JET. The causes of this remain a mystery, and is the first ticket on my "to be set to unsolved" list.

UCL - 1/7/13
95298
UCL's glexec ticket. Ben's been working hard at this recently, but keeps hitting show stoppers - the latest being a performance problem possibly due to the VM he's running argus on. On hold (19/2)

ECDF and LANCASTER - 1/7/13
95303 (ECDF)
95299 (Lancaster)
glexec for the tarball. This is *still* waiting on the tarball glexec, which is again waiting on me, which is waiting on me magicking some extra tarball development time. Will be reviewed by the end of March.


Monday 23rd February 2015, 15.00 GMT
Only 15 Open UK tickets this week. Feel free to bring up any ticket-based issues of your own on this quiet week.

RALPP
111703(11/2)
This CMS glexec hammercloud ticket is looking a little quiet - no update for a while. If it's continuing offline or waiting on input could it at least be put On Hold? In progress (11/2) (The Tier-1 version of this ticket, 111699, seems to be chugging along fine - there might be useful snippets in there).

100IT
111333(22/1)
The Cloud accounting probe ticket was reopened, asking if 100IT ticketed apel support (I assume contacting them via other means would work too) otherwise the new cloud accounting won't be properly republished. Reopened (20/2)


IMPERIAL (but not really their issue)
111872(20/2)
Tom opened a ticket after another cern@school user had troubles using the IC SE - there has been some problems with the newer versions of the dirac UI. Sometimes it's better to go Vintage! Although after trying Simon and Daniela haven't been able to reproduce the failure - perhaps something's up with the user's UI? Waiting for reply (23/2)

SUSSEX
110389(26/11/14)
Sussex's Perfsonar ticket. I know Matt RB has put the ticket On Hold and is very busy - is there any news/anything we can do to help? On Hold (21/1

TIER 1
109694(28/10/14)
Sno+ gfal copy problems at RAL. Brian informs us that the latest version of the gfal tools works for him and has asked if they work for Matt M. and Co. Did you get these packages out of epel/epel-testing or somewhere else Brian? Waiting for reply (18/2)

Monday 16th February 2015, 14.30 GMT
Only 19 open UK tickets today.

TIER 1
111120(12/1)
An atlas ticket concerning transfer failures between RAL and BNL. Brian mentioned last week that the lack of recent failures is due to atlas not attempting to transfer any older data recently. Perhaps this could do with being put into the ticket (and the ticket being put On Hold, or prodded some more)? Waiting for reply (29/1)

Also 111800 (17/2)
ARC CE issues at RAL detected.

RALPP
111703(11/2)
Atlas running glexec hammerclouds - having trouble at RALPP (and RAL, see 111699). The glexec experts have gotten involved on this one, and asked to take a peek at a proxy - not sure about anyone else, but I'd feel a tad uncomfortable sharing proxies, even with known and trusted experts as in this case. Am I being overly paranoid? Either way the ticket has gone a bit quiet. In progress (11/2)

100IT
108356 & 111333
Both the 100IT tickets are in Waiting for reply - the oldest one for quite a while - David asked a question a while back and no answer. The newest one asks if the 100IT logs made it to apel safely - I think what David has to do is submit a ticket with this question to the apel support team - have I got the right end of the stick?

Biomed Tickets at Manchester and Imperial
111356 & 111357
FYI There's a note at the bottom of both of these tickets that the version of CREAM that should fix this has been delayed until the end of February(ish).
Update - I read those updates wrong - the cream update has been released and these tickets have been (perhaps erroneously) closed.


Monday 9th February 2015, 15.00 GMT

Other VO Nagios Results
At the time of writing the only site showing red that aren't suffering an understood problem was RALPP with org.nordugrid.ARC-CE-submit and SRM-submit test failures for gridpp, pheno, t2k and southgrid for both its CEs and its SE. The failures are between 1 and 12 hours old, so it doesn't seem to be a persistent failure, but it seems to be quite consistent. They all seem to be failing with "Job submission failed... arcsub exited with code 256: ...ERROR: Failed to connect to XXXX(IPv4):443 .... Job submission failed, no more possible targets". Anyone seen something like this before?

Only 20 Open UK tickets this week.

Biomed tickets:
111356 (Manchester)
111357 (Imperial)
Biomed have linked both these tickets as children of 110636, being worked on by the cream blah team. AFAIKS no sign of Cream 1.16.5 just yet.

TIER 1
111347 (22/1)
CMS consistency checks for January 2015. It looks like everything that was asked of RAL has been done by RAL, so hopefully this can be successfully closed. In progress (3/2)

111120 (12/1)
Another ticket, this time concerning a period of Atlas transfer failures between RAL and BNL, that looks like it can be closed as the failures seem to have stopped (and might well have been at the BNL end). Waiting for reply (22/1)

108944 (1/10/14)
CMS AAA test failures at RAL. Federica can't connect to the new xrootd service according to the error messages. No news for a while. In progress (29/1)

100IT 108356
111333
Both of these 100IT tickets are looking a bit crusty - the first is waiting for advice, the second was just put "In progress".

QMUL
110353 (25/11/14) Dan has set up se02.esc.qmul.ac.uk to test out the latest https-accessible version of storm for dteam and atlas. As a cherry on top this node is also IPv6 enabled. I'm not sure if Dan wants others in the UK to "give it a go"? In progress (6/2)

LANCASTER
100566 (27/1/14)
(Blatantly scounging for advice) Trying to figure out why Lancaster's perfsonar is under-performing. Ewan kindly gave us access to a iperf endpoint and it's been very useful in characterising some of the weirdness - although I'm still confused. Ewan also gave us a bunch of suggestions for testing that have been useful - next stop, window sizes. If anyone else wants to throw advice to me all wisdom donations are thankfully accepted. My advice for others in be careful trying to connect to the default iperf port on a working DPM pool node.... In Progress (9/2)

Monday 2nd February 2015, 14.00 GMT
22 Open UK tickets this month.

SUSSEX
110389 (26/11/14)
A perfsonar ticket for Sussex. Their perfsonar has been reinstalled, but needs soothing. Matt has informed us that this might have to wait a few weeks due to other issues. On Hold (21/1)

RALPP
110536 (2/12/14)
MICE job failures at RALPP - it looked like they were dying due to running of of memory. The queues have been tweaked to give MICE more, but no word from the MICE if this has solved the problem. Waiting for reply (12/1)

BRISTOL
110365 (25/11/14)
Another perfsonar ticket. Again the node is reinstalled, just not quite working right. Winnie is waiting for news from the other sites in a similar boat. In progress (maybe On Hold it?) (20/1)

EDINBURGH
111118 (12/1)
ECDF "low availability" ticket - just waiting for the silly alarm to clear. Daniela submitted a ticket about this foolish alarm a while ago - 107689. On Hold (19/1)

95303 (1/7/13)
glexec tarball ticket. With my tarball hat on - still no positive news on this front - it's beginning to look like this can't be done but we're having one last go. Sorry! On Hold (19/12)

MANCHESTER
110225 (18/11/14)
Change of VO Manager for helios-vo.eu. It looks like this ticket is being held up at the user end a lot. I'm not sure there's anything we can do as it involves outside CAs. On Hold (20/1)

111356 (23/1)
One of Manchester's CEs not working for biomed, due to problems with the new CREAM/old WMS communication. Alessandra gave biomed some sagely advice, but I suspect this ticket will need to be prodded soon to get a reponse from biomed (who I agree should use a newer WMS and close it). On Hold (26/1)

LANCASTER
111547 (2/2)
I'm reporting on a ticket that I submitted to myself today. I'm not sure what that says about the world. Anyway - a ticket to track the decommissioning of one of Lancaster's CEs, as we try to do it all proper like. On Hold (2/2)

100566 (21/1/14)
Lancaster's perfsonar ticket, which I sadly let reach its first birthday. I've been prodding this offline, does anyone have the address for a regular, open iperf endpoint I could borrow? On Hold (9/1)

95299 (1/7/13)
Lancaster's tarball glexec ticket, as the ECDF one. On hold (26/1)

UCL
95299 (1/7/13)
UCL's glexec ticket. They've been having trouble getting it to behave, and at last check Ben was off ill - probably due to dealing with glexec :-) On Hold (20/1)

QMUL
110353 (25/11/14)
Atlas asking for QM's storage to be made available via https. Waiting on a production ready STORM that can provide this - Dan is trying it out on his testbed se02.esc.qmul.ac.uk, which still needs tweaking. In progress (28/1)

IMPERIAL
111357(23/1)
One of the IC CEs not working for biomed. Similar to the Manchester ticket, Daniela points to ticket 110635 and is waiting on an EMI release to fix it (due out imminently AIUI). On Hold (28/1)

EFDA-JET
97485 (21/9/13)
Jet's LCHB job failure tickets. I'm afraid I haven't been able to chase this up (partly due to only ever remembering on the first Monday of the month) - there's been no news for a while. On Hold (1/10/14)

100IT
111333 (22/1)
A ticket to 100IT and the NGI to get the cloud accounting probe upgraded. I notified 100IT, but forgot to reassign the ticket - thanks to Jeremy for doing it. Assigned (2/2)

108356(10/9/14)
Getting VMcatcher working at 100IT. David from 100IT has asked for some answers on which "glancepush" to use, but no reply for a while. Waiting for reply (19/1)

TIER 1
111477(29/1)
CMS would like to run some staging tests to warm up for Run2. The Tier 1 warned CMS of today's outage and they're happy to proceed tomorrow (the 3rd) - I think they'd like a response. In progress (30/1)

107935(27/8/14)
A ticket regarding inconsistent BDII and SRM storage numbers. Waiting on a fix from the developers regarding read-only disk accounting (I think), Brian is still on the case. Stephen B let us know that Maria the ticket submitter is on maternity leave, and asks in her stead if the numbers are expected to align now. On hold (28/1)

111120(12/1)
An atlas ticket about a large number of data transfer errors seen between RAL and BNL. Brian reckoned that this was due to shallow checksums on the old data being transferred, but had trouble looking at the BNL FTS. Regardless, the ADCoS shifter hadn't seen any errors for a week and suggests the ticket can be closed. Waiting for reply (29/1)

108944(1/10/14)
CMS AAA test problems at RAL. After setting up a new xrootd box the test failures have changed in nature, but sadly they're still failures. In progress (29/1)

111347(22/1)
CMS Consistency Check for RAL, January 2015 edition. Filelists were generated, orphan files were identified, then purged. Just need to know what CMS want to do next. Waiting for reply (26/1)

109694(28/10/14)
Sno+ ticket concerning gfal tool problems, waiting on the new release to come out (middle of this month I believe). If you don't want to wait that long then I believe the 2.8 gfal2 tools can be found in the fts3 repo at last check. On hold (20/1)


Monday 26th January 2015, 14.15 GMT
Back after being forgotten about by me:
Other VO Nagios Status:

At the time of writing I see:
Imperial: gridpp VO job submission errors (but only 34 minutes old so probably naught to worry about).
Brunel: gridpp VO jobs aborted (one of these is 94 days old, so might be something to worry about).
Lancaster: pheno failures (I can't see what's wrong, but this CE only has 10 days left to live).
Sussex: snoplus failures (but I think Sussex is in downtime).
RALPP: A number of failures across a number of CEs, all a few hours old. An SE problem?
Sheffield: gridpp VO job submission failure, but only 6 hours old. And of course the srm-$VONAME failures at the Tier 1, which are caused by incompatibility between the tests and Castor AIUI. Things are generally looking good.

22 Open UK Tickets this week.
NGI/100IT
111333(22/1)
The NGI has been asked to upgrade the cloud accounting probe, and then notify our (only at the moment) cloud site to republish their accounting. Not entirely sure what this entails or who this falls on, I assigned it to NGI-OPERATIONS (and also noticed that 100IT isn't on the "notify site" list - odd). Assigned (22/1)

TIER 1
108944(1/10/14)
CMS AAA test failures. Andrew Lahiff reported last week that the Tier 1 is building a replacement xrootd box which is currently being prepared. If that will take a while can the ticket be put on hold? In progress (19/1)

QMUL
110353(25/11/14)
An atlas ticket, asking for httpd access to at QMUL. The QM chaps were waiting on a production ready Storm that could handle this, and are preparing to test one out. This is another ticket that looks like it might need to be put On Hold (will leave that up to you chaps - there's a big difference between "slow and steady" progress and "no progress for a while"). In progress (21/1)

RHUL
111355(23/1)
A dteam ticket - concerning http access to RHUL's SE. Although the initial observation about the SE certificate being expired was incorrect (the expiry date was reported as 5/1/15, which to be fair I would read as the 5th of January and not the 1st of May!) there still is some underlying problem here with intermittent test failures. Also this ticket raises the question of under what context are these tests being conducted? Anyone know, or shall we ask the submitter? In progress (26/1)

BIOMED PROBLEMS:
Manchester: 111356(23/1)
Imperial: 111357(23/1)
Biomed are having job problems, looking to be caused by using crusty old WMSes to communicate with these site's shiny up-to-date CEs. According to ticket 110635 a cream side fix should be out by the end of January (CREAM 1.16.5), although Alessandra suggests that Biomed should try to use newer, working WMSes - or Dirac instead!


Monday 19th January 2015, 14.30 GMT
23 Open UK Tickets this week.

TIER 1
108944(1/10/14)
CMS seeing AAA test failures at RAL. The tests have been restarted recently and now seem to be having some suspicious looking authentication failures. In Progress (13/1)

SHEFFIELD
111162(14/1)
Atlas complaining about httpd doors not working on Sheffield's SE. After schooling the submitter in how to submit more useful information Elena is working on it. I bring this up as in the last few days I've had quite a few of my pool nodes have their httpd daemons crash on them (they're up to date, but still SL5), which may or may not be related. In Progress (19/1)

ECDF
111118(12/1)
ECDF "low availability" ticket after a few days of argus trouble, which Wahid fixed. Now the ticket will languish for a few weeks as the alarm clears. Daniela has reminded us of her ticket against these fairly silly alarms: https://ggus.eu/?mode=ticket_info&ticket_id=107689 . In the mean time this ticket could do with being put On Hold whilst the alarm clears. In progress (19/1)

100IT
108356(10/9/2014)
Setting up VMCatcher at 100IT. After some troubles things seem to have be looking up, although there are still some questions that the 100IT chaps have for the configurations and what they should be using that aren't getting answers. I set the ticket to "Waiting for Reply" hoping that this will help get those in the know's attention. Waiting for reply (15/1)

Perfsonar Tickets
110389(Sussex)
110382(TIER 1)
108273(Durham)
100566(Lancaster)
110365(Bristol)
Everyone seems to have updated their perfsonar hosts, so we're all good on that front, but a number of sites are either having trouble with their reinstalled hosts, or are having problems that they had pre-reinstall still haunt them. I'm afraid I have no suggestions of what to do about the growing number of these tickets though!