Monday 13th April 2015, 14.00 BST
24 Open tickets this week - going over all of them this week, site by site.
Fresh in this morning - 113010 and 113011 - Sno+ tickets concerning the RAL and Glasgow WMSes not updating job statuses.
RALPP
111703(11/2)
Atlas glexec hammercloud tests failing. There's been a lot of waiting on atlas to build new HC jobs. The most recent exchange (delayed due to Easter), was asking about SELinux - but no news since the first. In progress (1/4)
BIRMINGHAM
112875(7/4)
Low availability ROD ticket. Availability is crawling back up, just need it to go green. On hold (13/4)
GLASGOW
112967(10/4)
Another ROD ticket for bdii errors at Glasgow. Gareth has been doing everything right investigating this. Kashif recommended ticketed the midmon unit, but Gareth has spotted that the errors correspond to high load on their ARC CE - so it might be a site problem after all - Gareth asks for clarification. Waiting for reply (13/4)
EDINBURGH
95303 (1/7/13)
Tarball glexec ticket. No news (sorry). End of April I believe was the "deadline" I set for having this made. On Hold (9/3)
LANCASTER
100566 (27/1/14)
Lancaster's poor perfsonar performance. I'm not believing quite what I was seeing with the tests I performed so I'm aiming to rerun them. On hold (13/4)
95299 (1/7/13)
Lancaster's tarball glexec ticket. Same as ECDF. On hold (9/3)
BRUNEL
112966 (13/3)
A ROD cream job submit ticket, freshly assigned this afternoon. It's a bit mean of me to bring notice to it. Assigned (13/4) And POW, Raul closed this after kicking torque into shape - solved
100IT
112948 (10/4)
100IT needed to upgrade to the latest CA release. They've done this, but there are still authentication problems. In progress (13/4)
108356 (10/9/14)
Deploying vmcatcher at 100IT. After David's questions falling on deaf ears for a while it has been advised that the ticket be closed as this issue will be dealt with elsewhere. Whether or not it is to be "solved" or "unsolved" is open to debate! In progress (can possibly be closed) (13/4)
TIER 1
108944 (1/10/14)
CMS AAA tests failing at RAL. After a lot of work and new xrootd redirectors problems persist. It's looking to be a problem that needs the CASTOR and/or xrootd devs to look at. In progress (30/3)
112713 (27/3)
CMS asking to clean up the "unmerged area". Andrew conjured up a list of files and asked if they could be deleted - CMS responded with a "yes please then close the ticket". Has the deed been done? In progress (31/3)
109694 (28/10/14)
The Sno+ gfal copy ticket. Matt M still sees gfal-copy hang for files at RAL when he uses the GUID (SURL works). A Castor oddity perhaps? Matt asks a question about what problems like this (coupled with the move away from lcg tools) will mean for VOs that rely on the LFC. In progress (31/3)
112977 (10/3)
CMS high job failure rate at RAL. Related to 112896 (below) - the jobs all want that file! In progress (13/3)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=112896 (9/4)
CMS Dataset access problems - caused by over a million access attempts on a single file over a 18 hour period. Andrew L comments that CMS needs to have a think about how they access pileup datasets. In progress (9/4)
111699 (10/2)
Tier 1 counterpart to 111703. A new HC stress test was submitted near the end of March, but no news on how it did. In progress (23/3)
112866 (7/4)
A different "lots of CMS job failures" ticket. Again a "hot file" seems to be the root cause. In progress (7/4)
112721 (28/3)
An atlas file access ticket, seemingly caused by some odd FTS behaviour. No answers to Shaun's question about this odd occurrence or much noise at all till today. Waiting for reply (13/4)
UCL
UCL has 6 tickets - 4 just "assigned". I'll just list them in the interests of brevity.
112371 (ROD low availiability, On Hold)
112841 (atlas 0% transfer efficiency, assigned)
112873 (ROD srm put failures, assigned)
95298 (glexec ticket, on hold)
112722 (atlas checksum timeouts, in progress)
112966 (ROD job submit failures, assigned)
|