MONDAY 8th APRIL 15.00 BST</br>
27 Open UK tickets this week, and as it's the first working day of the month, we have the joy of looking at all of them.
The UK ROD is being pulled over the coals over not handling recent tickets "according to escalation procedure". I suspect all the tickets refered to are EMI1 upgrade ones, so justifying ourselves should be straightforward. Assigned to ngi-ops. (8/4)
Rolling out voms support for the new, Glasgow-based earthsci vo. After some discussion on domain naming it was decided to go with the vo name earthsci.vo.gridpp.ac.uk. It has been deployed at the Manchester, Oxford and IC, so I assume the next step is testing it. In progress (4/4)
EMI 1 UPGRADE TICKETS:</br>
RALPP https://ggus.eu/ws/ticket_info.php?ticket=91997 (On hold, extended 5/4)</br>
Chris has put back the dcache upgrade a bit, but it seems in order. The last other EMI1 holdout was being drained for upgrade last week.
GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=91992 (In progress, extended 5/4)</br>
Not much word from the Glasgow lads in a while (since 11/3), but they only had a few holdouts left.</br>
https://ggus.eu/ws/ticket_info.php?ticket=92805 (On hold)</br>
Glasgow's DPM ticket (despite their DPM technically being up to date)- Sam hopes to "update" when DPM 1.8.7 comes out, but if that looks unlikely in the time frame SAM will reinstall the DPM rpms to simulate an upgrade.
SHEFFIELD https://ggus.eu/ws/ticket_info.php?ticket=91990 (On hold, extended 5/4)</br>
Just some worker nodes left at Sheffield. Looking good. (But Elena has some publishing issues-see TB-SUPPORT).
BRUNEL https://ggus.eu/ws/ticket_info.php?ticket=91975 (On hold)</br>
Raul upgraded his CE, only to find that the nagios tests haven't picked up the upgrade! Daniela suggests a site BDII restart. Update - Raul seems to have figured out an arcane way of getting the publishing to work by yaiming twice then restarting the site BDII.
DURHAM https://ggus.eu/ws/ticket_info.php?ticket=92804 (In progress, extended 5/4)</br>
Not much news from Mike about this in the last few weeks- I think that he's in the same boat as Sam - technically up to date (just from the "wrong" repo).
COMMON OR GARDEN TICKETS:
Brian asked for a data dump, Ewan provided two! Ewan has left the ticket open whilst atlas decide what to do with the information. Waiting for reply (2/4)
Moving atlas data from the groupdisk token. Last word was from Stephene on the 3/3, asking for a dump of what remains. I think that the conversation has moved offline to expedite things. How goes it? On hold (3/3)
Glasgow supplied Brian with a list of all the files on the SE, Brian has given back a list of all the "dark data" files that they couldn't delete remotely. In progress (8/4)
Glasgow were being bit by stage in failures after disk server stress killed the xrootd service on a node. Measures have been put in place to stop this happening again, and Sam has said some wise words on this issue (as it was data hungry production jobs that caused the deadly stress). Sam suggests that it would be beneficial to have these data-hungry production jobs flagged in some way, so that they can be treated similarly to how analysis jobs are (staggered starts, limiting the maximum number running etc.) In progress (5/4)
This raises the question, is it likely that suggestions put in a ticket like this would work their way up the chain to someone who could act on them?
lhcb were having what looks like authorisation problems at Durham. Not much news on the ticket since then, does the problem persist? On hold (2/4)
atlas would like 5TB shuffled from localgroupdisk to datadisk. Assigned (8/4)
Atlas were suffering transfer failures, which puzzled the Liver lads as their logs showed the transfers succeeding. It could have been a problem with the University firewalls - the timing of the problems coincided with a change in the Uni firewall. These have been reverted so lets see if things go back to normal. In progress (8/4)
LHCB jobs were running in the tidgey home partition on the Lancaster shared cluster. I've tried to put in place a job wrapper that cds to $TMPDIR, but no joy - not sure what I'm doing wrong. On hold (27/3)
Path MTU discovery problems for RHUL. Passed to the networking chaps and Janet, this may be a long time in the solving. On hold (28/1)
Biomed are reporting seeing negative space on the RHUL SE- an old bugbear resurrected. In progress (1/4)
QM got a nagios ticket for the recent APEL troubles, Dan rightfully cited the apel ticket. In progress (8/4)
Atlas transfer failures, caused by a crash in a disk storage node. Reopened after the initial fix, it looks like a lustre bug is plaguing the QM chaps. Currently they're hoping on a bug fix or else they'll need to rollback. In progress (8/4)
Chris requesting webdav support on the RAL LFC. The RAL team are waiting on the next lfc version with better webdav support to come out in production. On hold (3/4)
Long standing ticket concerning the srm troubles with certain robot DNs. No fix is likely in the near future. On hold (27/2)
Correlated packet loss on the RAL perfsonar. The picture looks improved after last month's intervention, but still needs understanding. Proposed to wait until after the May intervention before looking at this hard again. On hold (27/3)
epic VO having trouble downloading output from the RAL WMS. Most likely related to known problem https://ggus.eu/ws/ticket_info.php?ticket=92288 (submitted by Jon from t2k). In progress (5/4)
Obviously Friday was the day of tickets. atlas were seeing a large number of cvmfs related cmtside failures. These nodes were testing the latest cvmfs 2.1.8, and have been rolled back. Waiting for reply (8/4)
RAL were having problems with their myproxy aliases not matching up to their myproxy's certs. After trying a few fixes the RAL guys are setting up a new machine with the hostname and certificate match. Aim to have this done within a fortnight. In progress (28/3)
Just in case you guys haven't been reading TB-SUPPORT, the ticket tracking the current APEL problems:</br>