Operations Bulletin 150413

From GridPP Wiki
Revision as of 08:50, 18 August 2014 by Jeremy Coles dc208346be (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 18th April 2013
Task Areas
General updates

Monday 8th April

  • As pointed out in Alessandra's email last week, aspects of the experiment computing model evolution is addressed in this DM presentation for ATLAS and CMS and this one on LHCONE.
  • Due to the Easter networking outages at RAL (4 days in total), APEL is still catching up. Some sites are publishing several months of data and the server has gone from processing 6 million to 16 million events per day. Sites have been seen to timeout when the consumer table is optimized (new issue) and this is being investigated. Please be patient as the service catches up. WLCG and GridPP are aware that accounting data is not yet up-to-date for March (for the monthly and quarterly reports).
  • On Thursday at the EGI Community Forum there is a talk on EMI-3 APEL - if you are a sysadmin and at the EGI CF that day please consider attending to give feedback on the approach. We can also get Will along to an ops meeting.


Tuesday 2nd April

  • Any remaining certificate problems?
  • Support for EMI-1 dCache was extended. See this broadcast. Report any tickets that have not been updated.
  • Jens has produced a page onKey Tokens. How do we want to use this now?
  • GGUS have released a new page on using the system.
  • There was an OMB meeting last Tuesday. (To be reviewed)


WLCG Operations Coordination - Agendas

Monday 8th April

  • The dates of the next WLCG Operations Coordination meetings are: Thursday 11th and 25th April, 15:30 CEST.
  • The agenda for Thursday is currently based on the standing items. Let Alessandra or Jeremy know if you have items you would like raised/discussed.

Tuesday 2nd April

  • A new task force on http proxy discovery is being formed (read more). They are looking for members.
  • Minutes of the 21st March planning meeting are now available.
Tier-1 - Status Page

Tuesday 9th April

  • Planned network intervention this morning (9th April) had problems. There was a break in connectivity to RAL from around 07:45 to 09:25 local time. Internally services carried on running OK.
  • Investigations are ongoing into problems at batch job set-up.
Storage & Data Management - Agendas/Minutes

Tuesday 9th April

Monday 1st April

  • DDN report - see slides circulated by Pete G.

Wed 20 March 2013

  • Ruminated over the agenda items from last week's GDB
    • EMI roadmap (dCache, and other things)
    • FTS support for HTTP - we knew this but how do we make use of it now
    • Storage accounting records, needs updated APEL;
    • Work of storage group(s) on interfaces and protocols, and future furlongpebbles.
  • RAL D1T0 evaluation.
    • Seems to be settling on HDFS and CEPH which will be run anyway
    • what about Lustre?
    • Presentation to PMB next Monday, but no decision yet.



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th March

  • APEL publishing stopped for Lancaster, QMUL and ECDF

Tuesday 12th February

  • SL HS06 page shows some odd ratios. Steve says he now takes "HS06 cpu numbers direct from ATLAS" and his page does get stuck every now and then.
  • An update of the metrics page has been requested.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 9th April

  • Please could those responsible for key documents start addressing the completeness of documents for which they are responsible? Thank you.

Tuesday 26th February

KeyDocs monitoring status: Grid Storage(7/0) Documentation(3/0) On-duty coordination(3/0) Staged rollout(3/0) Ticket follow-up(3/0) Regional tools(3/0) Security(3/0) Monitoring(3/0) Accounting(3/0) Core Grid services(3/0) Wider VO issues(3/0) Grid interoperation(3/0) Cluster Management(1/0) (brackets show total/missing)

Monday, 4th March

  • New draft document for putting a CE in downtime. It discusses the pros and cons of 3 approached. Needs discussion to finalise.
  • EPIC integrated into Approved VOs, pending full acceptance.
  • Process commissioned to delete stale documents.

Thursday, 29th November

The Approved VOs document has been updated to automatically contain a table that lays out the resource requirements for each VO, as well as the maximum. We need to discuss whether this is useful - it seems that the majority of WN software requirements are passed around by word of mouth etc. Should this be formalized? Please see

   https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs#VO_Resource_Requirements

This table will be kept up to date with a regular process that syncs it with the CIC Portal, should it prove to be useful.

Interoperation - EGI ops agendas

Tuesday 9th April

  • There was an EGI ops meeting on 3rd April.
  • UMD/SR - note issues with CREAM in UMD-2 - also there's a new CREAM in EMI-2, with security updates. Does anyone in the UK run CREAM from UMD-2 at the moment?
  • EMI-2 WN tarball has passed SR. Expect a deadline for the upgrade soon. gLite 3.2 WN tarballs should be updated ASAP.
  • EMI-3 WMS on SL6 doesn't work with Argus (GGUS 92773)
  • EMI-3 VOMS Critical issue; fix scheduled April 18th.
  • Only APEL and VOMS appear to have stopped supporting YAIM core in the early EMI-3 release.

Tuesday 2nd April

  • Minutes of the 20th March EGI ops meeting are available.


Monitoring - Links MyWLCG

Tuesday 9th April

  • David C has material to present (Glasgow solutions to monitoring) but can not make our Tuesday ops meeting. Looking at options.

Tuesday 5th February

  • Task will focus on probes and sharing of useful tools - suggestions and comment welcome
  • Glasgow dashboard now packaged and can be downloaded here.
On-duty - Dashboard ROD rota

Monday 1st April

  • A new GOCDB field related to the ROD email address was not populated. Emails should now reach the team.

Tuesday 5th March

  • Handling tickets related to EMI-1 probes - what to expect.
  • Recommendation with respect to upgrading CE (drain first)

Tuesday 12th February

  • Need all ROD members to complete availability survey for the rota.
Rollout Status WLCG Baseline

Tuesday 2nd April

  • EMI-1 components should be out of production. Nagios probes will report critical this month. Services remaining (without special condition) beyond 30th April will need to be placed in downtime.

Monday 4th March

  • EMI early adopters list by component.
  • Do we have a Staged Rollout list for EMI3?

Tuesday 5th February

References


Security - Incident Procedure Policies Rota

Monday 8th April

  • We have a number of site notifications from Pakiti. Please check your site summary.

Tuesday 2nd April

  • Reminder about ptrace kernel issue (CVE-2013-0871)
  • Thanks to all those sites that took part in the security challenge

Tuesday 5th March

  • Two openafs vulnerabilities announced (CVE-2013-1794 and CVE-2013-1795). Further details available at http://www.openafs.org/security. Updated RPMS for SL5/6 available.



Services - PerfSonar dashboard | GridPP VOMS

Tuesday 9th April

  • It is now getting urgent to configure and have enabled the backup VOMS instances at Oxford and Imperial. Please can we arrange a follow-up meeting (postponed last week as Daniela was out).

Tuesday 2nd April

  • Impending electrical work at Manchester - we need to commission the backup VOMS arrangement as soon as possible.

Monday 18th February

  • PerfSonar tests to BNL reveal poor rates for several sites since upgrade

Tuesday 5th February

  • NGS VOMS to be switched off this week
Tickets

MONDAY 8th APRIL 15.00 BST</br> 27 Open UK tickets this week, and as it's the first working day of the month, we have the joy of looking at all of them.

NGI</br> https://ggus.eu/ws/ticket_info.php?ticket=93142 (5/4)</br> The UK ROD is being pulled over the coals over not handling recent tickets "according to escalation procedure". I suspect all the tickets refered to are EMI1 upgrade ones, so justifying ourselves should be straightforward. Assigned to ngi-ops. (8/4)

VOMS</br> https://ggus.eu/ws/ticket_info.php?ticket=92306 (7/3)</br> Rolling out voms support for the new, Glasgow-based earthsci vo. After some discussion on domain naming it was decided to go with the vo name earthsci.vo.gridpp.ac.uk. It has been deployed at the Manchester, Oxford and IC, so I assume the next step is testing it. In progress (4/4)

EMI 1 UPGRADE TICKETS:</br> RALPP https://ggus.eu/ws/ticket_info.php?ticket=91997 (On hold, extended 5/4)</br> Chris has put back the dcache upgrade a bit, but it seems in order. The last other EMI1 holdout was being drained for upgrade last week.

GLASGOW https://ggus.eu/ws/ticket_info.php?ticket=91992 (In progress, extended 5/4)</br> Not much word from the Glasgow lads in a while (since 11/3), but they only had a few holdouts left.</br> https://ggus.eu/ws/ticket_info.php?ticket=92805 (On hold)</br> Glasgow's DPM ticket (despite their DPM technically being up to date)- Sam hopes to "update" when DPM 1.8.7 comes out, but if that looks unlikely in the time frame SAM will reinstall the DPM rpms to simulate an upgrade.

SHEFFIELD https://ggus.eu/ws/ticket_info.php?ticket=91990 (On hold, extended 5/4)</br> Just some worker nodes left at Sheffield. Looking good. (But Elena has some publishing issues-see TB-SUPPORT).

BRUNEL https://ggus.eu/ws/ticket_info.php?ticket=91975 (On hold)</br> Raul upgraded his CE, only to find that the nagios tests haven't picked up the upgrade! Daniela suggests a site BDII restart. Update - Raul seems to have figured out an arcane way of getting the publishing to work by yaiming twice then restarting the site BDII.

DURHAM https://ggus.eu/ws/ticket_info.php?ticket=92804 (In progress, extended 5/4)</br> Not much news from Mike about this in the last few weeks- I think that he's in the same boat as Sam - technically up to date (just from the "wrong" repo).

COMMON OR GARDEN TICKETS:

OXFORD</br> https://ggus.eu/ws/ticket_info.php?ticket=92688 (20/3)</br> Brian asked for a data dump, Ewan provided two! Ewan has left the ticket open whilst atlas decide what to do with the information. Waiting for reply (2/4)

GLASGOW</br> https://ggus.eu/ws/ticket_info.php?ticket=89804 (18/12/2012)</br> Moving atlas data from the groupdisk token. Last word was from Stephene on the 3/3, asking for a dump of what remains. I think that the conversation has moved offline to expedite things. How goes it? On hold (3/3)

https://ggus.eu/ws/ticket_info.php?ticket=92691 (20/3)</br> Glasgow supplied Brian with a list of all the files on the SE, Brian has given back a list of all the "dark data" files that they couldn't delete remotely. In progress (8/4)

https://ggus.eu/ws/ticket_info.php?ticket=93036 (2/4)</br> Glasgow were being bit by stage in failures after disk server stress killed the xrootd service on a node. Measures have been put in place to stop this happening again, and Sam has said some wise words on this issue (as it was data hungry production jobs that caused the deadly stress). Sam suggests that it would be beneficial to have these data-hungry production jobs flagged in some way, so that they can be treated similarly to how analysis jobs are (staggered starts, limiting the maximum number running etc.) In progress (5/4)

This raises the question, is it likely that suggestions put in a ticket like this would work their way up the chain to someone who could act on them?

DURHAM</br> https://ggus.eu/ws/ticket_info.php?ticket=92590 (18/3)</br> lhcb were having what looks like authorisation problems at Durham. Not much news on the ticket since then, does the problem persist? On hold (2/4)

MANCHESTER</br> https://ggus.eu/ws/ticket_info.php?ticket=93179 (8/4)</br> atlas would like 5TB shuffled from localgroupdisk to datadisk. Assigned (8/4)

LIVERPOOL</br> https://ggus.eu/ws/ticket_info.php?ticket=93160 (7/4)</br> Atlas were suffering transfer failures, which puzzled the Liver lads as their logs showed the transfers succeeding. It could have been a problem with the University firewalls - the timing of the problems coincided with a change in the Uni firewall. These have been reverted so lets see if things go back to normal. In progress (8/4)

LANCASTER</br> https://ggus.eu/ws/ticket_info.php?ticket=91304 (8/2)</br> LHCB jobs were running in the tidgey home partition on the Lancaster shared cluster. I've tried to put in place a job wrapper that cds to $TMPDIR, but no joy - not sure what I'm doing wrong. On hold (27/3)

RHUL</br> https://ggus.eu/ws/ticket_info.php?ticket=89751 (17/12/12)</br> Path MTU discovery problems for RHUL. Passed to the networking chaps and Janet, this may be a long time in the solving. On hold (28/1)

https://ggus.eu/ws/ticket_info.php?ticket=92969 (29/3)</br> Biomed are reporting seeing negative space on the RHUL SE- an old bugbear resurrected. In progress (1/4)

QMUL</br> https://ggus.eu/ws/ticket_info.php?ticket=93180 (8/4)</br> QM got a nagios ticket for the recent APEL troubles, Dan rightfully cited the apel ticket. In progress (8/4)

https://ggus.eu/ws/ticket_info.php?ticket=92951 (29/3)</br> Atlas transfer failures, caused by a crash in a disk storage node. Reopened after the initial fix, it looks like a lustre bug is plaguing the QM chaps. Currently they're hoping on a bug fix or else they'll need to rollback. In progress (8/4)

TIER 1</br> https://ggus.eu/ws/ticket_info.php?ticket=91658 (20/2)</br> Chris requesting webdav support on the RAL LFC. The RAL team are waiting on the next lfc version with better webdav support to come out in production. On hold (3/4)

https://ggus.eu/ws/ticket_info.php?ticket=91029 (30/1)</br> Long standing ticket concerning the srm troubles with certain robot DNs. No fix is likely in the near future. On hold (27/2)

https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/12/12)</br> Correlated packet loss on the RAL perfsonar. The picture looks improved after last month's intervention, but still needs understanding. Proposed to wait until after the May intervention before looking at this hard again. On hold (27/3)

https://ggus.eu/ws/ticket_info.php?ticket=93136 (5/4)</br> epic VO having trouble downloading output from the RAL WMS. Most likely related to known problem https://ggus.eu/ws/ticket_info.php?ticket=92288 (submitted by Jon from t2k). In progress (5/4)

https://ggus.eu/ws/ticket_info.php?ticket=93149 (5/4)</br> Obviously Friday was the day of tickets. atlas were seeing a large number of cvmfs related cmtside failures. These nodes were testing the latest cvmfs 2.1.8, and have been rolled back. Waiting for reply (8/4)

https://ggus.eu/ws/ticket_info.php?ticket=92266 (6/3)</br> RAL were having problems with their myproxy aliases not matching up to their myproxy's certs. After trying a few fixes the RAL guys are setting up a new machine with the hostname and certificate match. Aim to have this done within a fortnight. In progress (28/3)

APEL</br> Just in case you guys haven't been reading TB-SUPPORT, the ticket tracking the current APEL problems:</br> https://ggus.eu/ws/ticket_info.php?ticket=93183

Tools - MyEGI Nagios

Tuesday 13th November

  • Noticed two issues during tier1 powercut. SRM and direct cream submission uses top bdii defined in Nagios configuration to query about the resource. These tests started to fail because of RAL top BDII being not accessible. It doesn't use BDII_LIST so I can not define more than one BDII. I am looking into that how to make it more robust.
  • Nagios web interface was not accessible to few users because of GOCDB being down. It is a bug in SAM-nagios and I have opened a ticket.

Availability of sites have not been affected due to this issue because Nagios sends a warning alert in case of not being able to find resource through BDII.


VOs - GridPP VOMS VO IDs Approved VO table

Monday 8th April

  • Please note Chris W is away this week.
  • Information is being gathered for the Q1 2013 quarterly report.

Tuesday 2 April 2013

Monday 4th March 2013

Monday 26th February 2013

  • NGS VOMS server. Durham fixed. Last site is Glasgow, and I'm running tests now. Hopefully this should now be fixed https://ggus.eu/ws/ticket_info.php?ticket=90356 - note that this has taken 3 months to complete.
  • SNO+ reports lcg-cp timeouts for large files. I suspect this is a problem with the UI.
  • Issues with Proxy renewal.
    • Certificate for RAL myproxy server doesn't match advertised hostname (how does this work at all?).
    • Other myproxy issues as well. GGUS#99105 GGUS#9172

SNO+ Questions

  • Jobs appear to fail, but have uploaded output and it is in LFC
  • MC production
    • Want 2-3 people managing this
    • Shifters monitoring sites and filing tickets
    • How best to manage certificates - currently upload two proxies to myproxy - one for jobs to renew and one for the UI to renew.
    • How best to do this - should they use a robot cert?


Site Updates

Actions


Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda EVO meeting

Wednesday 10th April

  • Operations report
  • The Tier1 was disconnected for around 100minutes yesterday (Tuesday) morning following problems during a scheduled networking upgrade.
  • New disk servers deployed in production (540TB to AtlasDataDisk; 720TB to CMSDisk). One batch of the new worker nodes also in production.
  • The Post Mortem review of the failure of disk server GDSS594 (GenTape) in February that led to the loss of 68 T2K files has been completed. See here
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events

Empty

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A