Operations Bulletin 151214

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 8th December 2014
Task Areas
General updates

Monday 8th December

  • WLCG Operations Coordination Team requests that all sites now register their Squid services in GOCDB or OIM (broadcast on 4th December). Instructions are available.
  • This week 8th-9th December there is an HTCondor workshop taking place - Vidyo is available and most talks are being uploaded. We may have a summary next week.
  • There is a GDB this week. Let Jeremy know if you want issues raised or discussed.
  • The WLCG survey is now in its second week. All GridPP sites are expected to respond as the results will help not just WLCG but GridPP too! As of last week only a small number had responded.
  • Steve has noticed an issue with the ops portal VOMS information being slow to change following the change of CERN VOMS endpoints last week.
  • SAM3 results for November have been circulated. Re-computations are to be requested before 14th December.
  • There is an EGI OMB meeting on 18th December. Are there any UK matters we want raised? Also, we have been asked to share with other NGIs activities taking place in the UK that may be of benefit and interest to the other NGIs - does anyone have suggestions to put forward?
  • Are there any outstanding issues with the UK CA certificate reminders/renewals? Is there any uptake of John Kewley's Nagios scripts for checking the cert expiry status?

Tuesday 2nd December

  • WLCG Overview Board (OB) met on Friday 28th November. Ian Bird's status report gives a summary of resource usage, projections and current project directions (data preservation, RUN-2 preparations etc.).
  • There is an ATLAS jamboree 3rd-5th December 2014.
  • Certificate renewal email reminders were not working 3rd November - 1st December. Nagios may have reminded you... but if not contact John Kewley for the Nagios scripts.
  • The old VOMS servers 'voms.cern.ch' and 'lcg-voms.cern.ch' _were?_ switched off for good and replaced by 'voms2.cern.ch' and 'lcg-voms2.cern.ch on Wednesday 26th November at 15:00 CET.
WLCG Operations Coordination - Agendas

Monday 8th December

    • Baselines: FTS 3.2.30 baseline.
    • M/W issues: DPM 1.8.9 logging fixed.
    • T0&T1 services: Various upgrades to dCache 2.10.13
    • Oracle: Continued progress on upgrades & migration.
    • T0 news: voms2.cern.ch and lcg-voms2.cern.ch in use since 26th November. Still looking at AFS UI statistics (run till 2nd Feb) – still lot of use e.g. from CMS VO boxes. Issues with users editing voms-admin emails (needs to match HR DB).
    • T1: NTR
    • T2: NTR
    • ALICE: High activity. Progressing ARC CE SAM tests.
    • ATLAS: NTR – jamboree
    • CMS: DIGI-RECO at T1s. Various MC T2s. VOMS migration – smooth, some sites had to update Phedex machines.
    • LHCb: Final checks for stripping 21. VOMS migration – users referring to many UI, afs, cvmfs places… some were not ready.
    • gLExec: testing campaign for PanDA.
    • Machine/Job features: Agreed protocol for virtualized environments. Need all implementations in repository.
    • MW readiness: dmlite 0.7.2 on EPEL stable. dCache 2.11.0 verified for ATLAS. New jira dashboard for following progress. WLCG package reporter rebranded as Pakiti v3 (in EPEL).
    • Multicore: CMS testing PromptReco multithreaded jobs; tests on CMS T2s awaiting testbed (pilot factory).
    • SHA-2: New VOMS – ATLAS 24th, others 26th. AFS UI and CVMFS UI config quickly fixed on 26th. Some PhEDEx’s and LHCb user private scripts needed fix. EGI & WLCG broadcast made on 5th. Some VO cards need updating.
    • IPv6: NTR
    • Squid & HTTP proxy: Monitoring page (auto-gen from GOCDB) now supports multiple squid services. All sites about to be asked to register.
    • Network & transfer metrics: Waiting on experiment use-cases and other inputs. Strawman for early 2015. perfSONAR deadline 8th January.

Monday 1st December

  • WLCG ops coordination team have launched a survey. Please could all GridPP sites respond to it by 19th December.


Tier-1 - Status Page

Tuesday 2nd December

  • The headnodes of LHCb instance of Castor is being upgraded to SL6 today 10:00-14:00.
  • Investigating problems on the CMS Castor instance.
Storage & Data Management - Agendas/Minutes

Wedn 10 Dec

  • An audience with NA62

Wedn 03 Dec

  • Should we support DIRAC data management?

Wedn 26 Nov

  • DIRAC: we probably need to understand DIRAC storage and data management better than "tried the tutorial and got the T shirt" - more next week - but then we need access to DIRAC resources!
  • Learning from non-LHC VOs: not just their data problems, but also success stories
  • WebDAV getting more widely supported - need to start testing mode widely
  • Deletion rates revisited: old target no longer sufficient, needs revisiting

Wedn 19 Nov

  • Logs: chatty DPM 1.8.9, and elasticsearching logs.
  • Reports and other interesting things from workshops: cloud data transfer and sync, HEPiX, hepsysman.
  • RAID controllers for 36 bay nodes

Wedn 12 Nov

  • Update on CEPH with xroot. It works...

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Monday 8th December

  • Sheffield given heads-up about APEL issue. Now fixed?

Tuesday 25th November

  • All sites approximately up-to-date.

Tuesday 11th November

  • A reminder to please update HEPSPEC06 figures with new equipment benchmarks.
  • Please check your GridPP metrics lines in Steve's tables and report any issues.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.

Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.

Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


Security - Incident Procedure Policies Rota

Monday 8th December

  • Note ADVISORY [EGI-SVG-2014-7696]

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Monday 8th December

  • Today was the soft deadline for moving to perfSONAR 3.4. The following sites now appear in the dashboard as live: ... well the dashboard did not load! The hard deadline is 8th January.

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)

Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.

Monday 8th December 2014, 14.45 GMT.

38 Open UK Tickets this week.
A few of the perfsonar upgrade tickets are still open (although most don't seem to be stalled per se, I see that a lot of them are of the form "we didn't quite get time to finish it this week"). We have a number of nagios tickets open - two belonging to Lancaster (our premature DPM upgrade is biting us) - Gareth opened a few of them this morning. I also see a couple of LHCB cvmfs tickets - it looks like the lhcb cvmfs areas might be "clogging up" at sites and are probably worth a preemptive health check.

Matt M's ticket to the Tier 1 concerning not being able to get the gfal commands to work accessing Castor. Duncan has posted to the ticket that things are working for him now, along with the details of his setup. On Hold (4/12)

Duncan ticketed the Tier 1 about not being able to access the LFC via webdav. Catalin fixed a few misconfigurations on the LFC, but notes the limitations concerning VOMS proxies and browsers (i.e. they don't work together), and proposes to close the ticket. Waiting for reply (4/12)

This "http at Liverpool not quite working" ticket raised some valid points about what Atlas wants/expects from its http access. I think the original problem is "fixed", which leaves this ticket in danger of limbo-ing (like me after a few too many Pina Coladas), sadly there was no atlas cloud meeting last week to bring this up at. In Progress (2/12)

ECDF got an atlas transfer ticket, but as Andy correctly pointed out the rest of the UK cloud isn't looking pretty at all on the DDM matrix. Why did poor old Edinburgh get singled out for a ticket? Waiting for reply (8/12)

That's about it for the tickets that really caught my eye. Feel free to bring up any tickets that you think I should have picked up on in the meeting.

Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:


Tests removed:


release note is available here https://wiki.egi.eu/wiki/SAMUpdate23

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Monday 8th December 2014

  • Some changes from the Ops Portal to these VOs: ALICE, ATLAS, CMS, GEANT4, LHCB, OPS, VO_SIXT.
  • For each VO, any certificate with a CA_DN field that was: /DC=ch/DC=cern/CN=CERN Trusted Certification Authority replace it with /DC=ch/DC=cern/CN=CERN Grid Certification Authority

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data
Site Updates

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 26th November 2014

  • Operations report
  • Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
  • Provisional dates for safety testing of circuits in the machine room is Tues-Thu weeks 13-15 & 20-22 January '15. Services will be 'at risk' during this time.
  • Provisional dates announed for upgrades of Castor headnodes to SL6, strating with LHCb next Tuesday (2nd Dec).
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A