Operations Bulletin 171114

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 10th November 2014
Task Areas
General updates

Tuesday 11th November

  • RAL PPD experience of perfSONAR 3.4 installation and mySQL issues.
  • There is the November GDB and pre-GDB on volunteer computing this week.
  • Meeting minutes for the GridPP technical meeting last Friday are available. Looking at better meeting time and inclusion of other topics like (GridPP) DIRAC progress.
  • Minutes from October's EGI OMB meeting now released (agenda).
    • Resource provider and centre operations level agreements (OLAs) updated
    • EGI.eu OLA covers: Site Availability/Reliability, top-BDII, SAM, support quality, ROD performance. NGI gets a ticket if non-conformant.
    • GOCDB EGI group for NGI Argus instances setup. Inform operations at egi.eu of any changes.
    • New federated cloud procedures have been released.
    • SAM update 23 will include new FTS probes and gstat probes replaced by GLUE2 validator.
    • Fedcloud resources request from: DCH-RP, Astronomy and Genome communities.
    • Do we need the "Monitoring=N and production=Y" combination of flags in GOC DB?
    • EGI circulated a request to pull together interest in SME engagement. We ought to record any involvement and interest.
  • Ian circulated information about an HTCondor workshop 8th/9th December. Vidyo will be available. If you have strong reasons to attend in person and expect GridPP funding please contact Jeremy.
  • Please register for the November 17th HEPSYSMAN meeting at QMUL.

Monday 3rd November

  • The OPS VOMS server setting for gridppnagios.physics.ox.ac.uk was updated on 29th Octover. It now uses the new VOMS servers (lcg-voms2.cern.ch and voms2.cern.ch ). Watch out for failing Nagios tests!
  • The draft October WLCG T2 A/R figures have been released. ALICE. ATLAS. CMS. LHCb.
  • EGI operations will be trying a new approach with broadcasts - perhaps one summary broadcast each month.
  • Are there any outstanding VOMS (e.g. dteam or gridpp) or GOCDB role requests?
  • The agendas for November's GDB and the pre-GDB on volunteer computing are available.
WLCG Operations Coordination - Agendas

Tuesday 11th November

  • There was a WLCG ops coordination meeting last Thursday. (Agenda:Minutes). Highlights from the meeting...
  • The site operations survey will be final and go public after the November GDB.
  • Middleware: WLCG repository to be signed from this week (backwards compatible and new files available with public key).
  • Baselines: gfal/lcgutil unsupported from 1st November. perfSONAR 3.4 docs ready.
  • T0 & T1 services: list of storage and FTS upgrades.
  • Oracle: New IT-DB hardware. Testing now. Move by year end.
  • TO news: 18 October no proxies VOMS issue. CMS CVMFS briefly down due to machine crash. Looking at AFS UI usage. CERN major release of Service Now (SNOW) on 8th November.
  • T1 & T2 feedback: NTR
  • ALICE: High usage. Fixed pilot issue. RAL ARC CEs now usable. Using new VOMS.
  • ATLAS: Low job load. MC15 expected end of year. Stability issues with Panda and Rucio servers due to package upgrades. Not all sites yet with multi-core.
  • CMS: Last week offline computing week. New MINIAOD campaign. Reviewing critical services. Testing new VOMS. xrootd reconfig ongoing. T0->T1 transfer tests planned for Nov/Dec.
  • LHCb: "legacy Run1 data set" stripping proceeding. DIRAC fixed for IPv6. Tested new VOMS.
  • glexec: PanDA usage rewritten. Deployment campaign and testing (new queues with glexec enabled) ongoing.
  • Machine/job features: NTR
  • Multicore: Passing parameters to batch system discussion started. Limited tests. ATLAS 40% resources now MC. Still 37 sites to move. CMS tested multithreaded reconstruction at PIC, moving to all T1s. T2 candidates to be tested. OSG looking at Glue2 needs.
  • MW readiness: Various verifications for mid-to-end of November. Dashboard prototype expected this month.
  • SHA-2: Waiting on CMS and ATLAS news (re. new VOMS). OSG release with new VOMS on 11th November.
  • IPv6: NTR
  • Squid monitoring/HTTP proxy: Squid registration instructions ready. Automated MRTG monitor almost ready. Campaign starts soon.
  • Network & transfer metrics: Request perfSONAR 3.4 upgrades by 8th Dec. Various PS checks being done and configurations checked.
  • Next meeting 20th November.

Tuesday 28th October

Tuesday 14th October

Tier-1 - Status Page

Tuesday 11th November

  • Reboot of our Tier1 Router (that connects us to the rest of the RAL site) took place successfully last Wednesday morning (5th Nov).
  • Nothing else to report.
Storage & Data Management - Agendas/Minutes

Wedn 12 Nov

  • Update on CEPH with xroot. It works...

Wedn 05 Nov

  • T2K as a "case study" of how to do things right? Good example of a non-LHC VO that knows where its towel is.

Wedn 29 Oct

Wedn 22 Oct

  • Martin Bly (T1) - HEPiX report

Wedn 15 Oct

Wedn 01 Oct

  • Summary of all the exciting events in Amsterdam last week - EUDAT, EGI big data, RDA
  • DPM 1.8.9 early testing, and (separately) xroot4 early-ish testing. Supporting multiple VOs in one xroot server.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 11th November

  • A reminder to please update HEPSPEC06 figures with new equipment benchmarks.
  • Please check your GridPP metrics lines in Steve's tables and report any issues.

Tuesday 4th November

  • Sussex and Sheffield publishing issues still apparent.

Monday 27th October

  • Sites considering moving to HTCondor should be aware there are prototype APEL parsers in use for HTCondor so if you continue using CREAM as your CE then you can continue to use APEL accounting. The previous Condor parser for APEL was retired in EMI3 as there was no demand.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.

Tuesday 28th October

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.

Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.

Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 4/11

  • Next meeting 14th November: Status of SAM3 rollout (proposed topic)
On-duty - Dashboard ROD rota

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.

Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


Security - Incident Procedure Policies Rota

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Tuesday 21st October

  • The IGTF has an update which introduced rather unexpected changes in the trust anchors used by Comodo for the TCS. There

is now an additional set of SHA-2 intermediate CAs in addition to the old ones.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 11th November

  • Target date for perfSONAR 3.4 upgrades is 8th December.

Tuesday 4th November

  • perfSONAR 3.4+ install/update instructions are ready. More details will be included in the WLCG broadcast to all sites planned for later today.

Tuesday 28th October

  • Have the perfSONAR 3.4 instructions/documentation been updated yet? Last week volunteers were sought at the...
  • perfSONAR operations meeting took place on 22nd October.
    • There is a recommendation for sites supporting IPv6 to deploy perfSONAR dual-stack.
    • Concerned about Tier-3s requesting to join Tier-2 meshes.
    • A network transfer metrics wiki page is available.


Friday 14th of November 2014

With HEPSYSMAN this week there won't be a proper look at the tickets on the 18th, so please can you check yourselves:




Tools - MyEGI Nagios

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 11th November 2014

  • Status of CERN@School data

Monday 3rd November 2014

  • Please update cvmfs-keys and VO_<VONAME>_SW_DIR
  • Working with SNO+ and Mark Slater on ganga job submission rate
  • Northgrid support on RAL WMS being checked
  • Gfal-copy and castor issue

Thursday 23 October 2014

  • CVMFS keys - new cvmfs-keys package cvmfs-keys-1.5
    • Part of decoupling of CVMFS from CERN - support for keys from various repositories
    • <voname>.gridpp.ac.uk -> <voname>.egi.eu
    • Please update and change VO_<VONAME>_SW_DIR to point to new directory
  • Impact
Site Updates

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 12th November 2014

  • Operations report
  • Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
  • Provisional dates for safety testing of circuits in the machine room is the week 12-16 January '15. Services will be 'at risk' during this time.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A