Operations Bulletin 260115

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 19th January 2015
Task Areas
General updates

Tuesday 20th January

  • A GridPP technical meeting took place on Friday. Minutes are available.
  • There has been agreement on CVMFS repository naming for the regional VOs e.g. /cvmfs/londongrid.gridpp.ac.uk
  • Glasgow WNs and gcc-gfortran issue (masking user setup problems?)
  • EGI-Engage project evaluation was very positive (15/15).
  • Minutes of last Wednesday's GDB are now available. The GDB action list has been updated.
    • Still looking for feedback on ginfo
    • PerfSonar to be reinstalled/enabled at sites
    • Squid registration in GOCDB
    • Test Machine Job Features approach
    • Participate in MW readiness working group
    • Looking for more contributors to for ARGUS testing/support.

Thursday 8th January

Tuesday 30th December

  • Issues with Tier-1 switch 24th/25th December. GOCDB switchover.

Tuesday 23rd December

  • Accounting MC:
    • NGIs to check MC reporting (ncores=0 if parallel flag not set). CREAM requires EMI3 APEL (Others/ARC CEs use SSM2 or JURA).
    • Sites need to check all CEs. VOs check their sites.
    • Once parallel=true will need to republish to backdate.Portal Multicore View
    • sites view
    • APEL client users SSLv3 – new version of SSM released and in testing. Sites will soon be asked to upgrade SSM rpm.
    • Alerts/Linux-2014-12-17 
Critical vulnerability – kernel update required.
    • Two recent incidents: 7782 and 7765.
    • Cloud security: evaluating questionnaires. EGI-CSIRT collaborating with FedCloud. (F2F in Jan).
    • EGI-CSIRT/WLCG collaboration Pakiti and MW readiness WG.
    • Central suspension: NGI ARGUS services deployed. Info in GOCDB. Monitoring to check ban updates.
    • Update-23: Probes to UMD. GridMon removed. New documents. Tested. Released 9th December.
    • New tests: FTS3-Service & StalledTransfers and GLUE2-validate (but failed on ARC-CEs). Various WN tests removed.
    • Issues: xrootd protocol on DPM 1.8.9 and NCGLogFiles. SSLv3 move after APEL testing.
  • OLA/SLA framework
    • 3 phases. 1 RC+RP+TP SLAs. 2 +EGI.eu+EGI.eu fed SLA. 3. +TP UA+E-Grant (User SLA+User OLA).
    • EGI User SLA just sets expectations between EGI and VO: scope; service hours; services and support; communication; security and responsibilities.
  • Operational Tools
    • Release 3.1.1: remove VO ID card approval. Ticket management. Bug fixes.
    • Release every 2 months. There is now an Advisory and Testing Board.
    • GOCDB – next release authentication change allows anon access to selected pages.
    • GGUS – 5 releases in PY5. Failover improved. SSLv3 disabled. SAM/ARGO: ARGO will replace SAM in 2015 (for start EGI-Engage): enhance customisation; better metric store. Will work standalone.
    • Application database: 9 major releases in PY5: Cloud marketplace; support for eduGAIN; OpenAIRE’s API.
    • E-Grant

WLCG Operations Coordination - Agendas

Tuesday 20th January

Thursday 8th January

Thursday 18th December

  • There was a WLCG operations meeting today. Agenda: Minutes.
    • News: Sites requested to enable Multicore accounting (details) - for EMI3 CREAM you edit /etc/apel/parser.cfg and set the attribute parallel=true. Site managers requested to complete the [ http://cern.ch/wlcg-survey WLCG survey]. MW Readiness WG will participate in the ARGUS testing. No immediate issue with WLCG support for sites belonging to NGIs leaving EGI.

Monday 15th December

  • The next WLCG ops coorindation meeting is this Thursday: Agenda. Input notes.
  • Are there any UK issues that we want to raise?
  • There was a WLCG critical services meeting on 12th. Draft minutes have been linked from the agenda.


Tier-1 - Status Page

Tuesday 19th January

  • Safety testing of electrical circuits in the R89 machine room is underway. This started last week and will continue through to this Thursday 22nd Jan). We have declared an At Risk for the Tier1 site in the GOC DB for this work. Some problems have been encountered but so far no critical services affected.
  • The delayed upgrade of the Castor Namservers to SL6 took place successfully yesterday, Monday (19th Jan).
  • The problems with our primary network router (Christmas Eve, Christmas Day) are being followed up with the vendor. Some testing of the unit to reproduce the fault is anticipated and will be announced in advance.
Storage & Data Management - Agendas/Minutes

Wedn 21 Jan

  • Wahid's report on "protocol zoo" - can the set of protocols be simplified (and how long would it take)
  • Update from RAL's CEPH team.

Wedn 07 Jan

  • HNY. Most sites ticked over with relatively few glitches.
  • Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
  • Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

  • Update from T1 CEPH team.
  • Last meeting of year. No updating and configuring or touching! Merry and Happy!

Wedn 10 Dec

  • An audience with NA62

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 20th January

  • Some APEL delays seen for ECDF and Sheffield.

Tuesday 13th January

  • Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 12th January

  • There was an EGI ops meeting on Monday 12th January. Minutes to follow.
    • STORM 1.11.5
    • SR: Another reminder to check your contacts
    • Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
    • Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
    • EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).

Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Tuesday 20th January

  • Last week was quiet.
  • Kashif + Gordon (shadowing) next week.
  • The ROD work is to be included in the GridPP5 D/O/S activity.

Monday 13th January

  • A couple of tickets for systems with low availability.
  • There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
  • Tier-1 aware of long standing ticket.
Rollout Status WLCG Baseline

Tuesday 20th January

  • From Cristina's GDB talk last week note that EMI repositories will be frozen and the product team releases will become UMD-preview (repository content and webpages similar but now managed).

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


Security - Incident Procedure Policies Rota

Tuesday 20th January

  • DPM configuration instructions.
  • Status of CVE-2014-9322.
  • Steve's old perf package false positive.

Tuesday 13th January

  • EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
  • Any issues over holiday period?

Tuesday 16th December

  • Any update on the FTS3/GFAL bug?

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 20th January

Tuesday 16th December

  • For the PS dashboard for the time being use this link.
  • As of today the following have no data: RHUL; Sheffield; ECDF; Bristol and Sussex.

Monday 8th December

  • Today was the soft deadline for moving to perfSONAR 3.4. The following sites now appear in the dashboard as live: ... well the dashboard did not load! The hard deadline is 8th January.

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)

Monday 19th January 2015, 14.30 GMT
23 Open UK Tickets this week.

CMS seeing AAA test failures at RAL. The tests have been restarted recently and now seem to be having some suspicious looking authentication failures. In Progress (13/1)

Atlas complaining about httpd doors not working on Sheffield's SE. After schooling the submitter in how to submit more useful information Elena is working on it. I bring this up as in the last few days I've had quite a few of my pool nodes have their httpd daemons crash on them (they're up to date, but still SL5), which may or may not be related. In Progress (19/1)

ECDF "low availability" ticket after a few days of argus trouble, which Wahid fixed. Now the ticket will languish for a few weeks as the alarm clears. Daniela has reminded us of her ticket against these fairly silly alarms: https://ggus.eu/?mode=ticket_info&ticket_id=107689 . In the mean time this ticket could do with being put On Hold whilst the alarm clears. In progress (19/1)

Setting up VMCatcher at 100IT. After some troubles things seem to have be looking up, although there are still some questions that the 100IT chaps have for the configurations and what they should be using that aren't getting answers. I set the ticket to "Waiting for Reply" hoping that this will help get those in the know's attention. Waiting for reply (15/1)

Perfsonar Tickets
110382(TIER 1)
Everyone seems to have updated their perfsonar hosts, so we're all good on that front, but a number of sites are either having trouble with their reinstalled hosts, or are having problems that they had pre-reinstall still haunt them. I'm afraid I have no suggestions of what to do about the growing number of these tickets though!

Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:


Tests removed:


release note is available here https://wiki.egi.eu/wiki/SAMUpdate23

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015

Tuesday 16th December 2014

  • Discussion of setting up CVMFS for 'other VOs'.
  • LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Monday 8th December 2014

  • Some changes from the Ops Portal to these VOs: ALICE, ATLAS, CMS, GEANT4, LHCB, OPS, VO_SIXT.
  • For each VO, any certificate with a CA_DN field that was: /DC=ch/DC=cern/CN=CERN Trusted Certification Authority replace it with /DC=ch/DC=cern/CN=CERN Grid Certification Authority

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data
Site Updates

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 21st January 2015

  • Operations report
  • Only ALICE left using CREAM CEs now and they should be finished with them by 16th February. Will decommission then.
  • Electrical power circuit testing largely done (completes tomorrow). So far no effect on critical services.
  • Some Castor updates done (SL6 updates on Namesevers; Addition of a third SRM node for LHCb).
  • Planning some testing of the Primary Tier1 router to investigate why it hs given problems.
  • Rolling update of OS on Castor disk servers next week.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A