Operations Bulletin 050115

From GridPP Wiki
Revision as of 15:24, 8 January 2015 by Jeremy Coles 4cb4ce56a7 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 5th January 2015
Task Areas
General updates

  • Issues with Tier-1 switch 24th/25th December. GOCDB switchover.

Tuesday 23rd December

  • Accounting MC:
    • NGIs to check MC reporting (ncores=0 if parallel flag not set). CREAM requires EMI3 APEL (Others/ARC CEs use SSM2 or JURA).
    • Sites need to check all CEs. VOs check their sites.
    • Once parallel=true will need to republish to backdate.Portal Multicore View
    • sites view
    • APEL client users SSLv3 – new version of SSM released and in testing. Sites will soon be asked to upgrade SSM rpm.
  • EGI CSIRT
    • Alerts/Linux-2014-12-17 
Critical vulnerability – kernel update required.
    • Two recent incidents: 7782 and 7765.
    • Cloud security: evaluating questionnaires. EGI-CSIRT collaborating with FedCloud. (F2F in Jan).
    • EGI-CSIRT/WLCG collaboration Pakiti and MW readiness WG.
    • Central suspension: NGI ARGUS services deployed. Info in GOCDB. Monitoring to check ban updates.
  • SAM OPS
    • Update-23: Probes to UMD. GridMon removed. New documents. Tested. Released 9th December.
    • New tests: FTS3-Service & StalledTransfers and GLUE2-validate (but failed on ARC-CEs). Various WN tests removed.
    • Issues: xrootd protocol on DPM 1.8.9 and NCGLogFiles. SSLv3 move after APEL testing.
  • OLA/SLA framework
    • 3 phases. 1 RC+RP+TP SLAs. 2 +EGI.eu+EGI.eu fed SLA. 3. +TP UA+E-Grant (User SLA+User OLA).
    • EGI User SLA just sets expectations between EGI and VO: scope; service hours; services and support; communication; security and responsibilities.
  • Operational Tools
    • Release 3.1.1: remove VO ID card approval. Ticket management. Bug fixes.
    • Release every 2 months. There is now an Advisory and Testing Board.
    • GOCDB – next release authentication change allows anon access to selected pages.
    • GGUS – 5 releases in PY5. Failover improved. SSLv3 disabled. SAM/ARGO: ARGO will replace SAM in 2015 (for start EGI-Engage): enhance customisation; better metric store. Will work standalone.
    • Application database: 9 major releases in PY5: Cloud marketplace; support for eduGAIN; OpenAIRE’s API.
    • E-Grant


Monday 15th December

  • Note Jens's list of sites with certificates at risk of expiring over the hoiday period: RAL; Durham; IC; Sussex and UCL.
  • All sites are encouraged to complete the WLCG operations survey.
  • The top-BDII at Imperial was unavailable for a period on 12th December due to a water leak in the building. This impacted the SAM tests.
  • Summary notes from the December GDB are available.
  • A Condor Workshop (pre-GDB) took place 8th and 9th December. Again there are summary notes available.
  • An ARGUS Futures and Support workshop (post-GDB) was held on 11th December. A summary of the meeting is now online. The overall conclusion was that there is sufficient community effort to maintain ARGUS.
  • MJ has circulated a reminder of the HEP S/W workshop taking place 20th-21st January. It is not just for HEP (e.g. also astro and nuclear physics). There are now general and s/w mailing lists.
  • A reminder to register your Squid services in GOCDB (instructions linked below). (We had some interest in being UMD early adopters for Squid - thanks).
WLCG Operations Coordination - Agendas

Thursday 18th December

  • There was a WLCG operations meeting today. Agenda: Minutes. A summary follows:
    • News: Sites requested to enable Multicore accounting (details) - for EMI3 CREAM you edit /etc/apel/parser.cfg and set the attribute parallel=true. Site managers requested to complete the [ http://cern.ch/wlcg-survey WLCG survey]. MW Readiness WG will participate in the ARGUS testing. No immediate issue with WLCG support for sites belonging to NGIs leaving EGI.
    • Baselines:


Monday 15th December

  • The next WLCG ops coorindation meeting is this Thursday: Agenda. Input notes.
  • Are there any UK issues that we want to raise?
  • There was a WLCG critical services meeting on 12th. Draft minutes have been linked from the agenda.

Monday 8th December

    • Baselines: FTS 3.2.30 baseline.
    • M/W issues: DPM 1.8.9 logging fixed.
    • T0&T1 services: Various upgrades to dCache 2.10.13
    • Oracle: Continued progress on upgrades & migration.
    • T0 news: voms2.cern.ch and lcg-voms2.cern.ch in use since 26th November. Still looking at AFS UI statistics (run till 2nd Feb) – still lot of use e.g. from CMS VO boxes. Issues with users editing voms-admin emails (needs to match HR DB).
    • T1: NTR
    • T2: NTR
    • ALICE: High activity. Progressing ARC CE SAM tests.
    • ATLAS: NTR – jamboree
    • CMS: DIGI-RECO at T1s. Various MC T2s. VOMS migration – smooth, some sites had to update Phedex machines.
    • LHCb: Final checks for stripping 21. VOMS migration – users referring to many UI, afs, cvmfs places… some were not ready.
    • gLExec: testing campaign for PanDA.
    • Machine/Job features: Agreed protocol for virtualized environments. Need all implementations in repository.
    • MW readiness: dmlite 0.7.2 on EPEL stable. dCache 2.11.0 verified for ATLAS. New jira dashboard for following progress. WLCG package reporter rebranded as Pakiti v3 (in EPEL).
    • Multicore: CMS testing PromptReco multithreaded jobs; tests on CMS T2s awaiting testbed (pilot factory).
    • SHA-2: New VOMS – ATLAS 24th, others 26th. AFS UI and CVMFS UI config quickly fixed on 26th. Some PhEDEx’s and LHCb user private scripts needed fix. EGI & WLCG broadcast made on 5th. Some VO cards need updating.
    • IPv6: NTR
    • Squid & HTTP proxy: Monitoring page (auto-gen from GOCDB) now supports multiple squid services. All sites about to be asked to register.
    • Network & transfer metrics: Waiting on experiment use-cases and other inputs. Strawman for early 2015. perfSONAR deadline 8th January.



Monday 1st December

  • WLCG ops coordination team have launched a survey. Please could all GridPP sites respond to it by 19th December.

'

Tier-1 - Status Page

Tuesday 5th January

  • On Christmas Eve there was a networking problem that affected the Tier1. The primary of our router pair stopped working. However, the problem was not seen by the secondary router and no automated switchover took place. The primary router was manually restarted - triggering the failover and connectivity was restored. An outage of 70minutes was declared. Following some discussion the active router was flipped back to the primary during the afternoon to leave us in a resilient configuration for the holidays.
  • During Christmas Day evening, at around 21:30, the problem recurred. Staff (Martin Bly) attended on site and connectivity was restored at around 01:00 on Boxing Day morning. This time the primary router would not restart. Connectivity has been through the secondary router since this incident with no resilience. The problems with the primary router are being followed up and there may be a hardware fault.
  • There is an outage of the Castor GEN instance (Alice and non-LHC VOs) for an upgrade (Castor headnodes to SL6) this morning. There is a 'warning' on the Tier1 Castor tomorrrow (Wednesday 7th) for the Castor "nameserver" component to be likewise updated.
  • 10:00 - 11:00 on Wednesday 7th Jan. Quarterly UPS/Generator load test. This is a regular test, but services should be regarded as 'at risk'.
  • Safety testing of electrical circuits in the R89 machine room will take place during working hours on the following days:
    • Tuesday - Thursday 13-15 January
    • Tuesday - Thursday 20-22 January
  • During these days room power supplies will be switched off and circuits tested sequentially. As the racks are all powered from two, or more, separate power supplies systems and services should stay up.
Storage & Data Management - Agendas/Minutes

Wedn 07 Jan

  • HNY. Most sites ticked over with relatively few glitches.
  • Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
  • Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

  • Update from T1 CEPH team.
  • Last meeting of year. No updating and configuring or touching! Merry and Happy!

Wedn 10 Dec

  • An audience with NA62

Wedn 03 Dec

  • Should we support DIRAC data management?

Wedn 26 Nov

  • DIRAC: we probably need to understand DIRAC storage and data management better than "tried the tutorial and got the T shirt" - more next week - but then we need access to DIRAC resources!
  • Learning from non-LHC VOs: not just their data problems, but also success stories
  • WebDAV getting more widely supported - need to start testing mode widely
  • Deletion rates revisited: old target no longer sufficient, needs revisiting



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 16th December

  • Over the weekend there has been an error in the APEL repository during the preparation of data to send to the Accounting Portal. This is under investigation.

Monday 8th December

  • Sheffield given heads-up about APEL issue. Now fixed?

Tuesday 25th November

  • All sites approximately up-to-date.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.


Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Monday 15th November

  • Top BDII at Imperial (as well as other services) became unavailable following a water leak at IC leading to many alarms.
  • Three sites have low availability.

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.


Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 16th December

  • Any update on the FTS3/GFAL bug?

Monday 8th December

  • Note ADVISORY [EGI-SVG-2014-7696]

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 16th December

  • For the PS dashboard for the time being use this link.
  • As of today the following have no data: RHUL; Sheffield; ECDF; Bristol and Sussex.

Monday 8th December

  • Today was the soft deadline for moving to perfSONAR 3.4. The following sites now appear in the dashboard as live: ... well the dashboard did not load! The hard deadline is 8th January.

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)


Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.
Tickets

Monday 15th December 2014, 14.15 GMT

Last Christmas - you sent me a ticket,
And the very next day, you escalated it anyway.
This year, to save me from tears,
I'm going to put them On Hold (on hold).


It's the last ticket update from me for 2014, and as is my Christmas ticket tradition I won't go into too much detail as I suspect people will be winding down this week rather then rolling out changes. Still it would be good if sites take the time to tidy up their tickets before we all go and enjoy our Winter festivities, and any ones that will be left open could sites please make sure to update and put them On Hold if they're not going to be looked at for a few weeks.

36 Open UK tickets this winter's day.

Obligatory link to all the UK tickets:
http://tinyurl.com/p37ey64

Here's a few that really, really could do with a Solstice Update or at least On Holding:
110570 (lhcb cvmfs problems at Durham - looks like it can be closed).

110570 (cms AAA tests at the TIER 1)
109712 (cms glexec errors at the TIER 1)

110389(reinstalling perfsonar at Sussex)

108356 (fedcloud and vmcatcher rollout at 100IT)

110384 (perfsonar reinstall at UCL)

110608 (Sheffield low availability ticket due to accidental 1.8.9 upgrade- always worth On Holding here as they take so long to clear).

110606 (a similar story for UCL).

110482 (Lancaster still suffering in SAM tests after upgrading to 1.8.9 too soon).

A bit of Christmas cheer - the related CMS tickets at Bristol and the Tier 1 look to have a solution in sight, thanks to the Condor Masters:
106324
106325

Let me know if I've missed ought any tickets you would like brought up.

I'll leave it there - it would be nice if everyone could have a look at all their tickets this week, just in case. But more importantly, everyone have a good Festive Season!

Merry Christmas, and a Happy New Year!

Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate 

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep 

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23


Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015


Tuesday 16th December 2014

  • Discussion of setting up CVMFS for 'other VOs'.
  • LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Monday 8th December 2014

  • Some changes from the Ops Portal to these VOs: ALICE, ATLAS, CMS, GEANT4, LHCB, OPS, VO_SIXT.
  • For each VO, any certificate with a CA_DN field that was: /DC=ch/DC=cern/CN=CERN Trusted Certification Authority replace it with /DC=ch/DC=cern/CN=CERN Grid Certification Authority

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data
Site Updates

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 7th January 20145

  • Operations report
  • Some break in access to the Tier1 caused by the failure of the main Tier1 router on the 24th December (duration around 70 minutes) and again on the 25th - this latter being fixed at around 01:00 on 26/12 (duration around 3.5 hours).
  • Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room: Tue-Thu 13-15 January & Tue-Thu 20-22 January. There are some systems that need to be re-powered in preparation for this work.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A