Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 15th December 2014
Task Areas
General updates

Monday 15th December

  • Note Jens's list of sites with certificates at risk of expiring over the hoiday period: RAL; Durham; IC; Sussex and UCL.
  • All sites are encouraged to complete the WLCG operations survey.
  • The top-BDII at Imperial was unavailable for a period on 12th December due to a water leak in the building. This impacted the SAM tests.
  • Summary notes from the December GDB are available.
  • A Condor Workshop (pre-GDB) took place 8th and 9th December. Again there are summary notes available.
  • An ARGUS Futures and Support workshop (post-GDB) was held on 11th December. A summary of the meeting is now online. The overall conclusion was that there is sufficient community effort to maintain ARGUS.
  • MJ has circulated a reminder of the HEP S/W workshop taking place 20th-21st January. It is not just for HEP (e.g. also astro and nuclear physics). There are now general and s/w mailing lists.
  • A reminder to register your Squid services in GOCDB (instructions linked below). (We had some interest in being UMD early adopters for Squid - thanks).

Monday 8th December

  • WLCG Operations Coordination Team requests that all sites now register their Squid services in GOCDB or OIM (broadcast on 4th December). Instructions are available.
  • This week 8th-9th December there is an HTCondor workshop taking place - Vidyo is available and most talks are being uploaded. We may have a summary next week.
  • There is a GDB this week. Let Jeremy know if you want issues raised or discussed.
  • The WLCG survey is now in its second week. All GridPP sites are expected to respond as the results will help not just WLCG but GridPP too! As of last week only a small number had responded.
  • Steve has noticed an issue with the ops portal VOMS information being slow to change following the change of CERN VOMS endpoints last week.
  • SAM3 results for November have been circulated. Re-computations are to be requested before 14th December.
  • There is an EGI OMB meeting on 18th December. Are there any UK matters we want raised? Also, we have been asked to share with other NGIs activities taking place in the UK that may be of benefit and interest to the other NGIs - does anyone have suggestions to put forward?
  • Are there any outstanding issues with the UK CA certificate reminders/renewals? Is there any uptake of John Kewley's Nagios scripts for checking the cert expiry status?


WLCG Operations Coordination - Agendas

Monday 15th December

  • The next WLCG ops coorindation meeting is this Thursday: Agenda. Input notes.
  • Are there any UK issues that we want to raise?
  • There was a WLCG critical services meeting on 12th. Draft minutes have been linked from the agenda.

Monday 8th December

    • Baselines: FTS 3.2.30 baseline.
    • M/W issues: DPM 1.8.9 logging fixed.
    • T0&T1 services: Various upgrades to dCache 2.10.13
    • Oracle: Continued progress on upgrades & migration.
    • T0 news: voms2.cern.ch and lcg-voms2.cern.ch in use since 26th November. Still looking at AFS UI statistics (run till 2nd Feb) – still lot of use e.g. from CMS VO boxes. Issues with users editing voms-admin emails (needs to match HR DB).
    • T1: NTR
    • T2: NTR
    • ALICE: High activity. Progressing ARC CE SAM tests.
    • ATLAS: NTR – jamboree
    • CMS: DIGI-RECO at T1s. Various MC T2s. VOMS migration – smooth, some sites had to update Phedex machines.
    • LHCb: Final checks for stripping 21. VOMS migration – users referring to many UI, afs, cvmfs places… some were not ready.
    • gLExec: testing campaign for PanDA.
    • Machine/Job features: Agreed protocol for virtualized environments. Need all implementations in repository.
    • MW readiness: dmlite 0.7.2 on EPEL stable. dCache 2.11.0 verified for ATLAS. New jira dashboard for following progress. WLCG package reporter rebranded as Pakiti v3 (in EPEL).
    • Multicore: CMS testing PromptReco multithreaded jobs; tests on CMS T2s awaiting testbed (pilot factory).
    • SHA-2: New VOMS – ATLAS 24th, others 26th. AFS UI and CVMFS UI config quickly fixed on 26th. Some PhEDEx’s and LHCb user private scripts needed fix. EGI & WLCG broadcast made on 5th. Some VO cards need updating.
    • IPv6: NTR
    • Squid & HTTP proxy: Monitoring page (auto-gen from GOCDB) now supports multiple squid services. All sites about to be asked to register.
    • Network & transfer metrics: Waiting on experiment use-cases and other inputs. Strawman for early 2015. perfSONAR deadline 8th January.



Monday 1st December

  • WLCG ops coordination team have launched a survey. Please could all GridPP sites respond to it by 19th December.

'

Tier-1 - Status Page

Tuesday 16th December

  • There were problems with Tier1 services on Saturday (13th Dec) mainly triggered by a problematic DNS sever.
  • There was a transparent update of our Tier1 router firmware at 08:30 this morning.
  • There are a number of 'At Risk' periods coming up in the new year:
    • At 08:30 on Tuesday 6th Jan. 08:30 a patch to fix the RIP problem will be applied to the firmware in the routers.
    • 10:00 - 11:00 on Wednesday 7th Jan. Quarterly UPS/Generator load test. This is a regular test, but services should be regarded as 'at risk'.
    • Safety testing of electrical circuits in the R89 machine room will take place during working hours on the following days:
      • Tuesday - Thursday 13-15 January
      • Tuesday - Thursday 20-22 January
    • During these days room power supplies will be switched off and circuits tested sequentially. As the racks are all powered from two, or more, separate power supplies systems and services should stay up.
Storage & Data Management - Agendas/Minutes

Wedn 10 Dec

  • An audience with NA62

Wedn 03 Dec

  • Should we support DIRAC data management?

Wedn 26 Nov

  • DIRAC: we probably need to understand DIRAC storage and data management better than "tried the tutorial and got the T shirt" - more next week - but then we need access to DIRAC resources!
  • Learning from non-LHC VOs: not just their data problems, but also success stories
  • WebDAV getting more widely supported - need to start testing mode widely
  • Deletion rates revisited: old target no longer sufficient, needs revisiting

Wedn 19 Nov

  • Logs: chatty DPM 1.8.9, and elasticsearching logs.
  • Reports and other interesting things from workshops: cloud data transfer and sync, HEPiX, hepsysman.
  • RAID controllers for 36 bay nodes

Wedn 12 Nov

  • Update on CEPH with xroot. It works...



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 16th December

  • Over the weekend there has been an error in the APEL repository during the preparation of data to send to the Accounting Portal. This is under investigation.

Monday 8th December

  • Sheffield given heads-up about APEL issue. Now fixed?

Tuesday 25th November

  • All sites approximately up-to-date.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.


Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Monday 15th November

  • Top BDII at Imperial (as well as other services) became unavailable following a water leak at IC leading to many alarms.
  • Three sites have low availability.

Tuesday 11th November

  • Some minor issues with ROD Dashboard - quickly fixed.
  • Two unavailability tickets still open - issues dealt with.

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.


Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 16th December

  • Any update on the FTS3/GFAL bug?

Monday 8th December

  • Note ADVISORY [EGI-SVG-2014-7696]

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 16th December

  • For the PS dashboard for the time being use this link.
  • As of today the following have no data: RHUL; Sheffield; ECDF; Bristol and Sussex.

Monday 8th December

  • Today was the soft deadline for moving to perfSONAR 3.4. The following sites now appear in the dashboard as live: ... well the dashboard did not load! The hard deadline is 8th January.

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)


Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.
Tickets

Monday 15th December 2014, 14.15 GMT

Last Christmas - you sent me a ticket,
And the very next day, you escalated it anyway.
This year, to save me from tears,
I'm going to put them On Hold (on hold).


It's the last ticket update from me for 2014, and as is my Christmas ticket tradition I won't go into too much detail as I suspect people will be winding down this week rather then rolling out changes. Still it would be good if sites take the time to tidy up their tickets before we all go and enjoy our Winter festivities, and any ones that will be left open could sites please make sure to update and put them On Hold if they're not going to be looked at for a few weeks.

36 Open UK tickets this winter's day.

Obligatory link to all the UK tickets:
http://tinyurl.com/p37ey64

Here's a few that really, really could do with a Solstice Update or at least On Holding:
110570 (lhcb cvmfs problems at Durham - looks like it can be closed).

110570 (cms AAA tests at the TIER 1)
109712 (cms glexec errors at the TIER 1)

110389(reinstalling perfsonar at Sussex)

108356 (fedcloud and vmcatcher rollout at 100IT)

110384 (perfsonar reinstall at UCL)

110608 (Sheffield low availability ticket due to accidental 1.8.9 upgrade- always worth On Holding here as they take so long to clear).

110606 (a similar story for UCL).

110482 (Lancaster still suffering in SAM tests after upgrading to 1.8.9 too soon).

A bit of Christmas cheer - the related CMS tickets at Bristol and the Tier 1 look to have a solution in sight, thanks to the Condor Masters:
106324
106325

Let me know if I've missed ought any tickets you would like brought up.

I'll leave it there - it would be nice if everyone could have a look at all their tickets this week, just in case. But more importantly, everyone have a good Festive Season!

Merry Christmas, and a Happy New Year!

Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate 

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep 

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23


Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 16th December 2014

  • Discussion of setting up CVMFS for 'other VOs'.
  • LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Monday 8th December 2014

  • Some changes from the Ops Portal to these VOs: ALICE, ATLAS, CMS, GEANT4, LHCB, OPS, VO_SIXT.
  • For each VO, any certificate with a CA_DN field that was: /DC=ch/DC=cern/CN=CERN Trusted Certification Authority replace it with /DC=ch/DC=cern/CN=CERN Grid Certification Authority

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data
Site Updates

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 17th December 2014

  • Operations report
  • Christmas Plans
  • There were service problems on Saturday 13th December eventually traced to a problematic DNS server.
  • Some At Risk periods in tyeh New Year:
    • The rollout of the RIP protocol to the Tier1 routers still has to be completed. A software patch from the vendors, which should fix this problem, will be applied to the Tier1 Routers on Tuesday 6th January.
    • The next quarterly UPS/Generator load test will take place on Wednesday 7th January.
    • Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room: Tue-Thu 13-15 January & Tue-Thu 20-22 January. There are some systms that need to be re-powered in preparation for this work.
  • In addition upgrading the Castor Headnodes to SL6 needs to be completed:
    • Tuesday 6th Jan - GEN
    • Wednesday 7th Jan - Nameserver (transparent - at risk)
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A