Operations Bulletin 190115

From GridPP Wiki
Revision as of 22:20, 18 January 2015 by Jeremy Coles 4cb4ce56a7 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 11th January 2015
Task Areas
General updates

Thursday 8th January


Tuesday 30th December

  • Issues with Tier-1 switch 24th/25th December. GOCDB switchover.

Tuesday 23rd December

  • Accounting MC:
    • NGIs to check MC reporting (ncores=0 if parallel flag not set). CREAM requires EMI3 APEL (Others/ARC CEs use SSM2 or JURA).
    • Sites need to check all CEs. VOs check their sites.
    • Once parallel=true will need to republish to backdate.Portal Multicore View
    • sites view
    • APEL client users SSLv3 – new version of SSM released and in testing. Sites will soon be asked to upgrade SSM rpm.
  • EGI CSIRT
    • Alerts/Linux-2014-12-17 
Critical vulnerability – kernel update required.
    • Two recent incidents: 7782 and 7765.
    • Cloud security: evaluating questionnaires. EGI-CSIRT collaborating with FedCloud. (F2F in Jan).
    • EGI-CSIRT/WLCG collaboration Pakiti and MW readiness WG.
    • Central suspension: NGI ARGUS services deployed. Info in GOCDB. Monitoring to check ban updates.
  • SAM OPS
    • Update-23: Probes to UMD. GridMon removed. New documents. Tested. Released 9th December.
    • New tests: FTS3-Service & StalledTransfers and GLUE2-validate (but failed on ARC-CEs). Various WN tests removed.
    • Issues: xrootd protocol on DPM 1.8.9 and NCGLogFiles. SSLv3 move after APEL testing.
  • OLA/SLA framework
    • 3 phases. 1 RC+RP+TP SLAs. 2 +EGI.eu+EGI.eu fed SLA. 3. +TP UA+E-Grant (User SLA+User OLA).
    • EGI User SLA just sets expectations between EGI and VO: scope; service hours; services and support; communication; security and responsibilities.
  • Operational Tools
    • Release 3.1.1: remove VO ID card approval. Ticket management. Bug fixes.
    • Release every 2 months. There is now an Advisory and Testing Board.
    • GOCDB – next release authentication change allows anon access to selected pages.
    • GGUS – 5 releases in PY5. Failover improved. SSLv3 disabled. SAM/ARGO: ARGO will replace SAM in 2015 (for start EGI-Engage): enhance customisation; better metric store. Will work standalone.
    • Application database: 9 major releases in PY5: Cloud marketplace; support for eduGAIN; OpenAIRE’s API.
    • E-Grant


Monday 15th December

  • Note Jens's list of sites with certificates at risk of expiring over the hoiday period: RAL; Durham; IC; Sussex and UCL.
  • All sites are encouraged to complete the WLCG operations survey.
  • The top-BDII at Imperial was unavailable for a period on 12th December due to a water leak in the building. This impacted the SAM tests.
  • Summary notes from the December GDB are available.
  • A Condor Workshop (pre-GDB) took place 8th and 9th December. Again there are summary notes available.
  • An ARGUS Futures and Support workshop (post-GDB) was held on 11th December. A summary of the meeting is now online. The overall conclusion was that there is sufficient community effort to maintain ARGUS.
  • MJ has circulated a reminder of the HEP S/W workshop taking place 20th-21st January. It is not just for HEP (e.g. also astro and nuclear physics). There are now general and s/w mailing lists.
  • A reminder to register your Squid services in GOCDB (instructions linked below). (We had some interest in being UMD early adopters for Squid - thanks).
WLCG Operations Coordination - Agendas

Thursday 8th January

Thursday 18th December

  • There was a WLCG operations meeting today. Agenda: Minutes.
    • News: Sites requested to enable Multicore accounting (details) - for EMI3 CREAM you edit /etc/apel/parser.cfg and set the attribute parallel=true. Site managers requested to complete the [ http://cern.ch/wlcg-survey WLCG survey]. MW Readiness WG will participate in the ARGUS testing. No immediate issue with WLCG support for sites belonging to NGIs leaving EGI.


Monday 15th December

  • The next WLCG ops coorindation meeting is this Thursday: Agenda. Input notes.
  • Are there any UK issues that we want to raise?
  • There was a WLCG critical services meeting on 12th. Draft minutes have been linked from the agenda.

Monday 8th December

    • Baselines: FTS 3.2.30 baseline.
    • M/W issues: DPM 1.8.9 logging fixed.
    • T0&T1 services: Various upgrades to dCache 2.10.13
    • Oracle: Continued progress on upgrades & migration.
    • T0 news: voms2.cern.ch and lcg-voms2.cern.ch in use since 26th November. Still looking at AFS UI statistics (run till 2nd Feb) – still lot of use e.g. from CMS VO boxes. Issues with users editing voms-admin emails (needs to match HR DB).
    • T1: NTR
    • T2: NTR
    • ALICE: High activity. Progressing ARC CE SAM tests.
    • ATLAS: NTR – jamboree
    • CMS: DIGI-RECO at T1s. Various MC T2s. VOMS migration – smooth, some sites had to update Phedex machines.
    • LHCb: Final checks for stripping 21. VOMS migration – users referring to many UI, afs, cvmfs places… some were not ready.
    • gLExec: testing campaign for PanDA.
    • Machine/Job features: Agreed protocol for virtualized environments. Need all implementations in repository.
    • MW readiness: dmlite 0.7.2 on EPEL stable. dCache 2.11.0 verified for ATLAS. New jira dashboard for following progress. WLCG package reporter rebranded as Pakiti v3 (in EPEL).
    • Multicore: CMS testing PromptReco multithreaded jobs; tests on CMS T2s awaiting testbed (pilot factory).
    • SHA-2: New VOMS – ATLAS 24th, others 26th. AFS UI and CVMFS UI config quickly fixed on 26th. Some PhEDEx’s and LHCb user private scripts needed fix. EGI & WLCG broadcast made on 5th. Some VO cards need updating.
    • IPv6: NTR
    • Squid & HTTP proxy: Monitoring page (auto-gen from GOCDB) now supports multiple squid services. All sites about to be asked to register.
    • Network & transfer metrics: Waiting on experiment use-cases and other inputs. Strawman for early 2015. perfSONAR deadline 8th January.



Monday 1st December

  • WLCG ops coordination team have launched a survey. Please could all GridPP sites respond to it by 19th December.

'

Tier-1 - Status Page

Tuesday 12th January

  • We have had some load issues on the SRMs during the last week for LHCb during their stripping campaign.
  • Safety testing of electrical circuits in the R89 machine room is underway and will take place during working hours on the following days:
    • Tuesday - Thursday 13-15 January
    • Tuesday - Thursday 20-22 January
    • During these days room power supplies will be switched off and circuits tested sequentially. As the racks are all powered from two, or more, separate power supplies systems and services should stay up but are at risk.
  • Owing to staff absence the upgrade of the Castor Namservers to SL6 did not take place last week. This is being re-scheduled for next Monday (19th Jan). The Castor service will be At Risk during this work.
  • The problems with our primary network router (Christmas Eve, Christmas Day) are being followed up with the vendor. Some testing of the unit to reproduce the fault is anticipated and will be announced in advance.
Storage & Data Management - Agendas/Minutes

Wedn 07 Jan

  • HNY. Most sites ticked over with relatively few glitches.
  • Next week's pre-GDB on preservation and protocols: harmonising for LHC VOs(?) but we also need to support non-LHC VOs
  • Apparently no major outstanding issues to sort out prior to run 2.

Wedn 17 Dec

  • Update from T1 CEPH team.
  • Last meeting of year. No updating and configuring or touching! Merry and Happy!

Wedn 10 Dec

  • An audience with NA62

Wedn 03 Dec

  • Should we support DIRAC data management?

Wedn 26 Nov

  • DIRAC: we probably need to understand DIRAC storage and data management better than "tried the tutorial and got the T shirt" - more next week - but then we need access to DIRAC resources!
  • Learning from non-LHC VOs: not just their data problems, but also success stories
  • WebDAV getting more widely supported - need to start testing mode widely
  • Deletion rates revisited: old target no longer sufficient, needs revisiting



Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 13th January

  • Some glitches over holiday period when RAL network lost, but site publishing now looks fine across all sites.

Tuesday 16th December

  • Over the weekend there has been an error in the APEL repository during the preparation of data to send to the Accounting Portal. This is under investigation.

Monday 8th December

  • Sheffield given heads-up about APEL issue. Now fixed?

Tuesday 25th November

  • All sites approximately up-to-date.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 8th December

  • Chris Walker is handing over several 'other VO' documents. Some aspects of the role are being taken on by a combination of Duncan and Daniela... but the documents need reviewing in a core-ops meeting (next Thursday @ 11am being likely).

Tuesday 4th Nov

  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 12th January

  • There was an EGI ops meeting on Monday 12th January. Minutes to follow.
    • STORM 1.11.5
    • SR: Another reminder to check your contacts
    • Call for sites to test FTS3 and SQUID (Glasgow is testing the new squid)
    • Multicore accounting: As per broadcast, sites have been asked to publish number of cores used by jobs but updating apel config where appropriate: instructions given to test publishing in accounting-devel.
    • EGI forum in May: slightly delayed because they were waiting for approval to start fine-grained definition of programme. They will circulate the programme soon (end of the month).


Monitoring - Links MyWLCG

Monday 7th December

On-duty - Dashboard ROD rota

Monday 13th January

  • A couple of tickets for systems with low availability.
  • There are a couple of tests for QMUL that keep being flagged up although the service is in a "not production but monitored" state.
  • Tier-1 aware of long standing ticket.
Rollout Status WLCG Baseline

Tuesday 11th November

  • UMD v.3.9.0 was released Monday 10th November. It supports Scientific Linux 5 and 6 and also Debian 6 (Squeeze).
  • As proposed during October OMB production EGI resource centres will be notified later in the month with a summary broadcast together with other communications, to reduce the number of broadcasts sent to sites.


References


Security - Incident Procedure Policies Rota

Tuesday 13th January

  • EGI CSIRT alert 'High' Risk - CVE-2014-9295 - Remote code execution in NTP.
  • Any issues over holiday period?

Tuesday 16th December

  • Any update on the FTS3/GFAL bug?

Monday 8th December

  • Note ADVISORY [EGI-SVG-2014-7696]

Tuesday 4th November

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 16th December

  • For the PS dashboard for the time being use this link.
  • As of today the following have no data: RHUL; Sheffield; ECDF; Bristol and Sussex.

Monday 8th December

  • Today was the soft deadline for moving to perfSONAR 3.4. The following sites now appear in the dashboard as live: ... well the dashboard did not load! The hard deadline is 8th January.

Tuesday 2nd December

  • perfSONAR 3.4 available (63%)
    • YES: Imperial; QMUL; RHUL; Lancaster; Liverpool; Manchester; Durham; Glasgow; Bristol; Cambridge; Oxford; RALPP (12)
    • NO: RAL T1; Brunel; UCL; Sheffield; ECDF; Birmingham;Sussex (7)


Tuesday 25th November

  • Check on perfSONAR instances upgraded to 3.4...
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 9th and Tuesday 10th of February 2015 in Cambridge (UK), kindly hosted by Dante.
Tickets

Monday 12th January 2015, 16.30 GMT

Urk, today has really run away from me and I haven't managed a ticket update! I'm afraid we'll have to make do with a link to the 27 (at time of writing) Open UK tickets:

http://tinyurl.com/p37ey64

I'll do better next week!

Tools - MyEGI Nagios

Tuesday 25th November

Backup SAM Nagios at Lancaster was upgraded to update-23 as part of stage rollout process. It is major upgrade as some tests were removed from CE and probes are moved to UMD3 repository from SAM repository.

Tests added:

   ch.cern.FTS3-Service
   ch.cern.FTS3-StalledTransfers
   org.bdii.GLUE2-Validate 

Tests removed:

   org.nordugrid.ARC-CE-LFC-result
   org.nordugrid.ARC-CE-lfc
   org.nordugrid.ARC-CE-LFC-submit
   org.sam.WN-RepDel
   org.sam.WN-RepISenv
   org.sam.WN-RepFree
   org.sam.WN-RepCr
   org.sam.WN-RepGet
   org.sam.WN-RepRep
   org.sam.WN-Rep 

release note is available here https://wiki.egi.eu/wiki/SAMUpdate23


Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios

http://southgrid.blogspot.co.uk/2014/10/nagios-monitoring-for-non-lhc-vos.html


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 6th January 2015


Tuesday 16th December 2014

  • Discussion of setting up CVMFS for 'other VOs'.
  • LondonGrid VO now established in CVMFS - decision on top level needed together with who has write access.

Monday 8th December 2014

  • Some changes from the Ops Portal to these VOs: ALICE, ATLAS, CMS, GEANT4, LHCB, OPS, VO_SIXT.
  • For each VO, any certificate with a CA_DN field that was: /DC=ch/DC=cern/CN=CERN Trusted Certification Authority replace it with /DC=ch/DC=cern/CN=CERN Grid Certification Authority

Monday 24th November 2014

Tuesday 11th November 2014

  • Status of CERN@School data
Site Updates

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 14th January 20145

  • Operations report
  • Only ALICE left using CREAM CEs now and they should be finished with them in February. Will decommission then.
  • Electrical power circuit testing underway. Warning (At Risk) declared for Tues/Wed/Thus this week and same next week.
  • SL6 updates on Castor Namesevers announced for next Monday (19th Jan). (Castor services At Risk during this upgrade).
  • Planning some testing of the Primary Tier1 router to investigate why it hs given problems.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A