Operations Bulletin 250515

From GridPP Wiki
Revision as of 18:32, 23 May 2015 by Jeremy Coles 4cb4ce56a7 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Bulletin archive


Week commencing 18th May 2015
Task Areas
General updates

Tuesday 19th May

  • There was a GDB last week. The summary is available.
  • The summary of the pre-GDB about batch systems is available.
  • GridPP contacts for other VOs established (these are a current priority). Contacts expected to provide weekly updates on progress and status.
    • DIRAC: Jens Jensen (-> Brian Davies?) – vo being created
    • LIGO: Catalin Condurache – vo being created
    • LOFAR: George Ryan
    • LSST: Alessandra Forti
    • LZ: David Colling
    • UKQCD: Jeremy Coles
  • Glexec: Matt is redirecting efforts from coming up with a relocatable glexec tarball, to a recipe that sites could follow. He comments that this would be a lot more involved than he would like for a tarball install, but thinks that it's the only way to proceed with any confidence.
  • gstat is not supported. Note this ticket.
  • The network issues resolution process/procedure...
  • Assessment of the impact on User Communities/NGIs of the EGI core activities 2015 (results uploaded to the meeting page)
  • The EGI conference is taking place this week - link to the detailed agenda.
  • A reminder that the HEPSYSMAN & security training meeting is taking place 1st-3rd June.
  • STFC (through Catalin Condurache) are interested in investigating joining EGI Fed Cloud



Tuesday 12th May

Monday 11th May

  • There was an EGI Operations Management Board (OMB) meeting on 30th April.
  • Operations updates:
    • 12 service types will be removed from GOC DB due to not being used. They are defined in GGUS 113432
    • A list Tools-admins at mailman.egi.eu has been created for ops tools administrator discussion.
    • EGI OLA period 1 May 2015 - 30 April 2016
    • Security coordination moves to CERN after SNIC.
    • Only NGI-Argus servers should accept Nagios probes
    • What HPC facilities are available in NGIs for federating?
    • Suggestion for common RC suspension process.
    • EGI conference in Lisbon 18-22 May.
  • FedCloud
    • No stable monitoring tests. Proposal to create a new CLOUD-MON_CRITICAL (inc. eu.egi.cloud.APEL-Pub; eu.egi.cloud.OCCI-VM ...).
    • New sites IN2P3-IRES (FR) and NCG-INGRID-PT (PT). 2 others in process.
    • EGI to provide capacity to instantiate virtual machines to run the computational tasks (on earth observation datasets) generated by the users of the ESA funded Terradue for the development of the e-Collaboration for Earth Observation (e-CEO) platform.
    • Auger moving to production on FedCloud.
  • EGI CSIRT
    • Concern about effort going into perfSONAR issues (cacti; web interface; shellshocked...)
    • CRITICAL CVE handling. Want EGI CSIRT hook into site re-certification by NGIs.
    • Have no way to probe specific WNs. Proposed pakiti client run manually. (More UK feedback given).
    • EGI-CSIRT got reviewed by TI and certified according to maturity parameters. Looking to run review on sites/NGIs.
  • UMD support for SL5/SL6
    • Torque 4.2 is not backward compatible to 2.5.7. Update not recommended. Move to Torque 2.5.13 (patched by SVG) using AppDB repositoy with highest priority.
    • SL5 support alligned with RHEL5. In "Maintenance" until March 31, 2017 ... but >80% sites not using it anyway and some sites on SL7 + struggling with MW deployment.
    • Supporting CentOS7 in UMD requires to schedule the end of support of SL5 in UMD.
    • EPEL7/CentOS7: 13 products are ready for EPEL7.
    • No move from SL5 campaign foreseen.
    • 60% of cloud sites base their cloud infrastructure on RHEL-compat distribution. Most of these are Ubuntu.
    • Proposal: UMD4: September 2015. Decommissioning of SL5: March 2016.
  • ARGO Central Monitoring
    • Deploy test central instance in May. Review results in June.
    • High availability instances deployment in July (Croatia and Greece). Monitor during August.
    • Switch A/R engine in September.
    • Decommission NGI instances October 2015 (they can still be run for local alarms).
  • EGI Strategy Summary
    • See document. Basically: Expand cloud. Push 'commons' and open platforms.
    • "Consider open science as a production and dissemination system that needs integrated, easy and fair access to several types of shared resources (physical, digital, intellectual), engaged communities that contribute to the process and collaborates in the management and stewardship of the resources, a suitable governance with rules to allow/exclude access, to resolve conflicts, and finally financial support for the long-term availability".


Tuesday 5th May

  • It is a CMS week this week.
  • A pre-GDB on batch systems is taking place next Tuesday 12th May. More T2 participation is sought. Still need to define T2 GDB rep.
  • CHEP'15 proceedings submissions due byMay 17th.
  • April A/R figures circulated. No real issues this month except getting UCL (VAC/Cloud only) site correctly monitored.
WLCG Operations Coordination - Agendas

Thursday 7th May

  • The agenda. Minutes
  • News: Alessandra will present the WLCG workshop conclusions at next week's GDB.
  • Middleware news: UMD 3.12.0 released this week (fixes for ARGUS-PAP and dCache server)
  • Middleware baselines: dCache 2.6.x removed. New version 2.10.28/ 2.12.8 of dCache. Sites should avoid simultaneous updates.
  • Middleware issues: major upgrade of torque arrived in EPEL (from torque-2.5.7 to torque-4.2.10) which is not compatible standard EMI torque installation. If upgraded the patched 2.5.13 version of torque has been pushed to the EMI third-party repo in order to downgrade.
  • T0 & T1 upgrades: FTS 3.2.33 upgraded at CERN & RAL.
  • T0 news: batch HTCondor pilot is open for grid submission. Lower-than-usual WLCG availability figures in March for Atlas and CMS - possible overload.
  • T1 feedback: NTR
  • T2 feedback: NTR
  • OS support in UMD: Plans in EGI for CentOS7 support. 13 products are ready for EPEL7, but in general CentOS7 is not a viable option for sites. The release of UMD4 (supporting EPEL7 and Ubuntu) is foreseen for September 2015 and the decommissioning of SL5 for March 2016. It is likely that some products relevant for WLCG will not be ready for EPEL7 before 2016. The requirement for WLCG is to provide SL6 until the end of Run2, however, there are already offers for resources on CentOS7 and this is an incentive for experiments to validate their software on it.
  • ALICE: CASTOR at CERN - some re-reco job instabilities.
  • ATLAS: ~running full. Considering increasing job lengths for all MCORE jobs. Need sites to provide MCORE resources. Rucio/FTS issue was discovered - fix via update. Tier-0 data and computing workflow fully commissioned.
  • CMS: CMS production activities continue - Several sites reported network saturation. Evaluating to use selected “strong" Tier-2 sites to add computing capacity for DIGI-RECO. Plan to drop support of CRC32 checksum in CMS data transfer systems.
  • LHCb: Various operational issues reported - CASTOR/CERN SRM access problems; other data access issues.
  • gLExec: ATLAS 61 out of 94 sites. RAL, RALPP and TW-FTT issue was due to a bug in the pilot code that showed up with ARC CE + Condor sites.
  • SHA-2: old VOMS server aliases (lcg-)voms.cern.ch were removed on Tue Apr 28.
  • RFC proxies: RFC proxy readiness to be followed up per experiment. SAM-Nagios proxy renewal code fix to support RFC proxies.
  • Machine/Job features: NTR
  • MW readiness: 10th meeting on 6th agenda. WG is making a check-point of goals and priorities. ARGUS testbed at CERN is set-up and ready to start. Pakiti client requested at other test sites.
  • MC deployment: NTR
  • IPv6: LHCb: DIRAC was made IPv6-compatible back in November, but testing has started in April. Issue found at CERN with python library (wrong IPV6 address returned).
  • Network/Transfers WG: NTR
  • HTTP deployment: perfSONAR - Security: NDT 3.7.0.1 was released. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS. Network performance incidents process put in place as was agreed at the last meeting. OSG/Datastore validation progressing well. Publishing results to message bus progressing, development has finalized for esmond2mq prototype. Recent meeting focussed on FTS performance. Next meeting 3rd June. Plan is to focus it on latency ramp up and proximity service.
Tier-1 - Status Page

Monday 18th May

  • A reminder that there is a weekly Tier-1 experiment liaison meeting.
  • The agenda follows this format:
    • 1. Summary of Operational Status and Issues
    • 2. Highlights/summary of the Tier1 Monday operations meeting (Grid Services; Fabric; CASTOR and Other)
    • 3. Experiment plans and operational issues (CMS; ATLAS; LHCb; ALICE and Others)
    • 4. Special presentations
    • 5. Actions
    • 6. Highlights for Operations Bulletin Latest
    • 7. AoB


Tuesday 129th May

  • Remaining CREAM CEs were turned off last week.
  • The problems with our primary network router are still being followed up - likely to be an intervention one morning next week (to be planned).
  • We are planning an update to the version of the Oracle database behind Castor. Dates to be finalised.
Storage & Data Management - Agendas/Minutes

Tuesday 18th May

Tuesday 21st April

  • Has there been any Tier-1 contact with DiRAC?
  • Proposal to setup an 'other VOs' users list. GridPP-Users is too tied with WLCG projects.

Wednesday 15 April

  • Backing up data from DiRAC to GridPP (tape)
  • More case studies on supporting non-LHC VOs on GridPP: we have a lot of great stuff that can do great stuff - non-LHC VOs tend to have less regimented data models so maybe we need more case studies.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 12th May

  • Issues noted with sync for Brunel, Liv, ECDF (see EGI ticket 113473). Message broker issues (memory related) are likely the underlying EGI problem.
  • Need to check on VAC sync publishing.

Tuesday 21st April

  • (Slight) Accounting delays seen for: UCL; Sheffield; QMUL & RALPP.

Tuesday 14th April

  • APEL delays for UCL; Sheffield; RALPP and Bristol
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 21st April

  • The Approved VOs document has been updated to take account of changes to the Ops Portal VOID cards.For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503. Sites that support SNOPLUS.SNOLAB.CA should ensure that their configuration conforms to these settings: Approved VOs
  • KeyDocs still need updating since agreements reached at last core ops meeting.
  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Tuesday 21st April

  • There was an EGI ops meeting on Monday 20th.
  • David updated the UK SL5 response.
  • Please review the agenda/minutes.

Monday 9th March

  • The agenda for February's EGI ops meeting is here. Minutes are here
    • APEL 1.4.0
      • Added Month and Year columns to primary key of CloudSummaries table in cloud schema.
    • DPM-Xrootd 3.5.2 is in EPEL stable - this is the first version of the component compatible with xrootd4
    • gLExec-wn - v. 1.2.3: lcmaps-plugins-c-pep 1.3.0-1 & mkgltempdir 0.0.5-1
      • "The lcmaps-plugins-c-pep-1.3.0-1 preferably needs the argus-pep-api-c-2.3.0. This version will be released into EMI & UMD repositories in a near future."
    • UMD 3.11.0 released on 16.02.2014, UMD 3.11.1 released on 4.03.2014
    • lcg-CA 1.62 noted with an intention to broadcast these as they occur as opposed to monthly.
    • EGI looking at the decommissioning of SL5, possibly by end of 2015, as a byproduct of adding CentOS 7 to UMD. NGIs to make a note if extended SL5 support is required.
    • Vincenzo Spinoso has joined EGI Ops team from NGI_IT. Vincenzo will chair EGI Ops.
    • Next meeting is April 20th.


Monitoring - Links MyWLCG

Tuesday 31st March

Monday 7th December

On-duty - Dashboard ROD rota

Monday 11th May

  • Rota responses awaited from Andrew and Daniela.
  • Handover summary should be uploaded to the bulletin please.

Tuesday 28th April

  • Glasgow: A GLUE2 problem is transient and doesn't have a short-term solution (if the service status was checked a little more frequently it would help). Currently on hold. IC sometimes see this too.
  • UCL: No change to the on-going situation. UCL has hopped from one downtime to another this week. Note – AM visiting UCL this week to setup VAC. Services will be decommissioned after this step.


Tuesday 21st April

  • UCL have put themselves into a downtime until the 21st April. (Start of next week). Noted this in their outstanding tickets.
  • Birmingham's availability has steadily recovered over the week - and the low availability ticket against them should be closable next week.


Rollout Status WLCG Baseline

Tuesday 12th May

  • MW Readiness WG meeting Wed May 6th at 4pm. Attended by Raul, Matt, Sam and Jeremy.

Tuesday 17th March

  • Daniela has updated the [ https://www.gridpp.ac.uk/wiki/Staged_rollout_emi3 EMI-3 testing table]. Please check it is correct for your site. We want a clear view of where we are contributing.
  • There is a middleware readiness meeting this Wednesday. Would be good if a few site representatives joined.
  • Machine job features solution testing. Fed back that we will only commence tests if more documentation made available. This stops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

References


Security - Incident Procedure Policies Rota

Tuesday 18th May

  • EGI SVG and CSIRT Advisory "Critical/Low?". "VENOM: QEMU vulnerability (CVE-2015-3456)
  • Issue with VM appliance - image ships with ...
  • EGI SVG Advisory 'High' Risk - Dirac SQL injection vulnerability [EGI-SVG-2014-7553]
  • IGTF is about to release an update to the trust anchor repository (1.64)

Tuesday 12th May


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 12th May

  • LHCOPN & LHCONE joint meeting at LBL June 1st & 2nd. Agenda taking shape.

Tuesday 31st March

Tuesday 10th March

  • From the recent WLCG meeting, two slides (1 & 2) give the direction of the network monitoring and metrics progress: integration of perfSONAR event types into experiment monitoring and an architecture for data to get from RSV probes to client. Components described on slide 3.
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 1st and Tuesday 2nd of June 2015 in Berkeley (US) (hosted by LBL and ESnet).
Tickets

Friday 22nd May 2015
Matt's on leave until the 8th of June. But he's replaceable with handy links:

Other VO Nagios

UK NGI GGUS tickets

Tools - MyEGI Nagios

Tuesday 17th February

  • Another period where message brokers were temporarily unavailable seen yesterday. Any news on the last follow-up?

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 19th May

  • There is a current priority for enabling/supporting our joining communities.

Tuesday 5th May

  • We have a number of VOs to be removed. Dedicated follow-up meeting proposed.

Tuesday 28th April

  • For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503.

Tuesday 31st March

  • LIGO are in need of additional support for debugging some tests.
  • LSST now enabled on 3 sites. No 'own' CVMFS yet.
Site Updates

Tuesday 24th February

  • Next review of status today.

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 13th May 2015 Operations report

  • We are investigating some significant Castor performance issues for CMS.
  • Revised (still draft) dates for the upgrade of the Oracle database behind Castor (to version 11.2.0.4) were presented.
  • The decommissioning of the the CREAM CEs was effectively done a wek ago. They are now marked as being not in production in the GOC DB.
  • We are still working to fix the problematic Tier1 router.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Atlas S&C week 2-6 Feb 2015

Production

• Prodsys-2 in production since Dec 1st

• Deployment has not been transparent , many issued has been solved, the grid is filled again

• MC15 is expected to start soon, waiting for physics validations, evgen testing is underway and close to finalised.. Simulation expected to be broadly similar to MC14, no blockers expected.

Rucio

• Rucio in production since Dec 1st and is ready for LHC RUN-2. Some fields need improvements, including transfer and deletion agents, documentation and monitoring.

Rucio dumps available.

Dark data cleaning

files declaration . Only Only DDM ops can issue lost files declaration for now, cloud support needs to fill a ticket.

• Webdav panda functional tests with Hammercloud are ongoing

Monitoring

Main page

DDM Accounting

space

Deletion

ASAP

• ASAP (ATLAS Site Availability Performance) in place. Every 3 months the T2s sites performing BELOW 80% are reported to the International Computing Board.


UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A