Operations Bulletin 180515

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 11th May 2015
Task Areas
General updates

Tuesday 12th May

Monday 11th May

  • There was an EGI Operations Management Board (OMB) meeting on 30th April.
  • Operations updates:
    • 12 service types will be removed from GOC DB due to not being used. They are defined in GGUS 113432
    • A list Tools-admins at mailman.egi.eu has been created for ops tools administrator discussion.
    • EGI OLA period 1 May 2015 - 30 April 2016
    • Security coordination moves to CERN after SNIC.
    • Only NGI-Argus servers should accept Nagios probes
    • What HPC facilities are available in NGIs for federating?
    • Suggestion for common RC suspension process.
    • EGI conference in Lisbon 18-22 May.
  • FedCloud
    • No stable monitoring tests. Proposal to create a new CLOUD-MON_CRITICAL (inc. eu.egi.cloud.APEL-Pub; eu.egi.cloud.OCCI-VM ...).
    • New sites IN2P3-IRES (FR) and NCG-INGRID-PT (PT). 2 others in process.
    • EGI to provide capacity to instantiate virtual machines to run the computational tasks (on earth observation datasets) generated by the users of the ESA funded Terradue for the development of the e-Collaboration for Earth Observation (e-CEO) platform.
    • Auger moving to production on FedCloud.
  • EGI CSIRT
    • Concern about effort going into perfSONAR issues (cacti; web interface; shellshocked...)
    • CRITICAL CVE handling. Want EGI CSIRT hook into site re-certification by NGIs.
    • Have no way to probe specific WNs. Proposed pakiti client run manually. (More UK feedback given).
    • EGI-CSIRT got reviewed by TI and certified according to maturity parameters. Looking to run review on sites/NGIs.
  • UMD support for SL5/SL6
    • Torque 4.2 is not backward compatible to 2.5.7. Update not recommended. Move to Torque 2.5.13 (patched by SVG) using AppDB repositoy with highest priority.
    • SL5 support alligned with RHEL5. In "Maintenance" until March 31, 2017 ... but >80% sites not using it anyway and some sites on SL7 + struggling with MW deployment.
    • Supporting CentOS7 in UMD requires to schedule the end of support of SL5 in UMD.
    • EPEL7/CentOS7: 13 products are ready for EPEL7.
    • No move from SL5 campaign foreseen.
    • 60% of cloud sites base their cloud infrastructure on RHEL-compat distribution. Most of these are Ubuntu.
    • Proposal: UMD4: September 2015. Decommissioning of SL5: March 2016.
  • ARGO Central Monitoring
    • Deploy test central instance in May. Review results in June.
    • High availability instances deployment in July (Croatia and Greece). Monitor during August.
    • Switch A/R engine in September.
    • Decommission NGI instances October 2015 (they can still be run for local alarms).
  • EGI Strategy Summary
    • See document. Basically: Expand cloud. Push 'commons' and open platforms.
    • "Consider open science as a production and dissemination system that needs integrated, easy and fair access to several types of shared resources (physical, digital, intellectual), engaged communities that contribute to the process and collaborates in the management and stewardship of the resources, a suitable governance with rules to allow/exclude access, to resolve conflicts, and finally financial support for the long-term availability".


Tuesday 5th May

  • It is a CMS week this week.
  • A pre-GDB on batch systems is taking place next Tuesday 12th May. More T2 participation is sought. Still need to define T2 GDB rep.
  • CHEP'15 proceedings submissions due byMay 17th.
  • April A/R figures circulated. No real issues this month except getting UCL (VAC/Cloud only) site correctly monitored.
WLCG Operations Coordination - Agendas

Thursday 7th May

  • The agenda. Minutes
  • News: Alessandra will present the WLCG workshop conclusions at next week's GDB.
  • Middleware news: UMD 3.12.0 released this week (fixes for ARGUS-PAP and dCache server)
  • Middleware baselines: dCache 2.6.x removed. New version 2.10.28/ 2.12.8 of dCache. Sites should avoid simultaneous updates.
  • Middleware issues: major upgrade of torque arrived in EPEL (from torque-2.5.7 to torque-4.2.10) which is not compatible standard EMI torque installation. If upgraded the patched 2.5.13 version of torque has been pushed to the EMI third-party repo in order to downgrade.
  • T0 & T1 upgrades: FTS 3.2.33 upgraded at CERN & RAL.
  • T0 news: batch HTCondor pilot is open for grid submission. Lower-than-usual WLCG availability figures in March for Atlas and CMS - possible overload.
  • T1 feedback: NTR
  • T2 feedback: NTR
  • OS support in UMD: Plans in EGI for CentOS7 support. 13 products are ready for EPEL7, but in general CentOS7 is not a viable option for sites. The release of UMD4 (supporting EPEL7 and Ubuntu) is foreseen for September 2015 and the decommissioning of SL5 for March 2016. It is likely that some products relevant for WLCG will not be ready for EPEL7 before 2016. The requirement for WLCG is to provide SL6 until the end of Run2, however, there are already offers for resources on CentOS7 and this is an incentive for experiments to validate their software on it.
  • ALICE: CASTOR at CERN - some re-reco job instabilities.
  • ATLAS: ~running full. Considering increasing job lengths for all MCORE jobs. Need sites to provide MCORE resources. Rucio/FTS issue was discovered - fix via update. Tier-0 data and computing workflow fully commissioned.
  • CMS: CMS production activities continue - Several sites reported network saturation. Evaluating to use selected “strong" Tier-2 sites to add computing capacity for DIGI-RECO. Plan to drop support of CRC32 checksum in CMS data transfer systems.
  • LHCb: Various operational issues reported - CASTOR/CERN SRM access problems; other data access issues.
  • gLExec: ATLAS 61 out of 94 sites. RAL, RALPP and TW-FTT issue was due to a bug in the pilot code that showed up with ARC CE + Condor sites.
  • SHA-2: old VOMS server aliases (lcg-)voms.cern.ch were removed on Tue Apr 28.
  • RFC proxies: RFC proxy readiness to be followed up per experiment. SAM-Nagios proxy renewal code fix to support RFC proxies.
  • Machine/Job features: NTR
  • MW readiness: 10th meeting on 6th agenda. WG is making a check-point of goals and priorities. ARGUS testbed at CERN is set-up and ready to start. Pakiti client requested at other test sites.
  • MC deployment: NTR
  • IPv6: LHCb: DIRAC was made IPv6-compatible back in November, but testing has started in April. Issue found at CERN with python library (wrong IPV6 address returned).
  • Network/Transfers WG: NTR
  • HTTP deployment: perfSONAR - Security: NDT 3.7.0.1 was released. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS. Network performance incidents process put in place as was agreed at the last meeting. OSG/Datastore validation progressing well. Publishing results to message bus progressing, development has finalized for esmond2mq prototype. Recent meeting focussed on FTS performance. Next meeting 3rd June. Plan is to focus it on latency ramp up and proximity service.
Tier-1 - Status Page

Tuesday 12th May

  • Remaining CREAM CEs were turned off last week.
  • The problems with our primary network router are still being followed up.
  • We are planning an update to the version of the Oracle database behind Castor. Dates to be finalised.
Storage & Data Management - Agendas/Minutes

Tuesday 21st April

  • Has there been any Tier-1 contact with DiRAC?
  • Proposal to setup an 'other VOs' users list. GridPP-Users is too tied with WLCG projects.

Wednesday 15 April

  • Backing up data from DiRAC to GridPP (tape)
  • More case studies on supporting non-LHC VOs on GridPP: we have a lot of great stuff that can do great stuff - non-LHC VOs tend to have less regimented data models so maybe we need more case studies.

Wednesday 8 April

Tuesday 7th April

  • There was a DPM collaboration meeting last week. (Blog update)

Tuesday 31st March

Tuesday 17th March

  • An annual DPM collaboration board will take place in coming weeks. Are there issues that sites want raised (in relation to things like roadmap, concerns about DPM, requests etc.)?


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

One Brunel, Liv, ECDF. 113473. Message broker issues – mem problem. Site thinks sent - lost. APEL. Crotia/Greece. WLCG reports late. - Does VAC publish sync.


Tuesday 21st April

  • (Slight) Accounting delays seen for: UCL; Sheffield; QMUL & RALPP.

Tuesday 14th April

  • APEL delays for UCL; Sheffield; RALPP and Bristol
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 21st April

  • The Approved VOs document has been updated to take account of changes to the Ops Portal VOID cards.For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503. Sites that support SNOPLUS.SNOLAB.CA should ensure that their configuration conforms to these settings: Approved VOs
  • KeyDocs still need updating since agreements reached at last core ops meeting.
  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Tuesday 21st April

  • There was an EGI ops meeting on Monday 20th.
  • David updated the UK SL5 response.
  • Please review the agenda/minutes.

Monday 9th March

  • The agenda for February's EGI ops meeting is here. Minutes are here
    • APEL 1.4.0
      • Added Month and Year columns to primary key of CloudSummaries table in cloud schema.
    • DPM-Xrootd 3.5.2 is in EPEL stable - this is the first version of the component compatible with xrootd4
    • gLExec-wn - v. 1.2.3: lcmaps-plugins-c-pep 1.3.0-1 & mkgltempdir 0.0.5-1
      • "The lcmaps-plugins-c-pep-1.3.0-1 preferably needs the argus-pep-api-c-2.3.0. This version will be released into EMI & UMD repositories in a near future."
    • UMD 3.11.0 released on 16.02.2014, UMD 3.11.1 released on 4.03.2014
    • lcg-CA 1.62 noted with an intention to broadcast these as they occur as opposed to monthly.
    • EGI looking at the decommissioning of SL5, possibly by end of 2015, as a byproduct of adding CentOS 7 to UMD. NGIs to make a note if extended SL5 support is required.
    • Vincenzo Spinoso has joined EGI Ops team from NGI_IT. Vincenzo will chair EGI Ops.
    • Next meeting is April 20th.


Monitoring - Links MyWLCG

Tuesday 31st March

Monday 7th December

On-duty - Dashboard ROD rota

Monday 11th May

  • Rota responses awaited from Andrew and Daniela.
  • Handover summary should be uploaded to the bulletin please.

Tuesday 28th April

  • Glasgow: A GLUE2 problem is transient and doesn't have a short-term solution (if the service status was checked a little more frequently it would help). Currently on hold. IC sometimes see this too.
  • UCL: No change to the on-going situation. UCL has hopped from one downtime to another this week. Note – AM visiting UCL this week to setup VAC. Services will be decommissioned after this step.


Tuesday 21st April

  • UCL have put themselves into a downtime until the 21st April. (Start of next week). Noted this in their outstanding tickets.
  • Birmingham's availability has steadily recovered over the week - and the low availability ticket against them should be closable next week.


Rollout Status WLCG Baseline

Tuesday 12th May

  • MW Readiness WG meeting Wed May 6th at 4pm. Attended by Raul, Matt, Sam and Jeremy.

Tuesday 17th March

  • Daniela has updated the [ https://www.gridpp.ac.uk/wiki/Staged_rollout_emi3 EMI-3 testing table]. Please check it is correct for your site. We want a clear view of where we are contributing.
  • There is a middleware readiness meeting this Wednesday. Would be good if a few site representatives joined.
  • Machine job features solution testing. Fed back that we will only commence tests if more documentation made available. This stops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

References


Security - Incident Procedure Policies Rota

Tuesday 12th May

Tuesday 28th April

  • SSL version issues with GridFTP at Bristol? Turns out the server doesn't like proxies that have not been decorated correctly.

Tuesday 21st April

  • EGI Alert 'High' risk - Xen Vulnerability Hypervisor memory corruption due to x86 emulator flaw CVE-2015-2151 [EGI-ADV-20150415]


Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

HCOPN & LHCONE joint meeting at LBL June 1st & 2nd. Agenda taking shape.

Tuesday 31st March

Tuesday 10th March

  • From the recent WLCG meeting, two slides (1 & 2) give the direction of the network monitoring and metrics progress: integration of perfSONAR event types into experiment monitoring and an architecture for data to get from RSV probes to client. Components described on slide 3.
  • The next LHCOPN and LHCONE joint meeting will take place on Monday 1st and Tuesday 2nd of June 2015 in Berkeley (US) (hosted by LBL and ESnet).
Tickets

Monday 11th May 2015, 14.10 BST
22 Open UK Tickets this week.

TIER 1
There are a few tickets at the Tier 1 that are set "In Progress" but haven't received an update yet this month:
108944 (CMS AAA Tests, 30/4)
112721 (Atlas Transfer problems, 16/4)
109694 (SNO+ gfal copy trouble, 15/4)
112866 (CMS job failures, 7/4)
112819 (SNO+ arcsync troubles, 20/4)

Other Tier 1 Tickets (sorry to be picking on you guys!)
111699 (10/2)
Atlas glexec hammercloud test jobs at the Tier 1. It appears to be working, but a batch of test jobs failed because they couldn't find the "mkgltempdir" utility on some nodes ("slot1_5@lcg1742.gridpp.rl.ac.uk" and "slot1_4@lcg1739.gridpp.rl.ac.uk"). In progress (4/5)

113320 (27/4)
Maybe repeating what Daniela is going to say in the CMS update - trouble with CMS data transfers within RAL. It's under investigation, but it looks like the files in question will need to be invalidated - even if it's just to paint a clearer picture. In progress (10/5)

APEL REPUBLISHING
113473
At last update Brunel, Liverpool, Edinburgh, Birmingham and Oxford need to republish still. Oxford have their own ticket about it due to complications (113482).

UCL Tickets - Ben is starting to move to close these, some are going to be "unsolved".

GLASGOW
113095 (17/4)
Andrew asks if the timeframe for the move to Condor be added to this ticket, for the ROD team's information. On Hold (7/4)

100IT
112948 (10/4)
No news on this 100IT ticket for a while. In progress (27/4)

Tools - MyEGI Nagios

Tuesday 17th February

  • Another period where message brokers were temporarily unavailable seen yesterday. Any news on the last follow-up?

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 5th May

  • We have a number of VOs to be removed. Dedicated follow-up meeting proposed.

Tuesday 28th April

  • For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503.

Tuesday 31st March

  • LIGO are in need of additional support for debugging some tests.
  • LSST now enabled on 3 sites. No 'own' CVMFS yet.

Monday 9th March

  • SIXT VOMS parameters have changed.

Tuesday 17th February Two changes to approved VOs (https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs)

  • LSST uses port 15003 (had been 15002, clashing with dzero)
  • t2k has included a note that it';s software is now distrinuted via CVMFS.
Site Updates

Tuesday 24th February

  • Next review of status today.

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 13th May 2015 Operations report

  • We are investigating some significant Castor performance issues for CMS.
  • Revised (still draft) dates for the upgrade of the Oracle database behind Castor (to version 11.2.0.4) were presented.
  • The decommissioning of the the CREAM CEs was effectively done a wek ago. They are now marked as being not in production in the GOC DB.
  • We are still working to fix the problematic Tier1 router.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Atlas S&C week 2-6 Feb 2015

Production

• Prodsys-2 in production since Dec 1st

• Deployment has not been transparent , many issued has been solved, the grid is filled again

• MC15 is expected to start soon, waiting for physics validations, evgen testing is underway and close to finalised.. Simulation expected to be broadly similar to MC14, no blockers expected.

Rucio

• Rucio in production since Dec 1st and is ready for LHC RUN-2. Some fields need improvements, including transfer and deletion agents, documentation and monitoring.

Rucio dumps available.

Dark data cleaning

files declaration . Only Only DDM ops can issue lost files declaration for now, cloud support needs to fill a ticket.

• Webdav panda functional tests with Hammercloud are ongoing

Monitoring

Main page

DDM Accounting

space

Deletion

ASAP

• ASAP (ATLAS Site Availability Performance) in place. Every 3 months the T2s sites performing BELOW 80% are reported to the International Computing Board.


UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A