Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive


Week commencing 29th June 2015
Task Areas
General updates

Tuesday 30th June

  • Steve's GridPP tests relied on the CERN LFC (ATLAS) which was removed last week. We need to review what tests if any we still want to run in the UK. Do we?
  • A new mailing list has been setup for the discussion of batch and CE matters in WLCG: project-lcg-gdb-batch at cern.ch.
  • Some have noted that the GOCDB was not sending out downtime notifications. A ticket was raised.
  • A preliminary draft of the July GDB agenda is available.
  • In In order to get the broadest input EGI has prepared a survey (deadline July 7th).
  • There was an EGI OMB last week. Please take a look at the brief minutes and actions.
  • The SAM tests used for ATLAS Availability and Reliability of SEs are going to be changed to new gfal2 DDM-probes from 1st July. The new monitoring is already running at: http://wlcg-sam-atlas-dev.cern.ch/ (the name of the tests are: DDM-srm-Del; DDM-srm-Get and DDM-srm-Put.From 1st July these will move to http://wlcg-sam-atlas.cern.ch/.
  • The camont collaboration is no longer active, but the VO is enabled and useful. Previous contributors to the camont work would like to re-purpose the VO to support the Cambridge 'Geant Human Oncology Simulation Tool’. Providing we broadcast any change to sites supporting camont and the AUP changes (and all users re-sign it), is there any objection to this pragmatic solution (it fits well our current aim of getting varied communities using our resources successfully).

Tuesday 23rd June

  • The BDII discussion of last week continued into the WLCG ops coordination meeting (see email thread).
  • The IGTF is about to release an update to the trust anchor repository (1.65).
  • hone has notified us that they have completed their use of the grid.
  • CVMFS space for GridPP VO: /cvmfs/gridpp.gridpp.ac.uk or /cvmfs/gridpp.egi.eu
  • supernemo data in Liverpool DPM - can it be removed?
  • On July 17 a 4 hour long EGI Federated Cloud tutorial will be organised in London (at SAP near Heathrow). It's a free event, part of a 3-day long software carpentry workshop.
  • GridPP35 @ Liverpool in September.


WLCG Operations Coordination - Agendas

Tuesday 30th June

  • A new WLCG ops portal will be live soon. If you are particularly keen to give feedback please contact Jeremy

Tuesday 23rd June

  • There was a WLCG ops coordination meeting last Thursday 18th June: Agenda. [1].
  • Highlights: Information System discussion started. Use cases and dependencies will be built up and reviewed. May have a pre-GDB on topic.
  • All sites should enable multicore accounting
  • News: No update.
  • Baselines: Removed WMS & L&B. LFC will be removed soon.
  • MW issues: New globus-gss api released to mitigate problem reported last time.
  • T0&1 services: T0 LHCb and shared LFC will be decommissioned 22nd June. Some dCache upgrades reported.
  • T0 news: Efficiency meeting held. Cloud team making I/O changes. LHC exits see some improvement but T0 still behind other sites.
  • T1 feeedback: NTR
  • T2 feedback: UK response on Information System: Useful for service discovery; minor VO usage; contains too much information; cloud raises new questions; mixed data types; YAIM helpful to fill schemas.
  • OSG: Provide InfoSys as service to VOs. Best case deprecation early 2016 but depends on USATLAS.
  • InfoSys: HTC > GLUE in OSG. AGIS uses it (ATLAS seek merge of GOCDB, OIM and BDII). LHCb uses for CE discovery. CMS no clear usage. ALICE for SAM and CERN IT C5 reports.
  • ALICE: High activity. CASTOR issue with xrd3cp. Request sites to plan for Xrootd v4.1.
  • ATLAS: Good data taking. T0 some issues with batch/OpenStack improving. CERN network issue had impacts. CERN to BNL data backlog due to FTS not pushing hard enough.
  • CMS: Data taking but technical stops. MC going well. T1 CPU should be 90% production role and 10% pilot. File transfer FNAL-RAL - possible WAITIO on storage nodes due to many CMS jobs.
  • LHCb: Run2 offline processing workflows validated. Some issues with old files at RAL without checksums.
  • gLEexec: NTR
  • RFC proxies: SAM - okay now for ALICE. CMS PhEDEx instances being switched.
  • Machine/Job features: NTR
  • Middleware readiness: Good work. Credit to ECDF and GRIF for DPM work. New pakiti-client imminent in EPEL stable. MW readiness App now available on a production instance. EL7 support for ARGS urgent. Next meeting 16th September.
  • Multicore: Several sites still not publishing. APEL tickets on NGIs. Issues identified for CREAM and ARC MC publishing.
  • IPv6: NTR
  • Network and transfers WG: PS: proposed mesh for upto 100 sites. Potential bug noted. Next meeting 8th July.
  • HTTP: 2nd meeting on 3rd June. Draft conclusions. Next meeting 15th July.

Tuesday 16th June

  • The next WLCG ops coordination meeting is this Thursday 18th June: Agenda. There will be presentations and discussions on the Information System.
  • The next middleware readiness meeting is on Wednesday 17th June @ 3pm BST: Agenda.


Tier-1 - Status Page
  • A reminder that there is a weekly Tier-1 experiment liaison meeting.
  • The agenda follows this format:
    • 1. Summary of Operational Status and Issues
    • 2. Highlights/summary of the Tier1 Monday operations meeting (Grid Services; Fabric; CASTOR and Other)
    • 3. Experiment plans and operational issues (CMS; ATLAS; LHCb; ALICE and Others)
    • 4. Special presentations
    • 5. Actions
    • 6. Highlights for Operations Bulletin Latest
    • 7. AoB

Tuesday 30th June

  • There were two separate network-related problems on Tuesday afternoon last week. The first was a short (less than ten minute) break in connectivity when a router rebooted. The second, which lasted around 45 minutes, was a period of very high traffic caused by an operational/configuration problem on a hypervisor.
  • We have announced an 'at risk' for the next quarterly UPS/generator load test tomorrow (Wednesday) morning.
Storage & Data Management - Agendas/Minutes

Wednesday 24 June

  • Heard about the Indigo datacloud project, a H2020 project in which STFC is participating
  • Data transfers, theory and practice
    • Somewhat clunky tools to set up but perform well when they run
    • Will continue to work on recommendations/overview document
    • Worth having recommendations/experiences for different audiences - (potential) users, decision makers, techies

Tuesday 23rd June

  • Good progress with DiRAC transfers from Durham - data flowing since Monday.

Wednesday 17 June

  • EU projects - SAGE: HSM for HPC
  • Progress on new VOs. Can test as members of 'gridpp' or similar until they get their own allocations.
    • We've talked about it before; should VOs have individual T2 allocations to avoid stepping on each other's toes?
    • Case for expanding back-up-into-T1 to other VOs?

Wednesday 27 May

  • Working on troubleshooting DIRAC data for/with LIGO (not to be confused with DiRAC or with any of the other things called DiRAC)
  • Working on setting up DiRAC at Tier 1 (not to be confused with DIRAC or Dirac or with any other thing called Dirac)
  • New secret user support list!

Tuesday 18th May

Tuesday 21st April

  • Has there been any Tier-1 contact with DiRAC?
  • Proposal to setup an 'other VOs' users list. GridPP-Users is too tied with WLCG projects.

Wednesday 15 April

  • Backing up data from DiRAC to GridPP (tape)
  • More case studies on supporting non-LHC VOs on GridPP: we have a lot of great stuff that can do great stuff - non-LHC VOs tend to have less regimented data models so maybe we need more case studies.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 16th June

  • Region not publishing accounting by number of cores.
    • "0" core submission hosts:
    • ce3.dur.scotgrid.ac.uk
    • ce4.dur.scotgrid.ac.uk
    • cetest02.grid.hep.ph.ic.ac.uk
    • hepgrid5.ph.liv.ac.uk
    • hepgrid6.ph.liv.ac.uk
    • hepgrid97.ph.liv.ac.uk
    • svr009.gla.scotgrid.ac.uk
    • t2ce06.physics.ox.ac.uk

Tuesday 9th June

  • Delay noted for Sheffield

Tuesday 26th May

  • Delay noted for Sheffield.

Tuesday 12th May

  • Issues noted with sync for Brunel, Liv, ECDF (see EGI ticket 113473). Message broker issues (memory related) are likely the underlying EGI problem.
  • Need to check on VAC sync publishing.


Documentation - KeyDocs

Tuesday 23rd June

  • Reminder that documents need reviewing!

Tuesday 9th June

LSST voms2 records are not present in VOID cards yet. As a workaround, a temporary note of the actual values has been added to the LSST section of Approved VOs.

https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

General note

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 21st April

  • The Approved VOs document has been updated to take account of changes to the Ops Portal VOID cards.For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503. Sites that support SNOPLUS.SNOLAB.CA should ensure that their configuration conforms to these settings: Approved VOs
  • KeyDocs still need updating since agreements reached at last core ops meeting.
  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 15th June

  • There was an EGI operations meeting today: agenda.
  • New Action: for the NGIs: please start tracking which sites are still using SL5 services: how many services, and for each service if still needed on SL5, if upgrades on SL5 services are expected). A wiki has been provided to record updates. Also interesting to understand who is using Debian.

Tuesday 21st April

  • There was an EGI ops meeting on Monday 20th.
  • David updated the UK SL5 response.
  • Please review the agenda/minutes.


Monitoring - Links MyWLCG

Tuesday 16th June

  • F Melaccio & David Crooks decided to add a FAQs section devoted to common monitoring issues under the monitoring page.
  • Feedback welcome.


Tuesday 31st March

Monday 7th December

On-duty - Dashboard ROD rota

Monday 22nd June

  • Generally quiet. There are some 'glue2' errors that were ticketed. Tried to let these go and see if they would clear. However, in some cases the amount of time the error was outstanding was building up. Unclear if Glue2 is used anywhere.


Monday 8th June

  • The eu.repository has now made a comeback, so the arc alarms, cleared, but I the site availabilities (probably) need to be corrected.
  • Still getting on/off bdii alarms for a variety of sites.

Monday 11th May

  • Rota responses awaited from Andrew and Daniela.
  • Handover summary should be uploaded to the bulletin please.


Rollout Status WLCG Baseline

Tuesday 12th May

  • MW Readiness WG meeting Wed May 6th at 4pm. Attended by Raul, Matt, Sam and Jeremy.

Tuesday 17th March

  • Daniela has updated the [ https://www.gridpp.ac.uk/wiki/Staged_rollout_emi3 EMI-3 testing table]. Please check it is correct for your site. We want a clear view of where we are contributing.
  • There is a middleware readiness meeting this Wednesday. Would be good if a few site representatives joined.
  • Machine job features solution testing. Fed back that we will only commence tests if more documentation made available. This stops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

References


Security - Incident Procedure Policies Rota

Monday 29th June

  • EUGridPMA have announced a new set of CA rpms. Based on this IGTF release a new set of CA RPMs have been packaged for EGI. There is a request to please upgrade within the next seven days at your earliest convenience. When this timeout is over, SAM will throw critical errors on CA tests if old CAs are still detected.
  • The next security team meeting is this Wednesday 1st July.

Tuesday 16th June

  • Security team meeting this Wednesday.
  • One topic for review concerns ES.

Tuesday 9th June



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 23rd June

  • GridPP issued a position statement regarding LHCONE.
    • ...Concerning LHCONE for both T1 and T2. The high level summary is that the UK is not in favour, as within the UK we have no explicit need for LHCONE for any reason of T1 capacity planning, but to implement it involves additional complexity and possibly cost. The current system works fine and we therefore see no overriding reason to remove T1-T1 transit via LHCOPN. ...The UK is sensitive to the “collective” needs of the community, and as a general statement we would always seek to address any legitimate request agreed by the WLCG MB in order to play our role in meeting international expectations.

Tuesday 12th May

  • LHCOPN & LHCONE joint meeting at LBL June 1st & 2nd. Agenda taking shape.

Tuesday 31st March

Tickets

Monday 29th June 2015, 14.30 BST

Looking at the "Other VO" Nagios.
Things look generally alright - but Durham look like they need to update their CA rpms - but that might have to wait until Oliver is back from leave.

Tarballs:
I don't think this effects many, but there's was a ticket to produce a new version of the WN tarball (which is done): 114574
Although AFAICS there is no urgent need to upgrade tarball WNs.

26 UK Tickets, although not many stand out.

Gridpp VO Pilot tickets:
Largely doing alright. With Oliver away the Durham ticket hasn't been looked at yet. Sheffield and Bristol's tickets could do with an update (or on-holding if there's going to be a delay). The RHUL ticket has been reopened as they're deployment of the pilot roles hasn't quite worked out.

QMUL
114573 (23/6)
LHCB having trouble with two out of three QM CEs. Dan notes that the two "broken" CEs have been recently dual-stacked, and asks if this could be the problem. The answers is a resounding "maybe", and Raja asks if problems could be duplicated by others using lxplus. Waiting for reply (24/6)

IMPERIAL/DIRAC
114379 (16/6)
Sam's ticket trying to get SE support with Dirac...er, spruced up. Daniela has asked if the tests can be redone with the "new" dirac. Waiting for reply (22/6)

Let me know if I missed any tickets.

Tools - MyEGI Nagios

Tuesday 09 June 2015

  • ARC CEs were failing nagios test becuase of non-availability of egi repository. Nagios test compare CA version from EGI repo. It started on 5th June and one of the IP addresses behind webserver was not responding. Problem went away in approximately 3 hours. The same problem started again on 6th June. Finally it was fixed on 8th June. No reason was given in any of the ticket opened regarding this outage.

Tuesday 17th February

  • Another period where message brokers were temporarily unavailable seen yesterday. Any news on the last follow-up?

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 19th May

  • There is a current priority for enabling/supporting our joining communities.

Tuesday 5th May

  • We have a number of VOs to be removed. Dedicated follow-up meeting proposed.

Tuesday 28th April

  • For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503.

Tuesday 31st March

  • LIGO are in need of additional support for debugging some tests.
  • LSST now enabled on 3 sites. No 'own' CVMFS yet.
Site Updates

Tuesday 24th February

  • Next review of status today.

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 10th June 2015 Operations report

  • Ongoing investigation into Castor performance issues for CMS.
  • The second tranche of 2014 Worker Node purchases has been put into production.
  • There is a short outage announced for next Wednesday (17th June) to test the recent change in the network routing and confirm the problem with the Tier1 network router can still be reproduced.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Atlas S&C week 2-6 Feb 2015

Production

• Prodsys-2 in production since Dec 1st

• Deployment has not been transparent , many issued has been solved, the grid is filled again

• MC15 is expected to start soon, waiting for physics validations, evgen testing is underway and close to finalised.. Simulation expected to be broadly similar to MC14, no blockers expected.

Rucio

• Rucio in production since Dec 1st and is ready for LHC RUN-2. Some fields need improvements, including transfer and deletion agents, documentation and monitoring.

Rucio dumps available.

Dark data cleaning

files declaration . Only Only DDM ops can issue lost files declaration for now, cloud support needs to fill a ticket.

• Webdav panda functional tests with Hammercloud are ongoing

Monitoring

Main page

DDM Accounting

space

Deletion

ASAP

• ASAP (ATLAS Site Availability Performance) in place. Every 3 months the T2s sites performing BELOW 80% are reported to the International Computing Board.


UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A