Difference between revisions of "Operations Bulletin Latest"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 133: Line 133:
 
====== ======
 
====== ======
 
<!-- ******************Edit start********************* ----->
 
<!-- ******************Edit start********************* ----->
 +
 +
'''[http://storage.esc.rl.ac.uk/weekly/20150624-minutes.txt Wednesday 24 June]'''
 +
* Heard about the Indigo datacloud project, a H2020 project in which STFC is participating
 +
* Data transfers, theory and practice
 +
** Somewhat clunky tools to set up but perform well when they run
 +
** Will continue to work on recommendations/overview document
 +
** Worth having recommendations/experiences for different audiences - (potential) users, decision makers, techies
 +
 
'''Tuesday 23rd June'''
 
'''Tuesday 23rd June'''
 
* Good progress with DiRAC transfers from Durham - data flowing since Monday.
 
* Good progress with DiRAC transfers from Durham - data flowing since Monday.

Revision as of 15:13, 24 June 2015

Bulletin archive


Week commencing 22nd June 2015
Task Areas
General updates

Tuesday 23rd June

  • The BDII discussion of last week continued into the WLCG ops coordination meeting (see email thread).
  • The IGTF is about to release an update to the trust anchor repository (1.65).
  • hone has notified us that they have completed their use of the grid.
  • CVMFS space for GridPP VO: /cvmfs/gridpp.gridpp.ac.uk or /cvmfs/gridpp.egi.eu
  • supernemo data in Liverpool DPM - can it be removed?
  • On July 17 a 4 hour long EGI Federated Cloud tutorial will be organised in London (at SAP near Heathrow). It's a free event, part of a 3-day long software carpentry workshop.
  • GridPP35 @ Liverpool in September.

Tuesday 16th June

  • The minutes of the [GDB] last week are available.
  • ATLAS has re-classified the following sites as “ATLAS Tier 3” in AGIS (this classification being based upon minimum resource requirements): UKI-LT2-UCL-HEP UKI-LT2-IC-HEP; UKILT2-Brunel; UKI-SOUTHGRID-SUSX and UKI-SCOTGRID-DURHAM.
  • JANET is working with Imperial College to establish a VRF to perform tests of a connection to LHCONE.
  • Please could everyone who supports gridpp ensure that the site accepts jobs with the pilot role? In addition please also consider the other VOs mentioned in Daniela's email (17:30 on 15th June).
  • Glasgow: VO Shares and ARC CE config question.
  • WLCG workshop 2016: last call for volunteers.
WLCG Operations Coordination - Agendas

Tuesday 23rd June

  • There was a WLCG ops coordination meeting last Thursday 18th June: Agenda. [1].
  • Highlights: Information System discussion started. Use cases and dependencies will be built up and reviewed. May have a pre-GDB on topic.
  • All sites should enable multicore accounting
  • News: No update.
  • Baselines: Removed WMS & L&B. LFC will be removed soon.
  • MW issues: New globus-gss api released to mitigate problem reported last time.
  • T0&1 services: T0 LHCb and shared LFC will be decommissioned 22nd June. Some dCache upgrades reported.
  • T0 news: Efficiency meeting held. Cloud team making I/O changes. LHC exits see some improvement but T0 still behind other sites.
  • T1 feeedback: NTR
  • T2 feedback: UK response on Information System: Useful for service discovery; minor VO usage; contains too much information; cloud raises new questions; mixed data types; YAIM helpful to fill schemas.
  • OSG: Provide InfoSys as service to VOs. Best case deprecation early 2016 but depends on USATLAS.
  • InfoSys: HTC > GLUE in OSG. AGIS uses it (ATLAS seek merge of GOCDB, OIM and BDII). LHCb uses for CE discovery. CMS no clear usage. ALICE for SAM and CERN IT C5 reports.
  • ALICE: High activity. CASTOR issue with xrd3cp. Request sites to plan for Xrootd v4.1.
  • ATLAS: Good data taking. T0 some issues with batch/OpenStack improving. CERN network issue had impacts. CERN to BNL data backlog due to FTS not pushing hard enough.
  • CMS: Data taking but technical stops. MC going well. T1 CPU should be 90% production role and 10% pilot. File transfer FNAL-RAL - possible WAITIO on storage nodes due to many CMS jobs.
  • LHCb: Run2 offline processing workflows validated. Some issues with old files at RAL without checksums.
  • gLEexec: NTR
  • RFC proxies: SAM - okay now for ALICE. CMS PhEDEx instances being switched.
  • Machine/Job features: NTR
  • Middleware readiness: Good work. Credit to ECDF and GRIF for DPM work. New pakiti-client imminent in EPEL stable. MW readiness App now available on a production instance. EL7 support for ARGS urgent. Next meeting 16th September.
  • Multicore: Several sites still not publishing. APEL tickets on NGIs. Issues identified for CREAM and ARC MC publishing.
  • IPv6: NTR
  • Network and transfers WG: PS: proposed mesh for upto 100 sites. Potential bug noted. Next meeting 8th July.
  • HTTP: 2nd meeting on 3rd June. Draft conclusions. Next meeting 15th July.

Tuesday 16th June

  • The next WLCG ops coordination meeting is this Thursday 18th June: Agenda. There will be presentations and discussions on the Information System.
  • The next middleware readiness meeting is on Wednesday 17th June @ 3pm BST: Agenda.


Tier-1 - Status Page
  • A reminder that there is a weekly Tier-1 experiment liaison meeting.
  • The agenda follows this format:
    • 1. Summary of Operational Status and Issues
    • 2. Highlights/summary of the Tier1 Monday operations meeting (Grid Services; Fabric; CASTOR and Other)
    • 3. Experiment plans and operational issues (CMS; ATLAS; LHCb; ALICE and Others)
    • 4. Special presentations
    • 5. Actions
    • 6. Highlights for Operations Bulletin Latest
    • 7. AoB

Tuesday 23rd June

  • The short outage last Wednesday (17th June) took place as planned. As agreed at last week's meeting we stopped/started the ToP BDIIs around this intervention. Following that we test need to plan a longer (few hour) outage for an intervention on our problematic router.
  • DiRAC file transfers from Durham now working.
Storage & Data Management - Agendas/Minutes

Wednesday 24 June

  • Heard about the Indigo datacloud project, a H2020 project in which STFC is participating
  • Data transfers, theory and practice
    • Somewhat clunky tools to set up but perform well when they run
    • Will continue to work on recommendations/overview document
    • Worth having recommendations/experiences for different audiences - (potential) users, decision makers, techies

Tuesday 23rd June

  • Good progress with DiRAC transfers from Durham - data flowing since Monday.

Wednesday 17 June

  • EU projects - SAGE: HSM for HPC
  • Progress on new VOs. Can test as members of 'gridpp' or similar until they get their own allocations.
    • We've talked about it before; should VOs have individual T2 allocations to avoid stepping on each other's toes?
    • Case for expanding back-up-into-T1 to other VOs?

Wednesday 27 May

  • Working on troubleshooting DIRAC data for/with LIGO (not to be confused with DiRAC or with any of the other things called DiRAC)
  • Working on setting up DiRAC at Tier 1 (not to be confused with DIRAC or Dirac or with any other thing called Dirac)
  • New secret user support list!

Tuesday 18th May

Tuesday 21st April

  • Has there been any Tier-1 contact with DiRAC?
  • Proposal to setup an 'other VOs' users list. GridPP-Users is too tied with WLCG projects.

Wednesday 15 April

  • Backing up data from DiRAC to GridPP (tape)
  • More case studies on supporting non-LHC VOs on GridPP: we have a lot of great stuff that can do great stuff - non-LHC VOs tend to have less regimented data models so maybe we need more case studies.


Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 16th June

  • Region not publishing accounting by number of cores.
    • "0" core submission hosts:
    • ce3.dur.scotgrid.ac.uk
    • ce4.dur.scotgrid.ac.uk
    • cetest02.grid.hep.ph.ic.ac.uk
    • hepgrid5.ph.liv.ac.uk
    • hepgrid6.ph.liv.ac.uk
    • hepgrid97.ph.liv.ac.uk
    • svr009.gla.scotgrid.ac.uk
    • t2ce06.physics.ox.ac.uk

Tuesday 9th June

  • Delay noted for Sheffield

Tuesday 26th May

  • Delay noted for Sheffield.

Tuesday 12th May

  • Issues noted with sync for Brunel, Liv, ECDF (see EGI ticket 113473). Message broker issues (memory related) are likely the underlying EGI problem.
  • Need to check on VAC sync publishing.


Documentation - KeyDocs

Tuesday 23rd June

  • Reminder that documents need reviewing!

Tuesday 9th June

LSST voms2 records are not present in VOID cards yet. As a workaround, a temporary note of the actual values has been added to the LSST section of Approved VOs.

https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

General note

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 21st April

  • The Approved VOs document has been updated to take account of changes to the Ops Portal VOID cards.For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503. Sites that support SNOPLUS.SNOLAB.CA should ensure that their configuration conforms to these settings: Approved VOs
  • KeyDocs still need updating since agreements reached at last core ops meeting.
  • New section in Wiki called "Project Management Pages".
The idea is to cluster all Self-Edited Site Tracking Tables
in here. Sites should keep entries in Current Activities
up to date. Once a Self-Edited Site Tracking Tables has
served its purpose, PM to move it to  Historical Archive 
or otherwise dispose of the table.
Interoperation - EGI ops agendas

Monday 15th June

  • There was an EGI operations meeting today: agenda.
  • New Action: for the NGIs: please start tracking which sites are still using SL5 services: how many services, and for each service if still needed on SL5, if upgrades on SL5 services are expected). A wiki has been provided to record updates. Also interesting to understand who is using Debian.

Tuesday 21st April

  • There was an EGI ops meeting on Monday 20th.
  • David updated the UK SL5 response.
  • Please review the agenda/minutes.


Monitoring - Links MyWLCG

Tuesday 16th June

  • FN & DC decided to add a FAQs section devoted to common monitoring issues under the monitoring page.
  • Feedback welcome.


Tuesday 31st March

Monday 7th December

On-duty - Dashboard ROD rota

Monday 22nd June

  • Generally quiet. There are some 'glue2' errors that were ticketed. Tried to let these go and see if they would clear. However, in some cases the amount of time the error was outstanding was building up. Unclear if Glue2 is used anywhere.


Monday 8th June

  • The eu.repository has now made a comeback, so the arc alarms, cleared, but I the site availabilities (probably) need to be corrected.
  • Still getting on/off bdii alarms for a variety of sites.

Monday 11th May

  • Rota responses awaited from Andrew and Daniela.
  • Handover summary should be uploaded to the bulletin please.


Rollout Status WLCG Baseline

Tuesday 12th May

  • MW Readiness WG meeting Wed May 6th at 4pm. Attended by Raul, Matt, Sam and Jeremy.

Tuesday 17th March

  • Daniela has updated the [ https://www.gridpp.ac.uk/wiki/Staged_rollout_emi3 EMI-3 testing table]. Please check it is correct for your site. We want a clear view of where we are contributing.
  • There is a middleware readiness meeting this Wednesday. Would be good if a few site representatives joined.
  • Machine job features solution testing. Fed back that we will only commence tests if more documentation made available. This stops the HTC solution until after CHEP. Is there interest in testing other batch systems? Raul mentioned SLURM. There is also SGE and Torque.

References


Security - Incident Procedure Policies Rota

Tuesday 16th June

  • Security team meeting this Wednesday.
  • One topic for review concerns ES.

Tuesday 9th June



Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 23rd June

  • GridPP issued a position statement regarding LHCONE.
    • ...Concerning LHCONE for both T1 and T2. The high level summary is that the UK is not in favour, as within the UK we have no explicit need for LHCONE for any reason of T1 capacity planning, but to implement it involves additional complexity and possibly cost. The current system works fine and we therefore see no overriding reason to remove T1-T1 transit via LHCOPN. ...The UK is sensitive to the “collective” needs of the community, and as a general statement we would always seek to address any legitimate request agreed by the WLCG MB in order to play our role in meeting international expectations.

Tuesday 12th May

  • LHCOPN & LHCONE joint meeting at LBL June 1st & 2nd. Agenda taking shape.

Tuesday 31st March

Tickets

Monday 22nd June 2015, 14.00 BST

35 Open UK Tickets this week (!!!)

GridPP Pilot Role
A dozen of them are from Daniela (who painstakingly submitted them all) concerning getting the gridpp (and other) pilot role enabled on the site in question's CEs.

An example of one of these tickets is:
114440 (Lancaster's smug solved ticket).

Ticketed sites are: Durham, Bristol, Cambridge, Glasgow, ECDF (who are also having general gridpp VO support problems), EFDA-JET (looking solved), Oxford, Liverpool, Sheffield, Brunel, RALPP and RHUL. Most tickets are being worked on fine, but the Bristol and Liverpool ones were still just in an "assigned" state at time of writing.
Update - good progress on this, just one ticket left "assigned". Cambridge are done, as are JET (ticket needs to be closed). Oxford and Manchester are ready for to have their new setups tried out, with Oxford kindly road-testing glexec for the pilot roles. Good stuff.

Core Count Publishing
114233 (10/6)
Of the sites mentioned in this ticket (Durham[1], IC, Liverpool, Glasgow, Oxford) who *hasn't* had a go at changing their core count publishing? I know Oxford have. Daniela had a pertinent question about publishing for VMs, which John answered. In progress (17/6)

[1] Durham have another ticket on this which may explain their lack of core count publishing: 114381 (16/6)

DIRAC
114379 (16/6)
Sam S formed this ticket over having trouble accessing the majority of SEs over Dirac, after some discussion around this last week. Sam acknowledges that this could be a site problem, not a DIRAC problem, but you gotta start somewhere (he worded that point more eloquently). Daniela has posted her latest and greatest DIRAC setup gubbins for Sam to try out. Another, unrelated, point to have are the names missing from Sam's list - for example I'm pretty sure Lancaster should support gridpp VO storage but I've forgotten to roll it out! Waiting for reply (22/6)

LIVERPOOL
114248 (10/6)
Final ticket today, and another one discussed last week in the storage meeting. Steve's explanation of why (and how) Sno+ would need to start to using space tokens was fantastically well worded in a way to not spook easily startled users. David is digesting the information, but it will likely need to wait for Matt M's return before we'll see progress. In progress (16/6)

MANCHESTER
114444(18/6)
I told a pork pie when I said that was the last ticket - this one caught my eye. A ticket from lhcb over files not having their checksums stored on Manchester's DPM. A link was given to another ticket at CBPF for a similar issue which got the DPM devs involved (111403) - although Andrew McNab was already subscribed to the ticket. In progress (19/6)

Tools - MyEGI Nagios

Tuesday 09 June 2015

  • ARC CEs were failing nagios test becuase of non-availability of egi repository. Nagios test compare CA version from EGI repo. It started on 5th June and one of the IP addresses behind webserver was not responding. Problem went away in approximately 3 hours. The same problem started again on 6th June. Finally it was fixed on 8th June. No reason was given in any of the ticket opened regarding this outage.

Tuesday 17th February

  • Another period where message brokers were temporarily unavailable seen yesterday. Any news on the last follow-up?

Tuesday 27th January

  • Unscheduled outage of the EGI message broker (GRNET) caused a short-lived disruption to GridPP site monitoring (jobs failed) last Thursday 22nd January. Suspect BDII caching meant no immediate failover to stomp://mq.cro-ngi.hr:6163/ from stomp://mq.afroditi.hellasgrid.gr:6163/


VOs - GridPP VOMS VO IDs Approved VO table

Tuesday 19th May

  • There is a current priority for enabling/supporting our joining communities.

Tuesday 5th May

  • We have a number of VOs to be removed. Dedicated follow-up meeting proposed.

Tuesday 28th April

  • For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503.

Tuesday 31st March

  • LIGO are in need of additional support for debugging some tests.
  • LSST now enabled on 3 sites. No 'own' CVMFS yet.
Site Updates

Tuesday 24th February

  • Next review of status today.

Tuesday 27th January

  • Squids not in GOCDB for: UCL; ECDF; Birmingham; Durham; RHUL; IC; Sussex; Lancaster
  • Squids in GOCDB for: EFDA-JET; Manchester; Liverpool; Cambridge; Sheffield; Bristol; Brunel; QMUL; T1; Oxford; Glasgow; RALPPD.

Tuesday 2nd December

  • Multicore status. Queues available (63%)
    • YES: RAL T1; Brunel; Imperial; QMUL; Lancaster; Liverpool; Manchester; Glasgow; Cambridge; Oxford; RALPP; Sussex (12)
    • NO: RHUL (testing); UCL; Sheffield (testing); Durham; ECDF (testing); Birmingham; Bristol (7)
  • According to our table for cloud/VMs (26%)
    • YES: RAL T1; Brunel; Imperial; Manchester; Oxford (5)
    • NO: QMUL; RHUL; UCL; Lancaster; Liverpool; Sheffield; Durham; ECDF; Glasgow; Birmingham; Bristol; Cambridge; RALPP; Sussex (14)
  • GridPP DIRAC jobs successful (58%)
    • YES: Bristol; Glasgow; Lancaster; Liverpool; Manchester; Oxford; Sheffield; Brunel; IC; QMUL; RHUL (11)
    • NO: Cambridge; Durham; RALPP; RAL T1 (4) + ECDF; Sussex; UCL; Birmingham (4)
  • IPv6 status
    • Allocation - 42%
    • YES: RAL T1; Brunel; IC; QMUL; Manchester; Sheffield; Cambridge; Oxford (8)
    • NO: RHUL; UCL; Lancaster; Liverpool; Durham; ECDF; Glasgow; Birmingham; Bristol; RALPP; Sussex
  • Dual stack nodes - 21%
    • YES: Brunel; IC; QMUL; Oxford (4)
    • NO: RHUL; UCL; Lancaster; Glasgow; Liverpool; Manchester; Sheffield; Durham; ECDF; Birmingham; Bristol; Cambridge; RALPP; Sussex, RAL T1 (15)


Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.



Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 10th June 2015 Operations report

  • Ongoing investigation into Castor performance issues for CMS.
  • The second tranche of 2014 Worker Node purchases has been put into production.
  • There is a short outage announced for next Wednesday (17th June) to test the recent change in the network routing and confirm the problem with the Tier1 network router can still be reproduced.
WLCG Grid Deployment Board - Agendas MB agendas

Empty



NGI UK - Homepage CA

Empty

Events
UK ATLAS - Shifter view News & Links

Atlas S&C week 2-6 Feb 2015

Production

• Prodsys-2 in production since Dec 1st

• Deployment has not been transparent , many issued has been solved, the grid is filled again

• MC15 is expected to start soon, waiting for physics validations, evgen testing is underway and close to finalised.. Simulation expected to be broadly similar to MC14, no blockers expected.

Rucio

• Rucio in production since Dec 1st and is ready for LHC RUN-2. Some fields need improvements, including transfer and deletion agents, documentation and monitoring.

Rucio dumps available.

Dark data cleaning

files declaration . Only Only DDM ops can issue lost files declaration for now, cloud support needs to fill a ticket.

• Webdav panda functional tests with Hammercloud are ongoing

Monitoring

Main page

DDM Accounting

space

Deletion

ASAP

• ASAP (ATLAS Site Availability Performance) in place. Every 3 months the T2s sites performing BELOW 80% are reported to the International Computing Board.


UK CMS

Empty

UK LHCb

Empty

UK OTHER
  • N/A
To note

  • N/A