Operations Bulletin 150914

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 8th September 2014
Task Areas
General updates

Monday 8th September

  • Be ready for the new CERN and ops VOMS. Compare the prod and preprod instances for:
  • An EMI3 WN tarball update has been done by Matt (see also GGUS 107869).
  • There is an LHCONE/LHCOPN meeting next week on 16th and 17th (agenda). It would be good to have some remote participation.
  • Website redesign - please complete this survey.
  • For multicore - a reminder for sites running multicore and CREAM that there is an option in APEL to account multicore/multicpu. By default it is off.
  • There is a pre-GDB this afternoon on Clouds.
  • There is a GDB this week. Any input?
  • Storage placement - survey TBC.

Monday 1st September

  • A/R results for August have been released.
    • ALICE: All good.
    • ATLAS: Durham (89%:98%) - very close! Sussex (45%:86%) - downtime for various updates. Problems with CE for WMS jobs only, so fine for ATLAS.
    • CMS: All good.
    • LHCb: All good. Northgrid and London perfect!
  • EGI A/R results have also been uploaded to this table. July's results show the UK at 96% overall. UCL, Durham and Birmingham had a couple of issues that affected them.
  • There is a UK CA TAG on 3rd September. Please let Jeremy know if you have any CA related issues or comments.
  • There has been discussion about lock-up problems with 2.6.32-431 kernels on supermicro kit. Any conclusions?
  • VOMS updates checks (mixed amongst pre-prod critical alarms):
    • CMS: Bristol, RALPP, RHUL.
    • LHCb: ECDF, EFDA.
    • ATLAS: UCL, ECDF, Oxford, RALPP.
WLCG Operations Coordination - Agendas

Monday 8th September

  • There will be a multi-core meeting on Tuesday 9th at 14:30 (CERN time). Covering reviews of the UGE setup for multicore jobs at CCIN2P3 and of the method to passing job requirement arguments to batch systems via CE. (Agenda)
  • A review of last week's ops meeting (minutes) follows:
  • No operations news
  • The WLCG repository will become signed soon.
  • Baselines: No new EMI/UMD releases since last meeting.
  • MW issues: Missing key usage extension in delegated proxy. Fix for CREAM UI in October. Impacts ATLAS-Rucio intengration.
  • T1: FTS2 decommissioning done at 3 sites and 1 process another 3. NDGF-T1 is testing FAX using native xrootd and nfs4-mount from dCache.
  • OSG following up on how to discover HTCondor CEs in the information system.
  • Oracle: GoldenGate migration fine for IN2P3.
  • T0: AFS UI still used. lxplus5 target close of 14th Sept 2014. ARGUS - believe seen again unresponsive CAs problem.
  • T2: NTR
  • ALICE: Low activity. Job efficiencies issue still open.
  • ATLAS: Rucio test and normal DQ2 production activity are producing a slightly higher load on the storage of the sites.
  • CMS: Reminders - Target for CVMFS 2.1.19; update xrootd fallback configuration; add "Phedex Node Name" to site configuration.
  • LHCb: Mainly simulation work. SHA2 certificate testing started.
  • Network & transfer metrics: Meeting Monday 8th Sept. Slides. Pythia Network Diagnosis Infrastructure funded by NSF - perfSONAR-PS data to identify and localize network problems using the Pythia algorithms.
  • Tracking tools: NTR
  • FTS3 deployment TF: Done - FTS3 now in production. New releases every 3-4 months. There are lists for feature requests and also support. Some improvements to FTS dashboard.
  • glexec TF: NTR.
  • Machine/job features: New lead for condor part Marian Zvada.
  • MW readiness: T0 pre-prod to install package reporter. Latest Cream-CE and Bdii update have been installed at LNL-T2. Next meeting 1st October.
  • "MW software for the verification activity" uses the package reporter results to aggregate per software component, is used to tag good/bad versions, publishes the results in a dashboard.
  • Multicore: ATLAS 11 T1s and 35 T2s; CMS at T1s and some US T2s. Decided the TF would take on board the standardization of the blah scripts (and other CEs scripts if needed) for the scheduling parameters
  • SHA-2: Compliance being tested. Broadcast sent. Deadline 15th September. Switch SAM. Then expt. job and data systems. 88->55 tickets.
  • WMS decommissioning: Condor-based SAM probes due 1st October.
  • IPv6: NTR.
  • Squid mon & HTTP proxy discovery: Working on automated MRTG monitor. Working on documentation.

Tuesday 2nd September

  • The next WLCG ops coordination meeting is this Thursday 4th September.
  • There will be a Tier-1/2 feedback section in the agenda IF there is feedback/input. Do we have any items to raise?

Tier-1 - Status Page

Tuesday 8th September

  • There was a brief network interruption yesterday (Tuesday 8th Sep) to the Tier1 network at around 5pm local time. This lasted a few minuites and the cause is being investigated.
  • We are planning to stop access for all VOs apart from ALICE to our CREAM CEs. The proposed date is 23rd September.
Storage & Data Management - Agendas/Minutes

Wedn 10 Sept.

  • High load at L'pool causing low throughput - how to throttle xroot transfers (and is the load necessary or a bug?)
  • Still testing WebFTS
  • Prep for DPM workshop

Monday 1st September

  • FAX sites to update the C++ N2N rpms .
  • There is interest regarding issues/performance when placing storage outside firewalls. JC will shortly start a (closed) discussion/survey.

Monday 11th August

  • Pool nodes at RHUL have received test errors.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 9th September

Tuesday 2nd September

Tuesday 26th August

  • Sheffield has stopped publishing.
Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 9th September

  • Looking a bit better. Will review in more details at core ops meeting (next Thursday 18th@11:30am unless there is a clash)

Tuesday 2nd September

  • This work needs a kick-start! Reminders should now be being received.
  • Tom/Andrew in discussion about options for main site - main considerations are Wordpress and Drupal.

Interoperation - EGI ops agendas

Tuesday 9th September

    • Mostly a short meeting to give updates on product updates over the summer.
    • Please read the agenda/minutes for a full set but to pull out a couple of things:
    • Note that as per http://dmc.web.cern.ch, gfal and lcg-util are in end-of-life mode and support will end for both on 1st November.
    • FTS3, SQUID and CVMFS will soon be include in UMD; early adopters are requested
  • Next meeting planned for October 6th.

Monday 8th September

Monitoring - Links MyWLCG

Tuesday 2nd September

  • Monitoring consolation meeting last Friday
  • Squid monitoring TF meeting last Thursday
On-duty - Dashboard ROD rota

Tuesday 2nd September

  • Sussex is back in business - kept closing their low availability alarm wrt the GGUS ticket.
  • The UCL ticket is now finally receiving some attention.
  • Ongoing problems at RAL.

Tuesday 26th August

  • RAL : Nagios jobs staying in queue for long time - to be investigated.
  • Sussex : Matt needs help probably from some SGE experts.
  • UCL : No acknowledgement from the site (ticket escalated to second level).
  • 100IT : There is an alarm from EGI federated cloud - this needs discussion.
  • Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.

Tuesday 12th August

  • Last week was quiet.
  • Still one or to responses needed for next rota allocations.

Rollout Status WLCG Baseline

Tuesday 26th August

Monday 28th July


Security - Incident Procedure Policies Rota
  • FAX update

Monday 8th August

  • There was a security team meeting last Wednesday.
  • There was a CA TAG meeting also last Wednesday.

Monday 11th August

  • Topics as mentioned during the last GridPP technical meeting.
  • There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 9th September

  • RIPE probes now hosted: Cambridge, Sheffield, Liverpool, Lancaster (& Oxford and QMUL). Glasgow connected but no data.
  • RIPE probes not yet hosted: 6 sites.

Tuesday 2nd September

  • Only a few of the RIPE probes went live last week - any issues at the other sites to be discussed?
  • JANET is going to deploy a perfSONAR instance on one of the exchange points in London. They hope it will help raise awareness of issues with local systems affecting their transfer performance.

Tuesday 12th August

  • A reminder to update site status information in the IPv6 pages.
  • There is a new version (v3.4rc2) of perfSONAR being tested at QMUL [1]. Details here [2].
  • We will shortly review issues being picked up by perfSONAR and the steps to take when investigating.

Monday 8th September 2014, 15.00 BST
25 Open UK tickets this week.

As seen on TB-SUPPORT, the NGI has a ticket telling it to get sites to have the new voms servers configured for the switch over. Jeremy has kindly offered to field the ticket. I think we all have this in hand, but as I type this I realise I may have forgotten to set things up for the ops VO. I encourage everyone to double check their readiness ahead of next Monday's switchover. Assigned (8/9)

The RAL FTS2 service has been shutdown for nearly a week now, so I suspect this ticket tracking the switch off can be closed. In progress (3/9)

CMS having trouble running a "locateall" AAA test at RALPP (TBH I don't know what that is) - Chris has let them know that this is due to their xrootd reverse proxy being down, and it should be up and running in a day or two after it's reinstalled. In progress (8/9)

As mentioned last week, Sno+ have been having trouble as they can't assign software tags on Arc CEs, and they use these tags to do stuff like black/white listing. There was some dicussion on this in the ticket, but it fizzled out- I suspect due to the topic moving offline. Can it have an update please? In progress (27/8)

CMS transfer problems to Bristol. Winnie put an update, where she mentioned she has applied a fix to their Storm that might have fixed the problem. Maybe. She's asked if the problem still persists, as the monitoring links provided have all gone stale. Lukasz is on leave, can anyone CMS savvy help her? Waiting for reply (8/9)

CMS Pilots losing contact with home base. No progress since Winnie noticed that the problem only seems to affect one of the Bristol clusters, but none expected due to leave. On Hold (8/9)

Update - Bristol have another, possibly related CMS ticket 108317

Maarten ticketed ECDF about this CE's not having the new voms servers configured. Andy is working on it. There's a reminder that on top of adding the right configs services do need restarting. In progress (5/9)

glexec tarball ticket. There's a bit more movement on getting this done, but it's all on me to get the tarball glexec working still - naught the Edinburgh chaps can do.

Duncan noticed some interesting goings on on the Durham perfsonar page. The Durham chaps are talking to their networking team to figure out what the flip is going on. In progress (8/9)

Duncan's unwavering gaze also noticed a problem on Sheffield's perfsonar. Elena was tweaking it when it broke, and it looks like it's still broken, any luck fixing it Elena? In progress (26/8)

Liverpool got a ROD ticket when their CREAM CE got poorly. Steve worked his magic and things were fixed, but Gareth asks about the persisting BDII tests still failing. Solved (8/9) Update - the problems seems to have disappeared, so was probably just a artifact of BDII lag.

My personal shame number 1. Lancaster's poor perfsonar performance. Despite a reinstall of the box and not showing any signs of a bottle neck in transfers or running manual tests we still have really poor perfsonar results. No problems with the network have been found. Duncan helped formulate a plan at GridPP, but I haven't had the time to test it out yet. On hold (8/9)

95299(1/7/13) My personal shame number 2 - Lancaster's glexec deployment ticket. Some news in that I have something I'd like to test now - I just need to find time to test it, then see if I can package it somehow. On hold (8/9)

UCL's glexec deployment ticket. This work was pushed back to the end of August - any news on it? On Hold (29/7)

A ROD ticket for UCL APEL publishing errors. The apel admins got involved and things are looking better now - although Gareth points out that there is some missing data in the Spring. In progress (8/9)

Pointing VO_SNOPLUS_SNOLAB_CA_SW_DIR to /cvmfs/snoplus.gridpp.ac.uk. No news for a while on this after it was acknowledged - has the job fallen to the bottom of the stack? In progress (22/8) Solved now, issue was dealt with last week but the ticket wasn't updated.

Duncan ticketed QM about one of their pefsonar boxen - which Dan pointed out is their IPv6 perfsonar. So does that mean this ticket can be closed? In progress (4/9) Update - Duncan would like the ticket kept open to track this node's assimmalation into the mesh.

Longstanding LHCB ticket with JET. No movement on this, but none was expected. Still if anyone wants to heroically interject with some ideas I'm sure it would be appreciated. On hold (29/7)

As mentioned last week, Matt M of Sno+ fame has a user who only has access to srm tools and is having trouble accessing files at RAL. Brian has suggested using the webfts, but Matt doesn't think this will work for the user's limited abilities. Any thoughts? In progress (8/9)

Inconsistency between BDII and SRM reported storage capacity...hang on, haven't we been here before (105571)? It's not quite the same problem, but it's close. Brian has confirmed the mismatch, Maria has asked for an explanation for it (and how it only really effects ATLASHOTDISK). In progress (3/9)

Checking the site firewall configuration for RAL's Vidyo router. Last update was in July, is the dialogue between the Vidyo team and the RAL networking chaps ongoing? On hold (1/7)

The Tier 1's version of 106325 - CMS pilots losing contact. This was waiting on the firewall expert getting back from hols to compare the settings between the Tier 1 and Tier 2 (who don't see this issue). Are they back yet? On Hold (14/8)

Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

  • There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

  • Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.

Monday 16 June 2014

    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Site Updates

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 10th September 2014

  • Operations report
  • CMS are now writing to the newer T10KD tapes and migration of CMS data from 'B' to 'D' tapes is underway.
  • Access to the Cream CEs will be withdrawn apart from leaving access for ALICE. This has been announced for Tuesday 30th September.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A