Operations Bulletin 061014

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 29th September 2014
Task Areas
General updates

Tuesday 30th September

Monday 22nd September

  • Operations Portal v3.0.5 is now online.
  • There were some reports of CMS overloading WNs... discussion needed?
  • There was an EGI OMB last week (agenda: minutes):
    • we have a few more days to feed back on the revised OLAs;
    • gstat has now got some support again;
    • feedback is requested on the EGI.eu SLA;
    • site certification (tagging and process) within the EGI federated cloud is being re-evaluated;
    • ARGUS central banning is being piloted following which rollout across EGI will be expected early 2015;
    • SAM probe changes for ops have been agreed (e.g. removing the FTS2 tests);
    • Some products are still seeking early adopter sites (e.g. CREAM-LSF and CVMFS).
  • The next EGI OMB is on 30th October.
  • There is an EGI conference this week Challenges and Solutions for Big Data processing on Cloud.

Monday 15th September

Monday 8th September

  • Storage placement - survey TBC.
WLCG Operations Coordination - Agendas

Tuesday 23rd September

  • There was a WLCG ops coordination meeting last Thursday. (Agenda: Minutes).
  • Some highlights....
  • News: At the MB it was decided that WLCG Operations will investigate efficiency ideas in more detail and carry out a survey among sites to understand where effort is currently used.
  • Baselines: New BDII version (1.6.0) released in EMI-3, containing small fixes for GLUE2; New version of the EMI-3 UI and WN available. New dependencies as gfal2-util and ginfo included, new bouncycastle-mail version dependency for sha-2 voms; Frontier/Squid 2.7.STABLE9-19, enhancement and bug fixes.
  • MW issues: There is an issue in the info-provider for Storm.
  • Oracle: migration to GoldenGate is progressing well at CERN.
  • T0: Investigating use cases for AFS-UI; from October will use EGI WMS for SAM tests; plus5:batch5 still used by users to build binaries under SLC5; KB article describing how to set up a private VM under Openstack with all the relevant RPMs.
  • T2: gfal packages will not be removed from the repositories. WN version 3.1.0 including gfal2 will be declared baseline as soon as it is released also in UMD.
  • ALICE: investigation of job failure rates and inefficiencies.
  • CMS: Bad incident with xrootd fallback test on Sep 16. Want to close 'unknowns' in the storage accounting. Check out the AAA page.
  • LHCb: NTR
  • glexec: Minor site developments. ATLAS: the pilot is now able to use glExec if found working, if not fallback to traditional mode. LHCb implemented glExec support a long time ago but hasn't started pushing to use it on the sites yet.
  • Machine/job features: NTR
  • Mlddleware readiness: Next meeting on 6th October.
  • Multicore: Publishing multicore accounting to APEL works.
  • SHA-2: AM prod machines use the new servers since Tue Sep 16. Checking experiment systems.
  • WMS decomm: Deployment of the Condor-based SAM probes planned on Wed 1st of October 2014
  • IPv6: NTR
  • Squid/HTTP: NTR
  • Network & transfer metrics: Good participation in first meeting. An initial overview of the current status in the network and transfer metrics was presented and a list of topics and tasks to work on in the short-term was proposed.

Tuesday 16th September

  • The next core ops meeting is on 18th September.
  • The next multi-core meeting is today at 14:30 CERN time. It is on dynamic partitioning with LSF at CNAF.

Tier-1 - Status Page

Tuesday 30th September

  • We are planning to stop access for all VOs apart from ALICE to our CREAM CEs today.
  • There were some problems with the switch of the Atlas Frontier system to the new database last week - and that change was reverted.
  • We have declared an 'At Risk' on the site for a coupl of hours during tomorrow morning for the next quarterly UPS/generator load test.
Storage & Data Management - Agendas/Minutes

Wedn 01 Oct

  • Summary of all the exciting events in Amsterdam last week - EUDAT, EGI big data, RDA
  • DPM 1.8.9 early testing, and (separately) xroot4 early-ish testing. Supporting multiple VOs in one xroot server.

Wedn 17 Sept

  • iRODS - what it is and why it should choose to collapse on Betelgeuse 7.
  • Technical problems with Vidyo

Wedn 10 Sept.

  • High load at L'pool causing low throughput - how to throttle xroot transfers (and is the load necessary or a bug?)
  • Still testing WebFTS
  • Prep for DPM workshop

Monday 1st September

  • FAX sites to update the C++ N2N rpms .
  • There is interest regarding issues/performance when placing storage outside firewalls. JC will shortly start a (closed) discussion/survey.

Monday 11th August

  • Pool nodes at RHUL have received test errors.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 30th September

  • Slight delay for Birmingham and Sussex.

Tuesday 23rd September

  • Slight APEL delay for Birmingham .

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 9th September

  • Looking a bit better. Will review in more details at core ops meeting (next Thursday 18th@11:30am unless there is a clash)

Tuesday 2nd September

  • This work needs a kick-start! Reminders should now be being received.
  • Tom/Andrew in discussion about options for main site - main considerations are Wordpress and Drupal.

Interoperation - EGI ops agendas

Tuesday 9th September

    • Mostly a short meeting to give updates on product updates over the summer.
    • Please read the agenda/minutes for a full set but to pull out a couple of things:
    • Note that as per http://dmc.web.cern.ch, gfal and lcg-util are in end-of-life mode and support will end for both on 1st November.
    • FTS3, SQUID and CVMFS will soon be include in UMD; early adopters are requested
  • Next meeting planned for October 6th.

Monday 8th September

Monitoring - Links MyWLCG

Tuesday 30th September

  • Monitoring meeting last Friday, link : https://indico.cern.ch/event/341748/ , minutes: https://indico.cern.ch/event/341748/material/minutes/minutes.html
    • Of note, we had identified a couple of differences between SAM2/3 where entries were appearing in SAM3 which had not been in SAM2. This is because they were being picked up from the vofeed - from the minutes, "Pablo and Maarten confirm that the VOfeed is the only authoritative source of topology; agreed that all services in the VOfeed will be tested and their availability will be calculated; agreed to add a new attribute to VOfeed to flag which services should be excluded from the official reports."

Tuesday 23rd September

  • Next (replacement) meeting this Friday to continue discussions.

Tuesday 16 September

  • SAM Nagios probes refactoring TF meeting

We had first SAM Nagios probe refactoring TF meeting on 12 September. Some of identified issues are listed in TF wiki


We need operations team opinion on some of the issues

    • Removing LFC tests : Everyone agreed in the meeting that it should be removed
    • Use of WMS for SAM test
    • Testing SRM client tools from WN

Tuesday 2nd September

  • Monitoring consolation meeting last Friday
  • Squid monitoring TF meeting last Thursday
On-duty - Dashboard ROD rota

Monday 29th September

  • Rota being updated.

Monday 22nd September

  • Quiet week - little to report.
  • EGI is looking for people to join the ops portal review and testing TF.

Tuesday 2nd September

  • Sussex is back in business - kept closing their low availability alarm wrt the GGUS ticket.
  • The UCL ticket is now finally receiving some attention.
  • Ongoing problems at RAL.

Tuesday 26th August

  • RAL : Nagios jobs staying in queue for long time - to be investigated.
  • Sussex : Matt needs help probably from some SGE experts.
  • UCL : No acknowledgement from the site (ticket escalated to second level).
  • 100IT : There is an alarm from EGI federated cloud - this needs discussion.
  • Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.

Tuesday 12th August

  • Last week was quiet.
  • Still one or to responses needed for next rota allocations.

Rollout Status WLCG Baseline

Tuesday 26th August

Monday 28th July


Security - Incident Procedure Policies Rota

Tuesday 30th September

  • Shellshock - advice and follow-up.
  • Note particularly the advisories from WLCG/EGI.

Tuesday 23rd September

  • High priority vs critical tests in pakiti.

  • FAX update

Monday 8th August

  • There was a security team meeting last Wednesday.
  • There was a CA TAG meeting also last Wednesday.

Monday 11th August

  • Topics as mentioned during the last GridPP technical meeting.
  • There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 23rd September

Tuesday 16th September

Tuesday 9th September

  • RIPE probes now hosted: Cambridge, Sheffield, Liverpool, Lancaster (& Oxford and QMUL). Glasgow connected but no data.
  • RIPE probes not yet hosted: 6 sites.

Monday 29th of October 2014, 15.00 BST
24 Open UK tickets this week.

Inconsistent BDII/SRM reported storage numbers for ATLASHOTDISK. No news on this ticket for quite a while. In Progress (3/9)

Note quite a RAL ticket, Sno+ asking about how to accommodate users who only have access to srm tools to access data. Nothing since Henry's helpful input on the 10th. Anyone with any ideas? I've mentioned the UI tarball but I think that's clutching at the weediest of straws. In progress (29/9)

Sno + having trouble running jobs.
Sheffield: 108716(23/9)
Lancaster: 108715(23/9)
Matt M reports that Sno+ are having troubles running at Lancaster and Sheffield - it looks like we could be seeing the same problem, I'll let you (Elena) know what I find out. Both In Progress.

100IT not running VMCatcher at their site. There was some trouble creating a AppDB profile, but this was solved (108548). No news on this ticket since then. In progress (17/9)

Sussex has a BDII nagios check ticket that has escalated - can you please give it an update and if you're stuck let us know - Lancaster got a similar ticket on Friday and these issues are annoying to tackle. In Progress (29/9)

Just a warning that this atlas data transfer ticket has been re-opened on you, with a new set of transfers detected. Reopened (29/9) (It could well be that 108856 is a duplicate of this ticket.).

Tools - MyEGI Nagios

Tuesday 16th Sep

Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.


It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk

Should we start monitoring it more actively and open ticket for sites failing tests ?

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

  • Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.

Monday 16 June 2014

    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Site Updates

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 1st October 2014

  • Operations report
  • It ws planned to withdraw access to the Cream CEs for all VOs apart from ALICE yesterday (30th September). However, this has been delayed.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A