Operations Bulletin Latest

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 3rd November 2014
Task Areas
General updates

Tuesday 28th October

Tuesday 21st October

  • There is now a schedule of pre-GDB meetings to be aware of in coming months:
    • November 11: volunteer computing
    • December 8-9: a 2-day pre-GDB on Condor. Looking at the feedback of sites who recently migrated to Condor with some OSG Condor experts and developers. This is an outcome of HEPiX discussions which prompted wishes to share experiences and discuss advanced features and needs like multi-core job supports, containers and cloud support.
    • January 13: pre-GDB on data management focused on data preservation impacts on sites and reducing the protocol zoo.
    • A post-GDB ARGUS developer meeting will take place on 11 December.
  • The following ROC profile SAM tests have been removed: ch.cern.FTS-ChannelList; ch.cern.FTS-InfoSites and org.gstat.SanityCheck.
  • ATLAS FAX monitoring configurations need updating for EU privacy law compliance (see TB-S email on 30th Sept).
  • A power cut at CERN on 16th October has had ongoing intermittent impacts on computer services.
  • David has pruned a version of the blogs aggregator with a focus on our technical blog posts.
  • Please take a look through the uploaded 55 HEPiX talks for items of interest!
  • Alex Owen mentioned a call for FLOSSUK papers for the Spring 2015 conference in York.

Monday 13th October

WLCG Operations Coordination - Agendas

Tuesday 28th October

Tuesday 21st October

  • There was a WLCG ops coordination meeting last Thursday 16th October. Minutes are available. The brief summary:
  • News: A survey to investigate operational efficiency areas is being prepared - it covers managing changes, communication channels, monitoring, grid service administration. Expect it to be ready for public input soon.
  • MW news: dCache 2.6.35 and 2.10.7 have been released - update is a priority. perfSONAR 3.4 has many fixes - SSLv3.0 vulnerability TBC - but baseline at 3.3.4 until docs ready. CESNET will co-maintain (classads) package needed by CREAM, WMS, L&B, UI, WN (was removed from EPEL).
  • T0&T1 services: FTS and dCache upgrades.
  • Oracle: DB migrations timetable is work in progress.
  • T0: VOMS-ADMIN test cluster available again. Here is a link. lxplus5 now off. Job efficiency differences remain unexplained. High Availability FTS3 being discussed. Quattor End of Service (EoS) now end November.
  • T1: NTR
  • T2: NTR
  • ALICE: Improved job eff at CERN. Operational instability noted on 15th Oct.
  • ATLAS: Full Chain test considered successful. Reco campaign finished. Had 2 FTS3 problems.
  • CMS: PromptRECO will need Tier-1 sites in addition to CERN. Still need SLS and AFS-UI after Oct. Encourage dCache update.
  • LHCb: Proposes to re-visit the list of critical services before Run2. Setting up a prototype installation for http access and access federation where 70 % of the storage endpoints are accessible.
  • gLExec: NTR
  • Machine/Job features: Concluded on a single architecture for cloud and batch implementations.
  • MW readiness: Last met 1st Oct. Verification process now well documented. More volunteer sites needed. DB designed to hold verification results. HTCondor testing request from ATLAS. MW Package Reporter designs discussed. Next meeting 19th November.
  • Multicore: CMS started testing.
  • SHA-2: Ops and ALICE VOMS opened last Monday. LHCb TBC. CMS reminded. ATLAS tests progressing.
  • IPv6: All T1s should join HEPiX IPv6 WG. Dual-stack SAM to be put in place. Deploy PS in dualstack in 2015. OSG infrastructure ready.
  • Squid and HTTP proxy: Squid registration campaign pushed to November.
  • Network & transfer metrics WG: Request for sites to await 3.4 instructions. Internal security audit performed. Recommend all sites running 3.3 to temporarily disable SSL3 - (as at 16th Oct) patches coming.
  • Other: Vidyo ticket opened.

Tuesday 14th October

Tier-1 - Status Page

Tuesday 28th October

  • There were problems last week with the ARC-CEs. Three ARC-CEs have been re-installed. An additional ARC-CE, arc-ce05 was also installed and put online, though we may remove this machine once the current crisis is resolved.
  • Oracle patches were applied to the pluto standby database yesterday and the production pluto database will be patched tomorrow.

Tuesday 21st October

  • The intervention to put our Castor Oracle database configuration to its 'normal' state was completed OK last Tuesday. This was necessary following problems with one of the disk arrays hosting the primary database.
  • We are having problems with our ARC CEs at the moment.
  • The FTS3 service was updated yesterday, Monday 20th Oct. (To version 3.2.29-1)
Storage & Data Management - Agendas/Minutes

Wedn 29 Oct

Wedn 22 Oct

  • Martin Bly (T1) - HEPiX report

Wedn 15 Oct

Wedn 01 Oct

  • Summary of all the exciting events in Amsterdam last week - EUDAT, EGI big data, RDA
  • DPM 1.8.9 early testing, and (separately) xroot4 early-ish testing. Supporting multiple VOs in one xroot server.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Monday 27th October

  • Sites considering moving to HTCondor should be aware there are prototype APEL parsers in use for HTCondor so if you continue using CREAM as your CE then you can continue to use APEL accounting. The previous Condor parser for APEL was retired in EMI3 as there was no demand.

Tuesday 21st October

  • Sussex publishing of accounting records looks to have stopped a few weeks ago.

Tuesday 7th October

  • GridPP metrics need updating for CMS. Any comments on the metrics page at the moment?
  • APEL issues for Birmingham and Sussex, and the portal appears to stop at 1st October (being followed-up).

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 28th October

Tuesday 7th October

  • Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
  • Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.

Interoperation - EGI ops agendas

Tuesday 21st October

    • URT:
    • dCache server v. 2.6.35 verified by WLCG as baseline
    • DPM 1.8.9 in EPEL-testing
    • SR: If sites have been using/testing EMI-WN 3.1.0 please get in touch to help with verification. They seem keen for people to test this.
    • New VOMS servers rollout: NGI SAMs being notified for reconfiguration as of yesterday.
    • MySQL 5.0 EOL campaign: note progress in agenda.

Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

  • EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
  • MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
    • See agenda for guidance on middleware consequences
  • classads "retired" from EPEL repos
  • SL/SLC/CentOS 5 Support Lifetime
    • This was highlighted, though not suggested to be urgent
Monitoring - Links MyWLCG

Tuesday 4/11

  • Next meeting 14th November: Status of SAM3 rollout (proposed topic)
On-duty - Dashboard ROD rota

Tuesday 28th October

  • AM reports a quiet shift. Dashboard not catching up earlier in the week but ok later on.

Tuesday 21st October

  • Possible issue with slow updates to GGUS tickets as viewed via ROD Dashboard (or is it a RAL cache issue?)
  • EFDA - Recurring alarms owing to site availability. Just waiting for the bad period to age out of the 30-day window.
  • SUSSEX - Ongoing problem. Matt has a ticket open with middleware developers.

Rollout Status WLCG Baseline

Tuesday 26th August

Monday 28th July


Security - Incident Procedure Policies Rota

Tuesday 28th October

  • Note EGI-ADV-2014-10-28.

Tuesday 21st October

  • The IGTF has an update which introduced rather unexpected changes in the trust anchors used by Comodo for the TCS. There

is now an additional set of SHA-2 intermediate CAs in addition to the old ones.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 28th October

  • Have the perfSONAR 3.4 instructions/documentation been updated yet? Last week volunteers were sought at the...
  • perfSONAR operations meeting took place on 22nd October.
    • There is a recommendation for sites supporting IPv6 to deploy perfSONAR dual-stack.
    • Concerned about Tier-3s requesting to join Tier-2 meshes.
    • A network transfer metrics wiki page is available.

Tuesday 21st October

Monday 13th October

  • perfSONAR 3.4 has been released. Clear documentation on what to do (clean reinstall) coming this week together with information on mesh updates. See the GDB presentation slides 13 and 14.
  • RIPE have sent a reminder to connect probes that have been handed out (some weeks ago now). Please could the following sites check their status: Lancaster; Brunel; Sussex; and ECDF. 20599 at RAL has never properly connected (DHCP issue?).

Monday 3rd of November 2014, 14.45 GMT
26 Open UK tickets this month.

Sussex publishing "all the 4s" (bdii bingo!) for their waiting jobs. Matt RB has a ticket in with the developers over these problems (109263), although he has bravely said that he might try to tackle the problem himself...and it looks like lcg-infosites returns a sensible number now. On Hold (can be closed?) (23/10)

Cross-referenced with the above ticket, looking at the last few updates it looks like Matt RB release a spooky Hallow'een patch, and now they look to be green. Another ticket that can be closed? On hold (31/10)

CMS pilots losing connection at Bristol. No news for a while, it looks to me like Bristol are still in downtime though? This has been a tough issue to debug. On hold (14/10)

Someone at atlas were trying to raise the dead at Glasgow over Hallow'een, although rather then zombies it was long lost files. It appears that despite these files being declared lost last summer the deletion/recovery ritual hadn't been completed. UK cloud support are on the case. In progress (3/11)

Tarball glexec ticket. On Hold (29/8)

Durham's perfsonar results going "proper weird" suddenly. The local networking team where on the case, but the perfsonar got offlined from fear of shellshock and there has been no news since (is it alright to reinstall perfsonar yet?). On hold (6/10)

Sno+ asking for their VO_SW_DIR to point to cvmfs. Elena rolled this out, but sadly the ticket was reopened due to some job failures accessing cvmfs, and a few holdouts still with the wrong environment variable (Matt M threw in some CE errors he was seeing too, but he was very apologetic about it). Elena's investigating. In progress (30/10) Update - Catalin posted a reminder of the new cvmfs-keys release (1.5-1), and suggested moving snoplus' cvmfs area to teh egi.eu domain - /cvmfs/snoplus.egi.eu

Atlas have been seeing transfer problems, although it looks like these failures have mutated since the ticket was opened (checksum errors to srm type errors by the looks of it). Alessandra is on the case. In progress (3/11)

Getting Sno+ jobs running at Lancaster. It looks like everything is in place, just waiting for Sno+ to confirm (or give us a list of errors!). Waiting for reply (30/10)

Tarball glexec ticket... no news other then my last attempt a few weeks ago failed (not as simple as I hoped) On hold (8/9)

Poor Perfsonar Performance. Has hit a bit of a roadblock with both perfsonar boxes being switched off for the last month... have I missed an announcement saying that the latest perfsonar release is ready? On hold (31/10)

UCL's glexec ticket. Ben hit a snag installing this mid-October, no news since then after some feedback from Maarten. In progress (14/10)

LHCB having cvmfs trouble at IC, which was likely caused by a batch of naughty CMS jobs ruining it for everyone else. LHCB re-enabled IC to see if things were back on track, no news since. Waiting for reply (24/10)

Ops "availability" test failures at Jet. The cause of the alarms is known (Jet had a certificate problem on a few hosts). Just waiting for alarm to clear now. On Hold (28/10)

The case of the mysterious lhcb failures at Jet. No progress, none expected really though. On hold (1/10)

AFAICS this ticket now distills down to "Getting vmcatcher working at 100IT". Things seem to be progressing well, although the 100IT chaps aren't very good at setting their ticket statuses correctly! In progress (28/10)

Ticket listing the requirements for a cloud site. All the three actions have or already were completed, but there is a question over the state of the 100IT site BDII. In progress (30/10)

La Grada Uno
CMs are seeing glexec errors ("status 203") at the Tier 1. Looks to be caused by a lack of wildcard mapping, only just coming to light with the recent cms analysis jobs coming into the site. Andrew L is on it like a scotch bonnet. Or just on it. (29/10)

Matt M from Sno+ has noticed gfal-copy errors when trying to access the Tier 1 using those tools. He's not sure if this is a problem with the Tier 1 or the tools themselves (or even his setup), Duncan is already helping him out. In progress (3/11)

(possibly related) Sno+ "srmcp failures" for a bunch of SUSY users. Some great input on how to get the tools working from Duncan and Chris, but no word since. My suspicion is Matt is waiting to hear back from this user group. Maybe their mail clients don't work under SUSE either? In progress (21/10)

The Tier 1 version of the Bristol CMS pilots losing connection ticket. On hold after exhausting all ideas. On hold (13/10)

Submissions to the RAL FTS3 "REST" interface failing for some reason - AIUI thought to be a problem with the CRLs and apache. After some advice the system has been tweaked, and is in the waiting-to-see-if- that-fixed-it stage. On hold (3/11)

CMS AAA access tests failing at RAL. Reading down the ticket it looks to be a cms redirector problem at RAL... or something... Andrew has been working to fix things, adding another redirector and other tweaks. Andrew has asked the xrootd experts (cc'd?) why the behaviour they are seeing is occurring (and also notes some references to RALPP slipping into the Tier 1 discussion). Waiting for reply (27/10)

T2K notice the LFC denying the existence of the new user. The problem seem to go away from the T2K side, but Catalin has spotted a potential problem and asked for some voms-proxy-info output. Waiting for reply (28/10)

Atlas have noticed a lot of lost job heartbeats over the last day, the Tier One guys are on it. In progress (3/11)

Inconsistent BDII/SRM numbers. Looks to be a problem with how castor reports read-only disk servers, Brian has put in a request to the Castor team for information on this. On hold (3/11)

Tools - MyEGI Nagios

Tuesday 21st October

VOMS servers for OPS based at CERN were down on Saturday 18th October for around 12 hours. Nagios tests started failing after existing proxy expired. Availibilty figures will be slightly affected but outage will be considered as unknown.

Blog about VO Nagios


Tuesday 16th Sep

  • Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
  • https://vo-nagios.physics.ox.ac.uk/nagios/
  • It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
  • Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Monday 3rd November 2014

  • Please update cvmfs-keys and VO_<VONAME>_SW_DIR
  • Working with SNO+ and Mark Slater on ganga job submission rate

Thursday 23 October 2014

  • CVMFS keys - new cvmfs-keys package cvmfs-keys-1.5
    • Part of decoupling of CVMFS from CERN - support for keys from various repositories
    • <voname>.gridpp.ac.uk -> <voname>.egi.eu
    • Please update and change VO_<VONAME>_SW_DIR to point to new directory

Monday 11th August

  • Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.

Site Updates

Tuesday 21st October

  • High loads seen in xroot by several sites: Liverpool and RALT1... and also Bristol (see Luke's TB-S email on 16/10 for questions about changes to help).

Tuesday 9th September

  • Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 22nd October 2014

  • Operations report
  • There are ongoing problems with the ARC CEs.
  • The production FTS3 service was upgraded to version 3.2.29-1 on Monday (20th Oct).
  • The OGMA database system (Atlas3D/Frontier) has been updated and switched to using Oracle GoldenGate for updates from CERN.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A