Operations Bulletin 280714

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 21st July 2014
Task Areas
General updates

Tuesday 22nd July

  • The RAL FTS2 is due to be switched off on 2nd September.
  • HyperK enablement request still stands.
  • The WLCG biweekly WLCG status from yesterday's ops meeting is available here.
  • There was a general reminder last week about putting too much detail into messages that go to our public email lists. Please remember the Traffic Light Protocol!
  • There will be an IPv6 session at GridPP33. JANET will participate. Pete C and G are taking ideas for talks.
  • There was a GridPP technical meeting on Friday.
  • The final WLCG T2 availability and reliability figures for June 2014 are now available.
  • There was a first meeting last Wednesday of the HEP Software Foundation. The first step is to prepare a “call for volunteers" who can devote the time in the coming months to lead the work that has to be done.

Monday 14th July

  • Workshop - CVMFS monitoring feedback
  • ATLAS DC14 13TeV simulation starting - note Alessandra's recommendation regarding Nikehf scripts and multicore running (for torque/maui) sites.
  • topBDII caching and errors
  • ILC VOMS changes
  • Sites with ARC CEs who want to support LHCb need to make a few configuration changes. This is to ensure that there is an environment variable available to jobs which specifies the name of the queue.
  • EGI A/R report for June
  • Did anyone else see kernel problems like Liverpool (see blog)
  • Large numbers of biomed jobs have been impacting various sites. Is setting MaxTotalJobs the answer? Do we need follow-up with the VO?
  • HyperK can now make use of additional resources and a general request for enablement was circulated. It has been confirmed that they only need disk at QMUL.
WLCG Operations Coordination - Agendas

Tuesday 21st July

  • The next coordination meeting takes place this Thursday at 14:30 UK time. There is now a standing item for sites to raise issues of concern. Is there anything we would like to mention this week? we are invited to update the twiki up until 1 hour before the meeting.

Tuesday 14th July

Tuesday 1st July

Monday 23rd June

  • Minutes from last Thursday's meeting. Highlights....
  • A page is available listing current known middleware issues affecting WLCG.
  • Baselines: Storm 1.11.4 released in EMI containing several bug fixes. Baseline update with UMD release.
  • 3 issues affected some sites after the latest EMI update of Cream and LB. The problems are under investigations by the PTs.
  • CVMFS: Starting from July, sites not compliant with the 2.1.19 version will be notified with a GGUS ticket (noted that upgrade just requires an update of the RPM and a restart CVMFS).
  • T0: The OPS VO now runs in voms-admin instead of VOMRS, after the migration done on June 17th
  • Tier-1/Tier-2 feedback: NTR!
  • ALICE: successful campaign for users to move away from old ROOT versions. T0 job efficiency issues ongoing.
  • ATLAS: DC14 expected to start in approximately 2 weeks from now.Panda/Jedi is now fully ready for user analysis.
  • CMS: Started to remove individual release tags from CEs. After the introduction of disk/tape separation at the T1 sites, CMS now must site readiness measures for T1 sites
  • LHCb: Recommend CVMFS 2.1.19. General request: ensure that downtimes, including unscheduled outages, accurately reflect the specific services which are unavailable.
  • FTS3: Monitoring the auto-tuning algorithm closely and adjusting various monitoring tools of FTS3.
  • glexec: 10 sites have yet to enable it. ARGUS instabilities being investigated.
  • Machine/job features: PBS/torque and LSF implemented. SLURM pending. SGE and HTCondor in progress.
  • MW readiness: ATLAS and CMS DPM setups in progress. Monitoring prototype being deployed at test sites.
  • Multicore: CMS stable flow. Gathering reports for July workshop. ATLAS MC jobs on-hold pending new software release.
  • SHA-2: New VOMS fix for CERN instances requires sites to update ARGUS, UI, CREAM and WN instances.
  • WMS decommissioning: Progress with SAM Condor validation. ARC-CE WN tests failing for some CMS sites (incl. Imperial).
  • IPv6: NTR
  • HTTP proxy discovery: Task overview table updated.
  • Network and transfers metrics: Mesh leaders developed. Kick off in July.
  • AOB: OSG plan to migrate to HTCondor CEs by October.

Tier-1 - Status Page

Tuesday 22nd July

  • All Castor instances have been upgraded to version 2.1.14. The upgrade is complete apart from turning off a compatibility mode on the namserver component which will be done shortly.
  • We have announced that we will shutdown the FTS2 service on the 2nd September.
  • The software server used by the small VOs will be withdrawn from service. We are planning to do this on the 2nd September. Moving VOs to use CVMFS has been progressing well.
  • There was a 'warning' (At Risk) on the Tier1 for a network routing change in the core RAL network.
Storage & Data Management - Agendas/Minutes

Wednesday 23 July 2014

  • We really should try to document our VO policy: the stuff in people's heads, experiences with "small" VOs (that tend to grow bigger), best practices. Also, much of the wiki needs reviewing. "Boring" old documentation - hey ho!

Tuesday 22nd July

  • Alert today advising to update DPM, mainly to get the new (bug fixed)


Wednesday 2 July

  • Guidance and policies for "small" VOs: how to get them started with stuff, without preventing them later growing bigger.

Tuesday 1st July

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 22nd July

  • EGI accounting portal does not show any significant outages in publishing in recent days.

Tuesday 1st July

  • There are no SL6 HS06 entries in our wiki for UCL and EFDA.
  • Are there any observations from the latest GridPP metrics tables? (Does anything need addressing or correcting?).
  • APEL is not up-to-date for: RHUL; Manchester and Durham.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 22nd July

  • Starting on revisions this week.
  • Is the alert system now working?

Monday 16th June

  • A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Interoperation - EGI ops agendas

Tuesday 14th July

  • Last meeting yesterday.
  • URT: see agenda for details
  • SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
  • DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
  • Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
  • EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
  • Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
  • Next meeting placeholder 28th July, but may not happen (OMD depending)
  • Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

Tuesday 1st July

  • Today's ops meeting cancelled - partly due to forthcoming 4th EGI annual review.
  • EMI-2 decommissioning: The situation is followed by COD (GGUS 106354). "Please remember that we passed the decommissioning deadline and after today - Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points (source PROC16).
  • There is STILL use of UMD2/EMI2 APEL clients to send accounting data. As of today there are 20 sites (see latest list) still using UMD2/EMI2 APEL clients

Monitoring - Links MyWLCG

Tuesday 22nd July

  • Chris noted yesterday that gstat reports most sites as critical. SB thought the underlying problem is that a value is supposed to be "Production" in GLUE 1 and "production" in GLUE 2; at some point they changed both to lower case.

Tuesday 15th July

On-duty - Dashboard ROD rota

Tuesday 22nd July

  • Another quiet week. Bham availability alarm ticket created.

Tuesday 1st July

  • Quiet week. Sussex emi2 ticket is still open. UCL also has a open ticket regarding some problem with storage.

Tuesday 24th June

  • Very quiet shift. Dashboard downtime on Tuesday seemed to go ok.

Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).


Security - Incident Procedure Policies Rota

Tuesday 22nd July

Monday 14th July


Tuesday 1st July

  • There was a very useful security challenge debrief last week. Thanks to Heiko.
  • There may be a site contacts challenge in the coming months. Please could every site review their site security contact details and ensure that the GOCDB entry is up-to-date and working.
  • EGI indicates that site ARGUS instances can now be hooked up with the regional instances.
  • There was one EGI amber final report last week.
  • Next team meeting 16th July.

Monday 23rd June

  • CVE-2014-3153 - but no public exploit.
    • This kernel vulnerability has been patched in errata released last week.
  • PerfSonar/Cacti updates.
  • New IGTF CA release 1.58 - the EGI release is due on 30th June.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 22nd July

  • There was a problem with VOMS admin that was noticed last Thursday: VO page requestes resulted in the WEB-INF directory of the jetty app being displayed. A server restart fixed the problem.

Tuesday 17th June

  • The GridPP VOMS server was updated on 11/06/2014 - no issues reported.

Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

  • There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.

VOs - GridPP VOMS VO IDs Approved VO table

"Monday 14th July 2014"

  • HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

  • HyperK.org request for support from other sites
    • 2TB storage requested.
    • CVMFS required
  • Cernatschool.org
    • WebDAV access to storage -world read works at QMUL.
    • ideally will configure federated access with DFC as LFC allows.

Monday 16 June 2014

    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 23rd July 2014

  • Operations report
  • The termination of the FTS2 service has been announced for the 2nd September.
  • The software server used by the smaller VOs will be turned off - also on 2nd September.
  • We are planning to turn off access to the cream CEs (possibly keeping them open for Alice) - although no date yet decided for this.
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A