Operations Bulletin 300614

From GridPP Wiki
Jump to: navigation, search

Bulletin archive

Week commencing 23rd June 2014
Task Areas
General updates

Monday 23rd June

Monday 16th June

  • There was a GDB last week and pre-GDB on IPv6. Of particular interest is Helge's trip report from HEPIX. The official GDB summary notes are now available and the actions updated.
  • GGUS ran into problems over the weekend and this affected ticket processing. The system did not recover from a DB connection problem and required a server restart.
  • There is a plan for the upgrade of the CVMFS installation servers hosted at CERN from v 2.0 to v 2.1, starting from August the 5th. Before the servers upgrade (by the end of July) there is a request to make sure that the CVMFS clients deployed on the WLCG infrastructure are upgraded to the 2.1.19 version, released on 28th May.
  • Well done to Brunel for being the first UK site to offer EGI resources via the eGrant system!
WLCG Operations Coordination - Agendas

Monday 23rd June

  • Minutes from last Thursday's meeting. Highlights....
  • A page is available listing current known middleware issues affecting WLCG.
  • Baselines: Storm 1.11.4 released in EMI containing several bug fixes. Baseline update with UMD release.
  • 3 issues affected some sites after the latest EMI update of Cream and LB. The problems are under investigations by the PTs.
  • CVMFS: Starting from July, sites not compliant with the 2.1.19 version will be notified with a GGUS ticket (noted that upgrade just requires an update of the RPM and a restart CVMFS).
  • T0: The OPS VO now runs in voms-admin instead of VOMRS, after the migration done on June 17th
  • Tier-1/Tier-2 feedback: NTR!
  • ALICE: successful campaign for users to move away from old ROOT versions. T0 job efficiency issues ongoing.
  • ATLAS: DC14 expected to start in approximately 2 weeks from now.Panda/Jedi is now fully ready for user analysis.
  • CMS: Started to remove individual release tags from CEs. After the introduction of disk/tape separation at the T1 sites, CMS now must site readiness measures for T1 sites
  • LHCb: Recommend CVMFS 2.1.19. General request: ensure that downtimes, including unscheduled outages, accurately reflect the specific services which are unavailable.
  • FTS3: Monitoring the auto-tuning algorithm closely and adjusting various monitoring tools of FTS3.
  • glexec: 10 sites have yet to enable it. ARGUS instabilities being investigated.
  • Machine/job features: PBS/torque and LSF implemented. SLURM pending. SGE and HTCondor in progress.
  • MW readiness: ATLAS and CMS DPM setups in progress. Monitoring prototype being deployed at test sites.
  • Multicore: CMS stable flow. Gathering reports for July workshop. ATLAS MC jobs on-hold pending new software release.
  • SHA-2: New VOMS fix for CERN instances requires sites to update ARGUS, UI, CREAM and WN instances.
  • WMS decommissioning: Progress with SAM Condor validation. ARC-CE WN tests failing for some CMS sites (incl. Imperial).
  • IPv6: NTR
  • HTTP proxy discovery: Task overview table updated.
  • Network and transfers metrics: Mesh leaders developed. Kick off in July.
  • AOB: OSG plan to migrate to HTCondor CEs by October.

Tier-1 - Status Page

Tuesday 24th June

  • Castor Stager Upgrade was carried out last week. 'GEN' stager update this morning. Remaining updates: LHCb - Thu 26th; Atlas - Tue 1st July.
  • We are looking at how to end the FTS2 service, now FTS3 is becoming widely used.
  • The software server used by the small VOs will be withdrawn from service. Its use as a software server is very limited (possibly only SNO+) although a few VOs use it for uploading files to the CVMFS repository.
Storage & Data Management - Agendas/Minutes

Tuesday 17th June

  • Advances with CEPH at RAL will be reported to the Storage Meeting. It is hoped to setup a regular update contribution.

Tuesday 10th June

  • The DPM Collaboration agreement has been updated.

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 24th June

  • APEL not up-to-date for: RHUL; Manchester, Durham and Sussex.

Tuesday 10th June

  • APEL not up-to-date for: Brunel, Sheffield, QMUL, Durham and Sussex? EMI-2 service downtime related in some cases?

Tuesday 20th May

  • Sites with APEL 'delays': IC, Liverpool, Sheffield, Durham, ECDF and Glasgow.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Monday 16th June

  • A review is starting of old and obsolete pages within the GridPP website - there are many! Please review sections that you have created and update them if necessary.

Tuesday 6th April

  • KeyDocs are going to be reviewed (in next 4 weeks) as the system is not working (or not adding anything) in some areas.

Interoperation - EGI ops agendas

Tuesday 10th June

  • Next meeting June 16th.

Monitoring - Links MyWLCG

Tuesday 24th June

On-duty - Dashboard ROD rota

Tuesday 24th June

  • Very quiet shift. Dashboard downtime on Tuesday seemed to go ok.

Rollout Status WLCG Baseline

Tuesday 18th March

Tuesday 11th February

  • 31st May has been set as the deadline for EMI-2 decommissioning. There may be an issue for dCache (related to 3rd party/enstore component).


Security - Incident Procedure Policies Rota

Monday 23rd June

  • CVE-2014-3153 - but no public exploit.
    • This kernel vulnerability has been patched in errata released last week.
  • PerfSonar/Cacti updates.
  • New IGTF CA release 1.58 - the EGI release is due on 30th June.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 17th June

  • The GridPP VOMS server was updated on 11/06/2014 - no issues reported.


Monday 23rd of June 2014, 15.00 BST
27 Open UK tickets today.

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105571 (20/5)
RAL is publishing inconsistent storage numbers for lhcb. No word on this for a while - but the problem persists. In progress (17/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=105405 (14/5)
Vidyo firewall ticket - it looks like it's heading to ticket limbo, can it be saved from this fate and given an update (or closure). In progress (10/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=98249 (21/10/13)
Sno+ cvmfs stratum-0 ticket. Some interesting conversation in this ticket about cvmfs mirroring on the other side of the Atlantic. In progress (17/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106020 (6/6)
cern@school jobs stuck at Birmingham. Did the investigation yield any results? Or perhaps the problem has evaporated? Silence isn't golden when it comes to tickets! Well, unless they're on hold. In progress (23/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106369 (20/6)
Biomed have submitted a second ticket (first one was 105942) asking IC to get gsiftp read access to their dcache namespace (think I've got that right). Simon has replied saying that he doesn't want to circumvent what he sees has a security feature (fair enough), so I suspect this one might have to go Unsolved). Waiting for reply (20/6)

https://ggus.eu/index.php?mode=ticket_info&ticket_id=106347 (19/6)
Naught wrong with the ticket handling here, but I thought was interesting - The new Cloud site has been hammering the cvmfs stratum zero - this looks to be a problem with atlas jobs/images trying something new with proxy discovery. An installation of Shoal should fix things. Interesting that not too long ago we had a similar VAC problem. In Progress (20/6)

Tools - MyEGI Nagios

Tuesday 24th June

  • An update from Janusz on DIRAC:
  • We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
  • The async FTS is still under study, there some issues with this.
  • I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.

Tuesday 20th May

Between May 1st and May 12th, SAM-CENTRAL and the Message Broker Network have experienced a set of chained failures that resulted in the loss of a large portion of the metric results that were published by the SAM NGI Instances. The loss of these messages will result in an unusually high number of UNKNOWNS in the May A/R reports, but the actual A/R numbers will not be affected as UNKNOWNS are not take into account. No other services have been affected.

Tuesday 13th May

  • From last week's discussion DiRAC now supports: NA62, vo.landslides.mossaic.org, t2k.org, snoplus, gridpp, CERN@school and northgrid. NA62 are moving from LFC to DFC and plan to use DiRAC in place of the WMS.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 16 June 2014

    • Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software
  • VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

  • Is there interest in an FTS3 web front end? (more details)

Site Updates

Tuesday 20th May

  • Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries
Project Management Board - MembersMinutes Quarterly Reports


GridPP ops meeting - Agendas Actions Core Tasks


RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 25th June 2014

  • Operations report
  • Castor GEN Stager 2.1.14-13 updated yesterday (24th June). Some problems with xroot for ALICE not resolved until following morning. Remaining stager dates as follows (LHCb - Thu 26th June; Atlas - Tue 8th July.)
WLCG Grid Deployment Board - Agendas MB agendas


NGI UK - Homepage CA


UK ATLAS - Shifter view News & Links






  • N/A
To note

  • N/A