GridPP PMB Meeting 576

GridPP PMB Meeting 576 (12.10.15)
=================================
Present: David Britton (Chair), Tony Doyle, Roger Jones, Tony Cass, Andrew McNab, Andrew Sansum, Jeremy Coles, Dave Colling, Steve Lloyd, Claire Devereux, Pete Clarke, Gareth Smith (Minutes).

Apologies: Dave Kelsey, Pete Gronbech

3. AOCB
=======

DB: No news from STFC regarding GridPP bid.

GridPP Strapline: Tom Whyntie suggested that the current strapline on the new GridPP web pages (“UK Computing for Particle Physics… and Beyond”) could be improved. After discussion the strapline: “Distributed Computing for Data Intensive Research” was agreed. This was fed back to Tom.

Benchmarking Group: A Cloud Benchmarking Group is being set-up at CERN and there is a request from Helga Meinhard at CERN for participation. It was propose that Martin Bly be approached for this.

ACTION 576.1
AS To ask if Martin Bly will represent GridPP on the benchmarking group and feed back to the PMB.

CPU Efficencies: The CMS efficiency at the Tier1 has risen to 68% (report from Andrew Lahiff). DC responded that the CMS Algothorithm for filling multip-processor slots is tuned for the case of lots of data. The algorithm doesn’t work so well if there is less work coming in or it is in bursts. The PMB will keep maintain visibility of this issue. There was a discussion about APEL reporting efficiencies greater than 100%.

Visit Notice System: RJ noted that all experiments use Steve Lloyd’s visit notice system – and proposed that we could use this for GridPP as well. This seems a simplification – and would be one less thing to move to the new web site. The imposition of limits by the web site was discussed – these are not enforced in Steve’s system although there is a place for guidance notes. The £300 limit for needing authorization was also discussed; with the possibility of a £500 limit or just requiring authorization for oversees visits.

ACTION 576.2
DB to talk with DK about the policy for when visit notices are required.

ACTION 576.3
SL to implement authorization for GridPP funded visits on the system already in use by the experiments.

EGI User Community engagement request:
CD had circulated an email asking for interest in supporting the MoBrain project. This was part of a Horizon 2020 call for four areas of research each with “Virtual access and transnational access”. EGI will gather the resources and then offer them to these new communities. One of these areas is MoBrain which Durham and STFC already have an interest in. This initiative does offer some funding for support (i.e. staff) depending on the resources offered and this may be attractive. The four initiatives are: MoBrain (Structural Biology); Pan-Cancer Analysis of Whole Genomes (PCAWG); Distributed Research Infrastructure for Hydro-Meteorology (DRIHM); Bioinformatics Infrastructure for Life Sciences (BILS). The importance of having local engagement in the research project(s) was stated (DB). CD is completing a survey on behalf of the UK. She is looking at institutional involvement in these already and will send round and e-mail for members of the PMB to respond.

EU-T0 planning meeting: PC. Will be attending a meeting in Barcelona, along with David Corney, about the EU-T0 planning for H2020. This will discuss a number of topics including the Open Science Cloud.

Tier2 Evolution: AMcN reported that a set of tasks has been put in JIRA and links circulated. Those interested can drill down into the tasks. AMcN did not want to be restricted to detailed timescales for the tasks but proposed that more general, overall timescales would do. This point was accepted. DC proposed a report on this work at the bi-weekly technical meeting. However, DB pointed out that the PMB also needs a high level view of this. This work means that action 574.1 (AMcN to undertake various tests on Jira and discuss at a future PMB soon.) has been completed.

Workshop on Clouds: DC reported that there will be a workshop on clouds as part of an initiative to restart the Cloud SIG. Details are still being worked out. The likely date is the 1st December, with a venue somewhere in London. DC will send round details once they have been finalised.

4. Standing Items
==================
SI-0 Monthly report from Cloud Group.
————————————-
AMcN reported that Sam Skipsey & Ewan MacMahon have picked up the work on the storage side of the project.
PC reported that following Wahid’s departure a new person with appropriate good relevant previous experience has been recruited at Edinburgh.

SI-1 Dissemination Report.
————————–
SL repoprts: Tom Whyntie is acting on feedback received about the new website. AMcN is obtaining a standard (commercial) certificate for it.

SI-2 ATLAS Weekly Review and Plans.
———————————–
RJ reports: Torre Wenaus of BNL has been elected as the new Atlas Deputy computing co-ordinator. The next Computing coordinator is Simone Campana.

SI-3 CMS Weekly Review and Plans.
———————————-
DC reports: There is a CMS Offline & Computing Week taking place at the moment. CPU efficiencies being followed up.

SI-4 LHCb Weekly Review and Plans
———————————-
AMcN reports: LHCb have suffered a number of technical problems in the last couple of weeks including problems with run database. There was a recent occasion when three (LHCb) Tier1s had downtimes at the same time. The WLCG twice-weekly meeting is working on coordinating downtimes with the aim of avoiding these multiple downtime clashes.

SI-5 Production Manager’s report
———————————-
JC Reports:

1. Some GridPP sites have been affected by an automatic BDII SL6/CentOS6 update (for openldap-servers-2.4.40-6) that was released as a security update. Unfortunately from openldap-servers-2.4.40-5 onwards an issue has been introduced which provokes the slapd process to crash under certain conditions.

2. The WLCG Tier-2 availability:reliability figures for September have been circulated:

• ALICE: All okay.
• ATLAS:
o Lancaster: 38%:42%
o Liverpool: 81%: 100%.
• CMS: All okay.
• LHCb
o QMUL: 47%:47%
o Lancaster: 75%: 80%
o ECDF: 89%: 89%

The site explanations are:

Lancaster – 1st of September reinstalled the local cluster, moved to a new scheduler and queues (still SGE) and updated infrastructure. This was a 2 day scheduled downtime. Lost nearly a week to a host of teething problems (including a publishing error). Then the CE started getting unstable after the move, which took some fixing. In the last week of September suffered first a catastrophic NFS failure (bad for a tarball site!) and then the networking “imploded” (this is still being fixed).

Liverpool – For Atlas, they had 100% reliability but only 81% availability. This is because a large amount of deferred maintenance had been done on the electrical supply system over the last month. Even though the power outages have been of short duration, the clusters had to be drained each time, which significantly amplified the effects.

ECDF – Had a problem with CVMFS last month for LHCb which was ultimately not site related. There were also a few minor general site issues which may have impacted availability.

QMUL – The site has 1 of 3 CEs that fail about 70% of LHCb jobs while the others have no problem. This is being investigated.

Following the meeting, Andrew McNab mentioned that “Stefan Roiser has just pointed out that for LHCb, the lack of SAM tests for Vac sites shows up in the federation level reporting even though the individual sites look good. This is really an artefact of LHCb classing the Vac virtual “CEs” as part of a separate site, with no measurements. I don’t know if this will be flagged up anyway. I’ve suggested to Stefan that we just take the VAC.* sites out of the LHCb A/R numbers for now.”

It has not been an issue so far, but removing the VAC sites for now makes sense.

3. A problem was reported by UK sites with the CVMFS at RAL. Apparently this was due to a misconfiguration at CERN. Oddly the UK was affected but other regions did not report any issues. This is to be investigated further.

4. There has been progress with the GridPP DIRAC install at Imperial with the aim to enable pilot jobs for LSST – in conjunction with ECDF and Oxford.

5. We have finally switched off UCL storage for ATLAS.

6. There have been some security incidents affecting sites in China. No direct impact on GridPP seen from current investigations.

SI-6 Tier-1 Manager’s Report
—————————–
GS reports:

Castor:
– The next steps in the upgrade of the Castor Oracle databases to version 11.2.0.4 took place last Tuesday and Thursday. The main change was the upgrade of the “Pluto” database. (Pluto hosts the nameserver as well as the CMS and LHCb stagers). The final step in this change is scheduled for tomorrow (Tuesday 13th October). This will be the swapping over of the main and standby copies of the Pluto database so that we are in the final configuration with the main database running in building R89 and the standby in the Atlas building.
– We have seen very high load on the Atlas tape instance. This instance is running well delivering around 500Mbyte/sec combined in/out. However, there are delays to the ‘bring online’ requests. We are preparing to install some additional disk servers into the tape cache.

Networking:
– On Wednesday 30th September, as scheduled, the link from our main router pair into the RAL core was upgraded from a resilient pair of 20Gbit connections to a resilient pair of 40Gbit connections.
– We are keeping a close watch on some low-level packet loss within our network. We are continuing with the changes needed to remove the old ‘core’ switch from the network.

Batch:
We have been progressing with the upgrade of our worker nodes to their new build and configuration. The final batch of worker nodes started to be drained out on Friday 2nd October. However, a significant problem arose with glexec that had not been apparent during testing and the long roll-out. This problem was not spotted until Monday when it was fixed. The glexec problem affected CMS CE SUM tests all weekend – resulting in a large loss of availability for CMS. (Other VOs do not include glexec tests in availability calculations). We currently do not callout on the VO CE tests (only on the OPS ones) and furthermore subsequent checks show our nagios test for the VO CE SUM tests had not been running. At the moment we do not understand how this glexec problem was not plainly visible during the testing and roll-out of the new worker node configuration.

Infrastructure:
– The quarterly a UPS/Generator load test took place successfully last Wednesday (7th October).

Action 576.4: GS to respond to the PMB with the explanation of why the glexec test failures were not seen previously during testing and roll-out of the Tier1 worker node configuration. Also to provide list of resulting actions to mitigate this type of problem in future.

SI-7 LCG Management Board Report of Issues
——————————————-
Nothing to report (DB).

AOB: AS reported that Philip Garrad of RAL Networking has had discussions with JANET and there is a reduction in price of OPN link. DK has information and will pass it on to the PMB for discussion.

Next meeting. 19 October. Some people not able to attend (DB, PC). Propose PG chairs, but if there are too few attendees to be useful then the meeting should be cancelled.

Review of ACTIONS
=================
571.6 Any PMB members who have not already done so must now submit their quarterly reports. Done. (Tier1 reports submitted).
574.1 AMcN to undertake various tests on Jira and discuss at a future PMB soon. Ongoing. Done. See report in this meeting.
574.2 On CMS T1 efficiency discrepancies – DC reports CMS are running multicore pilots on single core jobs, but Atlas are doing correctly on higher efficiency. Ongoing.
574.8 DB to obtain information from PC about conclusion of MB discussion on Memory Items for the Future and share with PMB members. Ongoing.
575.1 SL should retain more metrics information so that anomalies can be checked and the page should be modified to record on: Elapsed Time rather than CPU with 50/50 weighting. This should take effect from 11 April but should be monitored from now until then as it is easier to access and share. Done.
575.2 DB will circulate a note to all GridPP group members advising of the change in dates for GridPP36 to 11-13th April 2016. Done.
575.3 All members should consider content and structure for the new website and feed suggestions to Tom. Done.

ACTIONS as at 12-10-2015.
=========================
574.2 On CMS T1 efficiency discrepancies – DC reports CMS are running multicore pilots on single core jobs, but Atlas are doing correctly on higher efficiency. Ongoing.
574.8 DB to obtain information from PC about conclusion of MB discussion on Memory Items for the Future and share with PMB members. Ongoing.
ACTION 576.1
AS To ask of Martin Bly will represent GridPP on the benchmarking group and feed back to the PMB.

ACTION 576.2
DB to talk with DK about the policy for requiring visit notices.

ACTION 576.3
SL to implement authorization for GridPP funded visits on the system already in use by the experiments.

Action 576.4: GS to respond to the PMB with the explanation of why the glexec test failures were not seen previously during testing and roll-out of the Tier1 worker node configuration. Also to provide list of resulting actions to mitigate this type of problem in future.