GridPP PMB Meeting 623

GridPP PMB Meeting 623 (06.02.17)
=================================
Present: Dave Britton(Chair), Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Cass, Andrew Sansum.

1. ResearchFish
===============
Some members have received emails from ResearchFish and SL advised his email confirmed all his entries were category 5. DBs email differed – now circulated to the PMB. PC summarised that he had an additional meeting with ResearchFish (Ian Fuller & Research people) on 03.02.17. He has seen the new functionality which should show PG can add output to a central GridPP project and members can go via institutional links and inherit links to their accounts. Ensure none of the HW grants ask for outputs, only staff grants. SL confirms his staff grants are noted as Category 5 and included – this category states items can be uploaded but this is not required. All PMB members will sign in and check whether previous issues have been resolved. PG and PC will contact Ian fuller to confirm functionality for GridPP – PG will confirm if papers uploaded from projects, e.g. Atlas, to import into GridPP. Additionally, ResearchFish have made all links between different grants, e.g. Atlas to CGs so individuals can see their own grants linked to wider GridPP grants. DB noted ResearchFish input period is from today until 16.03.17. SL email does list all the grants but other members have not. Some issues experienced last year still remain, including grants duplicated on his list.

ACTION 623.1: PG will test ResearchFish and upload the latest papers for other members to inherit into their CG.

ACTION 623.2: PC will email Ian Fuller to mention ongoing issues on ResearchFish from last year.

2. Tier1 Review
===============
There was some discussion on requirements for output from the review and how that is best framed and summarised for the OSC. The review requires to be concluded and a summary of the issues outlined PG provided a set of notes that can be combined with the agenda and documents – PG will provide links to the agenda.

ACTION 623.3: DB and AS will discuss how best to summarise the Tier1 review.
ACTION 623.4: GS will upload talks from the Tier1 review to the Agenda.

3. OSC Meeting Date/Schedule
============================
Date confirmed [NOTE ADDED: THE MEETING IS NOW TO BE RESCHEDULED] – 18.05.17 in London. PG confirmed the schedule for document preparation:
As with last time, PG has backtracked deadlines for handing in documents is 03.05.17, and each week prior different draft versions should be submitted. PG will put this onto the F2F agenda.

ACTION 623.5: PG will put the Deadlines for OSC reports onto the F2F agenda.

4. RAL CPU Efficiencies January
===============================
AS circulated these and Atlas efficiency is at 5-6% lower than normal and on par with CPU. RJ will investigate.
ACTION 623.6: RJ will conduct an investigation on Atlas efficiency being 5-6% lower than usual.

5. HSF Analysis Ecosystem Retreat
=================================
A Doodle poll has been set up to arrange a meeting in April/May. DB noted the lack of UK representatives. It was confirmed the meeting is mainly an opportunity to discuss Root, this has organically come out of the white papers and provides opportunities to engage further. Not immediate core business for Atlas UK or GridPP involvement, but we should be involved in HSF at a project level. This may be more relevant if we were funding Ganga. DB suggests it may be a good opportunity for someone to attend if possible (possibly RJ or James Catmore).

6. AOCB
=======
None

7. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
AM confirmed it was productive. Discussions were positive around ends and containers and issues Andrew Lahiff has been looking. They agreed to use experience with Ends to make containers with similar interiors – Andrew will make for Atlas and LHCb and we will supply interiors. This provides lightweight ways for LHCb and Tier1s on more technical matters.

SI-1 Dissemination Report (SL)
——————————
No report.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

SI-5 Production Manager’s report (JC)
————————————-
A few items this week:

1. There is a Benchmarking Working Group F2F in the pre-GDB slot this Tuesday: https://indico.cern.ch/event/578967/. It covers experiment views (including benchmarking with production jobs), cloud benchmarking, issues with Haswell processors (magic boost investigation).

2. The GDB for February is this Wednesday: https://indico.cern.ch/event/578983/. Topics include AFS phase-out, benchmarking, feedback on the EOS workshop and a baseline for WLCG T1 operations.

3. January R/A figures for the Tier-2 sites have been circulated:

* ALICE http://wlcg-sam.cern.ch/reports/2017/201701/wlcg/WLCG_All_Sites_ALICE_Jan2017.pdf. All OK

* ATLAS http://wlcg-sam.cern.ch/reports/2017/201701/wlcg/WLCG_All_Sites_ATLAS_Jan2017.pdf

– Sheffield 80%:82%
– ECDF 87%:98%
– RALPP 80%:80%

* CMS http://wlcg-sam.cern.ch/reports/2017/201701/wlcg/WLCG_All_Sites_CMS_Jan2017.pdf. All OK

* LHCb http://wlcg-sam.cern.ch/reports/2017/201701/wlcg/WLCG_All_Sites_LHCB_Jan2017.pdf

– Sheffield 71%:73%
– ECDF 88%:100%
– RALPP 83%:84%

Site responses so far:

– RALPP reports that the load on their dCache is causing intermittent SAM failures.

4. There was a short downtime to undertake updates of the GridPP website and the Vac websites on Friday 3rd February. No issues reported.

5. A round of updates on our “wider VOs” last week suggests in most cases we are waiting for the VOs (either to submit jobs, comment on plans etc.).

SI-6 Tier-1 Manager’s Report (GS)
———————————
General: We have been applying patches for CVE-2016-7117

Castor:
– As reported at the Tier1 review the Castor 2.1.15 update has been completed. There were some issues following the upgrades. These
were:
— Needing to adjust some parameters after the first (LHCb) stager update – as in my report two weeks ago.
— There was also a problem with ALICE after the ‘GEN’ upgrade. ALICE require a special version of the xroot component for Castor.
Checks that the xroot component would install under 2.1.15 had been made – but a newer version was needed. Once this had been provided there was a further ALICE specific configuration error that had to be tracked down. This caused a significant loss of availability for ALICE (failed between the 26th and 30th January).
– Since the upgrade we have seen a couple of further problems.
— There has been a problem with the LHCb instance – we see a database resource (number of cursors) exhausted – and have had to restart the service to clear stuck transfers (on 1st Feb). A similar operation was carried out for Atlas (on 31st Jan).
— We are also failing CMS tests for an SRM endpoint defined in the GOC DB but not in production (“srm-cms-disk”). This should not have tests running against it and needs following up with CMS. Even ‘though this test should not matter we would like to understand why it has stopped working after the Castor upgrade.
– I note that we have completed the move of all LHCb data to the T10KD tapes, so all Tier1 data is now on T1-0KDs.

Disk Servers:
– I note that there was a disk server failure for LHCb on 3rd Feb. We had to declare 5 files lost from that failure.

Tier1 Availabilities for January 2017:
Mainly affected by Castor. There were planned interventions on the 5th Jan (patching); 10th (Nameserver to 2.1.15) plus each instance had one other outage.
ALICE: 87% – Badly affected by the ALICE specific Castor problem following the 2.1.5 upgrade.
ATLAS: 98%
CMS: 90% – Background failure rate (timeouts) high. However, there is also a monitoring/reporting problem and I will request this figure to be checked.
LHCb: 98%
OPS: 99%

Andrew Lahiff circulated the January batch job efficiencies. I have copied the LHC VO ones in here for the record along with his comment:

Global CPU efficiency (CPU time / wall time) was up in January at 83.9%, compared with 83.5% in December. Of 228085 HEP-SPEC06
months available wall time, 220617 HEP-SPEC06 months were used (96.7% occupancy). Experiment summary:

Experiment CPU Time Wall Time Wait % Efficiency
HEP-SPEC06 Months
ALICE 25099.18 27472.58 2373.41 91.36
ATLAS 65378.09 85910.61 20532.52 76.10
CMS 20607.98 30328.85 9720.87 67.95
LHCb 66246.12 68719.93 2473.81 96.40

LHC Total 177331.37 212431.97 35100.60 83.48

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No meeting.

SI-8 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
616.3: DB and SL will discuss how best to progress replacement of TW’s role. (Update: SL has sent DB a job description and he will review and send back) Ongoing.
620.1 DB to contact DK re the procedure to deal with a security incident and the media. (Update: DK had devised an interim statement which involved TW as dissemination officer and he is no longer in post – there is no prescriptive full response as this would be dependent on circumstances and probably involve an emergency PMB and communication with relevant PR representatives). DK will send the statement to PMB in case required in future – spokesman SL as head of board or DB as project leader. Ongoing.
622.1: DB and PG will work on an agenda for GridPP38 and run this past DC for comment/input. Ongoing.

ACTIONS AS OF 06.02.17
======================
616.3: DB and SL will discuss how best to progress replacement of TW’s role. (Update: SL has sent DB a job description and he will review and send back) Ongoing.
620.1 DB to contact DK re the procedure to deal with a security incident and the media. (Update: DK had devised an interim statement which involved TW as dissemination officer and he is no longer in post – there is no prescriptive full response as this would be dependent on circumstances and probably involve an emergency PMB and communication with relevant PR representatives). DK will send the statement to PMB in case required in future – spokesman SL as head of board or DB as project leader. Ongoing.
622.1: DB and PG will work on an agenda for GridPP38 and run this past DC for comment/input. Ongoing.
623.1: PG will test ResearchFish and upload the latest papers for other members to inherit into their CG.
623.2: PC will email Ian Fuller to mention ongoing issues on ResearchFish from last year.
623.3: DB and AS will discuss how best to summarise the Tier1 review.
623.4: GS will upload talks from the Tier1 review to the Agenda.
623.5: PG will put the Deadlines for OSC reports onto the F2F agenda.
623.6: RJ will conduct an investigation on Atlas efficiency being 5-6% lower than usual.

Other business
1) Cloud working group last Friday – task force on computing from the previous meeting in 2016. DC has been contacting people to achieve some tasks and will report back to PMB as necessary.
2) Due to other commitments for several members the next PMB will take place on 20.02.17