GridPP PMB Meeting 612

GridPP PMB Meeting 612 (07.11.16)
=================================
Present: Dave Britton(Chair), Tony Cass, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Pete Clarke.

1. Status of OSC Document (PG)
==============================
PG has circulated up to date documents for perusal, these do not include financials. PG will complete the finance table which monitors the project against the Plan based upon the original figures, this should be relatively straightforward at this early point in the project.
The main report is almost complete and includes comments from PMB members, AM will make final comments. Risks have been migrated into the new format. The milestones spreadsheet has been added to the Grid Map. The Project Management Plan will soon be complete to include final comments. Section 8 (Procurement Plan) does not relate to monies, but is an overall plan for the project in a general statement for all procurement processes/frameworks/dates for the life of the project. In the project report there is a note on LHCb increased requirements which shall be covered separately. PG will submit tomorrow. Section 3 requires replacing to state at a high level what the objectives and deliverables of the project without detail – DB will draft.
DB and AM have made various additional comments on the Project Status for inclusion. DC should make a comment explaining the 8%-5% disparity between text and graphs referring to Tier-1 resources.
Action 612.1: DB will draft text for Section 3 of the OSC project management plan by tomorrow morning.
Action 612.2: PG will finalise and submit OSC documents by tomorrow.

2. Spend Plan (AS)
==================
AS compared the Tier-1 spend plan with monies available. In summary, we have received a request from Tony Medland on how we might partially meet the increased requests and prepare a budget request for RAL for 2017. Additional information was the offer of an additional £390K capital be brought forward to this year. AS re-ran modelling for that and believes we can meet 60% in 2016 to meet the 2017 uplift, per the original plan for 2016. Planning for 2018 is more challenging because of exchange rate shifts. The MOU request should be capable of being met but will potentially impact other capital requirements. More capital and resource is required in FY17.

3. Tier-2 HW grants – status and question of which sites to spend this FY?
======================================================================
PG noted distribution of funding between sites and made some adjustments as required. He emailed figures to PIs last week, 13 out of 16 sites already have JES forms created with one fully submitted to STFC last week. Brunel, Glasgow and Durham are in the process of creating JES. Delivery date is arbitrary throughout the period (some have stated early 17 and others 31 March 2018). We need to ensure monies have been spent at the early part of this year and grants need to be started with £10K spent within 3 months of the start date. DB stated procurement will be attempted at Glasgow this year. It would be helpful if Imperial can also spend this before end March – DC will investigate but believes this should be achievable. Imperial may spend c. £150-£200k this financial year. DB has began the procurement with a procurement officer assigned – this is further complicated because Glasgow has additional funds to spend and has recently had infrastructure issues with system shut-downs last week. RJ will assess. It would be helpful if smaller sites can be encouraged to undertake their procurement soon – PG will determine which sites are in a position to do that this year.
Action 612.3: PG will determine which small sites can undertake procurement this FY.

4. What H/W do we recommend sites to purchase? Strategic direction re disk/CPU balance
====================================================================
Metrics will be impacted by the H/W procurement discussed in Item 3 and guidance should be provided to small or medium-sized sites. GridPP should be positioned to meet experiments computing models as we move forward to 2020 and discussions should begin soon in this regard. This may need to be addressed at F2F in GridPP38. H/W distribution needs to be driven by need, but not spread as it is now relating to volume – ie performance needs to be disentangled from volume. An operating model for smaller sites should be established to determine where disk goes, CPU can follow performance, but disk must follow workload. CMS has a limited number of sites, LHCb has a well established system now, ATLAS is not as certain and RJ will consider this – RJ and DB will have preliminary discussions in Geneva. Some sites have decreased manpower which will have an impact. For the present grants these need to proceed algorithmically and future grants will be determined after discussion with sites’ requirements.

5. AOCB
=======
Tier-1 Review date set for 1 February 2017. High level issues including networking, CEPH, evolving Tier1 with staffing levels, cloud, etc will require to be agreed for discussion. AS and DB will discuss after the OSC meeting.

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Meeting took place last week – DC will circulate minutes.
ACTION 612.4: DC will circulate minutes from latest Technical Group meeting.

SI-1 Dissemination Report (SL)
——————————
##GridPP Engagement Officer Notes for PMB

### GridPP/SKA meeting

TW gave an introduction to GridPP and the New User Engagement Programme for the SKA/GridPP meeting organised by Jeremy C at Manchester, Weds/Thurs 2/3 November 2016:

https://indico.cern.ch/event/570594/

(Also many other GridPP members in attendance.)

Hopefully the start of a mutually-beneficial collaboration!

### Ganga integrated into the GridPP UserGuide

The UserGuide has now been fully re-written to incorporate Ganga as the User Interface of choice. Points of note:

* The local running capabilities of Ganga mean that grid-like jobs and workflows can be tested without the need for a Grid certificate, so new users can get stuck in straight away. The “Hello, World(s)!” example script demonstrates multiple, parameter-driven job submission with Ganga and Python scripting with a twist on the time-honoured example 🙂

* The way Ganga allows for simple switching between backends (local, cluster, DIRAC) means that the fully-worked example workflow from CERN@school can be moved from local to grid-running in a step-by-step fashion, which is much better from a pedagogical perspective. The new sections and structure reflect this.

* Thanks to CVMFS, the user no longer has to install anything (the GridPP DIRAC UI is provided in and sourced from the Ganga CVMFS repository).

* Ganga makes interacting with the DIRAC File Catalog (DFC) in Grid jobs trivial, and the data sections have been updated accordingly.

* Caveat: the UserGuide now assumes the user has access to the CERN and RAL CVMFS repositories (i.e. /cvmfs/ganga.cern.ch and /cvmfs/cernatschool.egi.eu), either via a local grid-enabled cluster for university-based users or a GridPP CernVM (instructions provided in an appendix). These are available by default on the latter but you ay wish to ensure that these are available on your cluster for new users.

#### Useful links:
* The UserGuide: https://www.gridpp.ac.uk/userguide/
* News item: https://www.gridpp.ac.uk/news-2016-11-04-userguide-ganga/

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
ATLAS is moving to a new site, very occasional misconfiguration needs to be addressed but at a low level.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
AM was unable to fully connect to the meeting because of technical issues so nothing to report.

SI-5 Production Manager’s report (JC)
————————————-
1. Operationally several sites have continued to focus on actions to address the DirtyCOW vulnerability (CVE-2016-5195). We have occassional straggler nodes being picked up along with some false-positives in the dashboard.

2. The WLCG T2 A/R results for October have arrived:

ALICE: http://wlcg-sam.cern.ch/reports/2016/201610/wlcg/WLCG_All_Sites_ALICE_Oct2016.html
All okay

ATLAS: http://wlcg-sam.cern.ch/reports/2016/201610/wlcg/WLCG_All_Sites_ATLAS_Oct2016.html
QMUL 78%:84%
Birmingham 77%:77%

CMS: http://wlcg-sam.cern.ch/reports/2016/201610/wlcg/WLCG_All_Sites_CMS_Oct2016.html
Bristol 85%:95%

LHCb: http://wlcg-sam.cern.ch/reports/2016/201610/wlcg/WLCG_All_Sites_LHCB_Oct2016.html
All okay.

Site explanations:

Birmingham: ongoing issues with the older storage nodes which are now mitigated.
QMUL: ATLAS’s own figures suggest site is at 95%! We need further investigation.
Bristol: TBC.

3. With support from (and thanks due to) Alessandra several sites continued to investigate their accounting discrepancies with ATLAS figures last week:
– Liverpool: now within 0.3% following use of revised ARC/Condor and VAC scripts.
– ECDF: Their REBUS was ‘wrong’. The HEPSPEC06 per core figure is known to be consistently too low (mixed hardware) and ATLAS use the lowest value. ECDF could republish back to April apart from August
– QMUL: issues arose due to rebalancing of their nodes – the types of node changed. There is little we can do to correct the figures at this stage.
– RHUL: New benchmarks now work. The site may need to republish with these new benchmarks but this will need to be scheduled with APEL and will take a few days at least.
– Brunel – There is nothing site can do to fix this at the moment. The suggestion is to use the ATLAS figures.

The lesson here is that we need to regularly check the figures and we will aim to do this now on a monthly basis.

4. Andrew noted last week that the Cherenkov Telescope Array (CTA) https://web.cta-observatory.org (a multisite gamma ray observatory that works by detecting atmospheric Cherenkov light) would be a good VO to explore supporting in the UK with 12 UK universities plus RAL involved and PB storage requirements.

5. Good GridPP participation in the SKA/SRC-GridPP workshop in Manchester last Wednesday/Thursday: https://indico.cern.ch/event/570594/. Talks have been uploaded (thanks to everyone who contributed) and summary notes will follow.

SI-6 Tier-1 Manager’s Report (GS)
———————————
General:
Bulk patching for security bug CVE-2016-5195 (or ” DirtyCOW”) carried on last week. Castor was taken down on Tuesday (1st Nov) for the patching and reboots. Still some mopping up to do but large majority of systems done.

Castor: (No change from last week’s report):
– As reported before the testing of Castor 2.1.15 is largely complete. Owing to staff availability this update will be carried out
in the New Year, with the intention of completing it by the end of January.
– We have seen problems on the “AtlasScratch” instance in Castor. This is a disk-only pool with a small number (only 5) of old disk servers. A plan to merge this disk pool into the larger AtlasDataDisk has been developed. This will alleviate the bottleneck of this being served by a small number of old disk servers. A similar merger is also proposed for LHCb (merging the smaller LHCbuser disk
pool into the larger LHCbDst one).

Tape System:
– The intervention (by Oracle) to replace the fixings for the rails used by the handbots within the Tier1 tape library took place successfully last Wednesday (2nd November).
– We have started the migration of LHCb data from ‘C’ to ‘D’ tapes.

Networks:
– There was a break in RAL’s main connectivity to Janet on Wednesday morning (2nd Nov). The site linked failed over transparently to the backup connection via Bristol. At the same time, and for the same reason (a fibre cut – I believe in Reading) one of the Tier1’s
OPN links also failed. We ran with a single 10Gbit connection until the repair was made. In both cases this was fixed during the afternoon when our OPN link reverted to operating at 20Gbit.

Services:
– On Friday there was a problem with the database systems behind the “tests” FTS service which is used by Atlas. We requested that
Alas flip over to our “production” FTS service.
– Our CVMFS Stratum0 server has been replaced with new hardware.

Availabilities for October:
– OPS: 92.8 (Effect of outages for Security patch)
– Alice: 100%
– Atlas: 100%
– CMS: 99%
– LHCb: 100%

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No meeting and nothing to report.

SI-8 External Contexts (PG)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
605.1: DK will investigate costs and timescales of upgrading the OPN Link to 30 and report back to PMB. Ongoing.
606.3: AS will propose a convenient date for Tier1 review and circulate to PMB for consideration. Ongoing.
607.2: PG will produce a spreadsheet containing explicit detail on Capital and Resource for Tier1 and as well as Tier1 and Tier2 pledges to include LHC requirements. Ongoing.
607.4: ALL to contribute to the OSC Project Status Report. (Almost complete) Ongoing.
607.8: JC to contribute Deployment Status for OSC Report. Ongoing.
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier1 network availability/performance.
610.2: All – review Pete’s comments in the metrics spreadsheet and act accordingly.
610.3: AS Attempt to get tape media re-classified from resource to capital.
610.4: AS/DB Contact Tony Medland to get new budget allocation (regarding extra capital) in writing so we can start procurement.
610.5: AS Provide numbers/details for H2020 bids. DB will contextualize them.
610.6: GS Produce report on how Tier1 missed that a very low number of CMS jobs were running and therefore fell significantly
behind running the CMS re-reco jobs.

ACTIONS AS OF 07.11.16
======================
605.1: DK will investigate costs and timescales of upgrading the OPN Link to 30 and report back to PMB. Ongoing.
606.3: AS will propose a convenient date for Tier1 review and circulate to PMB for consideration. Ongoing.
607.2: PG will produce a spreadsheet containing explicit detail on Capital and Resource for Tier1 and as well as Tier1 and Tier2 pledges to include LHC requirements. Ongoing.
607.4: ALL to contribute to the OSC Project Status Report. (Almost complete) Ongoing.
607.8: JC to contribute Deployment Status for OSC Report. Ongoing.
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier1 network availability/performance.
610.2: All – review Pete’s comments in the metrics spreadsheet and act accordingly.
610.3: AS Attempt to get tape media re-classified from resource to capital.
610.4: AS/DB Contact Tony Medland to get new budget allocation (regarding extra capital) in writing so we can start procurement.
610.5: AS Provide numbers/details for H2020 bids. DB will contextualize them.
610.6: GS Produce report on how Tier1 missed that a very low number of CMS jobs were running and therefore fell significantly
behind running the CMS re-reco jobs.
612.1: DB will draft text for Section 3 of the OSC project management plan by tomorrow morning.
612.2: PG will finalise and submit OSC documents by tomorrow.
612.3: PG will determine which small sites can undertake procurement this FY.
612.4: DC will circulate minutes from latest Technical Group meeting.