GridPP PMB Meeting 679

GridPP PMB Meeting 679 (17.09.18)
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Alastair Dewhurst, Roger Jones, Dave Kelsey, Andrew McNab, Gareth Roy, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Tony Doyle, Pete Gronbech, Steve Lloyd.

1. Tier-1 Review Summary
========================
The Tier-1 review took place last week and is summarised below, the PMB agreed this was a very useful meeting for all concerned. Full minutes will be circulated in due course, AD will send out a resource spend plan for this year and there will be a few follow-up meetings. DB confirmed the PMB will need to agree and formalise any actions that come out of the review.
– Euclid is one of a number of PPAN science programmes beyond the LHC that we explicitly support. This support is strategically important to GridPP and we reaffirm that we approve and encourage appropriate effort in these areas.
– The 2018 Tier-1 procurement is urgent and should be prioritised (particularly in light of external circumstances). 

– It is essential to fully understand issues with exposing services to LHCOPN and LHCONE (security & other).

– It is important to maintain good communication with the RAL networking team.
– For Tape Storage, we encourage an SCD strategy formed with other archival storage stakeholders.
– The feeling of “chronic staff shortage” in the Fabric Team may suggest more active management of the workload and responsibilities. We wondered if some kind of shadowing by apprentices etc. could help.
– We support the move of Database from ORACLE to MySQL and Postgres for LHC technology where appropriate.
– We expect the number of callouts to be reduced as ECHO is better understood and better monitoring & polices are put in place.
– The new patching strategy where patching is classed as a security issue is a good idea and should be promoted.
– Congratulate ECHO Team for taking a green field project to a production service.
– CASTOR: a plan is needed with contingency to end the CASTOR Service before the start of Run3.
– We note the need for succession planning for good people: improve capture of necessary information.
– Improve ATLAS efficiency and batch utilisation, improve usage monitoring.

2. GridPP6
==========
DB received an email from Tony Medland who mentioned discussions with Sarah – he reiterated a commitment to WLCG but funds are a challenge in supporting the LHC through Run3 and beyond that. UKRI will bring in new funding opportunities, but access is not yet clear. The eInfrastructure landscape in the UK is changing and there are various reviews ongoing. He also mentioned the Balance of Programmes review and this will set STFC priorities with available funds. Tony has suggested some dates in mid-October for DB to visit Swindon for a meeting, PC will also attend to manage expectations relating to IRIS. DB will update the PMB thereafter.

3. CERN Rucio Project
=====================
AD circulated information on this over the weekend and summarised that CERN have submitted a proposal for 1 year of funding from Attract, an EU programme for scientific programmes or outputs that can be used by others. The aim is to demonstrate Rucio would work with SKA and they would like to undertake international transfers into Europe to follow the model. From the RAL perspective, AD has advised this is a good idea and the Rucio set up at RAL could be used – this would be beneficial for us as it ties in with the Rucio project and makes production quality more robust. Ian Johnston is spending c. 0.5FTE setting this up, RAL have some data nodes and the front-ends have been set up to run on virtual machines. Therefore, the structure is in place for this at RAL and AD asked for comments from the PMB in this regard. IRIS is paying for this to be established and it is running on GridPP h/w.
DB commented not to be concerned over IRIS and GridPP distinctions at this stage as this is a very good idea that we should support, though timescales for CERN expectations of setting up and capabilities for this need to align. PC strongly encouraged taking this type of initiative, involvement in SKA side of things is going to contribute greatly to our resource for the UK. He also suggests Edinburgh should be involved in this and we should be looking to extend this into GridPP6. There are staff in Edinburgh with spare effort who could contribute to the program side of this – he has emailed AD and Anna in this regard. AD confirmed this could be administered by non-RAL based people and Edinburgh staff could assist – it was agreed more discussion needs to be undertaken on this going forward.

4. AOCB
=======
AD noted that HPSS comments could be perceived as unfair when going to tender. He has deleted any comments to HPSS in talks made publicly available and the recent PMB minutes since AD joined the PMB do not mention HPSS.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing significant to report.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
There has been progress on running jobs from Atlas, various configuration tweaks on Atlas side etc. Now up to a stable 3000 concurrent jobs running and a total in the HD Condor queue of +5000 but we should be at 8000. Beneficiaries are Alice and LHCb, so the capacity is being used, but not by Atlas. RAL-LCG2 Echo etc are ongoing. AD commented that RAL are seeing improvements after the changes, but there remain some areas to review, see Tier-1 report.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Nothing significant to report.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-4 Production Manager’s report (JC)
————————————-
1. There was a WLCG GDB last week: https://indico.cern.ch/event/651357/ . Topics included an update on IPV6 (see next item) and one on efficiency improvements from Markus (the latter very similar to the MB talk previously given and summarised by Gareth at GridPP41).

2. On IPV6 the goal is for Tier-2s to have deployed dual stack on production storage (and perfSonar if installed) by the end of Run2 (end 2018). Across WLCG the share on IPv6 is about 40% and the UK mirrors that. The status at GridPP sites is seen here https://www.gridpp.ac.uk/wiki/IPv6_site_status . The sites that are “on-hold” are generally waiting for University networking progress. The current status visible to WLCG is captured here https://twiki.cern.ch/twiki/bin/view/LCG/WlcgIpv6#IPv6Depl .

Budget and priorities have an impact, but DK noted JISC would like to see the PP roll-out succeed so that it could also be available to others. This is probably useful when we have a clear reason for using – WLCG management board have stated they want it.

3. The final availability & reliability reports for August 2018 – with corrections applied by the experiments – can be found at: http://wlcg-docs.web.cern.ch/wlcg-docs/?dir=reporting/reliability-availability/2018/08-18
ATLAS: http://wlcg-sam.cern.ch/reports/2018/201808/wlcg/WLCG_All_Sites_ATLAS_Aug2018.pdf
QMUL: 76%:82%

Lancs: 80:82%

ALICE: http://wlcg-sam.cern.ch/reports/2018/201808/wlcg/WLCG_All_Sites_ALICE_Aug2018.pdf
Bham: 60%:60%

CMS: http://wlcg-sam.cern.ch/reports/2018/201808/wlcg/WLCG_All_Sites_CMS_Aug2018.pdf
All okay

LHCb: http://wlcg-sam.cern.ch/reports/2018/201808/wlcg/WLCG_All_Sites_LHCB_Aug2018.pdf
QMUL: 77%:83%

The site explanations are as follows:

a. Lancaster: The site was migrating their DPM headnode to new hardware and OS which did not go smoothly; the nodes were out of action for nearly 3 days and running with poor efficiency for another 3. The site also had a power outage due to blowing a power distribution board late on Saturday the 19th, and the site did not recover fully until Monday afternoon. Technically the outage only affected half the kit, but it was the half running all the vital services so it had a bigger impact. A downtime is planned in the coming weeks to rebalance the phases with the aim of reducing the risk of further power outages.

b. Birmingham: “Real” problems with EOS with pool nodes crashing and files being corrupted. The nodes were taken offline to understand the cause.

c. QMUL: TBC

4. Alessandra and Alastair started an ops discussion last week about the future of the BDII and it was agreed to move towards dropping it, which was communicated to the WLCG Ops Coordination meeting on Thursday. An immediate question was about the impact on ARGO (EGI) tests which we need to consider. The LHC VOs do not rely on the BDII anymore. Most other VOs in the UK are now supported with GridPP DIRAC which can be made to read the proposed JSON formatted data. The UK plan is to aim for BDII removal by April 2019 when T1 effort drops. More background and details can be found in Alastair’s slides: https://tinyurl.com/y75ktdlc .

5. There have been observed issues with perfSONAR affecting the UK mesh. It is likely this is a configuration issue related to IPv6 routing following a recent upgrade of perfSONAR. There was some discussion on this and AM will write to Frederico to enquire if there is a UK strategy in this regard and to understand the timescales involved.

6. EGI is running a user satisfaction survey for Federation Services. I will complete this during the week and welcome wider input. Areas: Services used; Overall satisfaction; Quality of service rating; Quality of documentation and support and suggestions for improving the services.

SI-5 Tier-1 Manager’s Report (AD)
———————————
We are still waiting on the ClusterVision delivery. They have confirmed they have received the order but have no delivery date yet.

– XMA storage nodes have been weighted up to 50% (aim to weight them to 100% in the next week and a bit).

– There have been several tickets regarding the LFC. Users are getting ’null’ entries. We are comparing with the previous database to see if it is a new problem.

– Ongoing work to identify reasons behind ATLAS low usage of batch farm. Several changes made by ATLAS last week, the Tier-1 is looking at the impact. Over the weekend ATLAS was at 66% fair share. Multi-core ATLAS jobs are running more stably but there are lots of small drops and increases (these do not happen for CMS).

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
The MB is tomorrow.

SI-7 External Contexts (PC)
———————————
a) PC mentioned Future European Strategy and things in the UK led by high profile physicists and there is a planned meeting in Birmingham next week. There was some discussion over what will be written about computing and PC has reached out on this to various participants offering a contribution on computing if required. He cannot attend the Birmingham meeting and has asked for sight of what has been written and offered again to contribute our expertise. RJ plans to attend, but noted the agenda does not state an intention to discuss computing. There was discussion what the best strategy should be regarding CAP and whether they will be contributing something or require our input. DC confirmed that CAP have not been approached to contribute so it is important if a statement on required computing is incorporated, DC will follow up.
b) PC received a notice from STFC (Strategy, Policy and Comms) asking for comments on a draft document on the EOSC position to send to BEIS. DB did not receive the email. This is an STFC paper on the EOSC hub and working with BEIS to develop a UK position. It was suggested this may be something that Claire Deveraux and her team could be developing, but this is not yet clear. Ian Collier or Juan may have some comments to make on this and PC has sent some strong comments since many universities in the UK have not heard of EOSC. This should be supported to harmonise software; however, it appears to imply it will provide the physical resource access to research resource, but this is not the case. After PC has undertaken some further enquiries AS will discuss with Ian Collier and Juan to try and clarify the situation.

3) DB and PC are at the CERN SCF on Thursday. One thing that may arise is widening the remit of WLCG for use in the Neutrino platform and the potential consequences. AS enquired whether CERN would broaden the governance of WLCG, Fermilab have stated there should be some WLCG collaboration for DUNE. These discussions are at an early stage.

4) Escape project (to bring together the EOSC infrastructure) – JC asked if there is any connection or overlaps with us, there does not appear to be any.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
673.2: AD will provide the PMB with an overview of strategy for tapes and drives for the remainder of GridPP5 and GridPP6. (Update: elements covered in Tier-1 Review, others will be addressed in the background document by end September). Done.
675.1: DC to sign off report on Tier-1 LHC usage. Ongoing.
675.2: RJ to sign off report on Tier-1 LHC usage. Ongoing.
678.1: RJ, to finalise the Experiment Support background document by end September.
678.2: DK to finalise the Security, Trust and Identity background document by mid October.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September.
678.4: SL to finalise the Tier2 background document by end September.

678.5: JC to finalise the Storage background document by end September.
678.6: DB will send an email to the Collaboration Board. Ongoing.
678.7: DB, PG and GR will discuss how GR can take forward Pledges. Ongoing.

ACTIONS AS OF 17.08.18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
675.1: DC to sign off report on Tier-1 LHC usage. Ongoing.
675.2: RJ to sign off report on Tier-1 LHC usage. Ongoing.
678.1: RJ, to finalise the Experiment Support background document by end September.
678.2: DK to finalise the Security, Trust and Identity background document by mid October.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September.
678.4: SL to finalise the Tier2 background document by end September.

678.5: JC to finalise the Storage background document by end September.
678.6: DB will send an email to the Collaboration Board. Ongoing.
678.7: DB, PG and GR will discuss how GR can take forward Pledges. Ongoing.