GridPP PMB Meeting 671

GridPP PMB Meeting 671 (18/06/18)
=================================
Present: Dave Britton (Chair), Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Andrew McNab, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Tony Cass, Pete Clarke, Alastair Dewhurst, Roger Jones, Dave Kelsey, Steve Lloyd.

1. IRIS h/w allocation
======================
Some of the PMB are also members of the IRIS PB. IRIS has capital funding for 4 years starting this FY and there is a shortened bidding process for h/w, £500K will soon be agreed for GridPP to provide resources for IRIS from our sites. The PMB must decide on which site(s) to deploy the h/w and how to underwrite. As there is a storage component, it was agreed this should remain with the large sites. Manchester has already received IRIS funding and preference is to engage other sites so can be eliminated from this round, as can Glasgow on the basis that the new Kelvin Data Centre has been badly delayed and will not be ready to receive new hardware this FY. – QMUL, Imperial and Lancaster confirmed they are interested. A commitment to Openstack is not a mandatory requirement as existing interfaces can accommodate (Grid and Cloud), though there is an understanding Openstack may also be provided (there is now a baseline IRIS interface document). There was a discussion on the best spread of h/w across sites. DB will discuss with the sites involved and make recommendations to the PMB for consideration before approaching IRIS, making clear that new communities will see significant benefits as a result of this investment.
Action 671.1: DB will discuss with the sites involved for IRIS h/w allocation and make recommendations to the PMB to consider.

2. Dell Framework Agreement?
============================
After sponsoring GridPP40, Dell have suggested replicating the previous Dell/Atlas agreement within the UK to develop a pricing plan for GridPP for a specific and specialised kit. The logistics of this would be challenging but cost effectiveness and practical and common specs and configurations to fit with the various site requirements and potential IRIS requirements was discussed. It was agreed that both IRIS and GridPP specs should be explored, though the timeframe is tight and should be completed in the next few weeks. PG and AM will work with Ian Collier to check if there is a common requirement across the Grid that can be negotiated with Dell (e.g. Storage, Compute, Configurations).
Action 671.2: PG and AM to check if there is a common requirement across the Grid that can be negotiated with Dell for a framework agreement (e.g. Storage, Compute, Configurations).

3. Atlas Delivery
=================
This week’s Tier-1 Manager Report highlighted under-delivery to Atlas caused by a problem at Tier-1 preventing it being used, not by the lack of resource use. There was concern raised that this was not picked up at the Resources Meeting, which was not well attended. It was suggested that at future Resource meetings the experiments should sign off on this and PMB members should attend where possible. Discrepancies should in future be highlighted and signed off by the attendees who can agree reasons for any discrepancies with a summary provided to the PMB.

4. AOCB
=======

a) PMB dates for summer were agreed as:
25th Jun
2nd Jul
16th Jul
30th Jul
13th Aug
28th Aug F2F

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
There is nothing to report, there have been no meetings as no topics have been suggested for discussion. It was suggested that h/w could be discussed and this highlights a potential flaw with the meeting model, though there are other commitments at this time of year impacting attendance.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ was not in attendance and no report was submitted.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Nothing significant to report. One positive point to report is the opportunistic use of HLT has been successful and deemed as big as Tier-1 (during interval periods and shut-downs).

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-4 Production Manager’s report (JC)
————————————-
No report submitted.

SI-5 Tier-1 Manager’s Report (AD)
———————————
– On 14th June CMS completed their migration from Castor to Echo for disk. If you are looking at any CMS monitoring, T1_UK_RAL_ECHO got renamed to T1_UK_RAL and T1_UK_RAL (which was Castor) got changed to XT1_UK_RAL and is no longer reporting anything but is still viewable for historical reporting.

– On Tuesday 12th June there was an internal network problem, which took around 3 hours to resolve. This seemed to mostly affect facilities rather than Tier-1 services. The concern is not that a piece of hardware failed, but that it doesn’t always switch seamlessly to the backup links (especially for IPv6).

– ATLAS have spotted that they have not been running their near their pledged amount (See attached screen shot). We have started to investigate this and found multiple reasons (Not all were the Tier-1’s fault):
• August (with impact into September) Echo data loss
• October – December, incorrect ATLAS configuration of UCORE queue (sending single core jobs to multi-core queues and vice a versa).
• Christmas period, lack of work submitted by ATLAS.
• January – March, we can’t see a problem other than the fact that our deployed capacity was 99% of the WLCG pledged amount. General inefficiencies (e.g. broken WN, not being able to completely fill every node) pushed the total amount ATLAS got lower. This might be a relevant question for future procurements especially if money is tight: Do we need to deploy the just the required HEPSpec or do we need to ensure that the wall clock * HEPSpec for each experiment meets (approximately) their pledge?
• April – Now, this years procurement has not been deployed so we are significantly down on what we should be running. The majority of this years CPU procurement should be deployed into production this week, which should fix the capacity problems.
• Note that Tim Adye is also looking at the changes in workflows ATLAS have been sending to us incase there are some types of jobs that aren’t being sent (there is still a significant fraction.

– The relevant Castor port is being reported in the BDii again (resolving problems that some small VOs had). With Stephen Burke’s help we found we were able to stick a static text file in a directory on the site BDii.

– We are running 24 core SKA jobs at RAL. Our batch system would allow them to run 64 core jobs although there seem to be a 24 core limit imposed by DIRAC at the moment. The batch farm is also being (re-)enabled to allow bursting into the new Openstack cloud infrastructure.

– LHCb have now migrated over 500TB (and increasing) to Echo. Hardware continues to be deployed.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report.

SI-7 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). (Update: DB will write to DK with DC in copy with proposed way forward). Ongoing.
663.8: JC will examine GridPP staff roles/service/areas of expertise. (UPDATE: JC will provide a table with information for discussion at June F2F). Done.
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing.
667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting. Ongoing.
667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
669.1: DB to respond to a request for resources from DUNE. Done.
670.1 Discussion Papers to be written for GridPP41, Storage, Expt support, others? Ongoing
670.2: DB and PG will consider percentage splits of CPU/Disk. Ongoing
670.3: DB, PG and PC will undertake a high-level discussion of planning manpower for in GridPP6. Ongoing.
670.4 PG Invite Tim to give a talk on the Atlas Liaison Role at GridPP41. Ongoing.

ACTIONS AS OF 18/06/18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). (Update: DB will write to DK with DC in copy with proposed way forward). Ongoing.
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing.
667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting. Ongoing.
667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
670.1 Discussion Papers to be written for GridPP41, Storage, Expt support, others? Ongoing
670.2: DB and PG will consider percentage splits of CPU/Disk. Ongoing
670.3: DB, PG and PC will undertake a high-level discussion of planning manpower for in GridPP6. Ongoing.
670.4 PG Invite Tim to give a talk on the Atlas Liaison Role at GridPP41. Ongoing.
671.1: DB will discuss with the sites involved for IRIS h/w allocation and make recommendations to the PMB to consider.
671.2: PG and AM to check if there is a common requirement across the Grid that can be negotiated with Dell for a framework agreement (e.g. Storage, Compute, Configurations).