GridPP PMB Meeting 698

GridPP PMB Meeting 698 (18.02.19)
=================================
Present: Dave Britton (Chair), Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Steve Lloyd, Andrew McNab, Gareth Roy, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Tony Cass, Alastair Dewhurst, Jon Hays, Dave Kelsey,

1. GridPP6 Descope Scenarios
============================
DB circulated suggestions for descope scenarios he can build into tables/texts. Resource funding has a flat cash scenario that has been specified in the proposals based on the financial document sent to STFC at the last OSC meeting. This includes non-capital h/w at Tier-1 – it requires the amount of non-capital h/w funding Tier-1 needs in GridPP6 and travel funding that DB will discuss with DK at slightly less than current levels. Also requires network costs – all this would be subtracted and remainder is staff providing a baseline of c.40 FTE with descope scenarios +/- 50%, 90% and 70%. 44 FTE (110%) is the upper scenario currently. DB discussed various options/strategies for descoping the proposal
DB shared an overview spreadsheet summarising how the spread of FTEs looks with the above scenarios with baseline and different scenarios and indicating where discussions with various PIs have taken place, which DB summarised. DB reminded that the WPs are in priority order and summarised that effort at all sites is attributed against WPs on one spreadsheet and there is an additional master spreadsheet containing each post with effort broken down – there was some discussion on how this could be amended and impacts. The effort to experiments is to complement the effort embedded at core sites.
GR circulated a spreadsheet which compiles requirements of non-LHC experiments. There was some discussion on suggested changes and GR explained actions taken. DB confirmed if the numbers are correct the 6 other VOs require c. 13% of CPU 7% of Disk and 10% of Tape, which is not unlike that for GridPP5. There is, some additional risks, and GR considers c 1% CPU for others and 0 in form of storage, we should make the baseline h/w request £15%, 7% disk and 10% tape.

2. GridPP6 Status and Plans
===========================
DB stated there is a long list of actions that must be taken to process the proposal within the 2-week deadline. PC will take current version (10) then save it as v11, make changes and re-circulate. Actions required include:

– JC needs to finish work on the task table as per emails on the weekend and rationalise for incorporation into submission.
– JC, AD, RJ, PG DB need to write/refine the Section-3 descriptions of the tasks based on table.
– GR needs to circulate collated resources requirements (done).
– PMB needs to decide on xx% resources for unspecified “others”.
– DB/GR need to create big table of resource requirements and cost them for proposal.
– PC needs to review Network Requirement section 4.6 (4.f in “plan”)
– AD needs URGENTLY to provide draft of Tier-1 service section so I can see what is going in here. Done.
– DB needs to create a Tier-1 effort matrix in the same form as my other tables.
– PMB need to respond to emails I’ve sent to various people at institute effort.
– DB needs to finalise Tier-2 effort matrix and inform CB members whom I haven’t already contacted.
– We description of stuff not covered in currently unseen Tier-1
sections such as:
– Experiment Liaison Posts
– GOCDB/APEL
– Security (DB will discuss with DK)
– WP4 development posts for Tier-1 (two posts) – PC will work on this.
– PC/AD We need something about WP4 posts (four posts) that are GridPP.
This is our “response: to the tasks that we described earlier in
Section-3. Here we propose specific posts to do the tasks.
– DB/GR/PG We need the full management section written covering:
– Management and Administration overview (DB)
– List of Project Milestones and Metrics (PG and GR)
– List of Project Risks (PG and GR) – Castor & Echo need to be re-worked and some specific regarding Brexit, there was only 1-2 paragraph in GridPP5 proposal and the risk register was not included. DB and GR will look at the risk register for GridPP5 and update then write section of the proposal linking to it.
– Impact (based on pathways document) JH/AS – AS is working on this in JH’s absence on annual leave and noted the guideline requirement for explicit points of deliverable impact, ie how it will be done (SL will consider and discuss with AS).
– DB: Baseline Capital Funding Requirements – once resource requirements
known.
– DB: Text to address scope reduction to 90% 75% and 50% (requires some
thought and discussion with PMB)
– DB: Resource Funding baseline and descope strategies – text once
strategy agreed (see earlier email).

3. AOCB
=======
PC & DB will be at the Scientific Computing Forum next Monday at 1pm so it may be preferable to move the PMB to 12. DB will consider this and send out an email.
The proposal must be submitted 5 March (2 weeks). There may be very slight flexibility for 2-3 days extra, but highly unlikely any longer than that.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing to report.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing to report.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Nothing to report.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing to report.

SI-4 Production Manager’s report (JC)
————————————-
1. The latest DPM release, DOME, is coming at a cost to those sites that have moved to it due to bugs only being identified with the release in a production environment. This is particularly affecting Brunel and their CMS availability. Downgrading does not appear to be a straightforward option.

2. There has been a useful security exercise carried out for our sites to test NGI Argus use and central banning configurations. Results will be shared in due course.

3. The January Tier-2 A/R results are below:
ALICE: http://wlcg-sam.cern.ch/reports/2019/201901/wlcg/WLCG_All_Sites_ALICE_Jan2019.pdf
All okay.

ATLAS: http://wlcg-sam.cern.ch/reports/2019/201901/wlcg/WLCG_All_Sites_ATLAS_Jan2019.pdf
QMUL: 87%:96%
Bham: N/A:N/A

CMS: http://wlcg-sam.cern.ch/reports/2019/201901/wlcg/WLCG_All_Sites_CMS_Jan2019.pdf
All okay

LHCb: http://wlcg-sam.cern.ch/reports/2019/201901/wlcg/WLCG_All_Sites_LHCB_Jan2019.pdf
QMUL: 88%:100%
ECDF: 80%:80%

The site explanations are as follows:

QMUL: The issues are related to the planed 3-day power outage which was prolonged by 24 hours due to a failure of the electronic lock on the server room.

Bham: The availability is 0 because the tests (as regards EGI I guess) are based on the SE which has been essentially off for sometime now. Mark has now removed the SE from the GOCDB. EOS has been up for ALICE all that time so a re-computation could have been requested.

ECFD: 2 worker nodes became black holes for inbound LHCb jobs possibly due to bad WN CVMFS caches. In addition a power outage at the Edinburgh ACF facility led to unplanned downtime and a subsequent problem with NFS that took a while to understand – impacting cluster job submission and management for many days.

4. ATLAS is moving on with discussions about making additional sites diskless (there are 5 at present – excluding isolated sites). Details can be found here: https://indico.cern.ch/event/796418/contributions/3309170/attachments/1790565/2916945/ATLASDisklessSites.pdf. For GridPP this may affect Durham, Sheffield, Sussex, Brunel, Cambridge and Birmingham.

5. There was a WLCG last week at CERN. Topics included an update on Information Systems Evolution; Benchmarking news; a report from the Container Technology WG; the latest on the WLCG Data Privacy position; a proposal for a new more systematic approach to evaluating alternative implementations and finally a report from a workshop on Cloud Storage Synchronisation and Sharing Services. All talks are linked from https://indico.cern.ch/event/739875/.

6. An ongoing background focus at the moment is moving to CentOS7. This is making gradual progress with most sites having started as recorded in the notes section of this table https://www.gridpp.ac.uk/wiki/Batch_system_status. Glasgow is planning to migrate after their data centre move and Sussex is constrained by the shared nature of their resources.

SI-5 Tier-1 Manager’s Report (AD)
———————————
– Ongoing issues with ARC-CEs (No loss of availability but flaky performance). Catalin is slowing rolling out the updated ARC CEs software on the production machines, will be finished this week.

– Between Tuesday 12th and Thursday 14th February, we introduced a new data balancing algorithm into Echo. This is known as “upmap” and replace the “re-weight by utilisation” that we used before. I have attached plots showing how the spread of data across the disks has changed before and after the change. The amount of data on each disk is a much more consistent percentage, which means we can fill more of the cluster before running into problems (Also less effort to maintain).

– On Saturday 16th February at 01:30am, there were ~30 call-outs for Echo. The Echo on call restarted the monitor machines and all the alarms cleared. ~1 hours of degraded performance.

– We have observed absolutely atrocious CPU efficiency on the farm in the past week for CMS and ATLAS. ATLAS have been between 40 – 50% while CMS were down at 20% efficiency for some periods. It appears to be experiment problems (i.e. low efficiency was seen at many sites) but we are keeping a close eye on this as it could be related to either the ARC CEs or Echo changes.

– Procurement
XMA CPU, delivery has been booked for 25th February.
DELL Storage, delivery has been booked in for 4th March.
XMA Storage, delivery expected mid March.
Extra disks for ClusterVision17 storage. The order with ClusterVision has been cancelled since they have filed for bankruptcy. We can source the individual drives for multiple vendors so we can get delivery this year. Martin is exploring the best options.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report.

SI-7 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
696.2: RJ to provide ATLAS’ guidance for 2 FTE location at Tier-2 sites. Ongoing.
696.3: RJ to draft 4c(iii) in the Plan2 document: A description of WP2. Ongoing.
696.5: JC to draft 4c(iv) in the Plan2 document and work with DB on 4c(i). Ongoing.
696.7: JH to draft pathways-to-impact document and extract 1 page for proposal. (Update: JH is seeking clarification between provision in GridPP5 and requirements for GridPP6). Ongoing.
696.9: PC to coordinate development of 4c(v) WP4 description. Ongoing.
696.10: AD to provide draft of Tier-1 section 6b. Ongoing.
696.11: AD to contribute via PC to 4v(v). Ongoing.
696.13: DC to provide assistance to RJ with 4c(iii). Ongoing.
696.15: DB to draft 4c(ii) with help from JC. Ongoing.
696.16: DB to coordinate 4c(vi). Ongoing.
696.17: DB to continue to develop effort matrix once Experiment site preference are known. Ongoing.
696.18: GR to continue to gather resource requirements. (Update – emails have gone out and responses awaited). Ongoing.
696.19: GR to liaise with PG on 4c(vi)2&3. Ongoing.

ACTIONS AS OF 18.02.19
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
696.2: RJ to provide ATLAS’ guidance for 2 FTE location at Tier-2 sites. Ongoing.
696.3: RJ to draft 4c(iii) in the Plan2 document: A description of WP2. Ongoing.
696.5: JC to draft 4c(iv) in the Plan2 document and work with DB on 4c(i). Ongoing.
696.7: JH to draft pathways-to-impact document and extract 1 page for proposal. (Update: JH is seeking clarification between provision in GridPP5 and requirements for GridPP6). Ongoing.
696.9: PC to coordinate development of 4c(v) WP4 description. Ongoing.
696.10: AD to provide draft of Tier-1 section 6b. Ongoing.
696.11: AD to contribute via PC to 4v(v). Ongoing.
696.13: DC to provide assistance to RJ with 4c(iii). Ongoing.
696.15: DB to draft 4c(ii) with help from JC. Ongoing.
696.16: DB to coordinate 4c(vi). Ongoing.
696.17: DB to continue to develop effort matrix once Experiment site preference are known. Ongoing.
696.18: GR to continue to gather resource requirements. (Update – emails have gone out and responses awaited). Ongoing.
696.19: GR to liaise with PG on 4c(vi)2&3. Ongoing.