GridPP PMB Meeting 675

GridPP PMB Meeting 675 (30.07.18)
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Steve Lloyd, Andrew McNab, Louisa Campbell (Minutes).

Apologies: Roger Jones, Dave Kelsey, Andrew Sansum.

1. PM Report
============
a) Q118 Report summary
PG summarised some aspects contained in the report (attached to the agenda) which will now be signed off as complete for submission. Deployment and Ops section remains to be fully completed for various reasons. DB thanked PG for the report and noted efforts should be made to ensure future reports are submitted in a more timely manner. Someone should be identified to draw this information together in the future, (various names suggested), to see if they may be interested in taking this forward for the remainder of GridPP5.
b) QC Finance sheet
PG summarised the content of the report (attached to the agenda) as requested at the last OSC in March 2018. There was some discussion on the content as well as on how relevant figures were reached and the rationale. The report is effectively a snapshot of the position in March 2018 and does not reflect the position at the end of FY 18 or whether the late arrival of h/w has had an impact – AD is producing a financial forecast to include proposed spending and defence of existing spend.

2. Tier-1 Resource Meeting LHC Sign-off
=======================================
The meeting took place last week and 2 spreadsheets were produced (attached to agenda) – Tier-1 LHC usage QT was discovered. Top left corner of normal Tier1 spreadsheet – column D has changed to average per month vs allocation and Column G outlines usage. PG summarised the position regarding allocation and VOs will require to sign off on allocation and usage of, for example, Monte Carlo. There was some discussion on the rationale behind this and AM will look into this further. DB summarised that because we are near capacity if an experiment has a low day as a result of their own workload this is challenging to pick up again on a subsequent day. It was agreed that these types of discussions with VOs are extremely useful and non-LHC VOs are usually comfortably above their pledge but the 3 larger ones have liaisons and they will write a short paragraph in summary of aspects going forward. Much of the Atlas and CMS changes relate to temporary impact of migration from Castor to Echo. PG will send the spreadsheet to RJ, DC and AM to sign off at the next PMB so they can see, understand and agree the figures.
ACTION 675.1: DC to sign off report on Tier-1 LHC usage.
ACTION 675.2: RJ to sign off report on Tier-1 LHC usage

3. Tier-2 h/w Allocation
========================
PG summarised another spreadsheet (attached to agenda) – LHC Vs Others 16-18. DB has drafted a brief paper and sent to SL and PC for initial comments before circulation other PMB members. PC and SL will send comments back to DB for discussion at a later date and DB has shared the draft with other PMB for comments.

4. GridPP41 Agenda
==================
A skeleton agenda was discussed at the last PMB. PG will develop this and insert names to talks. At this stage we should agree the grand narrative and seek feedback. PG, JC and AM will ask about potential contributors at the Technical Meetings. Each sessions could be based on the subject areas discussed, one on Security from David Crooks and another on LHC at home (Beccy Parker).
Session 1 – After DB Introduction would be a good slot for an IRIS talk from PC then Beccy.
Sessions – allocated to different papers. Thu morning Tier-1 and Friday morning Resource, Tier-2 session. Dan Protopopescu has offered an NA62 talk (NA62 have very high efficiency). Missing Tier-1 and other content needs to be developed ASAP. Slots will be on template as with previous meetings – start Wednesday PM, Thursday afternoon team activities – there are 5 sessions still to complete and PG will match with papers and see how this works out. There is a F2F PMB on the Tuesday and an additional meeting room has been arranged for technical exchange meetings etc in parallel with the PMB. AD will update the Tier-1 sessions on Indico.

5. AOCB
=======
AM asked if people were having positive experiences using Zoom instead of Vidyo. PC confirmed Zoom was much better but if web interface is used documents cannot be shared. It is not clear if Zoom attracts a cost.

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No technical meetings have taken place. The person due to take over from Alastair was due to start and was awaiting visa matters to be resolved but now has had to withdraw due to personal family-related issues. However, this does not need to be re-advertised as an alternative candidate who is above threshold will be interviewed from the current talent pool later this week.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ was not in attendance and no report submitted.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Nothing significant to report.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
AM confirmed we started running low priority jobs on disk servers – CERN is keen on this. This has been discussed – these servers are the same as SCPU servers which differs to practice here.

SI-4 Production Manager’s report (JC)
————————————-
1. LSST: have run into job flow issues using Ganga due to a bug in the CVMFS implementation – a further step in being ready for their DESC data challenge. A new approach is to be tested shortly. Reading between thread responses it looks like the user could benefit from Ganga support. DC suggested they should not be using Ganga as this may not be the correct forum – JC will instigate a discussion on this.

2. GOCDB was unavailable for several hours on 25th July following power supply testing at RAL interrupting its database. The issues seen were timing out after the landing page and being inaccessible over IPv6. The Tier-1 Manager’s report provides more detail.
3. An additional VOMS server has now been deployed for LZ. The big LZ production run is now over and the VO moved to running reconstruction jobs that require about 8GB memory and these jobs mostly go to IC and RALPP.

4. There was an EGI Operations Management Board meeting 10 days ago: https://indico.egi.eu/indico/event/4155/. The main technical presentation was on udocker. Other updates:

– The EGI team are seeking NGI(./ROC) discussions with the various managers to obtain feedback, understand future plans and identify problems. Six meetings have been held so far with the common themes being: managing distributed insrastructure with limited effort; lack of recognition of resource contribution to EGI and lack of recognition of strongly performing Fedcloud Service Providers.

– EOSC-hub harmonization of operations has proposals agreed in the main areas of operations governance; service monitoring and collaboration between member e-Infrastructures (EGI-EUDAT peeting MoU in preparation).

5. DI4R is in October: https://www.digitalinfrastructures.eu/. DB attended the previous couple of meetings and may attend this one if appropriate.

6. Glasgow recently hosted an EGI CSIRT F2F meeting which was well received.

7. Several sites now running queues with GPUs on the nodes. RALPP may join QMUL and Manchester. The main community using them at present is ICECUBE. With Pheno now also interested.

8. There was a July GDB delayed by a week due to CHEP: https://indico.cern.ch/event/651355/. The main topics were: an update on the ARC technical workshop, an update on DPM, the direction of the authorisation WG, a report on the HEP Software Foundation and the latest news on Benchmarking WG activities.

9. Here are the WLCG T2 A&R figures for June. http://wlcg-sam.cern.ch/reports/2018/201806/wlcg/.

ALICE (http://wlcg-sam.cern.ch/reports/2018/201806/wlcg/WLCG_All_Sites_ALICE_Jun2018.pdf):

All okay

ATLAS (http://wlcg-sam.cern.ch/reports/2018/201806/wlcg/WLCG_All_Sites_ATLAS_Jun2018.pdf)

RHUL 85%:85%

CMS (http://wlcg-sam.cern.ch/reports/2018/201806/wlcg/WLCG_All_Sites_CMS_Jun2018.pdf)

Bristol 72%:72%

LHCb (http://wlcg-sam.cern.ch/reports/2018/201806/wlcg/WLCG_All_Sites_LHCB_Jun2018.pdf)

All okay.

Site explanations:

RHUL – Continuing SE headnode issues (time sync related). (*special note)

Bristol – Bristol’s /hdfs kept “filling up” and a spate of local user jobs led to kernel-panics/crashes onmany WN (like 10 at once), killing any grid jobs on them & making their /hdfs storage unavailable

10. An ops meeting question arose last week about the GridPP41 meeting theme.

SI-5 Tier-1 Manager’s Report (AD)
———————————
– The 2017 XMA storage nodes are starting to enter production and provide allocatable storage. They have Mellanox cards and it has been observed that they have slightly higher packet loss than the other nodes. The packet loss is still considered “small” and would not prevent us continuing to put them in production. Martin believe there is probably a configuration setting that will fix it. It has also been observed that they have a slightly higher memory usage than expected (44GB as opposed to 37GB for similarly specced machines). This is not a problem in itself but we are keeping an eye on it. It may be related to the packet loss, issue. Just for complete information regarding the packet loss situation. During active periods we can see 100 – 400k packet per second. A typical storage node may periodically drop 1 packet in 100 000 (See TypicalPacketLoss.png), the XMA17 have periods where 1 packet in 1000 are being dropped (See XMA17PacketLoss.png).

– While the testing is still ongoing for the majority of the ClusterVision 2017 storage nodes, a machine has been provided to the Ceph team which they have put in to the development Ceph cluster. We would hope to put it in production in the next week or so. These machine have AMD CPUs which we haven’t used before. We will be in a position to see these in production before we submit next years procurement.

– We had a meeting last week to go through the disk ITT. We have arranged a F2F meeting with SBS procurement specialists this week. (I know I need to circulate this years procurement plan to the PMB).

– CMS are having problems accessing data on Echo via XRootD, which turned the SAM tests red This was tracked to an issue with their XRootD configuration via singularity (containers within containers). They have the most complex setup and we are investing effort in better understanding it.

Summary of power testing
Every circuit breaker in R89 was tested from Tuesday 24th – Thursday 26th July. It was completed successfully. Most racks were powered by dual PDUs which were on different circuit breakers and therefore theoretically nothing should go down. 14 PDUs failed over the 3 day period. This was within the expected number of failures although it was at the highest end of our estimates. We had sufficient replacement PDUs.
Impact to Tier-1:
– 2 Production Castor disk servers went down with temporary loss of access to some file access.
– Castor transfer manager went down for ATLAS, lost of ability to schedule new transfers for ~20 minutes.
– A few SL5 machines also went down. No impact on service but a useful reminder to get them decommissioned properly.
– It should be worth noting that two Echo storage nodes went down with no impact on service. This happened because their power cables weren’t correctly fitted.
– There was an interruption to the GOCDB service, this is covered in the Production Manager’s report, but in more detail the GOCDB service was down for 45 minutes. It wasn’t directly a result of the power testing it was failed network switch (GOCDB is not on Tier-1 network, so this failure did not impact the Tier-1). The read only version located off site stayed up. There was an hour and a half between the EGI broadcasts (to say it was down and then resolved). The EGI broadcast to say it was down was sent out ~20 minutes after the service went down. Therefore there was about 2 hours between the RAL instance becoming unavailable and the broadcast to say it was fixed, but the actual impact was much less.
– Some WN needed to be taken down, this simply caused a temporary reduction in the batch farm capacity.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
There was an MB on 17 July. Two significant items arose:

A) Quantitative estimate of potential savings that could be made due to efficiency gains in running WLCG infrastructure. There is a working group with a couple of UK- based members (AS and Gareth Roy). They looked at efficiency of various elements, examined potential improvements and assessed impact on vectorisation, linker-loader strategies, organisation of storage and reducing numbers of replicas etc. They have attempted to classify big and easy gains that Ian Collier has circulated to some members, the summary identifies big gains (it is very extracted and country/site specific though has some valuable information). They conclude WLCG manage storage of 15 sites, reducing data redundancy, reducing data replication and smaller gains with scheduling and site inefficiencies, all provide gains. Exploiting modern CPU architectures etc. are challenging to quantify but provide gains.

B) Benchmarking – working group activity and plans. New HEPSPEC 2017 test suite does not give much different results from HEPSPEC 06 so there is not much motivation to jump into the new spec. The working group has been looking at fast benchmarks and the report covers some, e.g. Atlas KB and Dirac benchmark DB12 used by Alice and LHCb which provides fast benchmarks cannot be used to over-ride long running benchmarks (used for performance and understanding the h/w). The prediction from short benchmarks is not as robust as the longer running benchmark and the variance is too large to be useful for procurement. Because of various factors this is a work in progress.

SI-7 External Contexts (PC)
———————————
IRIS call for Capitalisable project has closed and DB is on the evaluation committee. DC submitted one from IC for Dirac. AD submitted a proposal with a multi-VO submission to re-evaluate and support SKA – he used Dirac and VML in the context of GridPP Dirac as an example and noted a slight overlap and dovetailing with DC’s bid. Ian did a lot of work on this and has submitted one on Apple and possibly Cloud and an authentication on IAMS. PC noted their bid was good but not submitted though he noted the opportunity should be explicit about how they assist other users. There was some discussion on detail regarding the various proposals. JC submitted something for V-Cycle and VMs also.

PC noted Big Ideas meeting which has a few things he will write up for software and engineers if funds can be identified.

HLLHC – PC is talking to Claire Sheppard and will bring up whether HLLHC aspects should be progressed. DB mentioned Charlotte Jamieson meeting last week about identifying 1 large, 2 medium and 2 small possible future projects to be written up and kept in case short notice funding comes up. Big projects: super Dirac; Medium projects: Dirac 3 and college research software engineer that PC will write up; other suggestions were discussed for the small project. The meeting was challenging to hook into and HTC was not discussed but was relevant to other projects including and beyond GridPP6. After the meeting DB circulated an email suggesting funding effort to support HTC software to benefit High Luminocity and LHC so a small project of £5M could be a small project to develop useful stuff for community approved projects. Jeremy Yates recognised there was an omission on HTC and there were possibilities for future changes.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). (Update: DB will write to DK with DC in copy with proposed way forward – almost complete). Ongoing.
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing.
667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting (Update PG, AS and AD to meet Friday to create a roadmap to monitor progress). Done.
667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
671.2: PG and AM to check if there is a common requirement across the Grid that can be negotiated with Dell for a framework agreement (e.g. Storage, Compute, Configurations). Ongoing.
672.3: RJ, DK and AM to draft the Experiment Support background document. Ongoing.
672.4: DK to draft the Security, Trust and Identity background document. Ongoing.
672.5: AD to draft the Tier1 background document. Ongoing.
672.6: JC, SL AM and PG to draft the Tier2 background document. Ongoing.
672.7: PG will consider the agenda for GridPP41 incorporating the GridPP6 Background Documents. Ongoing.
673.2: AD will provide the PMB with an overview of strategy for tapes and drives for the remainder of GridPP5 and GridPP6. Ongoing.

ACTIONS AS OF 30.06.18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). (Update: DB will write to DK with DC in copy with proposed way forward – almost complete). Ongoing.
665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing.
667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
671.2: PG and AM to check if there is a common requirement across the Grid that can be negotiated with Dell for a framework agreement (e.g. Storage, Compute, Configurations). Ongoing.
672.3: RJ, DK and AM to draft the Experiment Support background document. Ongoing.
672.4: DK to draft the Security, Trust and Identity background document. Ongoing.
672.5: AD to draft the Tier1 background document. Ongoing.
672.6: JC, SL AM and PG to draft the Tier2 background document. Ongoing.
672.7: PG will consider the agenda for GridPP41 incorporating the GridPP6 Background Documents. Ongoing.
673.2: AD will provide the PMB with an overview of strategy for tapes and drives for the remainder of GridPP5 and GridPP6. Ongoing.
675.1: DC to sign off report on Tier-1 LHC usage.
675.2: RJ to sign off report on Tier-1 LHC usage