GridPP PMB Meeting 598

GridPP PMB Meeting 598 (23.05.16)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Doyle,

1. Costings for Non-LHC VOs
=========================

PC, PG, DB and AS have been working this up – AS is considering staff costing. This relates to attempts to set a threshold after which sites request funding referred to in minutes of the 597th PMB. Consideration needs to be given to how this is best presented. The internal document requires to be completed this week then circulated to PMB for discussion in 2 weeks to decide method of presentation to groups. DB reminded members the threshold is not capping the level, but the point at which funding requires to be sought.

The document is written from a GridPP perspective but also a UKTO costing guide so may require integration as resources are primarily GridPP at present. Costing exercise demonstrates low marginal costs to customers compared to purchasing their own storage.

2. Comments on CRSG Experiment requests for 2017 & 18
=====================================================
These have been through the RRB – DB circulated the CRSG document and PG summarised. The start compares resources delivered vs pledged and resources used before covering requests for 2017-18. PG circulated slides highlighting the key points for discussion. ALICE – requests go far beyond the initial goal of keeping to a flat budget but ALICE stated that the physics would be impacted without this resource. This has been accepted with the caveat that should funding sources not be able to meet the demands then the physics programme would need to be redefined. ATLAS – 75% of the resources used at the Tier-2s is beyond pledge, this supports our arguments about the UK structure, but concerns the CSRG. CMS had requested slightly less CPU at Tier-2s for 2017 due to increased usage of MiniAOD. However they requested a little more disk to support the opportunistic use of the resources for GEN-SIM and RAW event processing. There was some concern about a deficit of tape provided at Tier-1s compared to that pledged. LHCb were applauded for their software efficiency resulting in savings and encouraged to continue optimisations to further reduce CPU and disk consumption.

PG summarized the resources requested by the LHC experiments for 2017 and 2018.
Firstly the 2017 requests for the Tier-1. Compared to the GridPP5 plan (which was based on 2015 REBUS figures), the CPU & disk requests are marginally increased by ~4% which should be OK.

The requests for 2018 for ALICE and ATLAS amount to increases of 8 and 5 percent compared to that planned for CPU and disk. The CMS request is as planned. (At the time of the meeting there were no estimates from LHCb but subsequently we were informed that for planning we could estimate an increase of 20% in CPU and disk and 30% for tape compared to the 2017 figures.)
Overall, the 2018 requirements for CPU and Disk are okay, but tape is slightly concerning. ALICE tape has doubled and the ATLAS request is a substantial increase too. We note that DIRAC tape usage is not included in the GridPP5 plan – this could be used to push for the fact that DIRAC requires funding from 2018. PG will write to Concezio Bozzi to ask him to provide an estimate of LHCb requirements in 2018.

The Tier-2 requests for 2017 shows changes from what is in REBUS – increases for ALICE on CPU and disk; ATLAS CPU reasonable increase but decreased disk; CMS reduction in CPU and increase in disk; LHCb increase in CPU. The 2018 requests showed a similar pattern, increases again for same experiments in same places, but smaller changes compared to our planning (102% of what was planned on CPU and 90% on disk).
Overall we should be able to meet the Tier-2 requests. PG will share numbers with SL who is looking at capacity planning.

ACTION 598.1: PG will write to Concezio Bozzi and ask for approximate number for LHCb 2018 requirements

ACTION 598.2: DB to reconsider h/w planning.

ACTION 598.3: PG to provide SL with figures to build into planning.

3. AOCB
=======
GS to provide PG with a report on Tier 1 for first quarter of the year (preliminary one provided).

ACTION 598.4: GS to provide PG with a report on Tier 1 for first quarter of the year.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No meeting and nothing to report.

SI-1 Dissemination Report (SL)
——————————
SL not present, no report submitted.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ not present, no report submitted.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
PC not present, no report received.

SI-5 Production Manager’s report (JC)
————————————-
1) There was an issue with the most recent CA rpms update. Confusion was created because the wrong release date was specified in the release guidelines (used by automated tests – so SAM was showing critical errors), there was no EGI Broadcast until quite late and the release was staggered. Oddly no problems were spotted during product verification or during Staged Rollout.

2) GridPP pilots are now enabled at all but one GridPP site. This helps take forward DIRAC based submission for our ‘other’ VOs.

3) CentOS7 readiness work is starting. Some UK sites are keen to move to this OS as soon as possible. The WLCG migration activity is being scheduled for LS2.

4) Questions are arising from Tier-2 sites about the next GridPP accounting period. When will it start and for how long will it run?

5) The next HEPSYSMAN meeting is 21st-23rd June: fhttp://hepwww.rl.ac.uk/sysman/june2016/main.html.

ACTION 598.5: JC will contact SL for confirmation of when accounting period commences to pass on to experiments.

SI-6 Tier-1 Manager’s Report (GS)
———————————
The tape library has worked OK since the problems reported last week. We have a review meeting with Oracle on Wednesday (25th May).
There are a few minor issues. I would not normally mention detailed tape library issues as they are not usually service affecting.
However, given the recent outage I list these here:
– There is an occasional problem with discrepancies between the tapes seen by the library and the control system that is being followed up.
– Last Thursday the Tier1 tape library was powered down for a moment while some safety lighting within was reconnected. This happened uneventfully.
– We still have four of the Tier1 tape drives in the ‘wrong’ library. These are still accessible – the libraries pass the tapes between themselves. At some point we will put these back in their correct location.

Castor:
There have been some problems with CMS access. The maximum number of xroot connections to the disk servers (in CMSDisk) was reached – this has been increased. There have also been a couple of problems with particular disk servers. We are still checking there are no outstanding problems.

Grid Services:
There was a problem with the RAL Stratum-1 server last week between Monday (~late afternoon) and Wednesday (~noon). These were both at replication (cern.ch repositories) and client access levels because of some routing issues. The cause was some disk space issues on the production Stratum-1 HA cluster, and a standby stratum-1 (which took over for that period of time) didn’t have the correct routing in place. The disk space issue was resolved and the production HA stratum-1 servers were back online last Wednesday afternoon.

New Capacity Hardware:
One batch of Worker Nodes (from XMA) is ready for use in production. Six of them have been running jobs for a few days. The remainder have been put into Condor ready to start doing production batch work.
HPE have completed their testing of the second batch of WNs. These are being readied for our set-up, testing and benchmarking.
Slight delay as we await some gigabit switches to connect up the management ports.
XMA have also completed their testing of the storage nodes. These will be prepared for our testing, benchmarking etc. Expect to systems to be passed to the CEPH (ECHO) team in a couple of weeks.

AS provided an update on the issue with air conditioning failure last week and failure to restart. A network event appears to have taken place and the VMS dropped out but the source is not clear. On failure to restart – this is being further investigated. It is expected this will be tracked after a liaison meeting with the operations team.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No meeting and no report.

SI-8 External Contexts (PG)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
595.9: JC to discuss instructions writing document workflows required by new users of the Grid. Ongoing.
596.1: SL will assess existing Tier-2 hardware and its expected lifetime. Ongoing.

596.2: DB & PG will digest CRSG paper and develop a set of figures. Done.

596.3: PG will rationalise figures from CRSG paper and consult with experimental reps on whether they are satisfied to proceed with the proposed numbers. Done

597.1: AS will look out and circulate paperwork for a Tier-1 cooling incident experienced in 2010 to determine any similarities with a recent situation. Done.

597.2: DB will monitor tape storage usage by non-LHC VOs in line with new accounting portal which he has been given access to. Done.

597.3: PC will work up text for a policy on supporting non-LHC VOs and will circulate for discussion this week. Done.

ACTIONS AS OF 23.05.16
======================
595.9: JC to discuss instructions writing document workflows required by new users of the Grid. Ongoing.
596.1: SL will assess existing Tier-2 hardware and its expected lifetime. Ongoing.
598.1: PG will write to Concezio Bozzi and ask for approximate number for LHCb 2018 requirements

598.2: DB to reconsider h/w planning.

598.3: PG to provide SL with figures to build into planning.

598.4: GS to provide PG with a report on Tier 1 for first quarter of the year.
598.5: JC will contact SL for confirmation of when accounting period commences to pass on to experiments.

Bank Holiday next Monday and next PMB Meeting will take place on 6 June.