GridPP PMB Meeting 614

GridPP PMB Meeting 614 (21.11.16)
=================================
Present: Dave Britton(Chair), Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Cass

1. HEPSYSMAN
============
PG sent an email re the HEPSYSMAN event at RAL planned for next June with a possible IPV6 focus, including tutorial and speaker. If relevant it may revisit the security theme next year. PG requested approval for funding the event, the PMB approved this.

AS mentioned that the information audit at STFC which assesses how systems containing user data, security and retention is used, commonly points to the HEPSYSMAN security training as an example of good practice.

2. OSC Talk
===========
DB thanked the PMB for their input and comments for inclusion. The talk is deliberately brief and focussed on big issues as opposed to lower or medium level issues. Project Management is not included in detail as the OSC are more interested in risks etc. PG has been discussing with AS the spend plan at Tier-1 (resource and capital) – a slide is included on this and alluded to the request submitted in November 2016 which will be further discussed in the meeting. AS will process the request for capitalising tape today. Consideration needs to be given to the presentation of resource requirements for Tier-1 (e.g. Robot maintenance, etc) to determine what elements have minimum figures included, this will be clearer when AS has completed the figures. £390K is not yet included in formal tables or project management plan etc but is included in the Powerpoint slides. DB will work up remainder of the slides and send to AS for figures update then circulate later today.
Resource requirement for funding HNSciCloud out of GridPP is 75K Euros over 3 years in 2 tranches – AS has tracked this back and demonstrated this was the original intent. It may not be useful for LHC experiments but may count to pledge. DB reminded that GridPP made this commitment with a 100% funding commitment, but this may be impacted by the final 90% funding. It may be challenging to determine their benefit in the long run, but may count towards the MOU commitment or be more effectively used in UKTO, but should be considered and could allow reduction in commitments elsewhere.
ACTION 614.1: DB will finalise OSC slides later today and forward to AS for figures update then circulate to PMB.

3. CMS issue at Tier-1
======================
DB advised there has been no update since last week. AS was asked to undertake SIR and sent round a google document to begin information gathering. The issue was identified at CHEP. To ensure this does not recur, it is worth undertaking forensics to identify the specific lack of communication which may have alerted the issue earlier through various channels. RAL has an excellent record in CMS communication over the last 10 years – DB will include a slide on this for the OSC presentation highlighting this is more of a communication issue rather than a technical one and confirming this came to light due to the success of the LHC. Initially it may have arisen due to some specific CMS workflows which affected Castor and the move to CEPH will remedy. DC will liaise with various team members on the operational side for input and ascertain when they realised there was an issue at RAL and by what mechanism it was escalated. Also, it was noted CMS were showing low usage in the graphs and future quarterly resource meetings should enquire on reasons for any low usage that shows up.
ACTION 614.2: DC will liaise with various Operational team members on the CMS issue at Tier-1 to ascertain when the issue was identified and by what mechanism it was escalated.

4. Tier-2 HW grants – status
============================
PG confirmed all but 3 grants are with STFC – Durham, Brunel and Bristol remain outstanding. PG sent out reminders last week and this week and these are progressing well. PG will follow up with a further reminder to the 3 sites and email STFC to check if all is in order.
ACTION 614.3: PG will follow up with a further reminder to Durham, Brunel and Bristol on submission of Teir-2 HW grants and email STFC to check if all is in order.

5. Pledges
==========
Pledges are due by 30 November and should be submitted this week. PG summarised Tier-1 and Tier-2 levels in emails to DB and AS this morning at old levels plus 60% of uplift. DB will enter into REBUS then check before formal submission. Tier-2s have not yet been consulted, PG has circulated pledges, but the four Tier-2 managers need to confirm they are satisfied with these levels and how these will be managed. The information should be sent to the sites, probably via the CB. PG uploaded the summary sheet onto the website and can advertise this to the Ops Team and ask them to highlight any issues with the proposed numbers for each site.

ACTION: 614.4: DB will enter pledged amounts into REBUS then check before formal submission.

ACTION 614.5: PG will advertise the pledges summary on the website to the Ops team and ask them to highlight any issues with the proposed numbers for each site.

6. AOCB
=======
a) LSST DESK (dark energy science collaboration) request.
PC sent round an email to the PMB for information, no action is required. He summarised the commitment made to the DESC some time ago (astronomers have an instrument called LSST and then different collaborations such as DESC make use of the data). We were asked to make a commitment (i.e. like a pledge) to DESC and this will be provided from LSST/DESC sites, e.g Edinburgh, Lancaster etc. PC received a request from Joe who has a new post at Edinburgh with Bob Mann (LSST head) noting non UK-based people who are keen to do follow-up work. LSST UK consider this as part of the UK’s commitment to the DESC and we should also regard this as such, but they will be running data slightly differently to our usual method. Bob Mann will provide information on requirements for the PMB to assess then for Marcus to begin collating UK contributors and open a dialogue. PC also spoke with George Beckett, the UK project manager and we await LSST to confirm they want to progress – PC will keep the PMB informed of any developments.

7. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
A short technical meeting took place 2 weeks ago, the technical group needs to define a new direction. In addition to regular items interesting topics are raised. It may be more beneficial to undertake via email. DB suggested focus discussion by PMB asking a few questions to be answered, eg what advice to give to a small site (Durham size), medium (oxford or Holloway) in terms of evolution. This will allow the technical group to work on answers now that lots of ground work has been done, e.g. Liverpool running large VAC and running multi-core jobs. Should focus on practical elements, eg storage options, to produce a clear framework under which monies will be spent in GridPP5. Roger is looking into this for ATLAS. This is something that can perhaps be more effectively addressed through F2F. Diskless CMS sites – looking at making Oxford Diskless Tier-3 sites, perhaps using RAL Tier-2 for Diskless jobs depending on resulting traffic. Chis started a CMS-centric one and Bristol is almost diskless, though there is a subtle difference as to how this site is set up and networking is also relevant.

SI-1 Dissemination Report (SL)
——————————
##GridPP Engagement Officer Notes for PMB

###New User Engagement Programme – EUCLID

Thanks to Andrew L we’ve engaged a new user from EUCLID who is currently working through the UserGuide and providing very useful feedback on the new Ganga instructions.

###New VO policies

Following a discussion at last week’s Ops meeting, TW had a look at the “small VOs” that have been using GridPP resources in the past year (since January 2016) [1] and comparing that with the supported VO list on the Wiki [2]. The former list consists of 30 active VOs (including the regional incubators and dteam) while the latter consists of around 40 (including “None”). A few questions arise:

1) Do we look at and actively pursue removing support for the ~10 VOs that have not done anything in the past year? Given the limited resources available at the GridPP DIRAC end, it might be worth removing the “bitty little” VOs (to paraphrase Daniela B) to reduce the burden on support.

2) Pheno, biomed and T2K are the biggest single experiment users, which is nice…

3) …but the gridpp incubator VO is now the second biggest user with nearly 1.5 M jobs submitted over the past year. Do we need to find who is using it now and, if they haven’t already, see if we can move them onto their own VO?

Advice or suggestions appreciated.

[1] https://accounting-next.egi.eu/egi/country/United%20Kingdom/njobs/VO/Year/2016/1/2016/11/custom-[object%20Object],biomed,calice,cernatschool.org,comet.j-parc.jp,dteam,dune,enmr.eu,esr,fermilab,geant4,gridpp,icecube,ilc,lsst,lz,mice,na62.vo.gridpp.ac.uk,nordugrid.org,pheno,skatelescope.eu,snoplus.snolab.ca,solidexperiment.org,t2k.org,vo.landslides.mossaic.org,vo.londongrid.ac.uk,vo.moedal.org,vo.northgrid.ac.uk,vo.scotgrid.ac.uk,vo.southgrid.ac.uk,zeus/onlyinfrajobs/

[2] https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

JC suggests there are some ways to pull out the information required.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Disk server loss – very little loss last quarter, but approx. 80,000 files were lost recently not recoverable. At the end of last week an ATLAS user job triggered a number of tape recalls and other issues which is causing delays in data recall. PG noted a high level of data running between Oxford and ATLAS during that time which may have impacted, RJ confirmed this should not have had such an impact. Also RALPP have less jobs than expected and this is under investigation.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
CMS is short of tape and there is a possibility of the UK providing more than strictly pledged. DC mentioned to Liz who has asked for more information and scales etc. AS suggested it may not be possible in FY17 as there is insufficient capital and if tape becomes capital that will have an impact. In FY16 there should be unused tapes in the plan with approx. 320 tapes unused at 8TB each = 2 PTB in principle up to this amount but not exceeding the FY17 pledge. They are currently at 8 PTB and would be going to 10.8, it may be possible to get most of the pledge early, PG will check the figures. AS will send an email cc’ing DB and PG that DC can pass on, probably by Wednesday.
ACTION 614.6: AS will email DC regarding pledges and availability.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Some disk failures have been experienced at Tier-1s. GS confirmed this is being looked at, one batch appears to have a blip (Sept/Oct) totalling 5 incidences with some repeats on the same machines indicating a potential slight concern which will be looked at – there was no disk failures in August.

SI-5 Production Manager’s report (JC)
————————————-
1. Our APEL-ATLAS accounting comparison continues with scripts for ARC and Torque being developed to better extract records. As previously mentioned the comparisons need to be made regular.

2. There is a new NA62 computing lead and NA62 grid activities are expected to ramp up in the coming months. (DB confirmed a review is being undertaken and a document written up which he is feeding in to).

3. An illustrative example of issues sites face was raised in the storage group meeting last week. A 2GB limit was affecting xrdcp transfers from Bristol using DPM with HDFS. The problem was traced to a bug (identified at Brunel) that was localised in the dmlite-plugins-hdfs package.

4. The EGI Security Policy Group has produced a new revised and updated Top-Level Security Policy document. This brings the document up to date in terms of terminology and the current set of security policy documents, although one of the aims was not to change more than needed (as the old document has served well over many years!). This policy applies to ALL current and future EGI infrastructures and services and a draft can be found here https://wiki.egi.eu/wiki/SPG:Drafts:Security_Policy.

5. LSST plans to restart work on GridPP resources. Joe Zuntz reported that there are various groups in LSST that are interested in using the grid in general for analysis work, and following on from the work they did on shape measurement last year people are keen to extend to other areas and see if GridPP resources can be useful for this work. UK LSST sites currenly: Imperial, Liverpool, Oxford, Edinburgh, Manchester, Lancaster. The Tier-1 may also be used.

6. HEPSYSMAN 2017 is now planned for 13th to 15th June 2017 inclusive. (DB mentioned Ian Collier was proposing a workshop on WLCG on the same dates so there may be issues on cross-over of people who will want to attend both – JC will contact Ian Collier to discuss)

7. A new release of the Accounting Portal is ready for testing (see https://accounting-pre.egi.cesga.es/). Interested parties are asked to test the new release for the next 2 weeks, and report via the tracking GGUS ticket any comment, bugs or suggestions for improvement: https://ggus.eu/index.php?mode=ticket_info&ticket_id=125040.

8. There will be a WLCG meeting (pre-GDB) on networking 10 Jan 2017 at CERN. It will be the opportunity for LHC experiments, Network operators, WLCG service and site managers to meet and discuss the network requirements for the 3rd run of the LHC, planned for 2021-2023: https://indico.cern.ch/event/571501/.

9. The Call for Expressions of Interest – EGI Activities and Services PHASE III (Jan 2018 – December 2020) – closed yesterday: https://documents.egi.eu/document/2945.

10. On 7th the Tier-2 A/R figures had a TBC against Bristol for CMS (http://wlcg-sam.cern.ch/reports/2016/201610/wlcg/WLCG_All_Sites_CMS_Oct2016.html) with 85%:95%. The drop was due to an outage at the beginning of October in order to upgrade their ARC CE from 4.2 to 5.1.3 on request of LHCb (submission failed with newer client software).

ACTION 614.7: JC will contact Ian Collier to discuss potential date clashes between HEPSYSMAN and WLCG meetings.

SI-6 Tier-1 Manager’s Report (GS)
————————————
General:
New DB Admin (Miguel Lopez) starts today.

Castor:
– Additional disk servers added to Alice (5 servers, each 100TB). Eight of the 12 additional servers (each 120TB) added to LHCb.
Remaining to be added in next days.
– As reported before the testing of Castor 2.1.15 is largely complete. Owing to staff availability this update will be carried out
in the New Year, with the intention of completing it by the end of January.
– We are looking to merge smaller disk pools into larger ones for both LHCb and Atlas. We expect to do this in December.

Tape:
– Migration of LHCb data from ‘C’ to ‘D’ tapes ongoing. Approaching the 20% mark with just over 800 out of the 1000 tapes still to
do.

Ceph/ECHO:
– There was an intervention on Echo last week while a network reconfiguration was carried out.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
The meeting was cancelled last week – this has been rescheduled for tomorrow and PC will attend. There may be an issue regarding accounting being raised – Ian will cover this if it does arise.

SI-8 External Contexts (PC)
———————————
1) BEIS bid from RCUK was submitted and feedback has been received – £30M line for PPAN (UKTO stuff). Timescales for monies to trickle through is not yet clear.
2) EUT0 has not been operating for some time – Tony Medland may step down from chairing collaboration board, there is a meeting Wednesday in Amsterdam. On the agenda are for EGI and EUTO to discuss bids, but this may not be effective. Other item on the agenda is to forget the H2020 elements and return to roots to make resources available to projects which are shared (e.g. UKLID).

REVIEW OF ACTIONS
=================
605.1: DK will investigate costs and timescales of upgrading the OPN Link to 30 and report back to PMB. Ongoing.
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier1 network availability/performance. Ongoing.
610.3: AS Attempt to get tape media re-classified from resource to capital. Done.
612.3: PG will determine which small sites can undertake procurement this FY. Ongoing.
613.1: AS will undertake a post mortem on CMS issues at Tier-1. Ongoing.
613.2: DB will prepare one talk to present to the OC Meeting. Done.
613.3: PG will create slides containing tables and notes for information with resource, capital and financials. Ongoing.
613.4: PG will write to sites to provide instructions and request they commence procurement of minimum £10K for capital. Ongoing.
613.5: ALL submit Q3 reports to PG. (Update: most now submitted, others will be submitted this week). Ongoing.
613.6: SL will advise Catalin that the PMB supports development work on creating secure CVMFS repositories hosted on the RAL Stratum-0 and may consider wider bids once the scope is better understood. Done.

ACTIONS AS OF 21.11.16
======================
605.1: DK will investigate costs and timescales of upgrading the OPN Link to 30 and report back to PMB. Ongoing.
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
612.3: PG will determine which small sites can undertake procurement this FY. Ongoing.
613.1: AS will undertake a post mortem on CMS issues at Tier-1. Ongoing.
613.3: PG will create slides containing tables and notes for information with resource, capital and financials. Ongoing.
613.4: PG will write to sites to provide instructions and request they commence procurement of minimum £10K for capital. Ongoing.
613.5: ALL submit Q3 reports to PG. (Update: most now submitted, others will be submitted this week) Ongoing.
614.1: DB will finalise OSC slides and forward to AS for figures update then circulate to PMB.
614.2: DC will liaise with various Operational team members on the CMS issue at Tier-1 to ascertain when the issue was identified and by what mechanism it was escalated.
614.3: PG will follow up with a further reminder to Durham, Brunel and Bristol on submission of Tier-2 HW grants and email STFC to check if all is in order.
614.4: DB will enter pledged amounts into REBUS then check before formal submission.

614.5: PG will advertise the pledges summary on the website to the Ops team and ask them to highlight any issues with the proposed numbers for each site.

614.6: AS will email DC regarding pledges and availability.

614.7: JC will contact Ian Collier to discuss potential date clashes between HEPSYSMAN and WLCG meetings.