GridPP PMB Meeting 688

GridPP PMB Meeting 688 (26/11/18)
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, David Colling, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Jon Hays, Roger Jones, Steve Lloyd, Andrew McNab, Gareth Roy, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Jeremy Coles, Dave Kelsey.

1. Authorship Fractions
=======================
DB was working on spreadsheets for the GridPP6 proposal and looked at authorship fractions previously used and has circulated the fractions which members have responded to with comments. In GridPP5 some experiments had included CERN in Tier-1 countries and others had not, so this has been addressed and CERN authors have not been included – this stimulated further discussion on why some were included last time. TC advised that Ian Bird confirmed they are not part of the expected pledge so they do not require to be included. This may be further discussed at a later date, but for the moment it has been addressed effectively and the revised authorship fractions will be included in the GridPP6 proposal.

2. CRICK Visit
==============
Some PMB members (DB, PC, AS, Ian Collier and AM) attended CRICK on Wednesday at the invitation of the Head of IT and Head of HTC (Steve Hindmarsh) where they were given a tour and heard about computing requirements before giving a presentation on Grid computing. This was well received and Steve has arranged a visit to RAL soon where he will meet with AS to discuss further.

3. 5GB S3 limit for DUNE
========================
There was some recent discussion about the file size limit DUNE had experienced and email exchanges – AM is scheduled to discuss with them later today. He had detailed discussion with CERN developers and they have recently got this working. Most of this has now been tested and is operational so DUNE should be able to use Echo end point very soon which will help to build confidence.

4. LSST Update
==============
PC updated the LSST situation. It is possible the LSST model is inappropriate as an issue arose from one of George’s slides which could be misread relating to the potential success of using the Grid. PC confirmed the job was completed by US colleagues as a result of internal decisions, not out of necessity or because the Grid could not complete the work. This may be clarified through email, but there is concern that if it is suggested astronomers cannot use the grid but can use HPC machines this could have a negative impact, particularly since the Chair of PPRC attended the talk. He has further discussed this with them and GridPP should agree for future that Grid middleware is highly successful software as proven by the LHC and others. There is an overhead in using it and if distributed computing is not required then it is not appropriate to use the Grid without other good reason. GridPP should have a higher threshold before agreeing for others to use the Grid and there should be a clear rationale for using it and which we can provide assistance for. Ie it is not appropriate to ‘test’ the grid to determine its suitability. Others who need simpler access should seek this via IRIS. DB noted LSST successfully ran 2,500 jobs and their goal was to run 10,000 but initial jobs are traditionally more problematic so it is likely the difficult issues were already mitigated and the remaining 7,500 would have processed more easily. This was impacted by LSST’s inappropriate use of Ganga. There was also discussion on whether batch access is appropriate.

5. OSC Meeting Planning
=======================
DB shared the draft slides for the OSC meeting on Wednesday and invited members to check for accuracy and make relevant comments/suggestions. Links to the CERNbox version and the Combined talk for CRICK was also provided. DB will arrive in Heathrow at 10am and meet other members at 11am at the Starbucks close to the venue. There may be new members on the committee so the format may be slightly altered from previous meetings. DB has contacted Sarah and Tony regarding GridPP6 proposal and awaits a response.

6. Quarterly Reports
====================
Matt will email GR in future quarters at the deadline to determine what reports remain outstanding so this can be added to the PMB agenda.

7. AOCB
=======
a) PG mentioned costs that OSC required for the remainder of GridPP5 – the only unclear item is Tier-1 for the next FY which AD will circulate. AS mentioned Tony Medland mentioned the spend plan to DK and AS will check if AD is on the circulation list for future emails in this regard.

8. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
AD noted a technical meeting on Friday towards new information system, minutes are attached to that agenda on Indico and the previous related to Rucio. This week’s meeting will probably cover Apel accounting for HT Condor as Steve Jones has been covering at Liverpool. AM will place the Indico link for future meetings on the PMB agenda page.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ confirmed singularity issues will be reported on by AM and appear resolved. The issues raised last week are still ongoing but not major.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
The CMS Computing meeting last week was interesting and related to AAA and Echo and whether they could work efficiently. Nothing further of UK-relevance to report.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing UK specific to report. Of wider interest is that BOINC gateway has been successfully turned off which relates to the recent request for Tier-3 like resources. This was used for Monte Carlo material which AM summarised – there were a very small number of providers in the Geneva area.

SI-4 Production Manager’s report (JC)
————————————-
JC was not present and no report submitted.

SI-5 Tier-1 Manager’s Report (AD)
———————————
– We believe the CMS AAA issues have been resolved. We will close the tickets but are continuing to look at ways of building more resilience in to the service. The change that appears to have fixed the problems were made on the 23rd November. This change reduced the chunk size being requested from Echo from 64MB to 4MB meaning that small data requests would be served much faster, with a slight (~10%) reduction in performance for large data requests. These changes were only made to the CMS AAA service. Since then the SAM tests have been all passing. 90% of the the CMS AAA Hammer Cloud test jobs have been passing, which is extremely good. The Hammer Cloud tests involve jobs at other sites requesting data from RAL. There can be a significant failure rate that is nothing to do with the site and anything above 70% success rate is considered a pass. The throughput on the proxy machines is much more balanced with the new chunk size in place.

– The problem report by NA62 in the previous week, when they couldn’t recall data in a timely manner, was the result of a forgotten cron job on the new system. This cron assigns new media to tape pools as they run short. The ATLAS tape pool ran out of tapes and a 200 000 file backlog built up before this was noticed. The tape system prioritises writing to tape above recalls (to ensure data is safe), and therefore once the problem was fixed, the next ~48 hours were dominated with clearing this back log.

– Since 22nd November, SAM tests against the (Old – tape only) CMS Castor instance appear to be “missing” from the reports (and in some plots appear to indicate 100% failure). If we check the actual results they are passing. Do not currently understand the issue, migration to new endpoint is only a week away.

– PPD Tier-2 reverted its switch to IPv6. This resolved a variety of problems for the Tier-2 some of which were being blamed on the Tier-1 (e.g. FTS service failures).

– ATLAS singularity problems have been understood. It turns out that it was nothing to do with privileges (ATLAS jobs should work fine with the privileges we are providing them with). The problem was a missing home directory, which caused misleading error messages later.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
There has been no MB meeting

SI-7 External Contexts (PC)
———————————
Nothing of significance to report. DB will consider the merits of inserting an external context slide to ensure the new OSC members are aware of the wider impact of GridPP.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland & DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in ñ GR will upload into Googledocs for info). Ongoing.
687.1: PC will report to PMB after discussions with LSST. Done.

ACTIONS AS OF 26.11.18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. (Update: PG will check the OSC minutes for details and cover with GR). Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. (Update: Almost complete and will circulate current iteration for comment). Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland & DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in ñ GR will upload into Googledocs for info). Ongoing.