GridPP PMB Meeting 676

GridPP PMB Meeting 676 (13/08/18)
=================================
Present: Pete Gronbech (Chair), David Colling (part), Alastair Dewhurst, Roger Jones, Steve Lloyd, Andrew McNab, Louisa Campbell (Minutes).

Apologies: Dave Britton (Chair), Tony Cass, Pete Clarke, Jeremy Coles, Tony Doyle, Dave Kelsey, Andrew Sansum.

1. PM Duties
============
The plan is to hand over most of the routine Project Management duties to others with in the collaboration. This includes Quarterly Reporting, setting up indico agenda’s and miscellaneous web site editing.PG will continue to run the Tier-1 resource allocation meetings and handle the Tier-2 allocations and keeping track of finance.

2. Tier-1 resource meeting LHC sign-off
=======================================
AD confirmed a procedure has been established to enable experiment liaison officers to review quarterly usage at the Tier-1. AD plans to include in Tier-1 manager report from this point on. Since this should be signed off at the PMB level it was agreed to defer decisions until DB is available at the next PMB.

3. Tier-2 h/w allocation
========================
PG explained to DC that he had already submitted the grant levels to STFC last week, but would try to modify them today in light of CMS wishes.

4. GridPP41 Agenda
==================
Tier-1 session – AD will speak first on Tier-1 vision for long term tape planning. 3 talks next – disk storage, castor decommissioning/ consolidation, future Echo aspects and S3 Swift; James Adams will talk on Batch System evolution, e.g. future plans; Alex Dibbo – provisioning other services, eg cloud and moving services to cloud where they can be easier scaled; Darren Moore – production, monitoring, operational matters; possibly Adrian Coveney on APEL/GOCD; and possibly Kashif on fabric elements, e.g. procurement, network, managing large numbers of machines and Daniela on Dirac in expt support session.
RJ will consider some talks based on DB’s suggestion on Experiment Support for a session split into 2 sections – small experiments and planning for GridPP6. Non-Tier1 people will present on CMS and LHCb. It was suggested that Daniela could talk on Dirac
Tier-2 – JC will progress this, SL suggested Gareth and Daniela could talk as they have contributed to the document.

Storage – JC will coordinate.

5. AOCB
=======
a) DC briefly outlined the way AD and he have agreed to intertwine their two proposals for IRIS Digital Assets. This is for the capitalization of software development to be become a digital asset, to be funded by IRIS. They had put in overlapping bids concerning RUCIO and DIRAC. They now have agreed to reference each other’s bids. There is also a Vcycle bid from AM.

b) DIRAC Workshop item. DC explained that IC are hoping to host the next DIRAC workshop. It lasts 4 days and attracts approximately 150 people. It is out of term time so they are looking at hotel venues. The Royal Society is fully booked. They are looking for subsidy (from GridPP) at similar level to previous conferences. Per attendee. It was agreed this should be deferred until the next PMB in the absence of DB and DK, though SL believes we should support at some level.

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Not present, no report submitted.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Echo problems at RAL since last Friday, there was an intervention with Dave. Manchester had a Fingal server fail which caused Atlas problems. AD

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Not present, no report submitted.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-4 Production Manager’s report (JC)
————————————-
Not present, no report submitted.

SI-5 Tier-1 Manager’s Report (AD)
———————————
CMS AAA – service is heavily loaded (lots of popular data on Echo), causing machines to go down or slow to the point SAM tests start failing. Restarting machines helps (on a cron). Working on adding throttling of some description.

Occasionally seeing (< 1%) transfers failing with Globus “Address already in use”. Correlation with bulk LHCb transfers from Castor to Echo. Ongoing effort to kill stuck transfers that block ports. Week beginning 6 / 8 / 18 CMS jobs - periodic failures when accessing files. Problem appears to be network related. This was tracked down to Docker containers losing all connectivity. Also causing some ATLAS jobs to fail with “lost heartbeat” errors. Problem identified requiring some machines to be re-installed. Batch farm at half capacity while this is being done. Echo: Unscheduled outage (https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=25859). In the evening on Friday 10th August something happened that caused machines in the Echo cluster (Dell16 generation) to start using more memory. 3 machines went down completely and needed someone to go to the machine room to restart them. Given the complexity of the problem there was a limited amount that could be done over the weekend and Echo was declared down. The Echo and Fabric teams have been working on the problem this morning. The downtime has been extended until Tomorrow lunch when we hope it should be fixed. I have arranged a meeting with our Ceph consultant tomorrow morning. SI-6 LCG Management Board Report of Issues (DB) ----------------------------------------------- Nothing to report. SI-7 External Contexts (PC) --------------------------------- Nothing to report. REVIEW OF ACTIONS ================= 644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing. 663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). (Update: DB will write to DK with DC in copy with proposed way forward – almost complete). Done. 665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Done. 667.2 ALL Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing. 671.2: PG and AM to check if there is a common requirement across the Grid that can be negotiated with Dell for a framework agreement (e.g. Storage, Compute, Configurations). Ongoing. 672.3: RJ, DK and AM to draft the Experiment Support background document. Ongoing. 672.4: DK to draft the Security, Trust and Identity background document. Ongoing. 672.5: AD to draft the Tier1 background document. Ongoing. 672.6: JC, SL AM and PG to draft the Tier2 background document. Ongoing. 672.7: PG will consider the agenda for GridPP41 incorporating the GridPP6 Background Documents. Ongoing. 673.2: AD will provide the PMB with an overview of strategy for tapes and drives for the remainder of GridPP5 and GridPP6. Ongoing. 675.1: DC to sign off report on Tier-1 LHC usage. Ongoing. 675.2: RJ to sign off report on Tier-1 LHC usage. Ongoing. ACTIONS AS OF 13/08/18 ====================== 644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing. 667.2 ALL Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing. 671.2: PG and AM to check if there is a common requirement across the Grid that can be negotiated with Dell for a framework agreement (e.g. Storage, Compute, Configurations). Ongoing. 672.3: RJ, DK and AM to draft the Experiment Support background document. Ongoing. 672.4: DK to draft the Security, Trust and Identity background document. Ongoing. 672.5: AD to draft the Tier1 background document. Ongoing. 672.6: JC, SL AM and PG to draft the Tier2 background document. Ongoing. 672.7: PG will consider the agenda for GridPP41 incorporating the GridPP6 Background Documents. Ongoing. 673.2: AD will provide the PMB with an overview of strategy for tapes and drives for the remainder of GridPP5 and GridPP6. Ongoing. 675.1: DC to sign off report on Tier-1 LHC usage. Ongoing. 675.2: RJ to sign off report on Tier-1 LHC usage. Ongoing.