GridPP PMB Meeting 652

GridPP PMB Meeting 652 (27.11.17)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Pete Gronbech (Minutes), Dave Kelsey, Steve Lloyd, Andrew Sansum, Gareth Smith,.

Apologies: Tony Doyle, Andrew McNab, Roger Jones.

1. QR Status
============
Q2 are now all received and PG can report on that next week.
Q3 is behind, still many reports not handed in.
DB said they must be done before Xmas, and asked those at the meeting with outstanding reports to submit them ASAP.

2. Data Management Plan
=======================
PC is still waiting for comments from a couple of groups but has sent current draft to PPGP.
The words to PIs will say they can add specifics to their institution at the end of the DMP if required.
DC has spoken to Antonin wrt LZ.

3. Tape Allocation Policy at the Tier1
======================================
PG raised the topic that the GridPP allocation process will have to reflect what is actually on the ground.
Will have to be less than the sum of MoU commitments but historically we have always had massive under-use and have some options to increase capacity if really necessary.
DB The problem has been mainly caused by the gap between Atlas requested resources and those used. One option is to allocate Atlas what we realistically think they are going to use. If this is not sufficient then skim 5% off the top of all allocations. Presumably media is relatively quick to purchase if required? AS: Probably, we have a path to procurement.
DB: As a last resort, we do have Tier-2 money to allocate at some point. Could dip into RALPP if it became a crisis.

4. AOCB
=======
None

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
DC reported that there had been no suggestions for the agenda beforehand. It was a poorly attended meeting.
Two topics discussed were:
How to register VAC and vcycle sites so they will appear automatically in DIRAC. Daniela and Simon have automated the process. Steve Jones will try it out.
Luke sent round a link to the community white paper. Asking for people to look and give f/b. There is a second draft of it available. PC and DB have provided f/b. It is 85 pages long. Some work required to remove duplication etc. PMB could read and comment maybe on particular sections. Link went round tbsupport last Friday. The deadline is early this week. Further comments from PMB after that deadline could be accepted. It’s pretty well written.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing significant to report. No further updates on t2k.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
Nothing significant to report.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-4 Production Manager’s report (JC)
————————————-
JC sent a written report.
Little of urgency to report from the operations area this week.

1. The security team are actively following up on some Pakiti alerts.
2. Our monthly site review last week revealed that a focus for most sites is migration to SL7/CentOS7.
3. The CHEP2016 proceedings have now been published: http://iopscience.iop.org/volume/1742-6596/898.
4. CMS has actively been following up on T2 IPv6 readiness: https://twiki.cern.ch/twiki/bin/view/CMS/IPv6Status4Sites. RAL appears twice in the list as “not ready”. Brunel and Imperial are okay. Bristol is being followed-up. CMS has also requested sites to deploy singularity by February 2018 (under discussion in our ops meetings as the EPEL version is out of date).
5. One of the key areas being followed up within the storage group is XRootD proxy caching.
6. It was noted that the CNAF outage took out the only VOMS for planck and the ipv6.hepix.org VOs.

SI-5 Tier-1 Manager’s Report (GS)
———————————
The main event was the power outage which happened shortly before last week’s PMB meeting. I did send an e-mail round with a status update on Tuesday. In summary:
– A power cut on Monday 20th November affected the RAL site and part of the wider area. It last for around ten minutes from just after midday.
– There are two power feeds into site. One had failed around 8am-ish that morning. The power outage happened when the second feed also failed.
– Systems on UPS power stayed up. Others didn’t (as expected!). Core systems, including all of Echo, were on UPS power.
– The diesel generator failed to start. In this instance this was not critical – the power returned well within the time limit the UPS batteries can supply.
– All systems came back by early evening Monday. The last was Castor – as many of the disk servers had been down.
– The diesel generator was run overnight. It was switched in at around 4.30pm and ran until 9am Tuesday morning. So those systems on UPS were effectively diesel powered overnight.
– The second power feed into site was restored on the Tuesday (21st). Power has been stable since.
– The cause of the diesel generator failing to start is not yet understood.
– There was a problem with IPv6 access after the power cut. This took a little while to understand and was resolved on Thursday (23rd).

Echo:
– Re-distribution of data in Echo onto the 2015 capacity hardware is now complete. There are now 8PBytes of usable space in Echo. Some data rebalancing has been done and the Atlas quota has been increased by 500TB.

Services:
– The MySQL database behind the Production FTS instance has been moved to a new distributed MariasDB Galera cluster database.

Tier1 Availabilities for October:
Alice: 100%
Atlas: 100% (RAL-LCG2-ECHO “site” also reporting 100%)
CMS: 96%
LHCb: 100%
OPS: 100%

Comment on the above:
CMS – SRM Test failures on 5 & 6th Oct: This was caused by CMS Disk storage becoming full. This had been triggered as old disk servers were drained for decommissioning ahead of an anticipated use of the Echo storage by CMS. On the 6th October five disk servers were put back into service – resolving the problem.
The dashboard requires some updates.

DB asked about:

Procurement status.

Disk capacity tender went out the week before last.
The CPU ITT should go out this week. Just finalizing with SBS.

Staffing issues

Darren Moore has started and is being trained to replace GS.
Replacement for Tiju in Fabric/production team is stalled.
Replacement for Bruno in Echo team was also declined. So is running again.
Tier-1 Manager advert being worked on by AS today.
Contractor plans, DB admin on short term filler. Trying to get a sys admin for the fabric team.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report.

SI-7 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
644.2: PG and AS will document plans and costings for the remainder of GridPP5 taking account of the Oracle tape issues experienced. (Update: a draft will be produced before Christmas). Ongoing.
How are we going to present this. AS has not thought through in detail how to present it.
• Change in cost model mismatch between requests and usage.
• Investigating suppliers wrt replacing CASTOR. Potentially have money in FY18 to spend on a solution. Do we have an appetite to move to CERNs system or procure a commercial solution? We do have this marker, as it is an Action for the OC. Need to do this before Xmas. Could discuss at a t1 review. This would have to be after the OC. Big strategic decision to be made with the lots of external (to GridPP) inputs. Could benefit from a dialogue with all the stakeholders. Diamond, JASMIN, in addition to GridPP. Could you send DB a list of current and future stakeholders? So can check overlap with UKT0. DIARC are a stake holder. They are now growing, and may become a major driver.

644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced before Christmas). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
647.1: PC will update Data Management Plan. Ongoing.

647.2: DB will circulate link for Data Management Plan once agreed. Ongoing.
649.1: DB will write Introduction of OS documents. Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.3: PG will schedule a discussion of the Risk Register at a PMB meeting in December then update this in the OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AS will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
650.1 SL will confer with PG and sign up to Biomed site to ensure our input will be explicitly credited in future. (Update: SL had sent info to Biomed. They wanted a site name; SL said we wanted it to be GridPP. He has emailed twice, so far no response). Ongoing.

ACTIONS AS OF 27.11.17
======================
644.2: PG and AS will document plans and costings for the remainder of GridPP5 taking account of the Oracle tape issues experienced. (Update: a draft will be produced before Christmas). Ongoing.
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced before Christmas). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
647.1: PC will update Data Management Plan. Ongoing.