GridPP PMB Meeting 653

GridPP PMB Meeting 653 (04.12.17)
=================================
Present: Dave Britton (Chair), Tony Cass, Jeremy Coles, Tony Doyle, Dave Kelsey, Pete Gronbech, David Colling, Roger Jones, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Pete Clarke.

1. Quarterly Report Status
==========================
Q2 Status Report has been circulated by PG – these do not need to be reviewed in detail except to note a) the Castor issues at Tier1 which had a knock-on effect and b) there was good progress on migration to Echo.

PG sent an email today regarding missing Q3 reports – Tier1, LHCb, Operations and Other VOs.

Tier1 – GS noted it is in preparation and needs a little more time to finalise, AS will finish off aspects that he has to contribute then it will be submitted as a priority.

LHCb – AM will submit this week.

Other VOs – Duncan is preparing this, PG will push.

Operations – JC is awaiting the final Tier2 report that Duncan sent last week, PG will forward it to JC.

2. Storage Evolution
====================
DB circulated a paper form Alastair Dewhurst this morning. He has responded encouragingly to Alastair and invited comments from the Technical Group. DC and AM will read the paper and raise it at the Technical Meeting on Friday. AM noted it seems very interesting but if not done at experiment level it may not be useful at Tier2. DB noted on page 17 there is a request for another Tier2 to get involved, and he has responded positively from Glasgow as the next active player in LIGO (Cardiff and Glasgow are the biggest institutes in the UK). Outside of the LIGO context it remains to be seen what the experiments are doing but it would be useful to have it demonstrated with LIGO. This will be discussed at the Technical Meeting and recommendations are encouraged, DB has also sent the report to Sam Skipsey to keep the storage group informed.

3. IGTF 1.88 rollout problem
============================
Jens sent a detailed technical explanation of the situation and why it has arisen. PG summarised the issue initiated when UK issued a new certificate based on SHA2 and when this was used by those working on SL6 it caused a a problem relating to BouncyCastle. There has been a fix put in for BouncyCastle but Jens is going to initiate a new certificate which should prevent the problem from occuring.
DK noted that it would have been tested and we were not involved – this should be looked at again though it does not appear to be a major issue and there appears to be 2 potential solutions (see also Production Manager and Tier1 Manager reports, below). From the GridPP and Central Support sides it was dealt with and responded to very quickly and did not become a major issue. DB emailed Robert Frank at Manchester to thank him for undertaking the work to identify the cause of the issue, Steve Jones was also involved – this clearly demonstrates that the team works extremely effectively and efficiently in these situations.

GS noted there was a discussion this morning at the Tier-1 to understand the issue and how it affects us. There was an upgrade and roll-back which showed FTS had an issue and the roll-back assisted there. If the Tier1 upgrades and Tier2 sites are still using the old version/certificate it is not clear if there would be an issue. He confirmed the plan to undertake the upgrade on Tuesday to ensure there is sufficient time ahead of Christmas. We are the last CA to upgrade to SHA2 and it is not clear if others have had these issues, it may relate to differences in certificate extensions.

4. Biomed
==========
We have been looking to ensure GridPP is acknowledged in publications that arise from Biomed work performed on GridPP resources. Originally, we understood that the easiest way to do this was to sign the EGI MOU to supply (opportunistic) resources to BIOMED but SL has had email exchanges and summarised this does not seem to be a productive route because we want GridPP acknowledged but EGI insist on listing sites. SL suggested we can either continue with this approach and register each site or insist that Biomed explicitly acknowledge us. PG confirmed that we are the largest national contributor to BIOMED and already meeting stringent requirements of LHC and EGI so we should be acknowledged. DB suggests we respond reminding that we have been the largest supplier for many years and are happy to continue supporting them but we must insist on having acknowledgement in their publications and provide a deadline. Alternatively, we could go back to individual Biomed communities or EGI in their role as supporter/broker in the Biomed community and ask them to investigate for us. Though there is some bureaucracy, this should be a straightforward matter of explicit acknowledgement outwith EGI connection. Alternatively, we could suggest their acknowledgement in the VO card could merely state “and GridPP resources across the UK”. It was agreed DB will write to VO managers and request this.

ACTION 653.1: DB will write to VO managers requesting they edit VO cards to acknowledge the use of “GridPP resources across the UK”.

5. AOCB
=======
None.

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
The next Technical Meeting is Friday – DC will report thereafter.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing significant to report.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
DC asked if ATLAS user analysis jobs complete successfully, RJ confirmed around 85%, DC noted that CMS are drifting down to around 70%. RJ noted that occasionally one user submits a load of jobs which significantly affects the figures. RJ will look into this – this is monitored by jobs and not weighted by CPU.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-4 Production Manager’s report (JC)
————————————-
Operations news:

1. Updates to 1.88-1 (contains SHA-256 UKeScienceCA-2B ICA) have caused problems at many GridPP sites. The trust anchors broke on SL6. ARGUS (SL6), UIs, VOMS was affected. Lxplus on SL6 was apparently also impacted. Other services needed restart. Some sites were able to rollback to 1.87 (if they use a local mirror of the EGI CA repo) but others not whilst the situation was debugged. The issue relates to an old version of BouncyCastle on SL6/EL6 with OpenSSL versus the server setup (the SL6 version executes a method that causes an exception). In theory client and server having different versions should have worked, but some middleware could not cope with re-signing of the same keys/DN with diff signature algorithm (SHA2). Robert built an SL6 bouncycastle rpm which fixed the implementation of that function, and Jens found that adding a non-critical extension to the CRL also avoided the exception. Tomorrow Jens plans to release a fresh CRL and in the meantime sites that could not/did not roll back have updated their bouncycastle version on their ARGUS servers.

2. Pheno have had an issue with the GridPP VOMS admin interface.

3. We are having a push on Pakiti access. It seems over time access has reduced and not all alerts are being proactively tackled.

4. Special topics for the WLCG ops coordination meeting this week are:

– CNAF incident. Consequences, actions, lessons learned
– Update from Container and Archival Storage WGs
– Update on migration of the SSB and SAM to MONIT

5. We are about to survey site usage of SAM results. The question of SAM sources and usage in WLCG is coming up again!

SI-5 Tier-1 Manager’s Report (GS)
———————————
Castor:
– Overall Castor OK. There is a planned update to the SRMs (except LHCb) on Wednesday. This is to bring them all up to the same version as the LHCb ones which were upgraded a few weeks ago.
– The LHCb Disk-only space in Castor has become very full. They are as yet unable to use Echo.
o We do not have any good hardware to deploy into Castor. The best we can potentially provide before the end of the year is 3 machines from the 2012 generation (~270TB). We are already in the process of discussing with the Fabric team to ensure that these machines are checked over and we have spares available for failing parts.
o We are concerned with the lack of progress made to move LHCb to Echo. We cannot rely on the 2012 generation of hardware to perform and if we deploy them for LHCb, we would like to know we can remove them within 6 months.

Echo:
• Allocation in Echo for ATLAS increased to 4.1PB. They now have 4PB in datadisk and 100TB in scratchdisk. This is part of the gradual increase of their usage to 5.1PB.

DB noted that the LHCb use of ECHO was discussed extensively at GridPP39 and asked what has changed since? This has been discussed in the Operations meetings and should be moving up the list of priorities. AS confirmed we could put a few old disk servers back into CASTOR soon, but noted there was no spare capacity as we are using the 2012 generators to resolve in the short term. DB suggested there should be a clear and agreed plan for the transition which should be benchmarked against available disk on both sides. The way the migration occurred between ATLAS and CMS was different as there was more integration between the teams and a more stepped approach. LHCb issue appears to be the lack of agreement between parties and differing priorities so it would be very useful to have a clear written agreement. AM noted that there were detailed plans in place but confirmed it is critical to get Chris involved in Echo meetings and get agreement. In the short term checks are ongoing to determine if the service can go back in – we have 275 TB that can be committed now, but meetings should be set up between Chris, Alastair, Raj, AM and AS. AM has an overview of the internal LHCb position and workflows and also attends critical meetings – he will monitor this situation over the coming few weeks. GS noted care should be taken if using just 3 machines and the time required for ATLAS releasing space, but the process can be started.

Certificates:
– Jens has provided input on the problem with the updated UK CA certificate in the IGTF 1.88 rollout. The Tier1 updated and then rolled back in light of the reported problems. We have been left with some issues in our configuration/deployment system (Aquilon) as a result of the rollback. The only service that had problems following the rollout was FTS. The rollback largely fixed the FTS problems.
o We understand there are two solutions being produced (one is the fixed version of the “bouncy castle” code produced by Robert Frank, the other are non-critical extensions that Jens plans to try in a CRL.
o Unless circumstances change we (Tier1) plan to sit tight at the moment. We will see if we can sort out the problem we have in Aquilon (We are hopeful we can). We plan to re-upgrade everything to the 1.88 rollout next Tuesday. This is aimed to allow reasonable time for the fixes to be rolled out, but give us time to check for issues ahead of the Christmas/New Year break.

ACTION 653.2: AS and AM to schedule a meeting with relevant parties to discuss Echo.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report. The last MB was 2 weeks ago and DB missed this but circulated the minutes from that.

SI-8 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
644.2: PG and AS will document plans and costings for the remainder of GridPP5 taking account of the Oracle tape issues experienced. (Update: a draft will be produced before Christmas). Ongoing.
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced before Christmas). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
647.1: PC will update Data Management Plan. Ongoing.
647.2: DB will circulate link for Data Management Plan once agreed. Ongoing.
649.1: DB will write Introduction of OS documents. Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.3: PG will schedule a discussion of the Risk Register at a PMB meeting in December then update this in the OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AS will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
650.1 SL will confer with PG and sign up to Biomed site to ensure our input will be explicitly credited in future. (Update: SL had sent info to Biomed. They wanted a site name; SL said we wanted it to be GridPP. He has emailed twice, so far no response). Done.

ACTIONS AS OF 04.12.17
======================
644.2: PG and AS will document plans and costings for the remainder of GridPP5 taking account of the Oracle tape issues experienced. (Update: a draft will be produced before Christmas). Ongoing.
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced before Christmas). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
647.1: PC will update Data Management Plan. Ongoing.
647.2: DB will circulate link for Data Management Plan once agreed. Ongoing.
649.1: DB will write Introduction of OS documents. Ongoing.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.3: PG will schedule a discussion of the Risk Register at a PMB meeting in December then update this in the OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AS will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
653.1: DB will write to VO managers requesting they edit VO cards to acknowledge the use of “GridPP resources across the UK”.
653.2: AS and AM to schedule a meeting between now and Christmas to push forward a meeting with relevant parties to discuss Echo.