GridPP PMB Meeting 658

GridPP PMB Meeting 658 (29.01.18)
=================================
Present: Dave Britton (Chair), Tony Cass, Jeremy Coles, David Colling, Tony Doyle, Roger Jones, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Pete Clarke, Pete Gronbech, Dave Kelsey.

1. Data Lakes
=============
DB recently circulated an email from Alistair relating to a meeting he attended with Ian Bird, Simone and others regarding Data Lakes. This relates to prototyping a small scale test of Data Lake concept and a request that the UK and Netherlands join the test due to our potential future heavy investment in SKA. DB summarised the proposal and requirements, suggesting this seems a good idea to be involved though recognises challenges in providing access on machines through RAL and necessary effort. Confirmation is required on whether this would be be an issue from an access perspective at RAL – in principle we would like to be involved but need further investigation on how this would operate. Storage team effort will likely reduce from the end of March and other commitments of resources (staff and h/w) needs to be considered before any commitment can be made. There is a tight timeframe to produce some results by end of March and it would be helpful initially to confirm CERN’s specific requirements. There was some discussion on the storage team being involved rather than RAL and it was noted that the requirements are relatively small (2 machines with 2TBs).
ACTION 658.1: AS will discuss with CERN and the Netherlands participants if and where we may contribute to Data Lakes.

2. T3 Accounting
================
Emails have been circulating following a message from John Gordon about whether to account Tier-3 resources as part of pledge-delivery. Context is CMS having an accounting meeting and a comment raised that QMUL resources delivered to CMS should not be reported as Tier2 on WLCG reports. After some email discussions it seems we pledge resource from Tier2 (London) then fund it within Tier2 then CMS get resources from other London sites and non-CMS sites in London have resources that are purely opportunistic. This is something that should be covered by the various partners and we do not wish to have additional pressure on our sites. DB has added another requirement that we require our input to be properly acknowledged and will respond to John Gordon.

3. OSC Documents
================
PG requested drafts to be submitted by today. DB circulated his section (Intro) to the PMB and had a few comments for change. Some of this goes to AS for input – need to flag some issues, discuss tape and ramp-down plans, etc. Bulk of the text for the main document can be dealt with relatively quickly and AS is working on this. It is likely to be suggested we will remain on TKD for the remainder of GridPP5 and are exploring options, e.g. we are executing the agreed plan for T10KD, etc. More information will be available by the time the OSC meeting but we may have £1M in FY 18 to procure a replacement tape system, which is significant and will suffice to cover requirements, but a formal decision is awaited. Actions from OSC need to be explicitly addressed in the text as well as some discussion UKTO to flag funding established and how it interfaces with GridPP – DB will discuss with PC and AS. DB also needs to discuss with PG risks that need to be included. This will be clearer once the other sections are complete. There was some discussion of specific points that should be included in this section that need input from other members, e.g. delivery in the last period. An issue to be considered at the OSC meeting is whether to formally enquire about GridPP6 funding.

4. Overview
===========
DB circulated an email from PC regarding the Balance of Programme Computing Review panel membership. Concern was raised on the absence of a Experimental Particle Physicist on the panel. This is a potential issue and it was suggested, for example, that DC should perhaps be on the panel or SL as chair of the GridPP collaboration board as a natural arms-length representative on the committee. Dirac is appropriately represented, but aside from AS the committee does not have a mechanism to understand future PP requirements. It is not yet clear precisely what the role of committee and members are – it is not necessarily a representational committee, but there is a questionnaire being constructed for circulation to entities that may have input to computing requirements. The rationale is GridPP would represent its PP interests and state what the requirements are. It was questioned how other projects would be covered/represented on the committee. Another issue is the interpretation of different areas and the distribution to funds to areas that are not as well understood as they could be within the current panel. The investment being reviewed comes from Tony Medland (e.g. LZ and GridPP, etc) and it was agreed DB should discuss this with PC then Tony for perspective then draft an email to the committee raising concerns about the lack of Particle Physicists on the panel.
ACTION 658.2: DB will discuss with PC and Tony Medland then draft an email raising concerns about the lack of Particle Physicists on the Balance of Programme Computing Review panel.

5. Status of Other Sections
===========================
Other actions for completing sections – AM is contributing the LHCb section, this is in hand. GS and AS T1 contribution will be completed in a couple of days. JC section is in progress, he is awaiting T2 reports to include.
ACTION 658.3: AS and GS will update their OSC document sections and specifically address the actions raised by the OSC last time.

6. AOCB
=======
a) GridPP40
There have been various suggestions at different meetings for content and these should be fed to DB.
b) PG will chair the meetings over the next three weeks.

7. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No report submitted.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
No report submitted.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
No report submitted.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
No report submitted.

SI-4 Production Manager’s report (JC)
————————————-
There is a lot of ongoing ops activity but little warranting mention to the PMB except perhaps:

1. The GridPP security team have been chasing sites regarding kernel updates ahead of an SVG deadline this Tuesday.

2. Liverpool sysadmins have had push back from their site networking team who have suggested PP pay for their JISC connection and would like traffic to go through a firewall. The discussion was taken up by Pete Clarke and other sites have responded on the positioning of storage outside the firewall. This was raised and Janet will raise bandwidth as needed so this is being actioned at the appropriate level between Liverpool and Janet.

SI-5 Tier-1 Manager’s Report (GS)
———————————
A brief report covering the last week.

• I reported problems a week ago with the Atlas Castor instance. This problem continued into the week just gone. It was found that one of the tables in the Atlas stager Castor database was badly fragmented. Overnight Tuesday to Wednesday last week (22/23 Jan) the Atlas Castor instance was declared down and this table (the “diskcopy” table and the associated indexes) was rebuilt. During Wednesday the Atlas Castor instance returned to normal operation. There is a large amount of what is effectively dark data (around 1Pbyte) to be deleted of the disk servers and this will take some time as a steady background operation.
• Last Wednesday’s planned intervention on one of the BMS (Building Management Systems) in the R89 machine room has been postponed. This system manages the pumps. A test on Tuesday (22nd Jan) with one of the pumps running ‘stand-alone’ failed. During the start-up procedure the remaining pumps ramped-up to balance the water flow-rate to compensate for the loss of one pump. This caused a power surge to pump4 and it tripped causing damage to the fuse carrier. This is now being looked at and a new date for the BMS board swap will be planned.
• The Upgrade of Echo to the latest CEPH release (“Luminous”) was carried out successfully during the second half of last week. This was done as a rolling update with the service available throughout.

I was asked to comment on the three metrics flagged at the Tier-1 for CMS in the last (2017Q3) quarterly report:

1.6.2 ‘Timely and efficient availability of resources’ CPU efficiency of 65%, one of three T1s at this level.
At the moment I don’t know how much of this is “RAL specific” and how much a wider CMS problem.

1.6.4 Site availability’ around 93%, again the worst CMS T1 site.
The CMS Castor availability figures for the three months were: 91%, 95%, 96%. My narrative in the quarterly report was:
“At the start of the quarter there was a high rate of SRM tests failures (timeouts) but in the middle of the quarter this improved – although we do not know why. (See narrative). The September figure was brought down by CMSDisk becoming full causing a very high rate of SRM test failures for a couple of days.” The specific causes of (Castor) unavailability have been improved and the Q4 availability is: 98%. Again brought down by another occasion in October when CMSDisk became full.

1.6.5 Data availability by AAA A big difference between local and remote performance is being investigated.
This is almost certainly due to the poor performance of the RAL firewall. Steve Lloyd’s network tests (http://pprc.qmul.ac.uk/~lloyd/gridpp/nettest_lcg.html) show clearly that the data rates seen from our local storage or from the RAL Tier2 to our batch worker nodes are good. However, network rates seen to/from other sites are bad. Steve’s figures for transfer success rates are also poor. There are two changes that should improve this. These are the replacement of the RAL firewall and the move to connect to the LHCONE network.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
No meeting last week

SI-7 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
OS documents MUST be done and submitted to PG this week.
649.1: DB will write Introduction of OS documents. Done.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AM will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
655.2: AS to prepare a report on failure of the generator to come up after a recent issue. Ongoing.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. (UPDATE: appropriate dates are being considered with AS). Ongoing.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. Ongoing.
656.2: DC will report on CPU efficiencies. Ongoing.
656.4: DB will contact external contacts to invite them to attend and/or contribute to GridPP40. Ongoing.
657.1: GS and SL will assess network tests for RAL and report to the PMB. (Update: included in this weeks’ Tier1 report). Done.
657.2: DC to report on the CMS taskforce. Ongoing.
657.3 PG will provide PC with documents and diagrams relating to the management structure. Ongoing.
657.4: DB will ask AS to invite Alison to join the PMB in the next few weeks for an update. Done.

ACTIONS AS OF 29.01.18
======================
644.3: AS put together a starting plan for staff ramp-down. (Update: a draft will be produced in January). Ongoing.
644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY?) Ongoing.
OS documents MUST be done and submitted to PG this week.
649.2: PC will write Wider Context of OS documents. Ongoing.
649.4: GS and AS will write the Tier1 Status section of OS documents. Ongoing.
649.5: JC will write Deployment Status section of OS documents with input from PG. Ongoing.
649.6: RJ, DC and AM will write LHC section of User Reports in OS documents. Ongoing.
649.7: JC will write Other Experiments section of User Reports in OS documents with input from DC and PG. Ongoing.
655.2: AS to prepare a report on failure of the generator to come up after a recent issue. Ongoing.
655.3: PG to consider the agenda and date for Tier1 review and include disaster recovery plans. (UPDATE: appropriate dates are being considered with AS). Ongoing.
656.1: DK will report before the end of February on any actions GridPP should take to comply with GDPR. Ongoing.
656.2: DC will report on CPU efficiencies. Ongoing.
656.4: DB will contact external contacts to invite them to attend and/or contribute to GridPP40. Ongoing.
657.2: DC to report on the CMS taskforce. Ongoing.
657.3 PG will provide PC with documents and diagrams relating to the management structure. Ongoing.
658.1: AS will discuss with CERN and the Netherlands participants if and where we may contribute to Data Lakes.
658.2: DB will discuss with PC and Tony Medland then draft an email raising concerns about the lack of Particle Physicists on the Balance of Programme Computing Review panel.
658.3: AS and GS will update their OSC document sections and specifically address the actions raised by the OSC last time.