GridPP PMB Meeting 631

GridPP PMB Meeting 631 (02.05.17)
=================================
Present: Dave Britton(Chair), Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Andrew McNab, Andrew Sansum, Louisa Campbell (Minutes).Apologies: Tony Cass, Dave Kelsey, Steve Lloyd, Gareth Smith,

1. xyz.gridpp.ac.uk domains (See email from AM 20/4/17)
=======================================================
SL raised this last week and AM commented stating there is no technical issue and it seems a good idea. We also have hep.ac.uk as a potential alternative.

2. LHCb Castor
==============
LHCb Castor rollback is also covered in the Tier-1 report, below. In summary, following a Castor upgrade and the recent upgrade to SRM and LHCb work patterns changed and combined to push Castor over the edge. The SRM upgrade was rolled back but did not resolve the issue. There have been discussions with CERN around this and they conclude this may relate to a known bug in this version of Castor. Phillippe escalated this last week and he noted its inadequacy for LHCb requirements, Rob responded acknowledging this and advising that they planned an upgrade this week, though it has became clear this timetable was too aggressive with insufficient time for testing: the upgrade will be rescheduled asap, probably prior to a CERN trip next Tuesday. CERN cannot assist prior to this upgrade because older version of CASTOR is unsupported. AM has had confirmation from Phillippe that this was running well from c. 11pm last night, though work is still being done on identifying the exact cause of the issue it is hoped it will run adequately until the upgrade runs.

AS confirmed Phillippe appears satisfied that the problem is being appropriately addressed and DB expressed thanks to Rob for dealing with the issue, including the regrettable cancelling of his planned trip to HEPIX.

3. 2016 Experiment Review
=========================
PG circulated a spreadsheet on the experiments’ use of GridPP resources in 2016 as he has done previously. Metrics are collated from the year’s quarterly reports and compared pledges at Tier-2 and Tier-1 with what was delivered. Comparison is then made on what would have been delivered at pledge and our traditional over-deliver. There were no real surprises and PG provided a brief summary of the results and ran through some examples for clarification of the figures. This document can be shared at the OC if requested. On the ATLAS spreadsheet columns the column numbers differ to the corresponding columns on the CMS spreadsheet, but provide the same information. DB suggested some changes in the column titles to ‘pledged’ and ‘used’ or similar and perhaps a high level summary spreadsheet pulling out the relevant numbers for the OC – PG will create this.

ACTION 631.1: PG will create a summary spreadsheet of the 2016 Experiment Review figures to extract important figures for the OC.

4. Quarterly Reports
====================
Q Reports – one Q416 is still awaited, this has been finished and will shortly be sent to PG to summarise. PG is awaiting Q117 – he has received 2 and will send out a reminder as they require to be submitted within the next 2 weeks to allow for the extraction of information for the OC.
ACTION 631.2: ALL to work on OC documents for submission by end May.

5. OC Docs
==========
The OC meeting is scheduled for 16 June and documents have to be submitted by end May so require to be written between now and then. PG attached an agenda of the last status report last November, this needs to be updated by each experiment from November to now. This and other documents produced by PG require updating and other items discussed at the F2F in Sussex.

6. AOCB
=======
a) Date of next PMB meeting – DB is flying from Gatwick to Catania on Monday 8th May at 2.30 for the EGI conference and may join remotely, PG will Chair.
b) GridPP39 – RJ confirmed provisional bookings for accommodation and lecture venues at Lancaster for 14-15 September. DB will announce GridPP39 on UPHEPGRID.
ACTION 631.3: DB will announce GridPP39 on UPHEPGRID.

7. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
A brief meeting took place last Friday. AM could not attend due to travel, but Glasgow got up to 900 VMs with the VACs set up (which is the largest ever). It is positive to confirm it scales to that level, memory balancing is currently being considered for ATLAS – the containers route previously discussed is potentially the best solution. Software container based solution should be available for testing very soon. This is helpful for ATLAS to test and provide feedback. There was some discussion on how the VAC system operates.
DC updated on the very brief technical group meeting and posted a link to the minutes for the PMB to consider. He noted in particular there will be a presentation from Ian on future certificates.

SI-1 Dissemination Report (SL)
——————————
SL was not present and no report was submitted.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
CMS have crashed EOS due to a bug in the ESI which means it can become unstable if there is an increase in traffic; however, there is a patch and methods of mitigation. Work is ongoing to determine why RAL performance was less efficient than other sites, there seems to be differences between jobs reading data locally or non-locally and other anomalies which are all being investigated. DC will update on this. CMS efficiency overall was noted below other experiments at the RRB. Multi-core jobs, moving from pilot regime etc are known and being looked at as the RAL issue is pulling down wider CMS stats.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

SI-5 Production Manager’s report (JC)
————————————-
1. Holiday days have meant many people being away and therefore not too much to report from an operations perspective. It also has meant our GGUS tickets increasing back from low thirties to around 42 today (some of those relate to issues with a recent perfSONAR update – the underlying causes of which are not fully understood at present but being discussed).

2. In today’s ops meeting I was asked to confirm that GridPP Travel could be used for registering for, and travelling to, the Manchester WLCG workshop: https://indico.cern.ch/event/609911/. Registration costs £150. DB confirmed this would be acceptable, encouraged staff to submit visit requests in advance.

3. EGI have been following up with NGIs about WMS usage as they want to begin a decommissioning campaign later in the summer. In the UK we have MICE and T2K using the WMSes but MICE have now been setup for GridPP DIRAC and we T2K has only had low usage by a user rerunning scripts from last year – so we are unsure of their future activity but will suggest DIRAC.

4. We are, along with all NGIs, currently reviewing GOCDB content for all sites.

5. We received a while ago the T2 R&A update for March. Looking at the results per LHC experiment:

a. ALICE: http://wlcg-sam.cern.ch/reports/2017/201703/wlcg/WLCG_All_Sites_ALICE_Mar2017.pdf.

All okay.

b. ATLAS: http://wlcg-sam.cern.ch/reports/2017/201703/wlcg/WLCG_All_Sites_ATLAS_Mar2017.pdf

Glasgow: 86%:86%
Birmingham: 85%:85%
Oxford: 89%:89%

c. CMS: http://wlcg-sam.cern.ch/reports/2017/201703/wlcg/WLCG_All_Sites_CMS_Mar2017.pdf

All okay.

d. LHCb: http://wlcg-sam.cern.ch/reports/2017/201703/wlcg/WLCG_All_Sites_LHCB_Mar2017.pdf

QMUL: 82%:82%
Glasgow: 85%:85%

Site follow-up explanations are:

Glasgow: Glasgow experienced a power cut which lead to problems with the chilled water supply meaning air conditioning was unavailable in the main machine room for approx. 2 days. On restart a corrupt DPM database needed to be addressed. Sam was able to unpick and resolve the issue.

Birmingham: DHCP server went crazy after switching network interfaces. Took a few days to recover.

Oxford: Httpd crashed on a few WN’s. https://ggus.eu/index.php?mode=ticket_info&ticket_id=126928. In addition a WN offlined itself without an obvious cause – this appears to have affected a few sites.

QMUL: LHCb saw some network issues with the QM link.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
The main problems have been with Castor – particularly for LHCb. I did circulate some information about this in e-mails to keep you aware that we had a significant problem. We had upgraded the SRMs for LHCb and then Atlas. The upgrade was from the very old version SRM 2.11 to the current major version, 2.16. The OS was upgraded from SL5 to SL6 at the same time.
– To re-cap for Castor LHCb: Following the upgrade to the SRMs LHCb experienced significant problems. The error rate seen was such that their stripping/merging campaign stalled as far as data at RAL was concerned. Shortly before Easter the SRMs upgrade was reverted for LHCb – and this made a large improvement. Since then LHCb have been able to carry on with their work. Having said that there are still problems that require the SRMs to be restarted occasionally and throughput is not adequate. At least one bug has been identified that is fixed in Castor 2.1.16 (we are at 2.1.15). Work is underway testing Castor 2.1.16 and the plan is to move to Castor 2.1.16 (with the SRMs re-upgraded to 2.1.16) as soon as possible.
It should be noted that the SRM upgrade was also carried out for Atlas – who have not experiences such major problems since. It was not done for the CMS and GEN Castor instances.
– We are also seeing problems with the SAM tests for CMS and availability is very poor. CMS have been experimented with disabling the ‘lazy download’ feature they use to try and improve Castor access.
– Atlas areas on Castor have filled up – causing test failures.

ECHO:
– The ‘oncall’ arrangements for ECHO are being piloted – and have been in place for a few weeks now. There now is around 1.5PByte of Atlas data in ECHO.

IPv6:
– Our IPv6 addressing scheme has been agreed and the Perfsonar nodes on our production network are now running tests over IPv6.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
The meeting was cancelled.

SI-8 External Contexts (PC)
———————————
PC noted an outside chance of some funds that may become available, he will provide details as and if they become available. AS noted a recent submission of equivalent capital requests to STFC which is a routine process for UKTO, DMZ and others.

REVIEW OF ACTIONS
=================
628.4: LC will check availability w/c 25th September at Lancaster (27-29th) for GridPP39. (Update – RJ has contacted the Lancaster conference team 13th-15th September has been booked now (PMB on 13th). Done.
NEW ACTION: LC Will check 16-18th April for GridPP40 at Durham (beginning of the week – CHEP is later in the week and wait to see when IOP). Ongoing.

630.1: AS and PG will commence planning and modelling for OSC documents and couple to plans and decisions on Tier-2 funding (2019-20). Ongoing.

630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.

630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.

630.4: RJ to provide a statement from ATLAS on the importance of the Frontier Service. Done.

630.5: PG will consider high level qualitative milestones required for network developments. Done.

ACTIONS AS OF 02.05.17
======================
NEW ACTION: LC Will check 16-18th April for GridPP40 at Durham (beginning of the week – CHEP is later in the week and wait to see when IOP). Ongoing.

630.1: AS and PG will commence planning and modelling for OSC documents and couple to plans and decisions on Tier-2 funding (2019-20). Ongoing.

630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.

630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.

631.1: PG will create a summary spreadsheet of the 2016 Experiment Review figures to extract important figures for the OC.
631.2: ALL to work on OC documents for submission by end May.

631.3: DB will announce GridPP39 on UPHEPGRID.