GridPP PMB Meeting 592

GridPP PMB Meeting 592 (21.03.16)
=================================
Present: Dave Britton(Chair), Tony Cass, Jeremy Coles, Tony Doyle, Dave Kelsey, Andrew Sansum, Pete Gronbech (Minutes).

Apologies: David Colling, Roger Jones, Steve Lloyd, Andrew McNab, Pete Clarke, Gareth Smith, Louisa Campbell.

(Minute taker absent; these are notes)

1. Notes on the GridPP36 Agenda
==================
First Session
DB – New beginning, new challenges…(less than 45 mins)
WLCG Workshop summary? (PC, RJ, PG?)

Second Session
Ian Nielson report and update
Report from LHCONE Tony Cass?? At Taipei meeting on T1 & LHCONE Site networking re-engineering. More focus on addressing infrastructure.

DB has been hearing a lot about DMZ from Tony Hey among others. We achieve a lot of that functionality without calling it DMZ.
AS could perhaps think about a talk on this. ESnet document and paper are available.

David Salmon has signed up, should we get him to give us a short update?
There was a meeting in London to discussed the network attack – what happened, how are things changed. DK to follow up with David Salmon.
Duncan T2 and towards NFL (Network Forward Look).

Session 3
It about what are our plans to meet the goals in the proposal. We are obligated to show sites how it is possible to do it with effort available but will not dictate how they have to do it.
AM should give a summary /report from working group to discuss.
DB will suggest this to AM.
Could have an hour on this after Tea, and move some things to the following day.

Session 4
Tom on case study
Sam storage?
AF (LSST?)
Simon or Daniela on Dirac, Mark Slater on Ganga. When is it most appropriate.
AN other to discussion.
Check with Pete Clark on new users. LSST issues??

Session 5
PG to chase Roger
The session may need more focus rather than three identical talks.

Session 6
PeteG to check with PeteC.
Does it need subdividing and do we need or want anything else in the session.
Gap in funding around back half, develop funding streams around UK-T0.
Tier-1 strategy/planning.
Email Tony Cass
Try to move the EU-T0 into the first session.
Maybe move networking into session 6.

2. ALICE
==================
How will Birmingham support the disk with only 0.5FTE?

Need to think about the numbers in advance in terms of how it affects Birmingham.

ACTION: 592.1 PG to send emails with draft deadlines and final deadlines for OC docs.

3. AOCB
=======
None.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Nothing of significance to report.

SI-1 Dissemination Report (SL)
——————————
Nothing of significance to report.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

SI-5 Production Manager’s report (JC)
————————————-
1) There is a request for Tier-1 sites (in the first instance) to install pakiti clients on their production service nodes. This is an attempt to extend the usefulness of central security monitoring tools to help sites.

2) Some sites continue to receive warnings for some of their WNs in relation to an NSS heap buffer overflow vulnerability ( https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2016-1950) – it has a target date of midnight tonight.

3) Within the WLCG ops coordination activity the Multicore and gLExec task forces have been brought to an end. The former as it successfully concludes and the latter due to a change in WLCG policy.

4) The final WLCG Tier-2 A/R results for February are out: https://espace.cern.ch/WLCG-document-repository/ReliabilityAvailability/2016/february-16/. Before re-computation we saw for the UK:

ALICE (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_ALICE_Jan2016.pdf):

All okay.

ATLAS (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_ATLAS_Jan2016.pdf):

RHUL 89%:89%
Lancaster 0%:0%

CMS (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_CMS_Jan2016.pdf):

RALPP: 80%::80%

LHCb (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_LHCB_Jan2016.pdf):

RALPP: 77%:77%

Reasons for the observed figures are:

RALPP: Both CMS and LHCb low figures are due to specific CMS jobs overloading our SRM head node (as it was in the past months). They should have stopped sending those jobs now.

Lancaster: The same problem with path lengths causing problems for ATLAS tests in recent months. This has been re-computed.

RHUL: The largest problem was related to the SRM. For reasons explained in January, the DPM version was upgraded and it took several weeks to get it working again and this impacted February’s results.

5) A retirement campaign is starting for SL5. We have several sites running DPM on SL5 but the nodes will be retired this year and so we are exploring options for allowing those to be decommissioned later. In addition about 4 sites are running other (ad-hoc) services on SL5 and these will need to be removed.

6) Ewan MacMahon ended his role in Oxford physics last week and therefore with GridPP. I wanted to thank him for all his very useful contributions to GridPP operations over the years and to wish him the best of luck in his new role.

SI-6 Tier-1 Manager’s Report (GS)
———————————
General:
– Another round of Security patching and rebooting took place last week.

Castor:
– Testing of the 2.1.15 version has proceeded. However, a problem with memory requirements of the Oracle database when running
Castor
2.1.15 has been uncovered. Work ongoing to understand this and we are in contact with both CERN and Oracle.
– We have had problems (crashes) of a total of four disk servers in this last fortnight. (Two for LHCb, One CMS, One GEN). Two of
these from
the Clustervision ’11 batch of severs. There was also a further server (Castor GEN instance) that reported hardware errors and was
taken out of service for BIOS/firmware updates. We will be updating RAID card firmware in eight Clustervision ’11 severs on Monday
(21st). These eight being those CV’11 servers in D1T0 services classes that have not yet had this update.

Networking:
– There was a problem of packet loss within part of the Tier1 network. These did not seem to affect operations. For example there
was
no packet loss between disk servers and worker nodes, but there was when accessing the worker nodes from staff desktop systems. The
problem was first noticed on Tuesday 1st March. It was finally resolved the following Tuesday (8th March) having been traced to a
component in OpenStack which is being used to develop new cloud infrastructure.

Batch:
– Nothing particular to report apart from a rolling sequence of drains and reboots to pick up the latest security update during this
last week.

Procurement – Deliveries.
– XMA CPU tranche – delivered.
– XMA Disk – First part today (Friday 18th); Remainder before Easter.
– HP CPU tranche – 1/4 already here, next quarter before Easter. Remaining half in the first two days after bank holiday Monday.

Availabilities for February.
These figures were:
Alice: 100%
Atlas: 99%
CMS: 100%
LHCb: 99%
OPS: 99%
The main cause of lack of availability was a stop of Castor on the 23rd February for a Security update requiring a reboot of all
systems.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report.

REVIEW OF ACTIONS
=================
587.2: AM will invite selected small, medium and large sites to contribute presentations at GridPP36 on their plans for site evolution over the next few years and construct a session around this. Ongoing.

588.4: ALL to inform PG of any new roles and other items that need to be inserted into different categories and grants on Researchfish so that he can ensure all are included and circulate to PMB to check. Done

591.1: DK will ask Duncan to collate information from Tier2 sites for the Network & Security session summary. Done

591.2: PG or LC will re-set the size for the Dell logo on Indico which is currently too large. Done

591.3: PC will write to Ian Fuller stating PIs cannot load individual publications into the current Researchfish and advise that each PI will upload a single submission pointing towards PG submission for all GridPP grants. He will also enquire whether they can be marked for no report required and advise SL what to tell the CB. Done

591.4: PG to collate information for inclusion in OSC Financial Report.

591.5: ALL to contribute to the OSC Project Status Report.

591.6: DB to contribute Introduction and International Context for OSC Report.

591.7: PG to contribute Summary of GridPP Status for OSC Report.

591.8: PG to contribute Discussion of Risk Register for OSC Report.

591.9: GS and AS to contribute Tier-1 Status Report for OSC Report.

591.10: JC to contribute Deployment Status for OSC Report.

591.11: RJ to contribute ATLAS User Report for OSC Report.

591.12: DC to contribute LHCb User Report for OSC Report.

591.13: SL to coordinate with Tom Whittle to contribute Impact and Dissemination Report for OSC Report.

591.14: AS to consider how to model a proposal for short term temporarily sign-ins for new users to access the Grid. AS has started discussing with Ian Collier.

ACTIONS AS OF 21.03.16
======================

587.2: AM will invite selected small, medium and large sites to contribute presentations at GridPP36 on their plans for site evolution over the next few years and construct a session around this. Ongoing.

591.4: PG to collate information for inclusion in OSC Financial Report. Ongoing.

591.5: ALL to contribute to the OSC Project Status Report. Ongoing.

591.6: DB to contribute Introduction and International Context for OSC Report. Ongoing.

591.7: PG to contribute Summary of GridPP Status for OSC Report. Ongoing.

591.8: PG to contribute Discussion of Risk Register for OSC Report. Ongoing.

591.9: GS and AS to contribute Tier-1 Status Report for OSC Report. Ongoing.

591.10: JC to contribute Deployment Status for OSC Report. Ongoing.

591.11: RJ to contribute ATLAS User Report for OSC Report. Ongoing.

591.12: DC to contribute LHCb User Report for OSC Report. Ongoing.

591.13: SL to coordinate with Tom Whittle to contribute Impact and Dissemination Report for OSC Report. Ongoing.

591.14: AS to consider how to model a proposal for short term temporarily sign-ins for new users to access the Grid. AS has started discussing with Ian Collier. Ongoing.

592.1 PG to send emails with draft deadlines and final deadlines for OC docs.