GridPP PMB Meeting 571

GridPP PMB Meeting 571 (10.08.15)
=============================
Present: Dave Britton(Chair) , Pete Gronbech (Minutes), Andrew McNab, Dave Kelsey, Tony Cass, Tony Doyle, Jeremy Coles, Pete Clarke, Roger Jones, Gareth Smith

1. GridPP5 Status
=======================
We don’t know more than the letter sent by STFC, which was forwarde to the PMB. It was very similar to the letters sent to the group leaders about the consolidated grants. STFC are not guaranteeing they will issue the grants before October as planned which means we run into the redundancy letters problem. STFC, do have a financial plan which appears to be sensible but has to be formally accepted (council in September). GridPP collaboration meeting in Sept will give us the opportunity to tell people they may get redundancy notices but we expect the funding to come along. CERN council will decide on rebate in September. The letter to the PIs re CG also mentions the funding is uncertain pending CSR outcome. Plan is based on what they expect to get for Capital expenditure, but they do not actually have the figures yet. CSR outcome will be published on 25?th November. Immediate problem is a cash flow issue due to the software problem.
Despite all the uncertainty, Paragraph 2 endorses the GridPP5 program in principle at 90%, ‘but with award levels consistent with normalised levels’ across program. This will affect academic time (ie DB), FEC and travel.
DB spoke to TM, asked if we could have specifics and numbers. Starting point is our 90% scenario. Possible meeting with TM, Sarah Verth , DB and may be PG may be arranged.
Beyond GridPP5, the experiment support posts may have to move into experiments remit. Must argue they should not be removed as they are not currently in the experiments plans. There is already a clear gap here: There is nobody funded to write/deploy experiment software. The experiment support posts help deploy it from the Grid side. Should change name from “support” or “liaison”. The GridPP review committee were supportive of the posts.
PC: We played a very good card game with this. Credit to DB. Only problem was calling these posts experiment liaison.
RJ pointed out that the CG could not cover the experiment posts as they are not based at CG universities they are at RAL.
PPGP wanted to tension more posts against CG posts.
If it is brought up then we can say they were mislabelled but we cannot change it for this time round. We got them added previously on so they will look like extras.

2. WLCG Pledges
=======================
Need to work out if we have sufficient resources at the Tier-2s (in particular). Look at REBUS for the current experimental global requests.
ACTION 571.1
PG to look at this. Look at Steve’s spreadsheet too. Do it this week.

3. GridPP35 Agenda
=======================
DB Thanked PC for all his work on the agenda.
Missing a few talks.
Should rename one talk to ‘Latest results from LHC’ rather than ‘from run 2’ as Pentaquark is from run 1 data.
Dave Colling has not responded to his action on who will talk for LZ. DC replied dependant on timing it will be Alastair Curry or DC on his behalf.
PC has failed to speak to Paul Alexander , maybe we should pull this talk.
Tony Price is intending to get things working in time for meeting…
JC suggested Lattice QCD talk? Craig from Plymouth can come. JC can get in touch.
Can we prime these people to give us a list of questions as to what they need to do, know etc and have a discussion session?
Talks maybe too long should guide them that they tells us about themselves for 12 mins, and then 8 mins discussion. Questions, problems, what we can do to help.
What do we want out of this?

ACTION 571.2
JC to speak to Frederic
ACTION 571.3
DC to talk to LUX-Zeplin

ACTION 571.4
PC to talk to LSST – They should have one slide at the end to say these are the issues they want to discuss. First day is politics and non HEP, next day a lot on networking… but currently lacking solid technical talks.
ACTION 571.5
On AS/GS, Need to know who is speaking on CEPH. DK took note.
AM Also unsure on what we can say that we did not say last time.
PG suggested asking in the OPS meeting what people on the ground thought we should be talking about.
CEPH, DB wants to have a formal status report to let us know if we (PMB) endorse that course of action.
PC worried about the format of the other VO’s as we could run out of time, maybe we should collect the questions for the end so it can be managed better.
DB agreed with PG that we have a huge challenge to work out how we are going deal with sites with 0.5 FTE, but not sure what topics exactly need to be covered.
Talk from Edinburgh on getting MC jobs running on Archer.
‘Data Centred’ people in Greater Manchester, Andrew McNab could get them to speak.
They are offering space for free.
LHCONE meeting end of October 28/29. Would be good to have some progress by then.
Have run technical tests.
In the past we have had site reports but they got repetitive. We have had reports from experiments but that’s sort of covered.
Perhaps build in more discussion talks.
JC suggested we could have a site status table on different issues such as IPv6, glexec, multicore,
Some of the tables exist and have been discussed in Ops meeting.
If there are things that are stuck how can we help?

4. Quarterly Reports
=======================
PG ran through his report on the Q115 reports. The main issue was that many of them had been delayed partly due to work on the GridPP5 proposal. A full Tier-1 report is still outstanding.
This month’s reports are due in, and not many have arrived yet.
ACTION 571.6
PMB members get reports in by the next PMB meeting is two weeks.

5. NI Tier2
=======================
Sue Foffano gave a heads up about this. Then we heard they were not going to go ahead. We thought it was a pity as they may be able to setup a lightweight Tier-2.
What’s in it for us? Does it help us? Does it help the experiments? If they really want to do it we would help but maybe they were not that keen.
We believe the VC was visiting CERN, and it could have been a low cost way to get involved.
PG cautioned that VAC is well suited to LHCb and Atlas but perhaps not to other VOs, also if we offer to help and find it difficult it could cause reputational damage.
DC & DB agreed but felt these are problems we need to overcome.
DB could make direct contact with the contact at QUB. Would prefer to make contacts with the scientist behind this to see what they are actually trying to do.
Other Things:
ACTION 571.7
Gareth to update us on the network issues at RAL
Safe AAI work: Agenda is in an email PC sent round. History of this:
A long time ago when we discussed SAFE with Jeremy Yates, SAFE is hopeless let’s have a joint project to replace it. Over time SAFE has become more than one thing. Creates a db of usernames and passwords so they can sign on to multiple systems and does some accounting. Parts of this has become important. So now the suggestion is that SAFE should become a community tool.
Need to get to the bottom of this. DC thinks it is the authentication part that is being talked about.
Seem to be reinventing structures that we already have with our certificates. DC worried that the stuff they reinvent will not be compatible with what we have.
PC will be at the meeting, so can ask questions. In the end whatever is done has to be aligned with AARC etc. Would like Jens to be there. DK ‘demonstrate easy integration with WLCG’? What does that mean…. Thursday 2pm at UCL.

6. AOCB
=======================
GridPP36 confirmed at Pitlochry.
SKA meeting on 12th August: SKA is very interesting and we would like to be involved in the computing. We had a high level meeting in London which went well. They may be behind in planning for data centres. Agreed to get together to have a technical meeting, to see what things that are out there could be useful. Meeting got cancelled. SKA Board meeting clashed. GAP analysis recently showed up many things that needed addressing. Some fallout from Italy not getting headquarters. PC from our point of view we are ready and keen. Feels we have nothing to lose. DC has also heard some info that shows they are ‘a little behind’.
Cliff Brereton has resigned at Hartree. New interim director person is Peter Allen (An STFC person) ex astronomer. Sounds like he might want to be more interested in the scientific community, may be for cycle stealing etc….
DB and DK thought he may be an interim posting. (He is according the Hartree web site)

STANDING ITEMS
==============
SI-0 Report from Development (Cloud) Group
===========================================
DC noted nothing major to report.

SI-1 Dissemination Report
===========================================
No Report
SI-2 ATLAS weekly review & plans
===========================================
ATLAS have obviously seen the RAL networking issues. We also have patchy use at some sites; this is sometimes on the site side, sometimes misconfigurations. Nothing systematic.
There is a summary of the original network intervention at the T1 in https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2015-08-05.

SI-3 CMS weekly review & plans
===========================================
DC noted nothing major to report.

SI-4 LHCb weekly review & plans
===========================================
A McN noted nothing major to report.

SI-5 Production Manager’s Report
===========================================
It has been a relatively quiet period recently. Some recent updates that may be of interest:

1. July was a very good month for our T2 availability and no site fell below the WLCG target of 90% in either availability or reliability.
.
ALICE (http://wlcg-sam.cern.ch/reports/2015/201507/wlcg/WLCG_All_Sites_ALICE_Jul2015.pdf):
All okay.

ATLAS (http://wlcg-sam.cern.ch/reports/2015/201507/wlcg/WLCG_All_Sites_ATLAS_Jul2015.pdf):
All okay.

CMS (http://wlcg-sam.cern.ch/reports/2015/201507/wlcg/WLCG_All_Sites_CMS_Jul2015.pdf):
All okay.

LHCb (http://wlcg-sam.cern.ch/reports/2015/201507/wlcg/WLCG_All_Sites_LHCB_Jul2015.pdf):
All okay.

The UK easily meets the EGI NGI targets. Typically 6 NGIs a month fail to meet the targets.

2. There is now a WLCG operations coordination website with important links and areas such as articles for sysadmins to share information: http://wlcg-ops.web.cern.ch.

3.a. Linda recently presented to the EGI OMB on changes to the Security Vulnerability Group procedures. These have been adapted to cater for wider deployed software stacks (including checklists for VM endorsers and operators for example), changing roles and collaboration with the EGI CSIRT.

3.b. In recent weeks there has been one EGI SVG Advisory ‘Critical’ risk announced for libuser local root exploit CVE-2015-3245. As the exploit requires local access there are only a limited number of nodes affected. Sites have patched as possible/requested.

4.A power cut at Imperial College led to an outage of the GridPP DIRAC server and their top-BDII for a period on 22nd July. On DIRAC, because the server uses a multi-VO proxy a decision to implement glexec for pilots means this now has to be enabled by all sites for all small VOs. An assessment/test of steps is underway along with a discussion!

5. ‘gridpp.ac.uk’ CVMFS spaces are generally being decommissioned so that VOs can benefit from more EGI aligned ‘ego.eu’ CVMFS spaces.

6. Machine/Job features work is becoming a focus area again. IC are running this now for LHCb and Oxford/Brunel are starting an HTC deployment.

7. Within ops we have recently reviewed our IPv6 progress. The summary is in https://www.gridpp.ac.uk/wiki/IPv6_site_status. As before there have been some positive changes (for example more dual stack nodes deployed), but progress in many areas (particularly campus allocations) is slow and waiting on institute upgrades.

8. Weekly ops meetings have continued to have a non-LHC VO review each week. In many cases progress is held up by the availability of those on the VO side. A short summary is:

— DiRAC: Proxy delegation has been main issue (getting renewals through). Documentation does not appear to be present for the problem being tackled. The work has led to improved security in delegation. There are ongoing tests with performance for parallel transfers – a limit of 2000 files has been found.

— LIGO: Understanding the CVMFS repo to use has caused delays. Now added needed components to CVMFS and the latest release of their software repo has been published. Next goal to setup a proper CERNVM machine to be able to upload files using DIRAC.

— LOFAR: There was a Tier-1 meeting at the end of July. Waiting on minutes.

— LSST: Repo has been replicated on CVMFS Stratum-1 (we now have access to lsst.opensciencegrid.org stratum-0 – it is at FNAL… so far the repo in Lyon is not usable by us).

— LZ: Waited on clarifications from VO. Data centre will be at Imperial. Sheffield and Endiburgh to be T2 sites, others welcome to contribute to MC jobs.

— UKQCD: Had issues with large output files swamping WMSes. Jobs now running. Main user has recently been away – now back and will take over the UKQCD VOAdmin role.

— UCLan/GalDyn: Seeing steady work flows. GalDYn used Northgrid to run 24,000 jobs for Q2 2015.

— PRaVDA: Core people have been away. Plan is to have work running across several sites by the end of August.

— GHOST: Rapid progress. Jobs running at Liverpool. Setting up a CVMFS area.

9. A course for sysadmins and users on DIRAC/ganga has been suggested for the autumn.

Multi VO proxy issue, need to deploy glexec at all sites.
Proxy delegation issues with DIRAC.
Performance issue. Number of parallel transfers limit. AS somewhat disappointed at how difficult it’s proving to be to do the transfers for DiRAC.

SI-6 Tier-1 Manager’s Report
===========================================
Operations:
– Main issue was the network problems of the last week or so followed by the (previously planned) intervention by the engineer from Extreme Networks on Tuesday (4th). We have been running with the resilient pair of routers since then. I circulated details in a
separate e-mail.
– Disk server deployments: Since the last meeting three disk servers have been added in for each of CMs and LHCb. An increase of 360TB
for each VO. There are no further disk server deployments pending.
– Ongoing issues with performance of Castor for CMS batch being worked on as well as an intermittent, low-level, packet loss.

Staffing:
– Andrew has reduced his Tier1 time down to 20% (from 50%)
– Matthew Viljoen has left. (He was 60% Tier1.)
– Carmine Cioffi (one of the DBAs) will be leaving early September (He iss 60% Tier1.)
Plans for recruitment awaiting GridPP5.

Purchasing:
– Tender requirement documents are being prepared.

Further update from Catalin:
So there was a problem with accessing DNS on Saturday morning ~07:30. Tier-1 was unavailable from 08:30 to 10:00, but also RAL site security people were affected (monitoring cameras were not working).
GOCDB was also unavailable for longer, as the RAL email service.

What Network@RAL confirmed:
Apparently the core switch rebooted at approx 07:20 on Saturday morning. This set off a cascade effect. Eventually the problem was traced to card 3 on router A. A member of the network team came in and did reset the card on router A. The system settled down and was working normally by about 10:00. Email services continued to have poor performance until the afternoon when the exchange system was restarted.

SI-7 LCG Management Board Report
===========================================
No management report.

REVIEW OF ACTIONS
Postponed until next time

ACTIONS AS OF 10.08.15
======================
533.3 DK to draft a preliminary response for any future security incidents where the press might contact GridPP (Tom Whyntie should be briefed on this as well in case he is contacted direct).

567.1 DC to report back on CMS RAL CPU efficiency drop in April & May.

568.1 DC to investigate the Open Compute Project and report to the PMB with information about cost-savings, risks, and capital required.

571.1 PG to look at this. Look at Steve’s spreadsheet too. Do it this week.
571.2 JC to speak to Frederic re GridPP35 talk
571.3 DC to talk to LUX-Zeplin re GridPP35 talk
571.4 PC to talk to LSST re GridPP35 talk
571.5 on RAL, Need to know who is speaking on CEPH. DK took note.
571.6 PMB members get their quarterly reports in by the next PMB meeting in two weeks.
571.7 Gareth to update us on the network issues at RAL