GridPP PMB Meeting 617

GridPP PMB Meeting 617 (12.12.16)
=================================
Present: Dave Britton(Chair), Tony Cass, Jeremy Coles, David Colling, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Pete Clarke, Tony Doyle

1. CMS tape request
===================
DC advised CMS is very short of tape and any substantial amount extra that could be used, particularly at Tier-1, would be very useful. The pledge for CMS from April 2017 increases from 8 PTB to 12.7 PTB. It was agreed that up to 5 PTB could be allocated to CMS early if it is available, as an advance on the request from next year, not in addition to it.

2. What hardware should people buy?
===================================
This was main topic of discussion at last week’s technical meeting. DB circulated a draft document for Tier-2 Evolution to establish the long term architecture of Tier-2s in the UK and what they should be aiming for by end of next year. This requires more input from the experiments – RJ, DC and AS are communicating separately with DB on this. AM commented on specifying the technology and how we envisage the storage being accessed. Institutes will receive money in the near future and agreement needs to be reached on the way forward with guidance provided for the forthcoming purchase. Middle rung sites (e.g. sites with ~1 PTB storage) should have guidance on how to evolve over the next 4-5 years. It was mentioned that disk lasts 4-5 years and there was discussion on whether this will be required for caches, supporting other VOs etc. CMS is not concerned over where storage is located so long as it is visible. DK will comment on this next week.
ACTION 617.1: ALL to review and comment on the Tier-2 Evolution document this week to agree a final version next week.
ACTION 617.2: DC will append a statement to the Tier-2 Evolution document on CMS requirements.

3. Quarterly Report Summary
===========================
(Appendix I)
PG summarised highlights from emailed reports. These are the first reports using the new metrics which are only slightly amended from previously.
Tier-1 all metrics green. One member has left the resource team leaving a slight staff shortage this quarter. DB has separately expressed concern over the Tier-1 manager position and AS is addressing this as a priority. The change to the purchasing plan caused some concerns over pricing.
ATLAS – mostly green with one metric amber on the data for Tier-1 which was 20% reduced from target. RJ commented on time spent on Tier-1 issues and would like to have a plan in place which needs to be separately addressed. AS mentioned a meeting today to address the CEPH project and agreed this was not a sustainable situation which has been ongoing for c. 6 months. Alistair has been increasing effort on resource to the CEPH project and a plan will very soon be in place. In the short term Alistair is the linchpin but over time this will settle and enable Alistair to move away. RJ will consider and articulate priorities.
CMS – 2 metrics are amber – one data provider and a cap on the number of jobs able to run at the Tier-1, but these are known issues. At Tier-2s there were some failed production jobs spread across sites.
LHCb – almost all green but Glasgow brought on as the 6th LHCb site this quarter.
Other experiments are almost all green, Alice had increased. Moedal and Icecube are now running jobs at Queen Mary and Manchester along with other sites running jobs on the grid.

Operations – a fraction of HEPSPEC used, 71% being used (target was 80%). DB asked if non-usage was job inefficiency or idle CPUs, this may be unclear due to the way the information is acquired. Overall, it’s possible to assume 80% is the target, but if there are idle CPUs that other sites can use it could be more efficient to do so. JC noted there is probably a mix of both increased efficiency but still some sites not fully utilised for some portions of the quarter. The latter needs to be addressed to improve the efficiency at some sites.

Various minor issues at the Tier-2 sites were noted.

Data group seems overall fine but Jens noted a concern with a definition of the science of DMZ and it was felt GridPP already provides. DB noted this is a helpful comment at a high level. Jens also commented on divergence of storage interfaces and a longstanding ticket against Tier-1 delayed by a shortage of staff. AM noted a fix in the latest version of CASTOR which pushes it into the forward and is awaiting a downtime for upgrade scheduled for mid-January. GS noted a separate component for bespoke work that needs some work.
Security appears normal, lots of meetings being attended.
Experiment support – GANGA developers have now stopped working at Imperial as their effort was spent at the start of the project and there is no longer any funding for GANGA in GridPP – this will be looked at in the future. DB asked what would be the requirement to support GANGA in future without developing new features, there are several features they would like to work on but this is not yet possible without funding. DB enquired about demand and support for these features and whether any summer projects for students could be helpful here. PG confirmed Mark Smith may be able to assist somewhat but is not directly involved in the project. In the context of TW’s imminent departure PG will enquire about what is required for taking this forward with Ulrich. A great deal of what TW has worked up in the user guide is integral to GANGA.

ACTION 617.3: RJ will establish a priority order for resources to address issues arising.

ACTION 617.4: JC will document what sites and periods CPU is idle and could be used elsewhere and will summarise in an email to the PMB.

ACTION 617.5: PG will discuss with Ulrich requirements for GANGA going forward and report back to the PMB.

4. CyberUK 2017
===============
AS circulated an email and enquired if someone from GridPP, WLCG or EGI could attend and discuss security of infrastructure as it is government funded and supported. There is a call for abstracts, though may be looking more towards industry, AS suggests from experience of Minister visits and how important cyber security is and we could serve as an exemplar of good practice. Perhaps someone from JISC and the redesign of the Janet network to address previously experienced attacks – Romain from WLCG may be possible if we decide to actively pursue, DK will consider this further. CERN is a good and high profile example of an integrated system, as is the testing and security issues. It is undesirable for inappropriate guidance being imposed without our input and demonstrating the complexity of these issues. TC will mention to Romain to determine what, if any, input we could have here.

ACTION 617.6: TC will discuss with Romain to consider submitting an abstract for CyberUK 2017.

5. AOCB
=======
a) Email from DB from Daniella re non-LHC job level and her comment that we are saturated at the 10% level (4,000 jobs running and 50,000 jobs waiting). SL clarified that these are just DIRAC numbers, not the full GridPP resources. Therefore, not a huge concern. The solution may be to allocate more of the other cores to DIRAC. If it is a mixture of groups recently signed up to use the Grid. PG will liaise with Danielle to determine how to resolve and set a balance.
ACTION 617.7: SL will look into possible saturation at 10% level for LHCBo jobs and determine if more resources should be allocated.

b) OSC – positive feedback with no particular actions on GridPP.

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No report submitted. AM noted we are running jobs on Alice – in principle this means we can use Liverpool for Alice and they will not need to set up a VO box. AM is discussing if the VO boxes can be set up at CERN. This takes pressure off Birmingham as sole provider of Alice cycles. VAC will be re-installed and this is a very positive thing.

SI-1 Dissemination Report (SL)
——————————
## GridPP Engagement Officer Notes for PMB

### 15 years of GridPP
Thanks to Andrew McNab we have a news item marking 15 years of the Grid!

Fifteen years of the Grid

which means, of course, that now The Grid is old enough to watch The Matrix.

### GridPP-powered results presented at MoEDAL Collaboration Meeting

One of the major topics of the 6th MoEDAL Collaboration Meeting (CERN, Mon 12-Tue 13 Dec 2016) is the paper from new 13TeV results from the MoEDAL Magnetic Monopole Trapper subdetector [1]. GridPP supported the simulation campaign to produce these results with a record turnaround and have been cited appropriately.

### Ganga release 6.3.0

There is a new version of Ganga, the Python-based UI of choice for GridPP/DIRAC interactions. The UserGuide has been updated accordingly as some of the configuration options have changed [2]. Thanks to Will F (Institute for Research in Schools/CERN@school) and Giuseppe C (EUCLID) for feedback.

[1] https://arxiv.org/abs/1611.06817

[2] https://github.com/gridpp/user-guides/issues/71

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ noted service problem last week caused issues with STF file transfers for 8 hours, not patched at RAL. As we went into the weekend there were issues at several sites including RAL caused by lots of pilot jobs that were submitted – user problem as they submitted a task that brought things down and caused failures. Essentially this was a User-DOS problem.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
No report submitted.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-5 Production Manager’s report (JC)
————————————-
1. The DNS record for planet.gridpp.ac.uk has moved to the new VM host.
2. As requested at the EGI OMB, we are putting forward one or two sites to support the (DPM/dCache) accounting pilot (run by John Gordon).
3. We have received notification of the November T2 R/A figures.
ALICE (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_ALICE_Nov2016.pdf)

– All okay

ATLAS (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_ATLAS_Nov2016.pdf)

– RHUL 89%:91%

– Glasgow 85%:85%

CMS (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_CMS_Nov2016.pdf)

– All okay

LHCb (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/WLCG_All_Sites_LHCB_Nov2016.pdf)

– Liverpool 33%:33%

– Glasgow 28%:28%

Explanations:

RHUL: There was a DPM database move scheduled during the month.

LHCb results were poor (for Liverpool and Glasgow) due to SRM tests which were false positives. This has been corrected and re-computations are being requested.

Glasgow (ATLAS): 1-2 November – Downtime due to Power failure. 7 November – DPM pool node disk042 caused issues with DPM headnode. 17-18 November – DPM pool nodes disk042/disk070 caused issues with DPM headnode freezing SRM.

4. The EGI Security Policy Group has produced a revised draft version of the top-level Security Policy bringing the document up to date in terms of terminology and with the current set of security policy documents. This is being reviewed.

5. A DUNE user submitted 100000 jobs last week, this uncovered that the DIRAC instance at IC could only handle 41000. A problem was seen running the DUNE work on CernVM as the software tries to check the kernel version before starting. These issues are being followed-up.

6. Preliminary results from the WLCG lightweight sites survey are available (https://indico.cern.ch/event/540424/contributions/2194899/subcontributions/212150/attachments/1381450/2100252/LW-sites-161201-v11.pdf), but the survey remains open and all sites are encouraged to complete it. (The UK response so far has been good.) The conclusion thus far is that an early area to target is increased use of shared repositories (e.g. for OpenStack, Docker and Puppet images).

For those wanting a more fine grained view, minutes from the last weekly ops meeting can be found here: https://indico.cern.ch/event/593440/attachments/1383735/2109502/OpsMinutes-06-12-2016.pdf.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
– We will be carrying out firmware updates on the RAID cards in the ClusterVision ’13 batch of disk servers on Wednesday.
– As reported before the testing of Castor 2.1.15 is largely complete. Owing to staff availability this update will be carried out in the New Year, with the intention of completing it by the end of January.
– On Thursday the smaller LHCbUser disk pool was merged into the larger LHCbDst pool. A similar merger is planned for Atlas disk pools which will be done at a time of their convenience.

Tape:
– Migration of LHCb data from ‘C’ to ‘D’ tapes ongoing. Approaching the 50% mark with just over 500 out of the 1000 tapes still to do.

Services:
– There was a problem with the Atlas Frontier service on Wednesday (7th). We believe load caused by particular Atlas user. The services on the squid systems needed several restarts through the day and evening.
– There was a change to the FTS service this morning to work around a temporary certificate problem for Atlas.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report.

SI-8 External Contexts (PC)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
612.3: PG will determine which small sites can undertake procurement this FY. (Update: DB and RJ both had notices from JES system confirming their applications have been approved. PG will discuss with each PI to establish figures and remind them imminent action must be taken). Ongoing.
613.1: AS will undertake a post mortem on CMS issues at Tier-1. (UPDATE: AS has pulled a lot of information together and will speak more to AL, he has good information from Chris Brew and will speak to Rob Appleyard about CASTOR). Done.
616.1: LC will secure venues and accommodation for GridPP38 in Sussex and advise Fab. Done
616.2: AS will update the PMB on Tier-1 procurement by next week. Ongoing.
616.3: AS & GS to undertake a sanity check on Janet. Ongoing.
616.4: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing.

ACTIONS AS OF 12.12.16
======================
610.1: AS/GS Produce suggestions for one or more metrics that will summarise the Tier-1 network availability/performance. Ongoing.
612.3: PG will determine which small sites can undertake procurement this FY. (Update: DB and RJ both had notices from JES system confirming their applications have been approved. PG will discuss with each PI to establish figures and remind them imminent action must be taken). Ongoing.
616.2: AS will update the PMB on Tier-1 procurement by next week. Ongoing.
616.3: AS & GS to undertake a sanity check on Janet. (UPDATE: Routing was checked to ensure Tier-1 traffic did not go out over Janet rather OPN. STF site transfers were also checked and will be summarized soon, most data is non-UK destined. Need to look at the federated access and flow level information and understand at a VO level require to be checked. A large test de-bug flow was picked up and eliminated). Ongoing.
616.4: DB and SL will discuss how best to progress replacement of TW’s role. Ongoing.
617.1: ALL to review and comment on the Tier-2 Evolution document this week to agree a final version next week.
617.2: DC will append a statement to the Tier-2 Evolution document on CMS requirements.
617.3: RJ will establish a priority order for resources to address issues arising.
617.4: JC will document what sites and periods CPU is idle and could be used elsewhere and will summarise in an email to the PMB.
617.5: PG will discuss with Ulrich requirements for GANGA going forward and report back to the PMB.
617.6: TC will discuss with Romain to consider submitting an abstract for CyberUK 2017.
617.7: SL will look into possible saturation at 10% level for LHCBo jobs and determine if more resources should be allocated.