GridPP PMB Meeting 601

GridPP PMB Meeting 601 (04/07/16)
=================================
Present: Dave Britton(Chair), Pete Clarke, David Colling, Tony Doyle, Pete Gronbech, Steve Lloyd, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Cass, Jeremy Coles, Dave Kelsey, Andrew McNab, Roger Jones.

1. ECHO Status report – Alastair Dewhurst
==========================================
Slides attached to agenda. Tier1 needs to provide 5PTB of useable storage on Echo by next April, available in Castor and deployed into ATLAS and CMS. To ensure deadline is met this needs to be operational by October then build up operational experience and debug any issues that may arise from running this with larger loads. At the F2F in April the Indico link was provided and it was agreed that an update would be provided by early July. Three plans are being considered:
Plan A – Ceph with GridFTP + XrootD plugins (plugins on top so a thin layer with access, this is simple and scalable);
Plan B if this does not work we will use CephFS with GridFTP servers (this has more off the shelf components than plan A but not as scalable as Plan A); and
Plan C – Software RAID on new disk servers in Castor (but this does not allow us to move away from Castor and is not as scalable as Plan A).

Slide 3 – new cluster is up with 2015 procurement – 60 XMA storage nodes with 260 TB of raw storage. Using Jewel version of CEPH, this will most likely be the version we run from April 2017. To do: recreating of pools required (command lines and benchmarking); various functional tests and monitoring that have not yet been set up on the new cluster; set up gateway machines to allow c. 10 GBs external transfer into the cluster to allow us to saturate links (machines are available but need to be properly connected to the network – this is not an immediate issue as yet); and build operational procedures and documents.

Slide 4 – Front end: XrootD plugin is working, no modifications to source codes were necessary and it has authorisation using Gridmap file + AuthDB. CERN hope to use for exporting data and this appears to work well. Minor fixable bugs are being addressed, none block critical functionality. GridFTP plugin authorisations have been written by Ian Johnson with the same model as XrootD. Still working on code to address performance issues with FTS transfers.

Slide 5 – All targets for early July are not yet met, but significant progress has been made on critical issues since April (XrootD and GridFTP plugins). We will progress with Plan A and test 2 x 2015 storage nodes in Castor with software RAID. ATLAS and CMS should be functioning by F2F in Ambleside (30 August 2016). We also plan to write ATLAS logs to Echo (c. 20-30% of total logs into Castor). Regular project management meetings take place for Tier-1s with milestones and targets and this will be sent to the PMB in advance of the F2F and allow longer term plans to be considered.
DB stated it is encouraging to see the progress since April and there is a better understanding of the scope of the work, but recognises there remains work to be done. DB questioned whether there is sufficient manpower working on this (pools in benchmarking was undertaken through a previous release of Ceph). AD agreed more manpower is useful, if the cluster was operating a week earlier there would have been more information available and a summer student will assist. The focus so far has been on critical work, longer term less critical work will benefit from additional manpower of graduate trainees or summer students. DB also enquired about the engagement of ATLAS and CMS. He noted other issues may arise when it begins to be used by wider sites and enquired what test are being done to ensure others can use the interface. AD confirmed all meetings at CERN over the last year have enabled him to speak with developers to demonstrate and test that the system works. CMS workflows and transfers have also been pushed through Echo successfully. This will be more fully tested when larger tests are run.

Items in red indicate delays due to manpower coming later which delayed. The Ceph cluster backing the Cloud VMs ran into an issue in late April/early May that diverted manpower (at least 1 FTE for c. 2 weeks) which affected. But this did provide a benefit of operational experience in dealing with CEPH issues.

2. EOSC Proposal
==================
PC suggested there is little to add to the discussions from the previous PMB meeting. In summary, two proposals were submitted and one is going forward, decisions are awaited. It is possible that EGI may submit a bid to EGI++ and we should be alert to that and play a role.

3. Draft Policy on supporting new VOs
=====================================
PC noted this has now been sent to the PMB who were invited to make suggested changes or amendments as necessary. PC will send to Tom Whytie for information and input.

ACTION 601.1: ALL to look at the draft policy document supporting new VOs and feed back comments to PC by the end of this week.

4. Confirmation of Tier-1 Review Date
=====================================
The PMB confirmed that the Tier-1 review date is scheduled for 25 October 2016.

5. Theme and Agenda for Ambleside
==================================
It was agreed that the agenda for GridPP37 in August requires to be developed collectively by the PMB and suggestions were invited. DB will send out a call on the UKHEPGRID list. PC will also raise this at the Ops meeting.

ACTION 601.2: DB will send out a call to the UKHEPGRID inviting suggestions for themes, sessions and presentations for GridPP37.

ACTION 601.3: PG will invite suggestions for themes, sessions and presentations for GridPP37 at the Ops meeting.

6. Tier-2 Accounting Period
============================
It was confirmed that the Tier-2 accounting period commenced on 1 July 2016 and will continue until such a time as the next h/w monies need to be dispersed. No h/w will be purchased this financial year, procurement will be undertaken next year. This should be announced around July 2017 grants agreed. Sites should be formally advised and PC will mention at the Ops meeting. However, at the moment, CPU numbers are zero as nothing has come out of the accounting portal for July as yet, this is probably merely a delay. Decisions should be driven by the experiments, DC and RJ should be involved in discussions. Clarity is required on whether funding will only be available for CPU at CPU sites. Changed style of sites may affect meeting of pledges for storage or manpower and these issues may require to be addressed. This can be discussed at F2F in August. Decisions also need to be taken on how ATLAS and CMS prefer resources to be deployed.

ACTION 601.4: PC will mention commencement of accounting period from 1 July 2016 at the Ops meeting.

ACTION 601.5: AS will check why the portal does not yet include July figures.

7. Ganga Support
=================
The PMB has had various discussions on how to continue to support Ganga in GridPP5 but have not identified anything definite. PC noted RCUK had funded the AAAI project (from which we get a little effort) but this would not help with Ganga. However, perhaps we could put in a similar bid on something like an openstack cloud. Prehaps DC, AM, Ian Collier and non-HEP people should be emailed to help develop a bid to STFC to develop elements on top of openstack for other communities to benefit from using the cloud. UKTO could put something together to submit as an extra work package for an FTE. DC is pushing for Ganga and may be involved in writing this up, perhaps as a 50/50 split with Dirac – more people are using Dirac than Ganga right now but Ganga user numbers have recently increased. DC will provide figures for consideration. PC suggests initiating a PRD for R&D that would not normally be project funded or associated with a particular project. DC should involve Mark Holloman, EUCLID. PC will check guidelines to determine parameters and draft an initial outline to circulate to the PMB next week. There was some discussion on whether GridPP or RAL received any FTE from the AAAI project, AS will check with Jens on the outcome.

ACTION 601.6: PC to check guidelines to submit PRD to STFC to develop elements on top of openstack to allow other communities to benefit from the cloud.

8. Q116 report summary
=======================
PG circulated the report in advance of the meeting. In summary:

1 or 2 metrics dipped slightly below target due to Castor and network availability on OPN. Two members of staff left the Tier-1 and the Tier-1 manager role at RAL is currently filled only at 20% FTE. Networking – project to replace old UKLight router was delayed due to staff departure and bandwidth limitations on links to JANET. Work on IPV6 networking has been slow/delayed. Several strands may need to be picked up at GridPP37 but IPV6 may need resolution before then.

ATLAS was mainly performing well, some low level performance issues; CMS is good. LHCb was good but experienced a slight drop. The success rate figure was pulled a little low by CVMFS/squid issues which are now resolved.

Other experiments are doing very well (90% efficiency), though Alice is at 85%. Alice was using a great deal of CPU (15%) at Tier-1. From the Deployment & Operations report, the fraction (of resources available) used was lower than usual partly due to air conditioning issues at Scotgrid and Lancaster. There was lot of work ongoing at the sites associated with h/w delivered this quarter to ensure grants were charged before the quarter end in March. Glasgow lost some capacity due to heat damage.

VOs – LSST has started production running at some UK sites. Tom Whytie produced a new proposal draft (MOU for small groups at a low level of new users interacting) – renamed to expectations by users, it is not clear whether this will be superseded or augmented by other documents. AS will read through Tom Whytie’s document and determine whether this is complementary then send to PC.
SL5 systems were decommissioned at several sites and there were a number of software upgrades, thus lots of development work at those sites during the quarter.

Good data transfer rates (~ 50 MBs per seconds) had been achieved between DiRAC and RAL.
Security – there was one incident at Imperial, but was resolved. The milestone (C3.3) to run the security challenge was delayed but is now underway.
NGI report – There was a new release of APEL. The question of whether the NGI work should now be reported in the Tier-1 report was raised now that CD has left the PMB.
Planning and execution: Oversight committee documents were all submitted on time. Now that we have moved into GridPP5 all reports should be revised and tweaked where necessary for the first quarter of GridPP5.

DB suggested a session on presenting project management for GridPP5 should be included at GridPP37. PG will circulate updated milestones etc in advance of GridPP37 where this can be discussed at the F2F.

ACTION 601.7: AS will read through Tom Whytie’s MOU document and pass to PC.

9. AOCB
=======

PPAP Meeting in Birmingham – Claire wrote to PC requesting he give a presentation on new technologies and R&D in computing in the UK. He responded this is not something that would be relevant, though he could speak on politics and issues surrounding technologies.

10. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No report submitted.

SI-1 Dissemination Report (SL)
——————————
##GridPP Engagement Officer Notes for PMB

###Website articles and updates

The following articles have been posted on the GridPP website:

* LSST case study [E1]
* PRaVDA case study [E2]
* Examples and Case Studies – now with DB’s “GridPP Impact Matrix” [E3]
* Services: GridPP DIRAC [E4]
* Services: Ganga [E5]

As ever, feedback, comments and suggestions welcome.

###GridPP Institutes review

Following various requests, TW would like to update the GridPP Institutes page, including the map [E6]. If you spot anything that needs updating please could you email TW specifying the required changes. For example, Warwick needs removing from the map.

###Links and references

* [E1] https://www.gridpp.ac.uk/users/case-studies/lsst/
* [E2] https://www.gridpp.ac.uk/users/case-studies/pravda/
* [E3] https://www.gridpp.ac.uk/users/case-studies/
* [E4] https://www.gridpp.ac.uk/services/gridppdirac/
* [E5] https://www.gridpp.ac.uk/services/ganga/
* [E6] https://www.gridpp.ac.uk/about/collaborating-institutes/

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
No report submitted.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
No report submitted.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

SI-5 Production Manager’s report (JC)
————————————-
No report submitted.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Tape Library:

We have had stability on the tape library control software since Wednesday 22nd June.
– Achieved by running with two out of the four layers of tape drives in use in the Tier1 library. The other (“facilities”) library is fully in service. This has given us nine out of the fifteen of ‘D’ drives in use. This has not given any throughput problem as yet. It does include two of the ‘D’ drives being used for the migration of the Atlas data from the ‘C’ to the ‘D’ drives.
– We have been able to re-introduce the crashes by adding the additional layers of tape servers back in. We have tried provoking the crashes during the day, returning to stability out of hours. This has enabled us to send more information to Oracle as well as start to look for the pattern of what causes the crashes.
– At the moment we are awaiting input from Oracle. We will also decide on our next set of tests.

Castor:
– Testing of Castor 2.1.15 has continued. We are in contact with CERN regarding a problem with the interface of Castor to GridFTP.
– We had a couple of disk server failures for Atlas shortly before the last meeting. Both servers were brought back into production
OK. However, in one of them the problems with the RAID array led to all the disks being removed and placed in another chassis before we could recover the array.
– We have seen a few moments when the Atlas stager database has been very heavily loaded. A symptom of this is that the application of the database updates to the ‘standby’ database system falls behind.
– There was a problem with the Atlas Castor instance on Tuesday afternoon (28th). This was fixed by a restart of various processes.
We believe a process count was exceeded.

Network:
– Although the OPN has been heavily used in this last fortnight it has not shown the saturation of the previous couple of weeks.

Grid Services:
– One of our three WMS systems (WMS06) has been decommissioned.
– We did have a problem on one of the other WMS systems (lcgwms04) when the disk space filled up owing to a lot of larg-ish user sandboxes. The user responded quickly and positively when contacted and we were able to get the problem fixed.

Status of Last Round of Capacity Purchase:
CPU: (HPE) Work ongoing to get the OS installed.
Disk: (XMA) Installation into CEPH ongoing (now complete).

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No Report submitted.

SI-8 External Contexts (PG)
———————————
Nothing to report.

REVIEW OF ACTIONS
=================
596.1: SL will assess existing Tier-2 hardware and its expected lifetime. Done.
598.2: DB to reconsider h/w planning. Done.

598.4: GS to provide PG with a report on Tier 1 for first quarter of the year. Done.
599.1 – SL will update h/w survey spreadsheet and circulate to PMB. (Update: PG spoke to Ben Morgan and acquired more information regarding Supernemo requirements. Ben will attend PPAN in Birmingham and PG will forward email to DB progressing if necessary). Ongoing

ACTIONS AS OF 04/07/16
======================
599.1 – SL will update h/w survey spreadsheet and circulate to PMB. (Update: PG spoke to Ben Morgan and acquired more information regarding Supernemo requirements. Ben will attend PPAN in Birmingham and PG will forward email to DB progressing if necessary). Ongoing
601.1: ALL to look at the draft policy document supporting new VOs and feed back comments to PC by the end of this week.
601.2: DB will send out a call to the UKHEPGRID inviting suggestions for themes, sessions and presentations for GridPP37.
601.3: PC will invite suggestions for themes, sessions and presentations for GridPP37 at the Ops meeting.
601.4: PC will mention commencement of accounting period from 1 July 2016 at the Ops meeting.
601.5: AS will check why the portal does not yet include July figures.
601.6: PC to check guidelines to submit PRD to STFC to develop elements on top of openstack to allow other communities to benefit from the cloud.
601.7: AS will read through Tom Whytie’s MOU document and pass to PC.