GridPP PMB Meeting 638

GridPP PMB Meeting 638 (26.06.17)
=================================
Present: Dave Britton (Chair), Pete Clarke, Jeremy Coles, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Cass, David Colling, Tony Doyle, Andrew McNab.

1. Network Forward Look
=======================
PC circulated a draft prompted by a recent Janet meeting attended by DC which requires inclusion of experiment statements on network requirements for Tier-2s and the Tier-1 – to address what is required of them by the experiments. PC circulated draft text and will contribute text. There was discussion on the current text included for Atlas and CMS which may infer Tier2s require higher bandwidth, but this is not all required within the 2 year timescale. The blue text will be assessed and comments invited from Atlas and CMS a summary on Tier1 and Tier2 requirements over the next 2 years and for the next 5 years will be inserted. RJ circulated an additional paragraph of text that should be included on nucleus sites and Tier2 sites. It is suggested that c.4 UK sites require higher bandwidth (associated with higher storage levels). There was some discussion on various institutional standards/ requirements and also costs and pricing models that may change in due course (e.g. RAL may pay more per GB than universities). PC will modify the text noting 4 large Atlas sites which in 2 years time will naturally scale to 20 and, as the model evolves, increase further. The document should show how the likely evolution in the Atlas model will reduce the requirements at different institutions and concentrate requirements on a small number.
There is no new text from the CMS statement as yet, PC will request input from DC and Duncan on the CMS element.
LHCb statement has not changed from last time and does not require input.
Tier1 statement appears acceptable (AS will verify the numbers to PC), taking account of Universities paying different scales than RAL due to subscriptions with JISC.
ACTION 638.1: PC will update the text on the Network Forward Look document for the forthcoming 2 years.

2. Request for resources from LSST
==================================
This relates to a request from Alessandra – Alice are about to begin a task and asked about resources that may be available from GridPP. The commitment originally given was relatively small and should be met, Alessandra is currently investigating requirements more accurately. LSST may want to run some things and some sites may be asked if they wish to contribute for a short period (timescale to be determined) – Imperial, Lancaster, Edinburgh, Glasgow are willing to contribute and it would be useful for RAL to run some where possible. Ideally, we should provide a large burst for a short time, rather than a sustained amount over a longer timescale.

3. Request for old equipment from Emyr
======================================
PG received and circulated an email request from Emyr last week requesting donation of old equipment to the African Data Centre for Bioinformatics and Medical Research – 1GB/core, at least 4 cores, AMD or Intel, ideally not more than about 7 years old and at least 500GB disk. AS had noted the constraints on this within STFC that may have relevance to universities. AS updated, he has looked through the policy and chatted to Robin Blowfield (security officer) and it seems this is more a business case issue rather than appropriate practice policy. Thus, if a business case can be put together it may be processed as an exemption as a valuable contribution to make. AS can check when the next batch of kit will become obsolete. If it can be done without being disruptive or impact on manpower availability we should attempt to provide this, though it is recognised that at the experiments kit is often used until it is no longer of use so may not be useful in practical terms. Global challenge staff would be keen to forge connections and this may lead to opportunities in the future. The PMB agreed to explore this within the legal and manpower constraints.
ACTION 638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research
4. AOCB
=======
a) PG mentioned a thread on industrial impace: There is a case study on spinouts from RCUK projects – we’ve been asked about Techcube (this did not proceed at the last moment due to challenges with the proposed metrics) and about Lockspace technologies (began at QMUL then technical people at Imperial, but this seem to fade away). DC was not in attendance to provide updates. It was noted our technologies are geared at larger issues than most SMEs experience.
ACTION 638.3: SL and DC will prepare a statement relating to Lockspace technologies.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
Not in attendance, no report submitted.

SI-1 Dissemination Report (SL)
——————————
Nothing of significance to report.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report. WLCG meeting took place last week.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Not in attendance, no report submitted.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

SI-5 Production Manager’s report (JC)
————————————-
Brunel are using GridPP VO resources, mostly for generic searches and JC is gathering more information (they are reviewing their ‘big data’ initiatives and have appointed a new lecturer in this area). AS is pulling off data from the accounting for UKT0, a verbal review would be possible and site admins can check the system.
Ian Neilson is standing down as security officer from end July and the intention is to start a new recruitment very soon. We are now capable of taking over this role.
May’s visibility figures have now been released:– Atlas okay, CMS okay, LHCb reflected same as Atlas (87%). There was a disk server down in Birmingham and a DCal update issue.
EGI workshop and a topic is the new storage plan as discussed at the WLCG workshop last week.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Castor:
———
– There were severe problems with the Atlas SRMs at the end of last week. On Thursday afternoon one of the SRM back end daemon process started crashing on each of the Atlas SRMs. A greatly increased number of SRM requests was also seen. Work went on through the remainder of Thursday and Friday but failed to resolve the problem. On Sunday a correction was applied to the Atlas SRMs to filter out double-slashes (“//”) in the incoming requests. This was re-instating a fix that had been applied to the old SRMs back in 2014. Since then the Atlas SRMs have worked OK. Work is going on the confirm this really is the solution before applying the fix to the SRMs for the other Castor instances. The high SRM request rate seen is possibly (probably) the response of the Atlas software as it tried to query the status of files and transfers during the problem. Atlas Castor was declared down in the GOC DB from Friday afternoon to Sunday morning when the fix was applied.
– There were problems with the SRMs for GEN over the weekend 10/11 June that were not understood at the time.

Echo:
——
Problems with the Echo gateways were seen a week or so ago. These coincide with a large increase in requests. In response the Echo
Xrootd gateways were stopped for a few days and a concurrent connection limit applied to the GridFTP gateways. High memory usage was observed and steps are being taken to rectify this. The load appeared to mainly come from our batch farm. The addition of the Xrootd gateways on each of the worker nodes will reduce the load on the central Echo XrootD gateways and a start has been made on setting these up.

Progress of Echo deployment:
————————————-
ATLAS are using 2PB out of the 3.1PB allocated to them in Echo this year. Their quota will be increased (and Castor decreased) as more storage nodes are added to Echo. All workflows are running on Echo.
CMS are using 0PB out of the 2.5PB allocated to them in Echo this year. We testing the performance of AAA access to Echo at the moment. A few workflows that store transient data in Echo are being run.
LHCb are using 0PB out of the 1.5PB allocated to them in Echo this year. Plan agreed with Chris Haen during WLCG workshop last week. No testing so far.
ALICE are using 0PB out of the 0PB allocated to them in Echo this year. Plan to start testing in September 2017.

Networking:
—————
– We are tracking the ongoing problem with the site firewall that affects data flows.
– Implementation of the third 10Gbit link for the OPN to CERN is now scheduled for Wednesday this week (28th June). We had planned to do this on the 14th June but there were errors seen when testing the new link. These have since been resolved and the change should go ahead on Wednesday.

Hardware:
————
– We are seeing a high rate of reported disk problems on one (the OCF ’14) batch of disk servers. In some of the cases the vendor finds no fault in the drives that have been removed. We plan to update the RAID card firmware in these systems following testing of the latest version.
– For the last purchase of capacity hardware:
– The disk servers are around two weeks into their acceptance testing (another week or two to go). So far so good.
– The CPU has all been cabled up. It is hoped to start the acceptance testing this week.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No meeting

SI-8 External Contexts (PC)
———————————
Attempts to draw up a document on what UKT0 has undertaken so far.

REVIEW OF ACTIONS
=================
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
633.1: AS will put together a proposal for HAG. Done.

ACTIONS AS OF 26.06.17
======================
630.2: DB and PG will continue to work on metrics and funding strategies at the macro level. Ongoing.
630.3: DB will tweak his metrics and funding model based on CPU. Ongoing.
638.1: PC will update the text on the Network Forward Look document for the forthcoming 2 years.
638.2: AS will check when equipment is due to become obsolete and investigate legal and manpower of donation to the African Data Centre for Bioinformatics and Medical Research
638.3: SL and DC will prepare a statement relating to Lockspace technologies.

Next meeting 3 July – PC may chair if DB is unable to attend remotely.