GridPP PMB Meeting 600

GridPP PMB Meeting 600 (20/06/16)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, David Colling, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Doyle, Andrew Sansum, Jeremy Coles.

1. Global Challenge Research Fund
=================================
PC and DB attended a meeting for the Global Challenge Research Fund (GCRF) on Tuesday. PC summarised briefly – STFC received a flat cash settlement and extra money was put into a set of funds, including Global Challenges that can only be accessed by research where the primary aim is to raise the standard of life of people in developing countries. Various aspirational aspects were raised in the meeting and break-out groups discussed potential ways of accessing funds, though not all particularly relevant to STFC projects. One point of potential interest raised by DB and PC could be establishing a programme for STEM training for people from developing countries who can thereafter take the knowledge back to their own countries while contributing to the PPAN science programme. DB, PC and Craig Buttar (Glasgow) also suggested that this could include a a PhD programme whereby, rather than STFC requesting information on projects to be funded, the onus would be on PhD applicants, with assistance from UK-based partners, to demonstrate how their project meets the criteria of Official Development Aid (ODA). They also submitted a suggestion for a Digital Skills Development Programme to train people from ODA-eligible countries in digital skills, including software sensors, data analytics applied to scientific problems within the context of the largest scientific collaborations in the world. This would develop transferrable skills, eg Linux, C++, Python, data transport, data analysis techniques, co-parallelisation, sensor development, so that the skills can be transferred back to national science programmes or local programmes in their home countries. We could then establish projects for people to work on in 3 strands – Researchers, Technicians and PhD Students then some work on other projects, eg Dirac of LHCb, so long as the focus remained on transferrable skills. Although this is an academic exercise, there is a large sum of money targeted to this programme, initially through the Newton Fund commencing this year which requires match funding as well as the GCRF for STFC next year to equal the Newton Fund, thereafter it increases – approx. £700M over 5 years will be available. Crucially, this may be the only opportunity for non-capital funding and these opportunities should be explored, though applications will create additional workloads. There will be two more meetings hereafter and DB urged PMB members to attend where possible and contribute to discussions.

2. SuperNemo Support
====================

PG summarised this email thread offering support to SuperNemo sent to the PMB on 15 June. Various suggestions have been made, e.g. re-enabling the VOs and they were put in touch with Tom for access to the new user manual. Ben Morgan, Warwick University, is rather reluctant to use virtual machines and Tom advised on CVMFS which was useful. A follow up email from Daniella who runs Dirac at Imperial querying how motivated SuperNemo are to use the grid and how far to take this. The team at Imperial are rather overloaded with existing users of Dirac rather than continuing attempts to meet the needs of a group who don’t align with what we can offer. In summary, our corporate message should be to be helpful in a flexible way wherever possible, but not to progress if a potential user does not wish to use our services. They could reconsider their computing requirements and perhaps submit an SOI for a computer infrastructure. Ben opinion was that neither Virtual Manchines or CVFMS can be a hard requirements on their users and asking for us to package Grid and Dirac tools in an alternative way. A suggested response was that we will take this suggestion on board and keep under consideration, but there is insufficient capacity to reduce this to an app and this is not a service that other VOs are currently requesting. It is possible that Ben does not have enough context or knowledge, and perhaps discussions should progress with an alternative contact to ensure any areas of miscommunication are addressed. Normally first contact with newly joining groups is made with a management representative who has wider context, then ultimately passed on to technical staff who may not have access to wider information. We should seek to arrange a meeting with a contact at higher management level to discuss more fully. Ben mentions Julia Sedgebeer at Imperial and Sheryl Patrick at UCL who could become involved and it may be helpful for DC to initiate informal discussions with Julia in this regard. A uniform approach should be developed for use with other groups mentioning preliminary technical discussions will seek to ascertain initial requirements and suggesting it is thereafter appropriate to establish their requirements in more detail. We should establish, e.g. management contact, technical contact, indication of required resources to make clear our desire to assist and negotiate requirements.
There were earlier communications with LSST which created discussions on how to resolve any differences in expectations and agree ways forward. It was agreed that to ensure consistency we should expend as much efforts for SuperNemo as previously for UCLID.

ACTION 600.1: DC to contact Julia Sedgebeer at Imperial to informally discuss and address SuperNemo’s computing needs and request Daniella and Tom to await outcome of these discussions before progressing further.

ACTION 600.2: DB/PC will consider whether to contact the head of SuperNemo in the UK discussing support requirements.

3. Update on Tier-1 tape access
===============================
A full update is contained in the Tier-1 Manager’s Report.

4. AOCB
=======
a) STFC are undertaking a consultation for a possible call for a CDT which has been commented on from several people and enquiries have been received from other GridPP sites on whether GridPP intend to take a lead. This will not be possible since running a CDT would require funding 4 PhD students per annum and being embedded in industry. However, GridPP could have some relevance if there was a distributed CDT to link together different institutes. There may also be possibilities relating to the GCRF to set up a CDT then ask STFC if other PhD students from ODA-compliant countries could be funded from the GCRF, thereby helping the requirement for institute funds. The possibility was raised of undertaking this under the banner of UKTO which includes astronomers. Only schools are invited to respond so we cannot have direct input and UKTO groups may have insufficient gravitas within their institute to secure agreement for such a proposal. It was agreed GridPP cannot generate studentship funding or industry links but would be pleased to support any regional/institutional applications in this regard.
b) PC attended a DUNE planning meeting in Oxford on Friday where computing was raised and he confirmed GridPP were happy to assist on any related matters. An attendee was identified as a technical contact, but previous contact was made with DUNE and DC will check with Dirac (JC may have information in this regard) to ensure consistency of information previously supplied. These are important clients who will run experiments at CERN Proto-DUNE.
c) PC noted receipt of a request from LSST UK Board as a result of their formal discussions with DOE who are very interested in what is being done with GridPP and enquired about to what extent in the next 5 years could UK computing support be figured into LSST plans. This is extremely positive and we should consider very carefully, PC and DB will draft up some thoughts as this is the first new Non-HEP Community that may look for commitment. Included in this will be confirmation that we will provide our best efforts but a request for serious commitment of resources will require them to speak to their LSST programme manager, Colin Vincent, to determine commitments. The response will outline potential percentages and limitations may be available and PC, RJ and DC confirmed that Imperial and Lancaster, who have a vested interest, and possibly Edinburgh, may be able to pledge resources already paid for outwith STFC-funded resources.
d) On Friday there was an EGI broadcast from the JET site announcing they have commenced the decommissioning of the site.
e) Issues with the GridPP Accounting Matrix appears to have been resolved and the portal is operating properly, though there are some minor niggles.

ACTION 600.3: PC and DB will draft up initial thoughts on how best to respond to enquiry from DOE regarding GridPP computing support and circulate to PMB for comment.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
DC noted a meeting is due to take place on Friday. There had been some discussion on storage and production of the requirements document which should soon be available. No significant updates on Rapid VCycle, ATLAS and CMS had nothing of note to report. There had been some discussion with Brian Brockleman on storage and how to get this into Cloud. Security did not have much to report, GridPP and Dirac – there was a bug found on Dirac which could lead to a comment on JDL taking down the Dirac server, this is in the detail. The meeting at RAL received much positive feedback and DB thanked Catalin for passing on some very positive comments received relating to the meeting.

SI-1 Dissemination Report (SL)
——————————
##GridPP Dissemination Officer Notes for PMB

###CernVM Users Workshop news item

See https://www.gridpp.ac.uk/news-2016-06-16-cernvm-workshop/

The Euclid talk on their use of the CernVM-FS and their data centre structure may be of particular interest to PMB – https://indico.cern.ch/event/469775/contributions/2148322/

### GridPP Institute press release/web page summaries

In preparation for the PRaVDA and LSST case study/news items, TW has been preparing press release material for news outlets and sister organisations’ press offices. This will include notes on individual sites that were directly involved with the case study in question. Attached (Appendix I) is the entry for QMUL; similar are being prepared for Manchester and Birmingham.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Two technical interchange meetings took place recently. At one some American colleagues discussed an idea named ‘Harvester’ which seems very like an SRM, it is possible the proposed tasks can be undertaken on ARC. This will be kept under review and there was nothing further of note operationally to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
No report submitted.

SI-5 Production Manager’s report (JC)
————————————-
1) We have our Tier-2 availability/reliability figures for May.

ALICE (http://wlcg-sam.cern.ch/reports/2016/201605/wlcg/WLCG_All_Sites_ALICE_May2016.pdf): All okay.

ATLAS (http://wlcg-sam.cern.ch/reports/2016/201605/wlcg/WLCG_All_Sites_ATLAS_May2016.pdf):

RHUL: 62%:62%
ECDF: 83%:83%
BHAM: 87%:87%

CMS (http://wlcg-sam.cern.ch/reports/2016/201605/wlcg/WLCG_All_Sites_CMS_May2016.pdf): All okay.

LHCb (http://wlcg-sam.cern.ch/reports/2016/201605/wlcg/WLCG_All_Sites_LHCB_May2016.pdf):

RHUL: 64%:64%
ECDF: 87%:87%
RALPP: 89%: 94%

The site issues encountered during the month are:

RHUL: A site network switch configuration was broken several times for days at a time – this is still being investigated. The compute nodes had no problems themselves.

ECDF: The results are primarily due to a cooling failure on Sunday 8th May. The site was put into unscheduled downtime whilst this was resolved.

BHAM: Awaiting response.

RALPP: The main issue was downtime for a dCache upgrade that mainly dropped the availability for LHCb. They suffered slightly before that with load related errors on the SRM which the upgrade seems to have fixed.

2) The Security training and HEPSYSMAN event take place Tuesday-Thursday this week: http://indico.cern.ch/event/518392/.

3) We are in the process of setting up a SoLid VO (probably solidexperiment.org). Currently this is driven by Imperial. DUNE has also been configured at a couple of sites and we are assisting registering it with the EGI ops portal.

4) Pete G raised a question to the PMB on 15th June “Support effort for SuperNemo too much?”. In the SuperNEMO case the solutions being proposed were not what the VO was wanting to pursue and in these cases the question is how much effort do we (GridPP) put into it?

5) Apart from storage end-points and a couple of acknowledged edge-case services (like VO Nagios) GridPP sites have migrated from SL5. We are not expecting any EGI security team tickets in relation to the migration. SL mentioned Chris Walker’s page with a matrix of sites against VOs which he has now re-established and can be found on his webpage.

SI-6 Tier-1 Manager’s Report (GS)
———————————
Tier1 Tape Library.

Since the last meeting there has been (and still is) an ongoing problem with the library control software (ACSLS) crashing (or, more correctly, hanging). There are several points relating to this:
– At the moment we do not know the cause of the hangs.
– The frequency of the hangs has varied. On Tuesday 7th June we switched back to running the control software on the main system. This initially seemed to work well but then the rate of software hangs increased. Since then we have flipped back to the ‘spare’ system again and ran some hardware diagnostic tests on the main server – although nothing was found.
– As part of the investigations we have tried running in a reduced configuration. We appear to have stability if we only use the non-Tier1 library.
– We have carried out a couple of interventions at Oracle’s request: Network connection to tape libraries moved to a completely separated network; a complete restart of the Tier1 library to correct a problem with not all the logs being gathered.
– At the end of last week we brought up a new third system running a later version of the “ACSLS” control software. The ACSLS software has been hanging at a rate of around four times per day since then. These are fixed up by a re-starter. We had been seeing more than one type of problem – one of them related to errors when accessing the internal database in the system. These disappeared once this system came into use – as expected with a completely clean new database build.
– We have an open call with Oracle. We had a meeting with them last Monday (13th). We have another meeting with them at 2pm today.

DB suggested making a statement to Oracle at the forthcoming meeting to clarify that, although the operational impact is relatively small, GridPP regard this is a very serious matter and that one of the fundamental elements of our infrastructure needs to be stable – we urge them to take all steps possible to resolve the situation. This also impacts our reputation with existing and future non-HEP clients as the breaking weakens our position as a destination of choice for tape back-up when others are available elsewhere.

Operationally for the Tier1 the control software being restarted at this rate (four times/day) only has a limited effect on our operations as tapes already in drives continue to be read and written. We note that we have written some three-quarters of a Petabyte to tape in the last four weeks.

Castor:
– Testing of Castor 2.1.15 has continued. One particular problem (which was traced to a database configuration issue) was fixed. A new problem, whereby GridFTP access via the SRMs doesn’t work is being investigated. This, plus any other problems that arise, need resolving before being able to resume stress testing.

Grid Services:
– The load balancer (a pair of systems running “HAProxy”) introduced that was introduced in front of the “test” FTS3 used by Atlas has been extended to the production FTS3 service.
– One of our three WMS systems (WMS06) is being decommissioned.

Networking:
– The OPN has been heavily loaded (saturated inbound) at times over this last month. Outbound traffic over Janet has also been high.
See plots attached for both.

Batch:
– Nothing particular to report. I believe I have already reported that one of the two batches of 2015 worker nodes (the XMA) batch is
in service.

Status of Last Round of Capacity Purchase:
CPU: HPE: Vendor testing completed. Some network reconfiguration needed to enable our benchmarking and final tests to take place.
Disk: (XMA) All acceptance testing and benchmarking completed. Being prepared for installation into CEPH.

Tier1 Availabilities for May 2016:
Alice: 100%
Atlas: 100%
CMS: 100%
LHCb: 99%
OPS: 98% – But I will put in a request for a fixup on this as the main cause of lack of availability affected may sites.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
The meeting will take place tomorrow.

SI-8 External Contexts (PG)
———————————
Main issues covered. PC noted a JISC meeting on preservation that he will attend on Friday after being invited by Jeremy Yates. This could prove interesting and people involved in the LHC context should be involved. It is possible this relates to a JISC plan for research data management service as a type of archive tape storage that DB has plots of planned processes for and it is therefore helpful to be represented.

REVIEW OF ACTIONS
=================
595.9: JC to discuss instructions writing document workflows required by new users of the Grid. Ongoing.
596.1: SL will assess existing Tier-2 hardware and its expected lifetime. Ongoing.
598.2: DB to reconsider h/w planning. Ongoing.
598.4: GS to provide PG with a report on Tier 1 for first quarter of the year. Done.
599.1 – SL will update h/w survey spreadsheet and circulate to PMB.

ACTIONS AS OF 20/06/16
======================
595.9: JC to discuss instructions writing document workflows required by new users of the Grid. Ongoing.
596.1: SL will assess existing Tier-2 hardware and its expected lifetime. Ongoing.
598.2: DB to reconsider h/w planning. Ongoing.
599.1: SL will update h/w survey spreadsheet and circulate to PMB.
600.1: DC to contact Julia Sedgebeer at Imperial to informally discuss and address SuperNemo’s computing needs and request Daniella and Tom to await outcome of these discussions before progressing further.
600.2: DB/PC will consider whether to contact the head of SuperNemo in the UK discussing support requirements.
600.3: PC and DB will draft up initial thoughts on how best to respond to enquiry from DOE regarding GridPP computing support and circulate to PMB for comment.