GridPP PMB Meeting 597

GridPP PMB Meeting 597 (16.05.16)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Tony Doyle, Tony Cass.

1. RCUK/BIS meeting
===================
PC circulated slides from the RCUK-BIS meeting. So far there have been 2 meetings where Swindon representatives met with RCUK and BIS (January and May 6th). In the short term it seems no additional funds are available, it is possible other routes to funds may be announced in the Autumn by the Government, but it is not certain. A high level proposal has to be developed by July which may need information for PC to provide to Charlotte so people may have to give input at short notice. Finally, the AAII project with DiRAC and GridPP is off the ground so that may attract small amounts of funding to cover costs.

2. Tier-1 cooling incident
==========================
AS has been following up and provided a preliminary outline of the situation which is complex and still under investigation. Around 4.30pm on Monday four chillers shut down, leading to a 1°C per minute temperature rise in the machine rooms. There was some mediation as we have some independent units to cover main power failures in the UPS and robots rooms (only partial at c. 150KW cooling). Fortunately, the incident occurred during working hours and building services staff managed a manual over-ride to restart the system. As part of the emergency management some jobs were dropped to keep the heat off, but we were realistically around 5-10 minutes away from a forced shut-down for the first time since 2010. The working theory is that in the new Space building (R100 – a separately managed network) a broadcast storm originated and caused difficulty for a sub-unit of the building management system for aircon in the old ATLAS building which generated various messages and flooded the primary unit. It is not clear why the primary unit shut the aircon down, but there remains a known deficiency in that it does not automatically go into restart and requires manual intervention. A full post-mortem review is still awaited and can be fully considered once complete.

This has been a concern previously from other parts of RAL, ie the BMS systems are on the shared infrastructure (though not accessible to the public). However, the network has had other issues previously which have not caused such a reaction. This could have been catastrophic had it occurred outside of working hours. AS will look out and circulate the paperwork relating to the previous incident in 2010 to determine if any similarities can be detected. DB reiterated the importance of quickly discovering the cause of the incident and concern over the need for a manual shutdown as it could impact our ability to offer a robust service if this recurred. GS’s Tier-1 report provides more information on these incidents and notes not all machines are configured so there is a monitoring of internal air into each system and if it reaches a certain threshold it will shut itself down. Entirely coincidentally, GS has recently instigated a review of procedures in such situations. An internal problem has also been experienced on one of the internal Cloud systems and led to a run-through of procedures.

ACTION 597.1: AS will look out and circulate paperwork for a Tier-1 cooling incident experienced in 2010 to determine any similarities with a recent situation.

3. Tier-1 Tape robot problems
=============================
GS circulated information with some details – this has been incorporated within the Tier-1 report for clarity. It seems likely that a replacement card was responsible for the situation, as changing back to the original card provided a resolution, but GS is still investigating more fully. 2 tape libraries are separate but also linked and recently some problems were experienced, ie Tier-1 library fluctuated while the Oracle engineer was working. Also, there was some discussion with an Oracle manager last Thursday before the library sprang back in the early evening to ensure sufficient resources were being invested – the engineer had remote support which was very helpful to know. Such an issue is unlikely to recur and, if so, it should be quickly resolved.

4. Support for Solid VO and Wider discussion of Experiments supported Scope for GridPP
=====================================================================
DB summarised SCD was approached by Dave Newbold about additional storage on tape over the next few years and an approach was made through GridPP enquiring whether Solid would be supported. Dave Newbold thought it may be rejected in the absence of a fee. Clearly GridPP wants to treat all groups equally but it’s probably time that we documented our policy. PC has begun writing a draft policy to circulate for discussion. He also raised with 3 programme managers in Swindon (Tony, Colin and Charlotte) that we are having conversations with other activities and it is perhaps sensible to determine at the outset what computing would be required and recorded in the system to highlight that this issue increasingly requires consideration. DB circulated a preliminary text explaining GridPP’s position, suggesting we help everyone but there may be limits which, when reached, will attract some costs. In the GridPP5 proposal we proposed to support the PPAN projects but decisions are necessary on how best to manage this over the longer term. It was suggested if projects go beyond a threshold they should seek funding, the threshold should be quite low in recognition that it may take some time to capture funds. Any change would apply only to new projects, not in retrospect.

ACTION 597.2: DB will monitor usage in line with new accounting portal which he will be provided access to.

ACTION 597.3: PC will work up text for a policy on supporting non-LHC VOs and will circulate for discussion this week.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
The agenda for the meeting was provided by DC: https://indico.cern.ch/event/532300/

SI-1 Dissemination Report (SL)
——————————
##GridPP Dissemination Officer Notes for PMB

###New user case studies and the “GridPP Impact Matrix”

Recently both EUCLID and LSST have provided us with feedback from their recent use of GridPP resources and tool sets. DB has suggested the creation of a “GridPP Impact Matrix” that lists such case studies against which resources and tools were used to provide an “at-a-glance” reference GridPP impact for the website. TW will implement this on the new website with the following case studies:

* CERN@school – LUCID (full workflow with DIRAC)
* GalDyn (full workflow with DIRAC)
* PRaVDA (full workflow with DIRAC)
* LSST (full workflow with Ganga + DIRAC)
* EUCLID (full workflow at the RAL T1)
* DiRAC (data storage/transfer)
* SNO+ (data transfer/networking)

###Local cluster submission with Ganga – success

TW can confirm configuring Ganga to submit to local Condor batch systems (without a grid certificate) works as described. Thanks to Mark Slater, Robert Currie and the Ganga team, Andrew Lahiff (RAL T1 cluster) and Chris Brew (RALPP cluster) for help with testing this.

This is useful to know as it means new users can start porting their workflow(s) to a distributed system (i.e. Condor batch running) immediately after being given an account on a local cluster that supports Ganga (e.g. via CVFMS) – i.e. no need for a grid certificate or VO configuration. Once they are running “Ganga-style” and in need of extra resources, they should have the motivation to get the grid certificate, join the gridpp incubator VO, etc.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Nothing of significance to report.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Going to start using Multi threaded Gaudi work – users to submit multi process jobs at Manchester then roll out to other sites after testing.

SI-5 Production Manager’s report (JC)
————————————-
1. There was a GDB last week http://indico.cern.ch/event/394782/. Two interesting topics arose. One was on the GDB Steering Group (1 representative per experiment + 1 or 2 representing sites to drive in-depth technical discussions); and the other concerned updates on the theme of “Lightweight sites – ongoing activities, plans and ideas”. GridPP had several contributions, but additionally we heard about the working group goal “One of the goals of WLCG Operations Coordination activities is to help simplify what the majority of the sites, i.e. the smaller ones, need to do to be able to contribute resources in a useful manner, i.e. with large benefits compared to efforts invested.” and some of the ideas being pursued in other WLCG countries. JINR presented on routes including OpenStack; OSG gave an overview on their Tier-3 in a box – sending out $10K configured hardware to 5 sites with central management. There was only a few Tier-2 representatives at the meeting and we could suggest someone from the UK to volunteer for this. David Crooks was agreed as the appropriate representative.

2. The DUNE (Deep Underground Neutrino Experiment) collaboration have users at several GridPP sites who are interested in using our resources. They already have themselves in a FNAL VOMS and have a CVMFS area, so we have agreed to set them up for further exploration of approaches first at Sheffield and then Liverpool.

SI-6 Tier-1 Manager’s Report (GS)
———————————
The main points from this last week are two significant problems.

1. On Monday (9th May) there was a problem with the cooling in the machine room. At around 16:30 the air-conditioning failed owing to a problem with the Building Management System (BMS) which in turn was caused by a network problem elsewhere on site. This resulted in all the Chillers and Pumps going off. We paused batch work and stopped new batch jobs starting. Other users of the machine room took similar actions. The temperatures stabilised. After around 30 minutes staff restarted the pumps and chillers and temperatures fell back. Batch jobs were then un-paused and continued to run. In order to keep the rate of any temperature rise down should there be a recurrence overnight, new batch jobs were not started until the following morning. An outage for the CEs was declared in the GOC DB overnight.

2. Problems with Tape Library:

There are two tape robots here. One is dedicated to the Tier1. The other deals with everything else. They do sit physically next to each other – and there is a “pass-through” port where tapes can be moved between them.

During the week before last one of the two “elevators” within the Tier1 robot stopped working. The elevators move tapes up and down.
The “handbots” move tapes horizontally and put tapes into drives. One elevator down was not a great problem – although we were getting it fixed. On Monday of last week (9th) the second elevator stopped working. This happened roughly coincident with the cooling failure we had then. We don’t believe this was the cause – but we didn’t immediately notice it with the more pressing matter that was then ongoing.

With both elevators broken tape access was still possible – but some tape mounts would fail. I.e. where drive and tape were on the same level then the mount could work.

On Tuesday (10th) at 17:45 there was a further failure of the Tier1 robot. Among these was that a bank of power supplies within the robot were off. (Subsequent work showed the power supplies themselves were OK – but there was no power on the power rail feeding them). The Tier1 robot was not working from this point.

During Wednesday (11th May) work continued (by Oracle) on the tape library. In parallel a spare computer system was set-up to run the control software pointing only at the non-Tier1 robot. This took a little while to get going but was fairly successful. The system needed restarting a couple of times. However, this meant we were able to provide non-Tier1 tape access.

On Thursday (12th) Oracle continued working and in parallel four tape drives plus some blank tapes were moved from the Tier1 library to the non-Tier1 library. The control software configuration was modified to provide write access to these drives (one for each of the Tier1 Castor instances). In this way we were able to provide some ‘write’ access – but no read. The situation was not stable. In particular the control software was affected by the state of the (non-working) Tier1 library.

During the evening of Thursday the main fault in the Tier1 library was fixed. We continued to run overnight (Thursday to Friday) in the partial configuration described in the previous paragraph. (I.e. only limited write access for the Tier1).

On Friday (13th) Oracle fixed the first ‘elevator’ and put things in the Tier1 library back as they should be. The normal control system was restarted (with some reconfiguration to allow for the four Tiet1 drives still in the ‘other’ library) and the whole system restarted.

The system has run pretty much OK since then. The control software needed to be restarted once early Friday evening and there are a few details that are being followed up. The ‘at risk’ (warning) that was left in the GIC DB over the weekend was removed this morning.

The cause of the problems within the library look to be in a card that was replaced on Tuesday to fix the second elevator problem. This board was replaced during the day and – presumably – failed completely at around 17:45 that day. The system came back when the original board was put back on Thursday.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No Report.

SI-8 External Contexts (PG)
———————————
Covered in agenda item 1.

REVIEW OF ACTIONS
=================
595.9: JC to discuss instructions writing document workflows required by new users of the Grid. Ongoing.
596.1: SL will assess existing Tier-2 hardware and its expected lifetime. Ongoing.

596.2: DB & PG will digest CRSG paper and develop a set of figures. Ongoing.

596.3: PG will rationalise figures from CRSG paper and consult with experimental reps on whether they are satisfied to proceed with the proposed numbers. Ongoing

ACTIONS AS OF 16.05.16
======================
595.9: JC to discuss instructions writing document workflows required by new users of the Grid. Ongoing.
596.1: SL will assess existing Tier-2 hardware and its expected lifetime. Ongoing.

596.2: DB & PG will digest CRSG paper and develop a set of figures. Ongoing.

596.3: PG will rationalise figures from CRSG paper and consult with experimental reps on whether they are satisfied to proceed with the proposed numbers. Ongoing

597.1: AS will look out and circulate paperwork for a Tier-1 cooling incident experienced in 2010 to determine any similarities with a recent situation.

597.2: DB will monitor usage in line with new accounting portal which he will be provided access to.

597.3: 597.3: PC will work up text for a policy on supporting non-LHC VOs and will circulate for discussion this week.

Bank Holiday on 30th May so no PMB on that week.