GridPP PMB Meeting 585

GridPP PMB Meeting 585 (11.01.16)
=================================
Present: Dave Britton (Chair), Pete Gronbech, Tony Doyle, Andrew Sansum, Jeremy Coles, Dave Kelsey, Gareth Smith, Andrew McNab, Roger Jones, Steve Lloyd, Pete Clarke, Louisa Campbell (Minutes).

Apologies: Claire Devereux, Dave Colling, Tony Cass.

1. Small VO support at RAL
===========================
This point relates to a ticket outstanding since October mentioned in recent email exchanges from GS (also covered in Tier-1 Manager’s Report, below). Tier-1 accepted they should have responded quicker, regardless of what priority level they are raised at. New VOs will be advised to use Dirac and responses should be more pro-actively dealt with. GS advised he normally picks up tickets weekly, usually resulting from Jeremy’s meeting on Tuesdays and another meeting on Wednesdays. It was recognised that as additional non-HEP communities are brought on board they will also need support to use Dirac.

2. December CPU efficiencies (Alice, CMS Low and ATLAS not much better)
=======================================================================
DB raised an email from Andrew Lahiff on 07/01/16. He noted low efficiency through Alice, CMS and ATLAS and enquired if this may be symptomatic of anything of concern, ie is it simply a reflection of new workflows or does it highlight unknown problems. It was accepted that the batch issues around Christmas may explain this – AmcN mentioned job failures before Christmas.

ACTION 585.1 – AS to report on RAL job efficiencies before Christmas.

3. Recognising Software contributions (theme of next SSI collaboration meeting)
=======================================================================
DB mentioned chain of emails on HEP software forum email list about recognising software contributions. Ben Morgan emailed advising the Software Sustainability Institute had presented a workshop on the topic and the next collaboration meeting has this as the theme. This is somewhat peripheral to GridPP as we don’t write the software; however, it is being hosted in Edinburgh 21-23 March and clashes with IOP in Brighton. AS is creating software in GridPP and it comes into the storage element. DB and AS will consider whether a member of staff should be in attendance, perhaps Sam Skipsey, to monitor and report back and to demonstrate our engagement.

ACTION 585.2: DB and AS will determine who best to send to SSI Collaboration meeting and report back on outcomes.

4. “Capital amounts” (yates)
============================
PC summarised background to RCUK on 22nd January – through several actions especially Jeremy Yates in his central role with RCUKe Infrastructure group, a spreadsheet of several £M of EE infrastructure is pushed to BIS, e.g. Dirac funding of c. £30M. HTC computing was c. £3m per annum and DB noted the need to firm up figures. Above RCUK (at RUK level) there are moves to push research councils together. The RCUK infrastructure group is having input there and it is crucial we present the most accurate costings and aspirations to ensure consideration. DB and PC will be in attendance at the ATI meeting on Wednesday 13.01.16 and determine what figures should most appropriately be inserted by the end of this week. The objectives are two-fold: there is some hope for funding to derive from this, but also to address any concerns over the restructure and its wider impact. Tony Hey is co-Chair of ELC and in contact with RCUK group.

ACTION 585.3: DB and PC to discuss and determine what figures should be included as capital amounts for computing infrastructure.

5. AOCB
=======
a) Glasgow over Christmas
—————————-
A series of very unfortunate events occurred on New Year’s Day leading to serious issues in Glasgow. The Kelvin Building had a very brief power spike causing computers in one room to switch off and in another room the Air Conditioning to shut down. Three independent monitoring/alarm systems failed to provide warning – the primary BMS sent notifications to the wrong place; the compressor alarm call-out failed because the VOIP phone system went down; and out own monitoring failed because the computers went down.

The resultant excessive heat (estimated between 70-90 degrees) led to melted plastic on light fittings, power distribution units to our nodes and shrink-wraps. Thankfully a member of staff investigating the outage noticed the heat and was able to activate a manual shut-down. We have lost 1000 nodes where on/off switches were melted. 9 discs were lost but there was no data loss and several other losses including the UPS systems (not for the Grid but for others in the School). DB expressed concern over reputational and financial risks caused by the failure and any longer term niggles that may arise which DB will take up with the university. It was recognised that it is fortunate the situation was not a great deal worse.

It was noted that back-up emergency procedures are crucial, but
individual system shutdowns are preferable, though DB noted the older systems cannot do this. As a positive outcome, DB mentioned that GU is now considering the provision of a containerised Datacentre by October. DB is currently, and coincidentally, engaged with a group (Cordless) who are investigating the provision of a longer term facility for GU.

b) GridPP grants
——————-
DB confirmed GU, Lancaster and QM grants have been approved and accepted. PG will check with Admin staff for the most up to date communication in this regard.

c) Quarterly reports
———————–
PG has received all quarterly reports and will summarise next week. He reminded members that new reports are now due.

ACTION 585.4: DB will provide AS with information on costings etc, for a containerised datacentre.

6. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
——————————————-
DC not present, no report submitted.

SI-1 Dissemination Report (SL)
——————————————-
##GridPP Dissemination Officer Notes for PMB

###Ganga workshop at HEPSYSMAN Jan ’16

There will be a Ganga workshop at the HEPSYSMAN meeting in Manchester, which will explore elements of the GridPP UserGuide (particularly with respect to the GridPP DIRAC setup).

AMcN noted they have agreed to accommodate Tom at RAL 2 days per week (Tue & Wed) – this is only a seating arrangement to accommodate his recent house move.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
GU was an issue but also going to transfer local group elsewhere as a precaution. Royal Holloway wanted to decommission some discs but there was insufficient time to do this so this is being looked into. Lancaster recovered from issues before Christmas but this revealed power issues at data centre that need to be addressed – probably alongside other updates before Easter so there will be no additional disruption. Rutherford, some additional space found. Some network issues experienced at RAL which is a growing concern. Possibly file corruption issues caused this.
DB questioned whether more clarification is required. It could be because old routers were involved (GS report covers in more detail).

SI-3 CMS Weekly Review and Plans (DC)
——————————————-
DC not present, no report submitted.

SI-4 LHCb Weekly Review and Plans (PC)
——————————————-
AMcN reported several Monte Carlo jobs were able to continue running over Christmas, but no new jobs are running at the moment. There was a workshop at the end of 2015 and more multi-processor jobs will begin to run. He will request some of the Tier-1s provide virtualised resources to validate VMs for workloads that run data (analysis and processes). AMcN asked if they can have access to Tier-1 for EC2 support at RAL to manage resources. GS will investigate.

Tier-2Ds with data – AMcN mentioned they are considering Tier-2a category (validated for doing analysis with tickets and hosts to storage element = the old Tier2d). There may be some references to these new aims in future, but no requests for sites with more discs.

ACTION 585.5: GS will check if Tier-1 access can be provided for EC2 support at RAL to manage resources.

SI-5 Production Manager’s report (JC)
——————————————-
A couple of issues arose for some sites over the holiday period. The most serious was at Glasgow where a temporary power cut led to a machine room running without cooling until being manually shut down. This had a knock-on impact on GridPP Nagios monitoring as the site is used in the list for file replication.

Together with unscheduled downtime at Lancaster due to flooding before Christmas, the month’s A/R figures reveal several sites below the 90% target of ATLAS during December. Details will be reported in the coming week(s).

For the holiday period all the experiments reported generally good operations: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes160107#Experiments_Reports.

The quieter period was used to the benefit of DIRAC who moved 120TB over the holiday period with transfers reaching 350MB/s (for details see http://gridpp-storage.blogspot.co.uk/2016/01/update-on-vodiracacuk-data-mopvement.html).

There are two GridPP events this week: A Ganga workshop on Thursday (http://indico.cern.ch/event/465558/) with the aim of helping our sysadmins get to grips with our current Ganga/DiRAC approach for other VOs, and HEPSYSMAN on Friday (https://indico.cern.ch/event/465560/).

Oxford will be losing Sean Brisbane (group sysadmin) and Ewan MacMahon (moving to a new position in Computational Biology) in the coming months and will be looking at ways to minimise the wider impacts.

There is a GDB this week. This is the first with Ian Collier as chair: http://indico.cern.ch/event/394776/. The focus is on security topics.

The PMB would like to record their thanks to Sean Brisbane and their particular thanks to the many contributions from Ewan MacMahon which will be very difficult to replace.

SI-6 Tier-1 Manager’s Report (GS)
——————————————-
Castor:
– Over the new year (31/12-01/01) AtlasDataDisk filled up. This caused the Atlas SAM tests to fail. There was very high load on the couple of servers that did have some remaining space – which in turn led to read failures as well. The problem was alleviated when Atlas changed their deletion algorithm on the 2nd Jan to preferentially delete larger files first and free space became available.
– There was a problem with all Castor instances during the day on Monday 4th January when there were internal problems within Castor. The problem lasted roughly the length of the working day but is not yet understood.
– There have been SAM test failures for the CMS SRM test at the end of last week. This currently tests our tape instance (CMSTape) which has been busy.

Networking:
– As reported at the last meeting we had doubled the link between the UKLight and the RAL border routers to 2*10Gbit, doubling our bandwidth for no-OPN data transfers. However, it was found that the ACLs that act as a firewall over this route were not in place. There were problems at the start of last week as we have attempted to put the ACLs back in place and still run with the 2*10Gbit link. There were significant problems during the night Tuesday/Wednesday 5/6 January when this link was not working. We have now reverted to a single 10Gbit link for this connection.

DB enquired whether this situation reflects a gap in expertise. GS confirmed some expertise in some areas might be helpful (e.g. light router). He confirmed there is sufficient expertise in the organisation, but there is a question over whether their time has been adequately allocated. DB asked whether sufficient support is being provided by the networking group due to other significant pressures on them from many other areas of the organisation. AS noted Tier-1 group capabilities do seem to be an issue, the estimated 0.5 FTE for the GridPP5 grant proposal that may prove insufficient. Also, the system admin is becoming increasingly complicated. Internally consideration is being given to contracting the same consultant that Jasmine recently engaged and also a link up with research infrastructure to deal with this alongside more in-depth training. Externally the site networking team changes mean they have a huge workload and do give Tier-1 attention but the volume of work impacts timescales and the progression of issues between teams. AS is also progressing a site-networking strategy covering the next 5 years. It was noted that the UK light-router has not come up as critical at TDA meetings and this is the proper forum for highlighting such issues should they become critical.

DB raised 2 points: 1) GridPP registered low level concerns; and 2) as the UKT0 grows networking becomes increasingly important and we need to cover this when making future grant proposals.

Batch:
– The batch system showed problems that became progressively worse on the approached to, and over, Christmas. This problem was initially flagged up by LHCb and could be seen in intermittent SAM test failures. These problems were resolved between Christmas and the New Year. The problems were found to be partly excessive load on the Condor components running on the ARC CEs caused by the new draining algorithm. There was an additional problem caused by a parameter change introduced with a Condor update.

Procurement:
– The tenders closed on Friday (18th December) and evaluations have been done.

Other:
– There was an e-mail discussion regarding support at the RAL Tier1 for jobs submitted via Dirac for a set of VOs. (GGUS #116866). This ticket being quite old and has not yet been completed. We do recognize the importance of enabling this access which is now working for three of the VOs concerned (GridPP, Pheno, SNO+). Work is ongoing to enable the other two (T2K, NA62).

Actions:

583.3 GS to check whether we have sufficient servers to accommodate increased tape requests and whether we would face the same issues again if LHCb make similar requests in the future:
I have reviewed this and we believe that such a request would be processed significantly better if/when made again. There are three factors here: The most significant is that a parameter change introduced when the Castor tape servers were upgraded to Castor version 2.1.15 has been revised. This parameter controls when the tape system reported to the rest of Castor that files have been read from tape. Castor 2.1.14 tape servers reported every file. The default in version 2.1.15 is every 500 files. This we have now reduced to 20. A small improvement to the performance of the tape servers in the relevant service class has been made (a change to the I/O scheduler within Linux). Furthermore at the time of LHCb’s large recall there were problems with two out of the five disk servers in the service class. The one thing that could be improved in the tape system for LHCb is to increase the number of tape servers. This would improve both resilience and throughput and is being discussed.

583.4 GS will investigate whether there is a need to issue guidelines to the experiments outlining what is acceptable and if we could handle any requests several times greater than the LHCb request:
Although it is difficult to have such certainty we believe the tape system would successfully handle bigger requests. At this stage we do not think it useful to request VOs to moderate numbers of tape requests – although of course they would need to wait longer for their files to be staged.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
Nothing to report since there has been no MB meeting.

REVIEW OF ACTIONS
=================
578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered. This element of the action has been closed, but AMcN will put together a template and report then update the PMB in due course. Done.

581.3 ALL members who have not already done so should submit reports to PG asap. Done.

582.4 DC to insert an update in the wiki page regarding communication with LZ. Done.

584.1 ALL to examine specification requirements for future procurements. Done.

584.2 PG will forward Tony’s email regarding recent STFC rule changes on overlap funding to PMB and PI’s so that they can plan and make best use of current funding. Done.

584.3 DB will email Tony Medlan to enquire who will replace Jonathan Flynn as Chair of CRSG. Done.

584.4 GS will investigate issues experienced with jobs at RAL between 17-18 December. Ongoing.

ACTIONS AS OF 11.01.16
======================

582.4 DC to insert an update in the wiki page regarding communication with LZ. Ongoing.

584.4 GS will investigate issues experienced with jobs at RAL between 17-18 December. Ongoing.

585.1 – AS to report on RAL job efficiencies before Christmas.

585.2: DB and AS will determine who best to send to SSI Collaboration meeting and report back on outcomes.

585.3: DB and PC to discuss and determine what figures should be included as capital amounts for computing infrastructure.

585.4: DB will provide AS with information on costings etc, for a containerised datacentre.

585.5: GS will check if Tier-1 access can be provided for EC2 support at RAL to manage resources.

Next meeting: DB will be in Chile next week and PG will chair the PMB or cancel in the absence of issues to be raised.