GridPP PMB Meeting 583

GridPP PMB Meeting 583 (14.12.15)
=================================
Present: Dave Britton (Chair), Claire Devereux, Roger Jones, Pete Gronbech, Tony Cass, Andrew Sansum, Jeremy Coles, Steve Lloyd, Dave Kelsey, Gareth Smith, Pete Clarke.

Apologies: Tony Doyle, Dave Colling, Andrew McNab.

1. Status of quarterly reports?
===============================
PG reported that the status was similar to last week but had just received the CMS report prior to the meeting. GS now had some required information from AS so should be able to complete the Tier-1 report this week.

JC had been very busy last week but expected to complete this week.

2. Status of GridPP5 grants?
============================
PG reported that after receiving two separate queries from university sites, he emailed STFC asking for an update. They responded: ‘we now have formal project approval. That means that we can get on with the next stage of grant processing and will move quickly. So they are coming.’ At least three sites have subsequently received emails from STFC confirming their proposals had been recommended for funding and that offer letters are en route.

In line with previous years, DB asked STFC for a letter confirming the overall totals etc. STFC advised the funding announcement now comes from SBS so there will be a delay, but Sarah Verth will send something to DB and discuss in the New Year.

3. CERNVM(-FS) workshop at RAL
==============================
There has been a formal request to support the CERMVMFS workshop at RAL.
This equates to c.£2500+ industry speakers’ expenses so approximately £3k in total. This will be a 3 day workshop with UK travel expenses and the costs will be split with SCD. Taken in the round, this seems reasonable. However, DB and AS expressed concern over the proposed £900 lecture theatre hire charge and £100 for production of badges, though it was confirmed Indico can be used to register delegates and generate badges.

DK enquired whether the lecture theatre was wireless-enabled as this is obviously a pre-requisite for a workshop. AS believes it is and questioned the value of arguing a case over the costs as it may be sensible to ensure the registration fee incorporates necessary expenses and negates the need for sponsorship. PG enquired why HEPSYSMAN does not get charged and GS confirmed this is due to the use of an ordinary meeting room. It was agreed that GridPP would be happy to sponsor but believe the venue hire fee is unreasonably high for an STFC hosted meeting at an STFC venue. From the UK T0 point of view it is worthwhile holding the meeting. AS will pursue this further.

ACTION 583.1 AS will pursue the possibility of reducing venue hire costs for hosting CERMVMFS at RAL and ensure the venue is wireless-enabled.

4. AOCB
=======
a) PG asked DB if we are planning on asking for an overlap of the dates for the GridPP4+ staff grants with the new GridPP5 grant. DB said that we should ask as this often proves useful to make best use of the funds at the institutes providing there is no double charging.

ACTION 583.2 PG to request an extension to the end date of the grants to help with the overlapping period, in line with previous grants.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
——————————————-
Nothing to report.

SI-1 Dissemination Report (SL)
——————————————-
##GridPP Dissemination Officer Notes for PMB

###GridPP Website 2.0
The new site launched successfully last week [GridPP]. A few requests for renamings (e.g. Brunel University London), additions of links (e.g. external blogs and monitoring sites), etc. have been implemented.
###Representation on STFC’s website
For reference, TW has checked and GridPP are featured on STFC’s new website [STFC], with a link to our new website [GridPP] and a description of the project.
###Reaching out to SESAME Net
With the new GridPP website launched, TW has reached out to SESAME Net [SESAME] via HPC Wales (via Claire Devereux) to see if there are any opportunities for collaboration, knowledge exchange, or sharing best practice with respect to supporting SMEs.

[GridPP] https://www.gridpp.ac.uk

[STFC GridPP] http://www.stfc.ac.uk/research/science-roadmap/roadmap-projects/gridpp/

[SESAME] http://sesamenet.eu/

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ reported that Lancaster is recovering from flooding at Lancs.
Storage migration at QML and Sussex is ongoing.
Atlas discussing submitting 48GB 8-core heavy ion jobs.

SI-3 CMS Weekly Review and Plans (DC)
——————————————-
Nothing to report.

SI-4 LHCb Weekly Review and Plans (PC)
——————————————-
Nothing to report.

SI-5 Production Manager’s report (JC)
——————————————-
Relatively quiet in terms of operations issues and news (so only one each this week!):

1. Janet issues continued to affect our infrastructure last week. The ROD dashboard was almost unusable (made worse by an extended GGUS downtime on Wednesday). Janet indicates the DDoS was contained from Tuesday: https://www.jisc.ac.uk/news/janet-network-service-update-10-dec-2015.

2. There was a GDB last week: https://indico.cern.ch/event/319754/. It was the last with Michel Jouvin as chair. The main topics were around the WLCG workshop in February, accounting, the future of the Information System and issues with HS06 scalability.

SI-6 Tier-1 Manager’s Report (GS)
——————————————-
General:
– A couple of significant problems during the week. One was a packet storm across the Tier1 network on Thursday. The other was a problem with throughput to tape for LHCb.

Castor:
We were seeing a bottleneck in both migrations to tape and dealing with a large number of tape recalls for LHCb at the end of last week and into the weekend. LHCb had put in a large number of recall requests. At first we were seeing with migrations to tape stalling. While Castor prioritises the writes to tape these were running out of resources. The number of tape drives allocated to reads was reduced and the backlog was worked through. For the recalls there were have been two issues. The first was that the tape servers (now running Castor 2.1.15) would only report to Castor once either a tape was finished with or 500 files had been transferred. With a number of tapes being read in parallel, some of which contained around a thousand files to be recalled, this was significantly delaying the reporting back that files had indeed been read. This was fixed (on Friday). There have also been problems with two (out of the five) disk servers in the cache for this LHCb tape area which has also throttled performance.

DB enquired whether the LHCb request was unusual and whether we should be looking at their workflow. GS confirmed thought this was unusual it was perhaps not unexpected and we should check whether we have sufficient servers. There was some discussion around whether a similar request in the future could be accommodated and recognition that a request five times bigger would be expected to work, but with reduced speed. It was agreed that reporting back the configuration change made last week would help future performance. GS will investigate whether there is a need to issue guidelines to the experiments outlining what is acceptable in terms of increased tape requests.

Networking:
– There was a packet storm across the network around lunchtime on Thursday (10th Dec). The trigger was a restart of a switch that is one of a pair that connect a new Windows Hypervisor Cluster into the network. Why this triggered the event is not yet clear. Efforts were made to suppress the packet storm – and for a period the Tier1 network was disconnected from the site network. We declared a site outage from 11:45 to 15:15.
– On Tuesday morning (8th) we successfully moved the link between the UKLight Router and the Tier1 network off the old ‘core’ switch. This is currently running with a 2*10GBit connection. There looks to be a fault in one of the cards in the UKLight router – and we are planning to swap that out in the next days. The aim is to get this link up to 4*10Gbit.

Batch:
No significant change. Work continues on tests of allowing the “pre-emptable” jobs to run while draining of worker nodes to make space for multi-core jobs.

Procurement:
No further update regarding the procurements. As reported before the tenders are set to close on 18th December.

I should have included the Tier1 Availabilities for November 2015 in last week’s report. I add them here for completeness:
Alice: 100%
Atlas: 100%
CMS: 100%
LHCb: 99%
OPS: 100%

Christmas Plans:
RAL closes on the 24th December and re-opens on the 4th January. During this period there will be staff on-call as usual out of hours. This will be augmented by some brief regular (daily) checks of systems.

ACTION 583.3 GS to check whether we have sufficient servers to accommodate increased tape requests and whether we would face the same issues again if LHCb make similar requests in the future. Ongoing.

ACTION 583.4 GS will investigate whether there is a need to issue guidelines to the experiments outlining what is acceptable and if we could handle any requests several times greater than the LHCb request. Ongoing.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
Pete Clarke attended on DB’s behalf but we have no report. DB will email round a report.

Next meeting

Next week, subject to cancellation should there be no major items to discuss.

Next meeting after the Festive break will be on 11th Jan.

REVIEW OF ACTIONS
=================
578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered. Ongoing.

580.1 AS will look at tape planning to determine if we want to take forward increasing ALICE space and if this can be easily accommodated in all the scenarios. (AS believes this does not need to be raised at the resource meeting and will implement a plan to use less resource and more capital as moving Atlas is a 6 month undertaking, though would be beneficial. Sufficient drives (14c 10d) are necessary and 10d will be purchased. AS has extrapolated usage and cautions we can keep DiRAC operating through FY16, but in the absence of funding the situation would be critical by end FY16). Done.

581.2 CD to provide DB and LC with copies of relevant spreadsheets, budgets and planning information from recent EGI conference. Done.

581.3 ALL to members who have not already done so should submit reports to PG. Ongoing.

582.1 – ALL bring LSCG and DPHEP workshops to attention of key staff that should be registered and monitor costs and visit notices. Done.

582.2 AS to advise ALICE to open a dialogue with us regarding additional tape space if they reach a crisis, but in the meantime the situation should be left as previously agreed. Done.

582.3 AS and PC to continue to model the costs and planning before the resourcing meeting (16.12.15). Done.

582.4 DC to insert an update in the wiki page regarding communication with LZ. Ongoing.

582.5 DB will email Sarah and ask her to confirm if a decision on the GridPP5 grant can be provided in the next week. Done.

ACTIONS AS OF 14.12.15
======================
578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered. Ongoing.

581.3 ALL to members who have not already done so should submit reports to PG. Ongoing.

582.4 DC to insert an update in the wiki page regarding communication with LZ. Ongoing.

583.1 AS will pursue the possibility of reducing venue hire costs for hosting CERMVMFS at RAL and ensure the venue is wireless-enabled.

583.2 PG to request an extension to the end date of the grants to help with the overlapping period, in line with previous grants.

583.3 GS to check whether we have sufficient servers to accommodate increased tape requests and whether we would face the same issues again if LHCb make similar requests in the future. Ongoing.

583.4 GS will investigate whether there is a need to issue guidelines to the experiments outlining what is acceptable and if we could handle any requests several times greater than the LHCb request. Ongoing.