GridPP PMB Meeting 587

GridPP PMB Meeting 587 (25.01.16)
=================================
Present: Dave Britton (Chair), Claire Devereux, Pete Gronbech, Tony Doyle, Andrew Sansum, Jeremy Coles, Gareth Smith, Andrew McNab, Roger Jones, Steve Lloyd, Pete Clarke, Tony Cass, Louisa Campbell (Minutes).

Apologies: Dave Colling, Dave Kelsey.

1. EVAL Reminder and Concerns
==================================
Members were reminded tha the window for submitting yearly results to EVAL closes on 1 February so information needs to be gathered especially regarding posts, committees and papers etc. PG raised this as he has had several queries from sites receiving concerns from their universities. PG normally pulls together and presents info but if PG collates and submits all the information associated with each university then the PI has to submit NIL entries but that causes issues at university level. The process currently in place is not ideal for either universities or STFC. PG will look at a more satisfactory solution to this and undertake the full listing as usual then progress from there and circulate the PMB for comment/input.

ACTION 587.1: PG will look at the EVAL information and seek a better solution for submitting the required information. He will forward to the PMB a detailed complaint from Birmingham to STFC highlighting some issues being raised at site level.

2. AOCB
=======
a) GridPP36 Agenda
DB agreed the suggested theme around site evolution from last week’s PMB seems sensible and appropriate. He suggested a Call go out for a small number of sites to present their plans over the next few years and a session constructed around that theme. A large, medium and small site should each do a presentation and then a discussion should be built into the programme around each. AM will progress this and invite contributions from suitable sites to create a session.

ACTION 587.2: AM will invite selected small, medium and large sites to contribute presentations at GridPP36 on their plans for site evolution over the next few years and construct a session around this.

b) DB advised the planned meeting with Tony Hey has been postponed, an alternative date is awaited.

c) PC is giving a talk on Thursday at AIS – he will state that GridPP are enthusiastic about running jobs. JC and AM will receive an email from PC to do something with Cupid who are extremely keen to be involved. Deep3600 is similarly keen (dark matter related). Keeping in mind that JC works only part time for GridPP, DB suggests a point person/champion be identified for progressing such matters. Preferably someone should be sought at institutional level and if this is unsuccessful an alternative sought through identifying a best match person.

d) CD summarised the NGI RCUK infrastructure management meeting last Friday ran by BIS (some PMB members also attended). Charlotte Jamieson represented STFC and Matthew Dovey attended a previous meeting – Susan presented on the current position. Jeremy Yates also presented PDG update. Outcome – it was proposed that a pilot project to inform future bids should receive c. £200K from across all research councils in JISC. This appears to have been well received though Susan received a number of questions on the Spreadsheet that all the projects contributed to. Some research councils included things that are in their current programmes/baselines while others did not. Jeremy Yates may be asked to present some case studies (e.g. multidisciplinary science, leverage from commercial sectors, societal benefits, etc). Charlotte reports the meeting went well – it appears likely the research councils will fund this and try to disentangle the data to ensure consistency. It is thought possible the infrastructure elements could develop into future funding in the next Autumn statement if something is prepared in advance. BIS have this at the top of their agenda so it seems sensible to have input now so that we remain involved and proactive in the agenda. STFC are clear about which lines of its programme are to be included (flat cash, minimum viable and optimal scenarios).

e) Project Aeneas is a proposal for the H2020 call closing 20 March with several work packages structured around a Call tailored toward SKAs. One workpackage is a design study relating to the development of a regional computer centre (European) to process data coming off SKA. STFC wish to be involved as a collaborator though the amount of money involved is not hugely significant – possibly a 3 year project for £3M with many partners involved (c. €300K = possibly 1 FTE). Security was raised as an important element of this and needs to be incorporated into the proposal as it develops. GridPP was asked if we supported using GridPP effort to match EU funds for this. This was agreed as a very important opening to bring our experience into the project.

f) CD has now returned from BIS and has been considering her future role in STFC. She has been offered, and has accepted, a new position in the Stakeholder and International team looking after Infrastructure from March. She will be liaising with BIS and continue to have input but will leaving GridPP. DB offered congratulations on behalf of the PMB and thanked her most sincerely for her considerable input to GridPP.

g) It was noted that the CHEP2016 abstract Call will go out in a couple of weeks with an end of March deadline.

h) DB noted an email received from Chris Allton (theorist at Swansea) asking to step down from GridPP CB as he has been invited to be the UK representative of the C-RSG (Computing Resource Scrutiny Group). He has been invited to propose a replacement for GridPP, but this not essential as we do not fund Swansea.

i) AS noted Alison Kennedy appointed as the Hartree Centre This will shortly be announced.

3. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
After the standing items where we went through the experiment and sites
– all of which appear to be making solid and reasonable progress, we turned to the main topic of how to push the model that we are suggesting forward. We decided that we needed a large site to install a good fraction of its resources as a vac site. We concluded that an ATLAS/LHCb site would be ideal for this. Liverpool has been suggested as an ideal site (for several reasons that Andrew can clarify if necessary). We also agreed that we should make sure the selected site should be protected against any loss of income from GridPP for undertaking the work. Steve Jones has been contacted and is considering the proposal. If the response is negative we will look for another large(ish) site – Glasgow have already said that it is not a good time.

Following last week’s PMB discussion about the evolution of CERN compute resources, DC, and Dirk Hufnagel and I met with Tim and Jan at CERN to discuss. I reported on the PMB discussions and others thought that the proposed changes will hit CMS more than the other experiments. While this might be correct, it is possible others will also be affected. Tim, Jan, Dirk and I agreed to restart the (very productive) ~monthly meetings that we used to have in the early days of CERN AI. These were usually (always?) accompanied by Maria Girone, often by Bernd (as they involved resource provisioning) and sometimes Dirk D. for EOS. So, while I will have a CMS hat in these meeting as GridPP TD I will report back to the technical meeting anything that is more general in nature.

The next meeting will focus on diskless deployment at sites. In a preliminary discussion Luke from Bristol said that user analysis jobs took ~14Gb/s/1000cores. In tests that we have conducted at Imperial, we found a wide variety of network usage patterns for analysis jobs with 14Mb/s slightly above anything that we found but, for example, the H->tau standard ntuple maker takes ~8Mb/core so our tests are agreed at the ball park level.

SI-1 Dissemination Report (SL)
——————————
##GridPP Dissemination Officer Notes for PMB
### GridPP UserGuide update
TW has added instructions for using the DIRAC File Catalog Command Line Interface to the UserGuide:

https://www.gridpp.ac.uk/userguide/data-on-the-grid/dirac-dfc-cli.html
This covers using the DFC CLI to upload, manage, replicate and download data using the command line, providing the new user with a relatively gentle introduction to putting data on grid Storage Elements.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
There has been a lot of load coming through from ATLAS after some Christmas process went through. Tier-1 experienced some disc failures – this seems to be a generic problem (Scratch tape and buffers). ATLAS has been hit by Login VPNs that affected some generator jobs at 4 of the UK Tier-2 sites but these were resolved very quickly. The Quasi4 report is almost complete.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
Nothing of significance to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of direct relevance to report.

SI-5 Production Manager’s report (JC)
————————————-
1. The T2 WLCG A/R results for December have been circulated (http://wlcg-sam.cern.ch/reports/2015/201512/wlcg/).

ALICE. All okay.

ATLAS.

QMUL: 71%: 76%

RHUL: 89%: 95%

Lancaster: 0%: 0%

CMS

RALPP: 89%: 89%

LHCb

Lancaster: 81%: 96%

RALPP: 87%: 87%

Issues encountered:

QMUL: the main issue was an extended downtime during which there was a move to a new lustre file system.

RALPP: very high SRM access rates from CMS caused timeouts for other access on the SRM (including by the SAM tests).

RHUL: weekend outage due to a bad firewall configuration affected 80% of the storage nodes, followed by a week’s outage of two storage nodes due to hardware failures.

Lancaster: Odd ATLAS results in the middle of an ASAP recalculation (that due to the floods at the start of December). The recalculated metrics show the site at above 90%.

2. Sites are being ticketed concerning their publishing. This has resulted in some annoyance as the info systems “do not work and cannot accurately represent reality”. It also leads to ‘publishing errors’ cause the ROD dashboard to show a critical alert that then has to be acted on within 24 hours to create a ticket against a site. If this is not done then our NGI gets ticketed for not running the ROD properly. We are following up these concerns.

3. There are a number of small issues that are being followed up by various people. These range from adding new VOs, updates to the GridPP User Guide, http deployment, and batch system configuration (publishing but also limiting jobs).

4. Reports suggest the HEPSYSMAN meeting and GANGA workshop were very useful (Anything to follow up here? This may have been reported on previously).

For Information:

A) There is an ATLAS sites jamboree this week: https://indico.cern.ch/event/440821/

Ganga issue was raised – DB confirmed this needs to be progressed through alternative finding lines.

SI-6 Tier-1 Manager’s Report (GS)
———————————
General:
Shaun DeWitt has announced that he will be leaving around end March. The PMB expressed their appreciation of Shaun’s valuable input over many years. He will be greatly missed and very difficult to replace due to his extensive expertise.

Castor:
– At the start of last week there was a triple disk failure on a disk server in AtlasScratchDisk (GDSS667). Efforts were made to recover files. This had some limited success (some hundreds of files) but most files on the server are lost. Atlas have been informed.
There were a few tens of thousands of files on the server less than ten days old (which are more likely those Atlas were interested in). We have been seeing a high rate of disk failures in some disk servers lately – of which this is an example and we are reviewing this.

A meeting is scheduled for Tuesday with Clustervision to look at failure rate stats and determine what hardware issues are being experienced. Normally we have a clear position on capacity and options available to deal with issues as they arise. Echo-specific hardware is being considered, including, for example, whether Ray cards are reusable from systems in a temporary way to retarget new hardware to support problems.

Networking:
– We had a problem during Friday morning when one of the set of four links between the UKLight router and the Tier1 core stopped sending packets in one direction. We are actively reviewing the replacement of the UKLight router.

Batch:
– Nothing to report this week.

Procurement:
– The CPU orders have been placed. (I reported at last week’s meeting that the disk order had been placed.)

Actions:

585.1 GS to report on RAL job efficiencies before Christmas. Ongoing.
I do not yet have an answer to this. Please leave ongoing. We did have a batch problem over the second half of December. However, this affected the Condor system and we cannot see how this would have affected the job efficiency figures. We have checked that the January efficiencies are looking OK.

ACTION 587.3: GS to report back on the outcome of meetings on Clustervision 11. This will develop a plan on how to handle an increase caused by catastrophic failures since using old kit to replace existing kit is not effective.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
Next meeting planned for February.

REVIEW OF ACTIONS
=================
582.4 DC to insert an update in the wiki page regarding communication with LZ. Ongoing.

585.1 – GS to report on RAL job efficiencies before Christmas. (GS confirms that January submissions are satisfactory). Ongoing.

585.2: DB and AM will determine who best to send to SSI Collaboration meeting and report back on outcomes. Ongoing

586.1: DC will discuss proposed IT reorganisation at CERN with Tim Bell. Done.

586.2: AS will contact MoBrain and discuss resources for their EGI project. Next step is to make allocations. Ongoing.

ACTIONS AS OF 25.01.16
======================

582.4: DC to insert an update in the wiki page regarding communication with LZ. Ongoing.

585.1: GS to report on RAL job efficiencies before Christmas. Ongoing.

585.2: DB and AM will determine who best to send to SSI Collaboration meeting and report back on outcomes. Ongoing

586.1: DC will discuss proposed IT reorganisation at CERN with Tim Bell. Done.

586.2: AS will contact MoBrain and discuss resources for their EGI project. Next step is to make allocations. Ongoing.

587.1: PG will look at the EVAL information and seek a better solution for submitting the required information. He will forward to the PMB a detailed complaint from Birmingham to STFC highlighting some issues being raised at site level.

587.2: AM will invite selected small, medium and large sites to contribute presentations at GridPP36 on their plans for site evolution over the next few years and construct a session around this.

587.3: GS to report back on the outcome of meetings on Clustervision 11. This will develop a plan on how to handle an increase caused by catastrophic failures since using old kit to replace existing kit is not effective.

Many PMB members will be attending Lisbon next week for the WLCG Workshop, including DB, PG, PC and others. Therefore, the next PMB meeting will be 8 February 2016. DB will try to join JC for the PMB in his office – if this is not possible PG will chair.