GridPP PMB Meeting 595

GridPP PMB Meeting 595 (25.04.16)
=================================
Present: Dave Britton(Chair), Tony Cass, Pete Clarke, Jeremy Coles, Tony Doyle, Pete Gronbech, Roger Jones, Steve Lloyd, Andrew McNab, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Gareth Smith, David Colling, Dave Kelsey.

1. OSC Papers to Prepare
=========================
Deadline imminent – DB received email from Lisa advising the deadline was actually last week (changed to align with other OSCs). However, she has agreed next Friday as a final deadline, but would welcome an earlier submission if possible. The additional form appears not to be too onerous and summaries can be extracted out of the Project Management Map. Attempts are being made to standardise how project are being dealt with and reported on, e.g. the Risk Register. PG will continue working on the financial documents today then look at the other document. DB will make a start on this later today and tomorrow when travelling.

PG has circulated most recent iteration with completed sections – the full document still requires detailed review (currently 30 pages).

Introduction – DB completed. EGI information has been slightly changed from previous version.

AM and JC have updates for sections – PG will integrate these and supply a final draft. PC and DB will read over and polish on Wednesday morning.

Financial document is half complete – it contains close-out figures for GridPP4 and GridPP4+ and travel plot from DK. Requires GridPP4 grant letter for comparison – PG has the GridPP4+ proposed costs and can compare final figures to that, which is around £73K under and in line with the previous report. Staff figures require to be worked up and a summary of the GridPP5 spending plan. PG will try to put together the finished version by tomorrow – PC and DB will look at this alongside the other documents for consistency/accuracy.

PC enquired how accurate the projected costs need to be for grant tables for GridPP5. We normally report actual spends, the form seeks clarification on spend for each institution. This is straightforward for closed grants but more complex for running projects. It was suggested that PG insert footnotes stating projected or fully spent subject to close-out reports being confirmed.

ACTION 595.1: DB will email AM to supply information for integration in to the reports.

ACTION 595.2: PC and DB will read through a final draft of full OSC document and offer suggested to PG.

2. GridPP37 Dates
=================
A doodle poll was submitted with 2 potential dates in August at Ambleside. The response has been largely positive with all but 3 respondents able to attend. Only PC cannot make the first dates suggested so the second date is preferred (Tue 30th Aug – Thu 1st Sept).
Tuesday – PMB Tuesday afternoon and members travel Tuesday (some may prefer to travel Monday)

ACTION 595.3: RJ will confirm the Ambleside reservation and liaise with LC on the details.

3. SKA progress at Manchester
=============================
AM has been speaking to Anna Scaife who is leading the SKA 2020 proposal and agreed to an evaluation using DIRAC in the SKA context of distributed analysis centre at multiple sites (if the funding application is successful). Although Open Stack was discussed, DIRAC seems most preferred. Anna has secured a web address so it should be set up this week and jobs submitted – Danielle has confirmed this can be done relatively quickly. This is a potentially very positive outcome and if we can influence the direction that would be very useful for running jobs and cataloguing etc which Anna is aware is required. This would ensure compatibility for storage and cataloguing using the existing services. AM will progress this and the PMB members agree this is a priority to push forward as far as possible.

GridPP and SKA meeting has not yet been arranged though provisionally agreed w/c 31 October. PC will follow up on his email of 18 April to Anna and firm up a date – AM and JC should probably attend.

ACTION 595.4: AM will progress discussions with Anna Scaife on using DIRAC in the SKA distributed analysis centre at Manchester.

ACTION 595.5: PC will prompt Anna Scaife to formalize a date in October to meet and discuss SKA requirements.

4. AOCB
=======
a) PDG (Project Directors Group) meeting:
PC and DB will travel to the PDG (Project Directors Group) meeting on Data Lifecycle and NEI organisation tomorrow (26.04.16) to ensure we have a presence and that there is also RCUK presence there, including Susan Morrell. The meeting is a workshop format – and a spreadsheet has been pre-circulated to be filled in with requirements of algorithims on work undertaken (mostly based on DIRAC)– data modelling, MCMC Bayesian, matrix inversions, FFX treecodes, etc. Others have project-by-project fields etc. Much of this is not relevant to us; however, it is important to be involved to ensure that we have input into the decision making and planning process of the exercise. It is helpful to make clear that our input cannot be easily reduced to such forms.

b) GridPP36 comments:
It was agreed that generally the meeting went well. There could have been a bit more discussion, but compared to previous years this was much better and people were more prepared to engage in discussion than before. There were some issues with the room, including a lack of wifi and mobile phone network access, but this was not greatly detrimental. Was less cold and noisy than last time, but there could be improvements for next time. DB invited comment for future.

ACTION 595.6: PC will fill in and submit spreadsheets in advance of the PDG (Project Directors Group) meeting on Data Lifecycle and NEI organisation.

5. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
DC not present, no report submitted.

SI-1 Dissemination Report (SL)
——————————
##GridPP Dissemination Officer Notes for PMB

###Ganga testing on Tier 2/3 clusters

Following GridPP36, TW has started testing using Ganga to submit jobs to local Tier 2/3 clusters. As well as configuring Ganga to submit jobs to one’s local machine and the GridPP DIRAC service, it appears that it can be configured to submit jobs to local clusters too (as one might expect). If new users can be given an account on a site’s local cluster, this would appear to offer the user a route to using GridPP resources that:

* Does not require a grid certificate (removing a significant initial barrier);

* Allows the local site to build and maintain a close relationship with the new user as they get up and running;

* Assuming CVMFS is enabled on the cluster, requires no installation of software – just configuration for Ganga to use the cluster;

* Offers the new user a chance to try out a limited implementation of their workflow on distributed computing services.

Once the user has successfully implemented their workflow on the local cluster, but realises they need more resources, this should provide the motivation to get a grid certificate, join an incubator VO, re-configure Ganga to submit to the GridPP DIRAC service, and start thinking about distributed storage.

So far TW has accounts on the QMUL local cluster, the RAL PPD Tier-3 cluster, and is awaiting confirmation for an account on the RAL Tier-1 cluster.

###PRaVDA Case Study draft

TW has passed on a draft of the successful PRaVDA Case Study to the PRaVDA group for comments; to be published on the website once these have been received.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Data appeared at the weekend. Workshop and some things would have been preferred from the start, e.g. from the pilot, but this is not an operational matter.

Castor broke for ATLAS last week, g-fail2/CAT and a patch is being deployed to fix.

There were two items relating to Tier-2s: Queen Mary – several randomly failing jobs which need to be resolved; and Glasgow has struggled with Monte Carlo and digitisation which may indicate capacity is being reached. DB noted Glasgow are using enormous memory (30 GB) for jobs which is disproportionate compared to other sites and multiple jobs on machines is stretching efficiency. This should be on a high memory queue, which is not available locally, and needs to be resolved – pilots are not detecting this. RJ will supply the name of a member of staff for Gareth Roy to discuss.

ACTION 595.7: RJ will supply the name of a member of staff for Gareth Roy to discuss recent issues relating to Tier-2s.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
DC not present – no report submitted.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
Nothing of significance to report.

I-5 Production Manager’s report (JC)
————————————-
March availability/reliability at Tier-2s. Some issues at Lancaster and Queen Mary which are being resolved. Queen Mary run as stack and there are issues in submitted from CERN, but these are not specific site issues.

SL5 situation – several sites still run this and others are decommissioned, they should be decommissioned by the end of the month. Progress at ECGF on this. Successfully have ATLAS jobs queuing.

SI-6 Tier-1 Manager’s Report (GS)
———————————
I have been asked for some information about the RAL network incidents report at the WLCG Management board. I detail these incidents at the start of the report.

Network Incidents:
================
The WLCG Service Report to the MB on 19th April stated for RAL:
“Internal network problems on 4 different occasions (24-25-30 March,16 April)”

In fact there were only three incidents – the first two dates above cover the same incident.

Incident on 24/25 (Thursday evening and then Good Friday):
————————————————————————-
There was a network problem that started on Thursday evening, 24th March shortly after 19:00 and was fixed during the following day (at around 14:00). This was traced to one of the four 10Gbit links from the Tier1 core network to the UKLight router partially failing. This affected data transfers to/from the Tier1. It took us some time to get to the bottom of this. Dropping one of the four links cleared the problem.

Incident on Tuesday 29th March:
—————————————-
One of the RAL core site routers stacks had a problem that led to the link between it and the primary of our Tier1 router pair flapping. There should have been an automatic failover to our secondary router – but owing to a configuration error that did not happen. Staff attended on site to force the failover which resolved the problem. The incident lasted two to three hours. The following day the configuration problem was understood. (The ‘priority’ setting on the primary router was set to its highest value, 255, with a lower value on the secondary one. This priority is one of the settings that controls the failover between the routers. However, this specific value of 255 means ‘don’t do anything’. Once understood this was corrected).

Incident on 15/16 April.
—————————–
There was a RAL site networking problem. This started shortly before midnight on Friday 15th April and was fully fixed the following afternoon. During this time there were four significant breaks in connectivity. One of the RAL core network switch stacks became inconsistent – in that there was not agreement between the stack member switch about which ones were visible to the other stack members. This problem also appeared on the second core network stack too. At present the problem not fully understood by the networking team. However, logs show a correlation with a board failing in another router that has connections to both these stacks. That router problem has been fixed. I understand that this stack inconsistency problem was also seen during the incident on the 29th March.
On this occasion the Tier1 router pair did failover ( a total of three times during the incident as links went up and down) – but with the problem affecting both core switch stacks to which we connect this did not enable us to maintain connectivity. At the end of the incident we were left running through what we consider the secondary of our pair of Extreme x670 routers. The following Tuesday we flipped back to the primary as that is the normal way we have been running.

A comment on the replacement of the UKLight Router.
——————————————————————
The plan is to replace the existing, old, UKLight router with a stacked pair of Dell Force10 4810 switches. These have been identified and the stack has been set-up. We are now looking at how best to configure and test the stack ahead of putting it into production.

DB enquired if these issues are related or merely a series of unconnected events. It was suggested the Tier-1 resourcing networking could possibly be more efficient. The network has grown over the years form a simple Layer 2 network with a bit of routing at the top and is becoming increasingly complicated due to multiple networks (e.g. management network, CEPH need parallel network, Hypervisor system require virtual networking). A lot of effort has been pushed into Cloud and CEPH activities – increasing links to and sharing of research infrastructure will be helpful, particularly the expertise around Jasmine. Internally we perhaps need to consider rebalancing resources. AS is in discussions with Phil regarding external as the perception is that there is now greater reliability since the old servers were replaced. CICT have had internal reorganisation over the past 3 months which may have had an impact. The concern is more on development strands rather than reliability, but this is multi-faceted. It was agreed that a holistic review in the autumn would be very useful to look at the data/stats and suggest a strategy going forward, e.g. networking, staffing, training.

The rest of the report:

Castor:
– Castor ‘preprod’ is our main test instance of Castor. A new Castor ‘preprod’ database has been built and set-up using an Oracle RAC – to better reflect the configuration we have in production. This is expected to be ready to resume the testing of Castor 2.1.15 on Monday (25th April). We had planned to bring forward the update of the Castor SRMs to a new version in the meantime. However, this will not be done as CERN have advised against the SRM update until we have moved to Castor version 2.1.15.
– The draining of Castor Atlas disk servers remains very slow. We believe this is related to the very large number of (small) files on the servers. Another approach is being tried to work around this. (We build a list of files to be moved and then request the file moves in blocks).
– The removal of the “GEN Scratch” storage area in Castor has been announced via an EGI broadcast.
– Atlas have been switched to writing new data to the T10KD drives. This is an advance of migrating their existing data to the T10KD drives/tapes.

Networking:
There was one network related incidents in this period – on the 15/16 April. This is detailed elsewhere in this report.

Grid Services:
The load balancer (a pair of systems running “HAProxy”) introduced in front of the “test” FTS3 used by Atlas has been extended to handle around 60% of the requests to this service. This is part of a gradual ramp up to handle all requests to this service via the load balancers. If this is successful this approach of using load balancers will be extended to other services.

Batch:
– Nothing particular to report.

Tier1 Availabilities for March 2016:
Alice: 100%
Atlas: 98%
CMS: 97%
LHCb: 99%
OPS: 99%
(I have included the OPS availability figures although these are not in the WLCG reports.)

In addition to the network incidents in March, and referred to above, the other main cause of failure of the availability tests was the downtime for Castor on 17th March for security updates to be applied and systems restarted.

ACTION 595.8: AS will make a high level plan on undertaking a review of the Tier-1 in the autumn.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
DB will report on this next week.

SI-8 External Contexts (PG)
———————————
Nothing of significance to report.

REVIEW OF ACTIONS
=================
591.4: PG to collate information for inclusion in OSC Financial Report. Ongoing.
591.5: ALL to contribute to the OSC Project Status Report. Ongoing.
591.7: PG to contribute Summary of GridPP Status for OSC Report. Ongoing.
591.9: GS and AS to contribute Tier-1 Status Report for OSC Report. Ongoing.
591.10: JC to contribute Deployment Status for OSC Report. Ongoing.
591.11: RJ to contribute ATLAS User Report for OSC Report. Ongoing.
591.12: DC to contribute LHCb User Report for OSC Report. Ongoing.
591.14: AS to consider how to model a proposal for short term temporarily sign-ins for new users to access the Grid. AS has started discussing this with Ian Collier. Chris suggested this could be done at the Tier-2, he can offer a central UI to support, he and Tom will submit something in writing to the Tuesday Ops meeting. Done – see new action below (595.9).
594.1: PG to circulate draft OSC reports to PMB for comment. Ongoing.
594.2: PG to initiate the production of a GridPP5 Project Map. Ongoing
594.3: PG will discuss and agree placement of risks on the risk register. Ongoing.
594.4: LC to create a new Standing Item for future PMB agendas ‘External contexts’. Done.
594.5: ALL agree a decision on ALICE storage and communicate to Catalin before Monday 18 April 2016. Done – see new action below (595.10).

ACTIONS AS OF 25.04.16
======================
591.4: PG to collate information for inclusion in OSC Financial Report. Ongoing.
591.5: ALL to contribute to the OSC Project Status Report. Ongoing.
591.7: PG to contribute Summary of GridPP Status for OSC Report. Ongoing.
591.9: GS and AS to contribute Tier-1 Status Report for OSC Report. Ongoing.
591.10: JC to contribute Deployment Status for OSC Report. Ongoing.
591.11: RJ to contribute ATLAS User Report for OSC Report. Ongoing.
591.12: DC to contribute LHCb User Report for OSC Report. Ongoing.
594.1: PG to circulate draft OSC reports to PMB for comment. Ongoing.
594.2: PG to initiate the production of a GridPP5 Project Map. Ongoing
594.3: PG will discuss and agree placement of risks on the risk register. Ongoing.
595.1: DB will email AM to supply information for integration in to the reports.

595.2: PC and DB will read through a final draft of full OSC document and offer suggested to PG.

595.3: RJ will confirm the Ambleside reservation and liaise with LC on the details.

595.4: AM will progress discussions with Anna Scaife on using DIRAC in the SKA distributed analysis centre at Manchester.

595.5: PC will prompt Anna Scaife to formalize a date in October to meet and discuss SKA requirements.

595.6: PC will fill in and submit spreadsheets in advance of the PDG (Project Directors Group) meeting on Data Lifecycle and NEI organisation.

595.7: RJ will supply the name of a member of staff for Gareth Roy to discuss recent issues relating to Tier-2s.

595.8: AS will make a high level plan on undertaking a review of the Tier-1 in the autumn.

595.9: JC to discuss instructions writing document workflows required by new users of the Grid.

595.10: AS to draft response to Alice to formalize actions agreed with Catalin.

Next Meeting: Next Monday is a Bank Holiday and the next PMB will take place on Monday 9th May.