GridPP PMB Meeting 574

GridPP PMB Meeting 574 (21.09.15)
=================================
Present: David Britton (Chair), Tony Doyle, Roger Jones, Pete Gronbech, Tony Cass, Gareth Smith, Andrew McNab, Andrew Sansum, Jeremy Coles, Dave Colling, (Minutes Ð Louisa Campbell)

Apologies: Steve Lloyd, Claire Devereux, Dave Kelsey, Pete Clarke

1. Tier-2 Evolution WG
=======================
AMcN summarised the current position. The proposed list of members has been distributed by AMcN including Chris Brew. A GridPP mailing list has been created and the list has also been posted to PMB. No feedback has been received which is taken as an indicator that there is general consensus. A JIRA project has also been created Ð this is a system CERN use internally for ticketing and most LHC experiments also use for their own projects. AMcN proposes to use for GridPP, sites and machines count as components – they can be tracked and comments made on JIRA. CERN lists are used for access to the system but people outside of the lists cannot access for reasons of privacy. AMcN will construct lists of tasks and post on JIRA, e.g. ATLAS, and validate workloads in Tier-2. This way he can create tasks and keep a check on progress.
The taskforce will be mentioned at JeremyÕs meeting next week and the link can be included and updated.
DC and AMcN will now use Technical meetings on alternative Fridays as Tier-2 Evolution Task Force meetings and also as a means of recruiting more people and determining who wants to test things out. The timetable discussed at the face-to-face PMB in Liverpool to set this up was quite relaxed on this, but excellent progress has been made. AMcN has now uploaded some tasks onto the JIRA site and after some discussion it was clear that members have to physically login to the system using existing CERN I.D. in order to see the tasks. DC, JC and AMcN are administrators for JIRA

ACTION 574.1
AMcN to undertake various tests on JIRA and discuss at a future PMB.

2. CMS T1 efficiency (DB)
===================
DB noted efficiency was low again in August then a subsequent email from Andrew Lahiff provided conflicting information about efficiency. GS also received an email from Andrew with efficiency plots that clarified that cores allocated and used are discrepant so efficiency also shows unused cores while other figures include efficiency only on cores actually used. DC clarified that two issues affected efficiency at RAL:
1) Mostly on Digi-reco jobs that run across all CMS Ð these caused a drop below 50% efficiency at RAL due to disks being over-stressed. They have now moved to a system of pre-mixing jobs to alleviate workflows.
2) Multi-core pilots problems, ie attempts to achieve maximum use for ongoing work across CMS and experiments in general.
DB asked when we can expect this to be resolved and DC suggested this should no longer be an issue since the new methodology of pre-mixing meant that the previous workflow had not run again so this should resolve the situation. There was some discussion of previous issues when the Dashboard did not show multi-flow jobs correctly. DB noted that if the Dashboard shows CMS 90% efficiency but we see 41% efficiency then CMS will not realise there is a discrepancy and this could mean they donÕt address this serious issue.

ACTION 574.2
DC will liaise with Andrew Lahiff to highlight discrepancy and state this is being resolved (tracked through emails 14-16 September).

3. GridPP talks at e2e meeting
==============================
DB states that Brian Davis will give one talk and Duncan Rand will give the other. The meeting is being hosted in London at Friends House, Euston Road and since DB is in London at that time for another meeting so he will probably attend.

3. AOCB
=======
a)EGI User Forum meeting is being held in Amsterdam on 6th-8th April which presents a clash with GridPP36 on 4th-6th April 2016 for several members. DB is on the Board and would like to attend the full EGI meeting, CD and others will also doubtless want to attend and transport from Pitlochry to Amsterdam will be challenging. DB mentioned previous discussions with Athol Palace Hotel had explored the possibility of 11th-13th April for GridPP36,.

ACTION 574.3
DB will put round an email to all PMB members to enquire whether we should avoid clashes between GridPP36 and EGI meeting.

ACTION 574.4
LC will investigate whether reservation at Athol Palace Hotel can be amended to 11-13th April.

b)DC noted Tier-2 hardware money cutoff (1.10.15) was agreed but had been asked to provide figures for CMS accounting Ð this has now been done.

c)JET round-up Ð we have to round up some sites round up to £10K because we canÕt issue a capital grant for less. DB noted that the round-up may also apply to Durham and Sussex. On LHCb there will be no explicit funding for T2D disk. Nothing special is needed by ATLAS. The Bristol figures were approaching £20K in line with the previous year. CMS needed to confirm the allocation to Bristol. It was discussed whether sites not undertaking actual Tier-2 work should receive these funds. Bristol was prototyping a disk-less Tier-2 and DB enquired about the plans going forward at Bristol.

d) Minutes from previous PMB meetings require to be typed up and distributed.

e) DC Ð RCUK Cloud had its first steering group meeting with 12 people and wider outside involvement. Phil Kershaw is the Chair and there are 4 Co-Chairs including DC. DC had suggested somebody from Scientific and Computing should be included and will enquire about this after the groupÕs Terms of Reference are agreed. The group reports direct to RCUK.

ACTION 574.5
DC will ask CMS spokesperson about funding Tier-2 sites and determine if value for money was obtained from Bristol performing work but not supporting a full range of Tier-2 workloads.

ACTION 574.6
DB will update minutes from previous PMB meetings as pass to LC for distribution to PMB members.

4. Standing Items
==================
SI-0 Report from Development (Cloud) Group
——————————————-
DC has nothing to report except that evolving Cloud activity and Tier-2 activities are merging.

ACTION 574.7
LC to amend heading for future PMB Minutes to ÒBi-Weekly Report from Technical GroupÓ.

SI-1 Dissemination Report
——————————————-
SL absent, no report. DB noted he had visited the website and there is adequate information thereon at the top level but more detail is required.

SI-2 ATLAS Weekly Review and Plans
——————————————-
JR noted nothing to report as Alastair is currently en route to CERN.

SI-3 CMS Weekly Review and Plans
——————————————-
DC noted that other than the efficiency issues already covered there is nothing significant to report. He noted the workload is run on CMS at home on Monte Carlo generation, but investigations are ongoing to integrate this with other systems.

SI-4 LHCb Weekly Review and Plans
——————————————-
PC noted nothing to report.

SI-5 Production ManagerÕs report
——————————————-
JC Reports:
1. At the last WLCG GDB Ian Bird requested to receive nominations for a new GDB chair. He gave a talk on proposals for the WLCG Boards going forward: https://indico.cern.ch/event/319751/contribution/1/attachments/1151660/1653517/WLCGBoards.pdf. New here is the suggestion to create a WLCG Technical Forum to prepare for the long-term future of the needed infrastructure. Do we wish to nominate anyone? There will be a vote at the November GDB to elect the new chairperson. AS to discuss with Ian Bird the potential for nominating him as the new Chair of GDB. DB suggests it should be made clear to Ian that the planned Technical Strategy Group of LHCb may in the future cross over with this so a commitment to one of these groups may lead ultimately to a commitment to the other.

2. The next EGI Community Forum is in Bari, Italy, 10-13 November 2015.

3. There will be a new filter for the critical profile for ATLAS WLCG SAM tests so that only production endpoints will be tested and taken into account for site availability metrics. This will be available from the SAM3 interface.

4. We are attempting to take forward Machine Job Features implementations at a couple of sites, but each time we make a push we run into problems that stall progress. LHCb are now indicating that they would like to see this implemented at all sites by the end of the year. This seems ambitious given the rate of progress across WLCG.

5. We are continuing to contribute well to the WLCG Middleware Verification work including with CentOS7 testing. This contribution is well received and welcome within WLCG.

6. We have added a Tier-2 evolution panel to the GridPP Bulletin. The bulletin is updated on a weekly basis: https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest (Though getting consistent and good contributions for items is an ongoing challenge). Andrew McNab has suggested using JIRA for tracking T2 evolution issues – this works well in other projects and we should consider wider use within GridPP (e.g. in the ops team)!

ACTION 574.8
AS to discuss with Ian Bird the potential for nominating him as the new Chair of GDB.

SI-6 Tier-1 Manager’s Report
——————————————-
GS reports:
Castor:
– We are in the process of upgrading the Castor Oracle databases to version 11.2.0.4 (from 11.2.0.3 – which is no longer supported).
This is a multi-step operation which has to allow for our use of Oracle Dataguard to mirror to standby databases. The first major step of this was successfully carried out last Tuesday when the “Neptune” database was upgraded. This hosts the stager databases for the Atlas and GEN instances which were down for some ten hours for this work. The next major step is on 6th October, when all of Castor will be down for a similar length of time. Finally there will be a shorter, but still significant outage on the 13th October.
– We have successfully upgraded the Castor disk servers for disk only service classes to SL6. We are tackling the disk caches for the tape-backed service classes over the next two or three weeks. However, this is an easier operation as we will just drop a few of the servers out a time and it should be transparent to the VOs.

Networking:
– There is ongoing work to remove the switch stacks connected to our old ‘core’ switch (from when we had a star topology) and complete the move to the mesh topology. Apart from this needing to be done more generally we have to free up the rack space where the old core switch is located. On Thursday and Friday last week we encountered some significant packet loss to worker nodes on a couple of network stacks (around 10% to one of them). Some work was successfully done at the end of the week to investigate and fix this and packet loss is now looking much better. There are connections to be moved off the old core switch over the coming weeks.
– On Tuesday 30th September the link from our main router pair into the RAL core will be upgraded from a resilient pair of 20Gbit connections to a resilient pair of 40Gbit connections.

FTS:
– We have has some problems with our production FTS3 server in this last week or so. The system became overloaded with over a million queued transfers each for Atlas and DiRAC. ATLAS have moved transfers across to our “test” FTS3 service – and the situation has eased but not gone away yet. An update was applied to the Production FTS3 service last week (at VO’s request) – but it looks like a memory leak has been introduced. We (Andrew Lahiff) is working closely with the FTS3 developers on this.

Purchasing:
I don’t have any details but Martin Bly was working on this at the end of last week. (He is on leave this).

SI-7 LCG Management Board Report of Issues
——————————————-
DB reports that the MB took place. There was discussions on information use cases about BB1 and its use. DB has posted this to the CHAT window as well as the report on WLCB and the Benchmarking Presentation.
Memory items for the future was also discussed but, regrettably, DB had to leave before the discussion on this item concluded.

ACTION 574.9
DB to obtain information from PC about conclusion of MB discussion on Memory Items for the Future and share with PMB members.

REVIEW OF ACTIONS
=================
571.1 Maintenance Ð PG to do and DB to inform Tier-2 what we are pledging on their half Ð PG needs to answer so that DB can do this. Done.
571.5 On RAL, need to know who is speaking on CEPH. DK took note. Done
571.6 PMB members get their quarterly reports in by the next PMB meeting in 2 weeks. Ongoing
573.1 Steve Lloyd to chair new metrics group to assess how much things should be metric-driven in new GridPP5 world. Done
573.2 Andrew McNab to ask LHCb datamanagement for a projected profile of requirement within 2015 and 2016 for RAL Tier-1 planning. Ongoing
573.3 Andrew McNab and David Colling to form Tier-2 evolution working group: members in 2wks, terms of ref in 4wks, outline plan in 8wks. Ongoing
573.4 Create/pick an MoU and other required documents: Peter Clarke to form a group with David Kelsey, Jeremy Coles and others.

ACTIONS AS OF 21.09.15
======================
571.6 PMB members get their quarterly reports in by the next PMB meeting in 2 weeks. Ongoing Ð AS should have all reports back and complete by next PMB.
574.1 AMcN to undertake various tests on JIRA and discuss at a future PMB soon.
574.2 On CMS T1 efficiency discrepancies – DC will liaise with Andy Lahiff to highlight discrepancy and state this is being resolved (tracked through emails 14-16 September).
574.3 Re clash between GridPP36 and EGI User Forum meeting – DB will put round an email to all PMB members to enquire whether we should avoid a clash.
574.4 Re clash between GridPP36 and EGI User Forum meeting – LC will investigate whether reservation at Athol Palace Hotel can be amended to 11-13th
April.
574.5 Funding round-up to £10K – DC will ask CMS spokesperson about funding non Tier-2 sites and determine if value for money was obtained from Bristol performing work but not supporting Tier-2.
574.6 DC will update PMB minutes from previous meetings and pass to LC for distribution.
574.7 Report from Development (Cloud) Group – LC to amend heading to ÒBi-Weekly Report from Technical GroupÓ for future PMB Minutes.
575.8 AS to discuss with Ian Bird the potential for nominating him as the new Chair of GDB. Done Ð Ian has responded.
575.9 DB to obtain information from PC about conclusion of MB discussion on Memory Items for the Future and share with PMB members.