GridPP PMB Meeting 579

GridPP PMB Meeting 579 (16.11.15)
=================================
Present: Dave Britton (Chair), Pete Gronbech, Tony Cass, Andrew Sansum, Jeremy Coles, Steve Lloyd, Claire Devereux, Gareth Smith, Peter Clarke, Dave Kelsey, Dave Colling, (Minutes – Louisa Campbell)

Apologies: Tony Doyle, Andrew McNab, Roger Jones, Pete Clarke,

1. Planning for h/w spend. Pledge v future direction
=====================================================
DB questioned whether it is necessary to provide Tier-2 sites with guidance and advice on how to spend forthcoming GridPP grants. PG has had similar enquiries and noted that agreement was reached at the GridPP35 meeting in September to issue guidance. It was suggested we remind the four regional Tier-2s they must deliver to the pledges made for next April – this should be achievable but should be checked. Beyond meeting these pledges there are potentially 3 categories to consider:

a) Small sites should concentrate on CPU (e.g. Cambridge);
b) Large sites should continue to provide the mixture of CPU and disk appropriate to the experiments they support; but for
c) Medium sites (e.g. Oxford/Holloway) – there is a less clear direction: These sites should reflect on the manpower available going forward and any institutional support for hardware and manpower that they can anticipate.

It was suggested that small sites need to concentrate on CPU but it would be helpful to have a clear idea of how quickly the experiments would like to rebalance the computing models for medium sites over the next 4 years.

DB asked for advice on the timing of the next round of h/w funds and it was discussed whether these may be on 1st and 3rd or 2nd and 4th years of GridPP5. STFC have not issued guidelines (yet). The requirements may change with LHC schedules, but we have some time to sort this all out.

ACTION 579.1 DB to write a paragraph with guidelines on how small sites are expected to use GridPP4+ h/w spends and circulate to PMB in the next few days for comment.

2. Update on ALICE requirements (PG)
=================
As discussed at the previous PMB meeting, ALICE tape use has increased significantly. ALICE had not provided a clear reasons for this increase to current use of 800TB. There is a concern about increasing this any further at present because it is already greater than the planned ALICE allocation at the end of GridPP5.

DB is awaiting information from AS in regard to tape planning and his concerns include:

a) We need information on what is and what is not possible on tape storage; and
b) In GridPP5 the proposals total 0.68 tape for ALICE at the end of GridPP5 and he is hesitant about increasing this now as it can lead to difficulties in the longer term. ALICE acknowledges they are using more than the UK’s fair-share contribution. He would like to retain this level and keep the situation under review with the potential to increase.

It was agreed that we refrain from agreeing to more given that the current level is already at double the amount allocated but we would not ask ALICE to reduce their usage at this point.

ACTION 579.2 DB to contact AS for comment on tape planning before agreeing that ALICE can use the existing 850TB, but we cannot increase this as 680 was agreed until end GridPP5.

ACTION 579.3 PG to keep Catalyn informed that ALICE won’t be asked to delete data and will be advised we are considering scope to increase data but this cannot be agreed at the moment.

3. AOCB
=======
a) Next HEPSYSMAN meeting
———————-
Manchester will be hosting the next HEPSYSMAN meeting on 13/1/16. Some consideration should be given to content and speakers.

b) Hpc-sig meeting. Should GridPP join?
————————————
This is taking place in Sheffield in February. Some PMB members may wish to attend. It was noted that we are formally members, though this is not acknowledged on the website. NGS is listed as an Associate Member and many GridPP institutes will have paid the £100 membership fees to be included in the HPC-SIG mailing list. It was agreed to maintain the status quo as most sites are members through their institutions.

c) Visit Notices
————-
This relates to Action 576.2 – DB has drafted an update to the policy.
It was agreed that the text should be slightly amended to make clear that Visit Notices are not generally required for UK travel but it is expected that the most cost-effective travel and accommodation arrangements will be made and this is under constant review. DB will adjust the text and insert a maximum ‘per night’ cost for hotels. Each institution has established internal procedures, e.g. every trip should be discussed with line managers. Some sites, e.g. Liverpool, require Visit Notices for all travel, which is also fine. It was recognised that STFC rely on institutions to follow due diligence in monitoring/controlling these processes and robust systems are in place for this. DB has tested the new website and confirms the system works well. Agreement was reached to progress with this new system on the new website.

ACTION 579.4 DB will adjust the wording on Visit Notices as discussed.

4. Standing Items
===================
SI-0 Bi-Weekly Report from Technical Group (DC)
——————————————-
DK noted that things were progressing as expected with little to report. He will provide a more formal summary before Christmas.

SI-1 Dissemination Report (SL)
——————————————-
###New website – nearly there!

Thanks again to AM for implementing the Indico meeings and PMB minutes functionality on the new website:

* https://indico.cern.ch/category/overview?selCateg=4452&period=month

* https://vm36.tier2.hep.manchester.ac.uk/collaboration/pmb-minutes/

We have also added information for Collaboration Members about travel (including the link to SLL’s visit notice form) and GridPP jobs.

###GridPP UserGuide

Thanks to Steve J (SJ) and Steve J jnr. for the feedback on the GridPP UserGuide. All comments and corrections have been added to the UserGuide GitHub Issues page:

* https://github.com/gridpp/user-guides/issues

and fixed, linking to the relevant Pull Requests for tracking purposes. The updated UserGuide can, as ever, be found here:

* https://vm36.tier2.hep.manchester.ac.uk/userguide/

NB: we have added a note about the “untrusted connection” warnings you get when accessing the VOMS server pages. Are these ever likely to go away? Who would we ask about this?

###Meeting with Colin Hayhurst, SEPnet Innovations Partner, 20th Nov. 2015

On Tuesday 10th November 2015 TW met with Colin Hayhurst, SEPnet Innovations Partner (based at the Uni. of Sussex). It was a productive meeting – CH had already provided great feedback on the new website. The main points relevant to GridPP:

* Now that the UserGuide and new website are nearly ready, GridPP have a “product” to “sell” to SMEs, something lacking before. This is good.

* It might be useful to have a product tasting session, i.e. a GridPP hack day, for SMEs to come along to now that the product is ready. (This has been suggested before, but we haven’t been in a position to do this. Now we are.)

* CH’s role is about REF submissions for SEPnet universities, tracing impact case studies back to papers published by the SEPnet university researchers. CH asked how this worked for GridPP impact case studies with respect to STFC’s funding requirements (i.e. for GridPP5) – does impact need to be traced back to particular papers, or simply any case study that has used GridPP resources? What is the mechanism by which “impact” counts in terms of satisfying STFC’s or RCUK’s funding requirements? This might be useful to establish and document as part of the New User Engagement Programme.

ACTION 579.5 CD will forward Karen Padmore’s contact details to SL for more information on potential SMEs.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
RJ was absent – no report presented.

SI-3 CMS Weekly Review and Plans (DC)
——————————————-
There was nothing significant to report at this time.

SI-4 LHCb Weekly Review and Plans (PC)
——————————————-
AM was absent – no report presented.

SI-5 Production Manager’s report (JC)
——————————————-
1) On CPU utilization, John Gordon reported that a database merge was underway Once completed the APEL team will send an integrated dataset to the portal and they will put the current dev view in the production portal alongside the current one. They are currently developing a major rewrite of the portal.

2) The Tier-2 availability/reliability figures for October show the following:

ALICE: http://wlcg-sam.cern.ch/reports/2015/201510/wlcg/WLCG_All_Sites_ALICE_Oct2015.pdf
All okay.

ATLAS: http://wlcg-sam.cern.ch/reports/2015/201510/wlcg/WLCG_All_Sites_ATLAS_Oct2015.pdf

QMUL: 85%:85%
Lancaster: N/a:N/a?
Liverpool: 85%:100%
Sheffield: 78%:78%

CMS: http://wlcg-sam.cern.ch/reports/2015/201510/wlcg/WLCG_All_Sites_CMS_Oct2015.pdf
All okay.

LHCb: http://wlcg-sam.cern.ch/reports/2015/201510/wlcg/WLCG_All_Sites_LHCB_Oct2015.pdf

QMUL: 79%:79%
Liverpool: 86%:100%
Sheffield: 86%:86%
RAL PPD: 79%:79%

Issues at the sites below 90% in October:

Liverpool: Availablity was impacted due to several scheduled weekend power outages that required repeated draining of the site.

QMUL: an issue with their second gridftp server,se04, meant an internal dhcp ip address was lost and with the CEs needed to be rebooted after a weekend.

Lancaster: was in the middle of an ASAP recalculation. For the recalculation see https://its.cern.ch/jira/browse/ADCMONITOR-407.

RALPPD: LHCb A/R were due to 2 problems: 1. Slapd segfaulting after openldap update on ARC CEs prevented LHCb job submission: https://ggus.eu/?mode=ticket_info&ticket_id=117063. 2 dCache SRM node overloaded by CMS jobs caused various problems, including time out for LHCb SRM LS tests. More details at https://ggus.eu/?mode=ticket_info&ticket_id=116872

Sheffield: Had a couple of worker nodes with misconfigured cvmfs . These worker nodes did not impact that ATLAS ASAP metric.

3. A continuing site focus has been T2 Prod disk decommissioning for ATLAS and for non-HEP VOs the discussions have been about the use of pilot jobs.

4. As reported via email (since our last PMB), Ian Collier has been appointed as the next WLCG GDB chair.

SI-6 Tier-1 Manager’s Report (GS)
——————————————-
General:
– The fix the recently announced vulnerabilities in the RedHat crypt libraries is being rolled out.

Castor:
– We have found a problem on some disk servers of one particular batch that have been updated to SL6. The servers can run slowly and individual commands hang (until a timeout) while making name lookups. So far a total of three servers have been affected spread over a couple of weeks. We can easily fix the problem but do not yet know why it occurs.
– Two of the tape servers are now running Castor version 2.1.15. These can be updated independently of the rest of Castor as they have no interaction with the Castor database. Work is getting underway to test Castor 2.1.15.

Networking:
– We continue keeping a close watch on some low-level packet loss within our network. We also continue with the changes needed to remove the old ‘core’ switch from the network. There is a site ‘warning’ announced for an hour Wednesday of this week for one of the steps in this.

Batch:
We have seen a few problems with Atlas Hammercloud tests failing (loss of heartbeat). This is not understood yet.

Regarding Action 576.4: (glexec test failures for CMS after the final batch of worker nodes was updated – leading to loss of availability).
There are two parts to this:
1) Investigations show the it is the CMS glexec tests that showed a specific affinity for the higher numbered worker nodes. Other tests do not show such a preference. This explains why we were not seeing these specific tests fail before the final batch of worker nodes was drained. We do not, as yet, understand why these tests show this preference.
2) We have a new Nagios test in place to pick up SAM test failures for each VO. This is based on one developed at PIC. It needs some more testing which awaits a time when our services are not available.

Procurement:
– The CPU tender documents are visible (i.e. the tender is live).
– We await confirmation that this is the case for the disk tender documents.

ACTION 579.6 GS to provide high level milestone dates for procurement, plans timing.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No report.

REVIEW OF ACTIONS
=================
574.2 On CMS T1 efficiency discrepancies – DC reports CMS are running multicore pilots on single core jobs, but Atlas are doing correctly on higher efficiency. Done.

576.2 DB to talk with DK about the policy for requiring visit notices. Done.

576.4 GS to respond to the PMB with the explanation of why the glexec test failures were not seen previously during testing and roll-out of the Tier1 worker node configuration. Also to provide list of resulting actions to mitigate this type of problem in future. Done.

577.4 AS and PC will prepare and circulate a summary report on the UK Tier-0meeting last week – DB will take this forward. Done.

578.1 PG to contact ALICE to determine the amount of tape required and whether that can be accommodated. Done

578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered. Ongoing.

578.3 DB will discuss the issue of high accommodation costs for CHEP with DK. Done.

578.4 PG will finalise HW grant totals. Done.

578.5 HW grants are almost ready to proceed, DB will make an email proposal for members to consider and sign off on. Done.

578.6 DB to ask the Glasgow team to check historical data on CPU utilisation. Done.

578.7 JC to follow up on CPU utilisation at Ops meeting on 3.11.15. Done.

578.8 DB will consider reputation risks of inadequate support of new VOs. Ongoing.

578.9 DB will test SL’s travel process and confer with DK on Visit Notices then send an email to UKHEPGRID members. Done.

ACTIONS AS OF 16.11.15
======================
578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered. Ongoing.

578.8 DB will consider reputation risks of inadequate support of new VOs. Ongoing.

579.1 DB to write a paragraph with guidelines on how small sites are expected to use GridPP4+ h/w spends and circulate to PMB in the next few days for comment.

579.2 DB to contact AS to comment on tape planning before agreeing that ALICE can use the existing 850TB, but we cannot increase this as 680 was agreed until end GridPP5).

579.3 PG to keep Catalyn informed that ALICE won’t be asked to delete data and will be advised we are considering scope to increase data but this cannot be agreed at the moment.

579.4 DB will adjust the wording on visit notices to make clear that staff are required to make the most cost-effective travel and accommodation arrangements.

579.5 CD will forward Karen Padmore’s contact details to SL for more information on potential SMEs.

579.6 GS to provide high level milestone dates for procurement, plans timing.