GridPP PMB Meeting 578

GridPP PMB Meeting 578 (02.11.15)
=================================
Present: Dave Britton (Chair), Pete Gronbech, Tony Doyle, Tony Cass, Andrew McNab, Andrew Sansum, Jeremy Coles, Steve Lloyd, Claire Devereux, Pete Clarke, Gareth Smith, (Minutes – Louisa Campbell)

Apologies: Peter Clarke, Dave Kelsey, Dave Colling, Roger Jones

1. Changed 2017 REBUS numbers
==============================
PG had noted recently that the 2017 REBUS numbers had been updated. RJ was not present to comment on the increase in the 2017 ATLAS tape requirement, but it was noted that this was somewhat compensated by a reduction in the LHCb tape requirement. The ATLAS change might reflect an adjustment in the computing model of disk vs tape data. DB had noted previously a discrepancy between CMS and ATLAS tape requirements in 2017 but they were now more balanced. It was recognised we need to ensure coherent planning between REBUS in GridPP5; on the other hand our close association with the computing people in the LHC experiments meant that we sometime knew about changes well before they appeared in REBUS.

2. ALICE Tape use
=================
A rapid increase of tape used by ALICE has been noted. This year it shot up to over 700 TB against an allocation of 420 TB, which is a good deal more than the pledged level of 204TB which is calculated according to the UK authorship fraction. PG has checked with ALICE but the reason for the discrepancy is not clear, he seeks PMB input before pursuing this further and questioned whether we should be working towards the original pledged amount or attempting to accommodate. It was suggested there was not undue pressure on tape in the short term but we needed to be careful about the mid- and longer- term implications.

This leads on from point 1, that more coherent planning processed should be in place for REBUS. DB noted these issues are challenging and the next step should be for PG to contact ALICE and confirm (a) UK authorship fraction and (b) how much Tape Storage they actually need in the short and mid term. Once we have that information, we can attempt to address the issue.

ACTION 578.1 PG to contact ALICE to determine the amount of tape required and whether that can be accommodated.

3. EGI and VM-based sites
=========================
AM summarised the current situation with UCL not passing availability and reliability testing (due to using the VAC model)and suggested other sites may soon be similarly affected. Previously such sites were removed from EGI monitoring. Interestingly, UCL suddenly seemed to be passing EGI tests again despite nothing having been done. This needs to be investigated: Perhaps EGI has recognised that not running tests is not the same as failing tests? AM noted that the Community Platform is a concept introduced at the end of EMI and it may be possible to register VAC (and possibly the UK DIRAC service) as such a platform. This is effectively an administrative issue relating to access rather than a software issue. DB suggested this should be undertaken formally within the structure of EGI and feasibility tested. CD noted that Matt Viljoen has now joined EGI Operations and may be a good contact for the future.

ACTION 578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered.

4. ResearchFish deadline 15 November
====================================
PG confirmed that the next period for entering data into ResearchFish is February 2016 and can be reconsidered nearer that time.

5. SCD “Echo” Storage
=====================
Alistair’s recent emails proposed branding the name ‘Echo’ for the Ceph storage facility service for other communities. “Echo” has been suggested as a name based on “E”rasure coding; “C”eph; “H”igh-throughput; and “O”bject store. Although “echo” could have some unfortunate connotations if things went wrong, nobody objected or could think of anything better. Agreement was reached for this branding.

6. CHEP costs
=============
CHEP is now confirmed as taking place in downtown San Francisco and concern was raised over the high cost of accommodation ($250-$300 per night). CHEP organisers argued that CERN and US delegates receive an allowance that would cover this (TC noted that was not necessarily true for CERN) but DB noted that this was very expensive and would further limit the UK contributions. Potential options include:

a) Check out alternative accommodation accessible to downtown San Francisco; or

b) Funding only plenary speakers to reduce costs (this is considered to be unattractive since this is a major showcase event for GridPP output).

It has been confirmed that the HEPIX meeting will also take place near by at Berkley Lab. between 17-21 October the week after CHEP (c. 30 minutes travel). Members should raise concerns over the high cost of participating to organisers at suitable opportunities. Possibly we could explore alternative accommodation for c. 15 delegates now to take advantage of any reduced rates, though it is recognised that may attract a deposit.

ACTION 578.3 DB will discuss the issue of high accommodation costs for CHEP with DK.

7. Hardware Grants
==================
PG sent circulated a list of proposed GridPP4+ hardware grants based on the metrics accumulated for 2 years up to Oct 2015 and invited comments.

ACTION 578.4 PG will finalise HW grant totals.

ACTION 578.5 HW grants are almost ready to proceed, DB will make an email proposal for members to consider and sign off on.

8. Q215 reports
===============
PG confirmed these have now been received and he circulated a summary report in advance of the PMB meeting. PG summarised the main points arising. It was noted that attempts to support new VOs can be challenging and if done incorrectly potentially damaging to GridPP’s reputation. DB noted that Tier 2 cpu utilisation numbers require to be better understood. More focussed discussion will take place during the Ops meeting on 3.11.15 on this issue. Annual workshops or other formal processes for engaging with new VOs (e.g. userboard) were discussed as well as pro-formas where new VOs can track usage etc.

ACTION 578.6 DB to ask the Glasgow team to check historical data on CPU utilisation

ACTION 578.7 JC to follow up on CPU utilisation at Ops meeting on 3.11.15.

ACTION 578.8 DB will consider reputation risks of inadequate support of new VOs.

9. AOCB
=======
a) Travel visit notices – SLs system has now been enabled for GridPP and text drafted to be sent to UKHEPGRID. DK has registered and DB will test out. It was recognised that rules require to be established and the issue of whether Visit Notices are required should be considered. Moving to SL’s system addresses two issues:
i. It can be confusing to have 2 separate processes for visit notices; the experiments all use Steve’s system so why not GridPP?
ii. The old GridPP system would, in any case, have had to be moved to the new website, which would have been some work. So why bother?

ACTION 578.9 DB will test SL’s travel process and confer with DK on Visit Notices then send an email to UKHEPGRID members.

10. Standing Items
===================
SI-0 Bi-Weekly Report from Technical Group (DC)
——————————————-
DC not present. No report.

SI-1 Dissemination Report (SL)
——————————————-
DEAP3600 Jeremy Coles is organising a meeting with the DEAP3600 experiment this week to get them started on the grid.

GridPP website and UserGuide usability testing

Steve Jones has taken advantage of half-term to run some usability tests with a willing volunteer (his computer science student son!) with the new website and UserGuide material. The only problem reported so far is a bug on the VOMS id Web page software when joining the GridPP VO, which is beyond our control. A full report will follow in due course.

Commercial certificates are now in place and it was noted that these are considerably cheaper through commercial sources (£8) rather than JISC (£35). It was recognised that issues raised over Visit Notices require to be resolved in advance of the website launch.

SI-2 ATLAS Weekly Review and Plans (JR)
—————————————
RJ not present and no report.

SI-3 CMS Weekly Review and Plans (DC)
——————————————-
DC not present and no report.

SI-4 LHCb Weekly Review and Plans (PC)
——————————————-
Tier-2 undertake reconstructions using Tier-1 data. Previously this was done by allocating a single Tier-2 computer to a Tier-1, but more recently this has been expanded and tested and is going well so far. DB noted interest in determining efficiency hits.

SI-5 Production Manager’s report (JC)
——————————————-
Nothing significant from me this week that is not already known to the PMB (and of interest to it). A snapshot of some items:

1. There is a GDB this week https://indico.cern.ch/event/319753/. One of the first topics is the chair election. There are two candidates.

2. ATLAS T2 prod disk decommissioning is proceeding quickly in the UK.

3. There is a new RCUK Cloud Working Group with plans for a workshop at Imperial on 1st December: http://bit.ly/cloudwgdec15.

4. In case it is lost elsewhere, Andrew McNab raised the question to the PMB of what do we want to do about EGI and our VM-based sites, especially sites that only have VMs run by Vac (or Vcycle) and no CREAM/ARC, like UCL? We are seeing low availability alarms in our ROD dashboard as the probes cannot currently test these sites.

For information:

A. There is an updated version of the EGI Acceptable Use Policy that is awaiting final comments: https://documents.egi.eu/document/2623. The updates were driven to

1. Generalise to include all EGI service offerings (Grids, Clouds, Long Tail of Science, etc.) 2. Add policy requirement to acknowledge support in publications 3. Address liability issues.

B. There was an EGI Operations Management Board last week: https://indico.egi.eu/indico/conferenceDisplay.py?confId=2381. Focus is on the EGI meeting coming up in Bari next week. Relevant to operations are sessions on: Security group activities; the EGI marketplace; federated GPGPU options; EUDAT interoperability and AAI. There was also a talk on the Long Tail of Science (LTOS) activities in EGI. A new VO eu.egi.long-tail has been created together with an approach for authenticating users to various cloud gateways: https://indico.egi.eu/indico/materialDisplay.py?contribId=6&materialId=slides&confId=2381.

SI-6 Tier-1 Manager’s Report (GS)
——————————————-
Castor:
– We have again seen very high load on the Atlas tape instance. A couple of weeks ago we added five additional servers – doubling the size of the disk cache for AtlasTape at that time. Over the weekend 24/25 Oct. two of the servers crashed and another exhibited problems in the middle of last week. For the server that crashed diagnostics found a failing drive on each and a bad battery on one of them. They have been re-running through acceptance testing since the middle of last week. No further faults found although on each a further disk replacement has been made. The third server exhibited a network problem that was fixed by a reboot. The cause of this is not understood. A disk has also been replaced in this server.
– Some members of the Castor Team attending a face to face meeting with the CERN Castor Team for the first couple of days of this week.

Grid Services:
– We had problems with virtual machines on one of the Windows Hypervisors at the start of last week. This affected the VMS hosted there. This included one of the ARC CEs which was drained before being restarted.

Networking:
– We continue keeping a close watch on some low-level packet loss within our network. We also continue with the changes needed to remove the old ‘core’ switch from the network.

Batch:
Regarding Action 576.4: (glexec test failures for CMS after the final batch of worker nodes was updated – leading to loss of availability). I still do not yet have what I consider a full answer – please leave action ongoing. We do now have a replacement Nagios test in place to pick up SUM test failures for each VO. There still remains the puzzle of why, when we had only one batch of worker nodes still to be updated the large majority of the CMS glexec test jobs ran on the few remaining un-upgraded nodes.

Procurement:
The tender documents should, all being well, will be made visible this week.

SI-7 LCG Management Board Report of Issues (DB)
——————————————-
1) Operations Report noted that there was a pending Globus host certificate validation change which had been delayed once but was scheduled before the end of the year. The issue is that some CA’s cannot issue certs in the required format. DK: Can you confirm UK is OK? In any case, WLCG would cope by freezing affected RPMs in repository.

2) Memory Requirements: LHC experiments all basically agreed that 2GB/core was the baseline but that some (advertised) resources with up to 4GB/core would be valuable for some workflows. It was also important to ensure virtual memory was adequate (4-8GB/core). It was noted that use of c-groups to restrict memory usage was problematic if restriction was too tight (1.9 GB/core killed everything; and some jobs still killed until raised to 8 GB/core).

3) Presentation on WLCG workshop in February as kick-off for technical evolution groups. This is now plural because the medium (end of LS2) and long term (Run4) problems are different. Presentation is here:
https://indico.cern.ch/event/455391/contribution/2/attachments/1177447/1702968/WorkshopPlanning.pdf

4) PCP – Pre-commercial-procurement and HNSciCloud. This is approved and starts January-16. UK has small involvement. The plan is to build on the hybrid cloud service that results, in order to deploy a European Open Science Cloud funded from the INFRADEV-04 (2016) call – this is the thing that both CERN and EGI expressed intentions to lead. The LHC experiments have been asked to get together to produce a common use case for this infrastructure.
https://indico.cern.ch/event/455391/contribution/5/attachments/1176651/1701479/HNSciCloud-WLCGMB.pdf

5) DK asked MB to approve change of security policy – approve IOTA CA and specific applications that use it. Agreed unless there were any last minute objections after people re-examined GDB talk containing the details.

DB is travelling and will not be attending next week’s PMB meeting. If members decide to proceed with the meeting PG will chair.

REVIEW OF ACTIONS
=================
574.2 On CMS T1 efficiency discrepancies – DC reports CMS are running multicore pilots on single core jobs, but Atlas are doing correctly on higher efficiency. Ongoing.

574.8 DB to obtain information from PC about conclusion of MB discussion on Memory Items for the Future and share with PMB members. Done.

576.2 DB to talk with DK about the policy for requiring visit notices. Ongoing.

576.3 SL to implement authorization for GridPP funded visits on the system already in use by the experiments. Done.

576.4 GS to respond to the PMB with the explanation of why the glexec test failures were not seen previously during testing and roll-out of the Tier1 worker node configuration. Also to provide list of resulting actions to mitigate this type of problem in future. Ongoing.

577.1 SL to provide relevant hardware numbers to PG. Done.

577.2 CD to submit final 2 QReports. Done.

577.3 PG to prepare update on Q2 reports for next PMB. Done.

577.4 AS and PC will prepare and circulate a summary report on the UK Tier meeting last week – DB will take this forward. Ongoing.

ACTIONS AS OF 2.11.15
======================
574.2 On CMS T1 efficiency discrepancies – DC reports CMS are running multicore pilots on single core jobs, but Atlas are doing correctly on higher efficiency. Ongoing.

576.2 DB to talk with DK about the policy for requiring visit notices. Ongoing.

576.4 GS to respond to the PMB with the explanation of why the glexec test failures were not seen previously during testing and roll-out of the Tier1 worker node configuration. Also to provide list of resulting actions to mitigate this type of problem in future. Ongoing.

577.4 AS and PC will prepare and circulate a summary report on the UK Tier-0meeting last week – DB will take this forward. Ongoing.

578.1 PG to contact ALICE to determine the amount of tape required and whether that can be accommodated.

578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered.

578.3 DB will discuss the issue of high accommodation costs for CHEP with DK.

578.4 PG will finalise HW grant totals.

578.5 HW grants are almost ready to proceed, DB will make an email proposal for members to consider and sign off on.

578.6 DB to ask the Glasgow team to check historical data on CPU utilisation.

578.7 JC to follow up on CPU utilisation at Ops meeting on 3.11.15.

578.8 DB will consider reputation risks of inadequate support of new VOs.

578.9 DB will test SL’s travel process and confer with DK on Visit Notices then send an email to UKHEPGRID members.