GridPP PMB Meeting 590

GridPP PMB Meeting 590 (22.02.16)
=================================
Present: Pete Gronbech (Chair), Dave Britton, Dave Kelsey, Andrew Sansum, Jeremy Coles, Gareth Smith, Roger Jones, Steve Lloyd, Pete Clarke, Claire Devereux, David Colling, Tony Doyle, Tony Cass, Andrew McNab, Louisa Campbell (Minutes).

Apologies: None.

1. Vacuum platform as EGI community platform (AM)
=================================================
AM has suggested submitting VAC as an EGI community platform. As a first step AM has completed the initial section of the template emailed to PMB and sent this to PC in the hope of securing a slot at Thursday’s EGI meeting. AM also included a link to the Vacuum platform technical notes he now has in draft to submit as a software note. The information AM and Andrew Lahiff have worked through can be collated into one document stating that if VMs are created following these rules they can be run on our sites. It is hoped this will satisfy EGI that we have some formal guidelines and can point to this document.

2. Effort and Services in kind for SKA H2020 proposal (AENEAS) (AS)
===================================================================
PC and AS are involved in a Horizon 2020 bid driven by the Astronomy (SKA) community who are putting together a bid against a H2020 programme targeted at SKA. This sets out a proposal for a European handling facility taking data out of SKA telescopes in S. Africa & Australia then shipping it to Europe. To some extent this is a paper exercise but it could include some test-bed activity. It is similar to many of our existing activities on the Tier-1, this allows input to how SKA may use UK infrastructure and how our infrastructure may be of use to SKA. It was suggested there may be synergy here with our existing work and this would be good for long term modelling and showcasing our work. It was pointed out this is a relatively modest project (€3M spread across 20-30 sites) but there may be potential to attract some funding for planning and policy and provision for test-bed activities. It would also be a good opportunity to demonstrate to STFC the need to resource such services. At present this is only at European data level activity, though there has been some test codes so far. JC is also very involved in this, as are Lancaster, Imperial and others.

3. CHEP2016 (PC)
================
PC invited names to put forward for CHEP track convenor. It was highlighted that the position attracts a reasonable workload and demands attention. DB suggests there is good pay-off for being involved and it may, for example, be a positive strand for CVs of early career members of the team. The procedure is based on keywords then papers solicited on the keywords, but this not yet available. Some potential names were suggested, particularly if storage is a theme, and members will supply PC with names for consideration. GridPP may pay for 50% of travel for people involved in CHEP organisation/operation.

ACTION 590.1: ALL contact PC with suggested names for CHEP track convenor.

4. GridPP36 Agenda (PC/PG)
==========================
Work is ongoing with the draft agenda and some topics have been grouped in to sub-themes. The importance of positive issues relating to GridPP5 was reiterated. Other potential themes discussed include: HEP direct support staff, WLCG Lisbon reports, a Technical session (DC to chair possibly), various Networking –LHCONE, Tier-1 and Tier-2, access for Tier-2, work on Cloud things, new technologies to be used in GridPP5, PDG, UKT0, ongoing non-HEP VO support tasks, etc…

PG will take ownership of the agenda for this meeting and will progress this. It was agreed there is a good opportunity to demonstrate support for VOs as a central focus and integrated element of GridPP5. SKA through STFC is now naturally part of our business and we should consider including that in the agenda. Consideration will be given as to whether we should make clear to VOs that we support them and prioritise support for new VOs in the same way they previously prioritised for LHC.

ACTION 590.2: DB will consider the suggested GridPP36 session themes and email PG.

ACTION 590.3: PG will create a draft of sessions for circulation and develop the agenda for GridPP36.

5. Update on Researchfish
========================
PG has already added a number of papers. He is now looking at other sections and would be grateful for input on these. He enquired if there was any other funding that should be included. Suggestions included Indico Data Cloud (only people at RAL involved in that), similarly with AARC. It was suggested we should be able to claim some kind of third-party ownership on these. It was suggested PDG director group should be included in the ‘influence & policy’ section – DB & DK are part of a Working group (security and access management working groups) with policy documents at the JISC level and DK will send relevant information to PG. Other suggestions include: Big data on Cloud, Cloud working group, CD national representative on Digital research forum.

PG summarised several categories (outlined in his email) that need to be considered and asked for suggested inclusions to be emailed to him, including ‘Roles and Recognition’.

PC summarised an apparent issue with ‘project coordination award’ where a list is visible but no access is enabled for individual awards other than undertaking individually for each 1100 or so entries. PC has prompted Ian Fuller but has not yet received a response. The objective seems to be an ability to see key outputs per grant but this is not possible with the current system. Each university achieves a scoring system per output on Researchfish – a very important opportunity to ensure outputs for each grant are recognised. It was suggested there should be a system in place allowing smaller grants to allocate papers from larger coordinations. Hardware grants should have special codes indicating they don’t have to have submissions against them. Researcfish deadline is imminent (10 March).

ACTION 590.4: ALL to contact PG with summaries of items to include on ResearcFish sections.

ACTION 590.5: PC and PG will push Ian Fuller for a response on how to make the Researcfish system more intuitive to Input information and request contact with a representative.

6. AOCB
=======
AS and PC had a very useful telephone meeting with Tom Kitchin, Mark Hollowan and others at EUCLID last Tuesday. Agreement was reached to move forward from this point and get workflow running at RAL and in Edinburgh. Andrew Lahiff, Andrew Washbrook and Marcus (Edinburgh) are involved and meetings will now take place fortnightly. A conference is planned for the end May and the objective is to ensure some useful work is completed by then as this is a useful, clear and achievable target. One requirement has already been tested effectively.

7. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
DC left early – nothing significant to report.

SI-1 Dissemination Report (SL)
——————————
##GridPP Dissemination Officer Notes for PMB

###New User Engagement Programme – New User First Contact

* Feedback from PMB has been taken on board with a view to re-phrasing the MoU as more of a “handy checklist” to ensure we can provide the best support possible without implying any sort of commitment or signed agreement. Revised draft to follow.

###GridPP UserGuide

* Job submission via DIRAC section tested and added. During testing, TW noted that gridpp jobs were processed very quickly – either gridpp being given priority already or DIRAC is getting very good at finding free sites – thanks either way!

###Industrial contacts

* Contact made (via new website) with the Nuclear Physics and Neutronics group, Clean Energy Europe, Amec Foster Wheeler regarding using grid resources for fusion-related Monte Carlo simulations. Based near Manchester. Discussions ongoing regarding distributing sensitive software (i.e. MCNP) – CVMFS probably not appropriate in this case;

* Contact made (via RAL) with start-up SeeCycle www.seecycle.com regarding using GPU or CPU resources for training image-processing/object recognition algorithms for improving cycling safety.

###Research contacts

* Contact re-established with GalDyn group (UCLan) – contact now submitted thesis and looking to continue research by moving to large-scale simulations on the grid;

* SuperNEMO – looking to resurrect the VO – ongoing;

* Climate change (Oxford) – GRIDPP-SUPPORT email list working well to support new user from a climate change group at Oxford. Some issues with using Storage Elements via DIRAC – under investigation but the community is supporting as best it can.

SI-2 ATLAS Weekly Review and Plans (RJ)
—————————————
Some issues were experienced with ATLAS scratch disc at RAL but no data was lost – the disc server went out. Overloaded the frontier service as a result of overload from Monte Carlo but this should be resolved in future
The real data reprocessing is underway and should be running Tier-2s and Tier-1s but is currently only using Tier-1. There were problems in the analysis queues that were traced to the Monte Carlo production jobs overlaying pile-up overloading the global Frontier service, and the analysis queues were suffering knock-on effects. PR event – the USA is trying to run 100,000 jobs through the Amazon cloud but they also want the Grid full to allow a very large number of simultaneous jobs to be claimed.

SI-3 CMS Weekly Review and Plans (DC)
————————————-
DC left early – nothing significant to report.

SI-4 LHCb Weekly Review and Plans (PC)
————————————–
A very brief report: the current stripping campaign is near completion. Processing Turbo data is about to commence, but in the meantime Monte Carlo jobs are proceeding as normal.

SI-5 Production Manager’s report (JC)
————————————-
From operations:

1. There is a glibc vulnerability to which the infrastructure is responding (CVE-2015-7547).

2. We had another long and useful discussion on “Other VOs” at the ops meeting last week. We are continuing to reach out to help communities but not all are responding. A snapshot would be:

· DEAP3600 – no response

· DiRAC – steady progress. Setting up additional sites (e.g. Leicester)

· GalDyn – No current work

· LIGO – With Andrew’s help they were able to achieve some milestones. For example they submitted to
RAL ARC CEs some jobs. So far it has only been test work.

· LOFAR – no report (George Ryall)

· LSST – Currently distributing LSST data. Moved to use gridpp VO.

· LZ – no report (David Colling)

· PRaVDA – No current activity but planning to engage again soon.

· UKQCD – Planning to engage again soon.

We are just introducing EUCLID. In addition there have been a number of other queries or engagements at various GridPP sites with examples being: MicroBooNE; SuperNEMO!; an ITER collaborating company (we need to review more carefully their role in the work and the licensing for codes they run); OeRC Climate Prediction.

3. For January WLCG A/R (released later than usual):

ALICE (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_ALICE_Jan2016.pdf):

All okay.

ATLAS (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_ATLAS_Jan2016.pdf):

RHUL 89%:89%

Lancaster 0%:0%

CMS (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_CMS_Jan2016.pdf):

RALPP: 80%::80%

LHCb (http://wlcg-sam.cern.ch/reports/2016/201601/wlcg/WLCG_All_Sites_LHCB_Jan2016.pdf):

RALPP: 77%:77%

Site responses:

RHUL: The largest problem was related to the SRM. The DPM version was upgraded and it took several weeks to get it working again (13 Jan onwards). Several short-lived occurrences of running out of space on the SRM for non-ATLAS VOs.

For around 3 days (15-17 Jan) the site suffered from a DNS configuration error by their site network manager which removed their SRM from the DNS, causing external connections such as tests and transfers to fail.

For one day (25 Jan) the site network was down for upgrade to the 10Gb link to JANET. Some unexpected problems occurred extending the interruption from an hour to a day. The link has been successfully commissioned.

Lancaster: The ASAP metric for Lancaster for January is 97.5 %. There is a particular problem with ATLAS SAM tests which doesn’t affect the site activity in production and analysis and this relates to the path name being too long. A re-calculation has been performed.

RALPP: Both CMS and LHCb low figures are due to specific CMS jobs overloading the site SRM head node. The jobs should have stopped now.

4. There will be several updates being applied to the GridPP website this afternoon so it will be unavailable for a short period today.

5. A note that the EGI 2016 Conference is 6th-8th April in Amsterdam: https://indico.egi.eu/indico/event/2875/overview.

6. There will be a GridPP core-ops meeting this Thursday to review the status of our operations activities.

SI-6 Tier-1 Manager’s Report (GS)
———————————
General:
– Juan Sierra, one of our Database Administrators will be leaving at the end of March.
– Updates for the most recent security update being rolled out and systems rebooted. Outages for Castor announced for tomorrow
morning.

Castor:
– Testing of the 2.1.15 version is ongoing. One problem identified (slower file open times) that is being followed up with the
developers.
– We had some problems with the Atlas SRMs on Wednesday (17th Feb) – with some SAM tests failing during the evening. Fixed by
restarting the front end processes.

Networking:
– No change to report. The traffic on the OPN link this last week was significant but not so as to cause concern.

Batch:
– Nothing particular to report.

Procurement:
– Disk and CPU capacity orders in place. We have been checking delivery times.

Action 588.8: GS to report on ongoing disc server issues in general.
We had seen a higher rate of disk server problems particularly through December and January with a total of 13 server outages in each month. Many of these are from one particular batch of servers (CV’11) which we have looked at in detail. These systems are out of warranty so a small number if decommissioned servers are used as a source of spares. These servers are mainly in the tape caches.
Following a review meeting at the end of January three actions are underway. The first is that failed disks, as well as those showing a high rate of error, have been replaced with those from a different batch of servers. The disks in these servers are 2TB Western Digital drives. The replacements are 2GB Hitachi drives from an older set of servers. Furthermore the firmware in the RAID card on these servers will be updated. (Any server that has been down has already had this done.) Following the review it has been agreed that disks servers appropriate to use in tape caches will be purchased to replace these servers. At that review we noted a higher level of problems recently on one other batch of servers (Vig’11) which we are monitoring.

AS confirmed we got the big procurements underway but still more to place. Last Monday issues with SBS system which has now been unavailable for a week which has delayed our orders and is a concern for delivering small orders and possible knock-on for end of year expenditure for GridPP, though it is hoped to be resolved before next week. Result of an update applied more than a week ago and beyond our control, but it is a concern. AS has asked for more project manager information and contingency planning.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
DK and PC attended. It was a very long meeting (almost 2 hours). The main topics under discussion were:

1) Accounting – there was a discussion in September about a desire for publishing Tier-1 accounting reports by the end of the month. It was confirmed that we do report correctly. The plan is to publish the reports first and ask for Tier-1 input if issues arise. Bigger issues include Ian’s desire to have a complete review of all reports and what the figures are used for, etc for WLCG. It may be run by WLCG Operations in April, this will be decided soon and we need to engage to ensure our input is included.

2) Follow-up from the Lisbon workshop – Ian Bird’s summary slides are available and should be considered. Much of the meeting focussed on the medium term future and summarised the current position on various issues. Longer term discussion related to the LHC upgrade and concern over a lack of discussion on this during the workshop. Ian is proposing to establish 4 working groups with immediate effect to consider:
A) Definition of the upgrade problem (long term evolution of computing models, building cost models)
B) Strengthening the LSF – making sure we lead on software performance
C) Performance and Evaluation of performance – can we model and then use info to model useful elements in the future?
D) prototyping and demonstrators.

At WLCB there is lots of opportunity for the GridPP group to be involved in these various elements and members are encouraged to look at the slides in the first instance.

It was noted there was very little discussion or contention raised at the meeting.

REVIEW OF ACTIONS
=================
582.4: DC to insert an update in the wiki page regarding communication with LZ. JC will send a reminder to DC. Ongoing.

586.2: AS will contact MoBrain and discuss resources for their EGI project. Next step is to make allocations. The ticket is now with Catalin – AS will check and report to the PMB next week. Done.

587.2: AM will invite selected small, medium and large sites to contribute presentations at GridPP36 on their plans for site evolution over the next few years and construct a session around this. Ongoing.

588.1: PG will discuss with DB Oversight meeting re concentrating efforts on paperwork for closing down GridPP4 and leave GridPP5 until later and generate emails to update PMB later in the week. Done.

588.2: ALL to make email suggestions for GridPP36 themes over the next few days. Done.

588.4: ALL to inform PG of any new roles and other items that need to be inserted into different categories and grants on Researchfish so that he can ensure all are included and circulate to PMB to check. Ongoing.

588.5 PG will email Ian Fuller at STFC to update the lists of current grants associated with PIs. Done.

588.6: GS will investigate reasons for saturation on OPN and report back to the PMB with findings. Ongoing.

588.8: GS to report on ongoing disc server issues in general. Done.

589.1: PC and PG will discuss and agree the GridPP36 Agenda. Ongoing.

589.2: LC will announce GridPP36 registration is open to UKHEPGRID mailing list and provide url link as well as information on travel to the venue from Edinburgh). Done.

ACTIONS AS OF 22.02.16
======================
582.4: DC to insert an update in the wiki page regarding communication with LZ. JC will send a reminder to DC. Ongoing.

587.2: AM will invite selected small, medium and large sites to contribute presentations at GridPP36 on their plans for site evolution over the next few years and construct a session around this. Ongoing.

588.4: ALL to inform PG of any new roles and other items that need to be inserted into different categories and grants on Researchfish so that he can ensure all are included and circulate to PMB to check. Ongoing.

588.6: GS will investigate reasons for saturation on OPM and report back to the PMB with findings. Ongoing.

589.1: PC and PG will discuss and agree the GridPP36 Agenda. Ongoing.

590.1: ALL contact PC with suggested names for CHEP track convenor.

590.2: DB will consider the suggested GridPP36 session themes and email PG.

590.3: PG will create a draft of sessions for circulation and develop the agenda for GridPP36.

590.4: ALL to contact PG with summaries of items to include on ResearcFish sections.

590.5: PC and PG will push Ian Fuller for a response on how to make the Researcfish system more intuitive to Input information and request contact with a representative.