GridPP PMB Meeting 603

GridPP PMB Meeting 603 (08.08.16)
Present: Dave Britton(Chair), Tony Cass, David Colling, Tony Doyle, Pete Gronbech, Roger Jones, Dave Kelsey, Steve Lloyd, Andrew McNab, Andrew Sansum, Gareth Smith, Louisa Campbell (Minutes).

Apologies: Jeremy Coles, Pete Clarke.

1. PC’s email about Nuclear Physics
Referring to emails regarding provision of 150 TB for Nuclear Physics at Tier1. It was noted the original request was to SCD which was then pushed to GridPP and taking forward would mean a coordinated response from one entity GriddPP is targeting the PPAN programme so this probably falls into UKT0 for consideration. We should clarify expectations as we have interfaces into the storage system for other communities The PMB agreed the use of 150 TB of storage capacity if it can be accessed by existing mechanisms in the GridPP suite of tools.

ACTION 603.1: AS will discuss with Jens and confirm agreement for 150 TB tape storage capacity for the Nuclear Physics request on the provision it can be accessed by existing mechanisms in the GridPP suite of tools.

2. OPN load – GS’s email about OPN load
GS sent round some plots (attached) in his Tier-1 Manager’s Report and summarised the OPN load here. First half of month the link was saturated and ATLAS submitted a ticket requesting a fix. Traffic settled down after the 28th. Discussions with networking staff at RAL and CERN on the option to put both links in use; this was thought to be simple (this is already used by some Tier1s). Concerns raised include – different routing of links may result in issues with transit times (unlikely) and whether increased loading on Castor may adversely affect performance. The change was made Thursday morning. Testing confirmed this worked well. Since then we are going up to maximum capacity for periods then recovering so all appears to be working effectively. There is a separate issue concerning some packet loss which will continue to be tracked. It was agreed to assess trends over the next couple of months and discuss with experiments their expectations. The plot figures show average usage figures at the bottom and another metric to consider is looking at the averages on weekly basis. The single link seems to have handled operating well up to this point, which suggests we now have headroom but we should consider the next step – unless the costs are prohibitive. AS confirmed DK and Philip have acquired costings and he will circulate for discussion. Costs have decreased to those projected for GridPP5. This should be further discussed at F2F in Ambleside. GS confirmed that breaks have been infrequent and in the current configuration where we use both links we should be in a position to catch up quickly if one link goes down for a short while. For disaster planning we should determine timescales to implement an increase in bandwith in the event of a long outage.

ACTION 603.2: DK will circulate costings for single links to determine which options would be affordable.

ACTION 603.3: PG will include an item on F2F agenda for costs so that decisions can be made on the best way forward for OPN load.

3. Quarterly Reports
Some quarterly reports have not yet been submitted, PG has sent out reminders this morning to those concerned.

4. GridPP37 Agenda
The agenda page is now set up on Indico. PG advised some slots are now confirmed:

First session – DB
Second session – Portal and accounting and Benchmarking. Alessandra cannot now attend so this may be scaled back to a 15-20 minute overview to advise what each task-force is undertaking, an overview on the machines and why they are useful. Aim as a sales pitch to encourage involvement and revisit at the next GridPP meeting in spring when there will be more involvement by UK sites. Ian Collier could talk as a high level WLCG perspective and AM could pick up on the UK side (PG will invite Ian Collier). Following on Apel publishing and how the information feeds in as well as scaling – Adrian Coveney is confirmed and George Ryall will speak to provide an understanding of the CRIC elements and CRIC-light interface which will be a long-term thing. AM can provide a site perspective. Other WLCG workshop – David Crooks could deliver a talk on the work he is undertaking which differs from other taskforces – i.e. a contribution from the security team.

Thursday Morning
JC is suggested in a technical talk – with Sam Skipsey and Brian discussing storage. We urgently need to invite these individuals to give talks. Also, Marcus from Edinburgh on NFS, he is speaking on that topic at CHEP there may be an overlap. If he is speaking he should have clear guidance on what to cover – JC will invite speakers for session 3.

Session 4 – Tier 1 session. AS will consider, elements to consider include CEPH, cloud, networking (possibly IPV6 talks – but these may be included in session 6) etc. Andrew Heath suggested a talk on whether a large site can run services for small sites which Ian Collier could give (AS will contact and invite Andrew to talk). Tier 2 evolution – this may slot into JC session

Session 5 – PC will give external focus, LSST speaker and other suggestions. We did not get an SKA speaker as none are available and JC was hoping to contact other people – JC may be best placed to give a talk as he is 50% Grid and 50% SKA, but he focusses more on data processing. DB will discuss with PC.

Session 6 could be security and IPV6 – Andrew Dewhurst and Duncan Rand or Dan Traynor to give a talk on their experience on worker nodes and network translations. DK will contact them and invite to speak so that PG can place titles on the agenda. David Crooks may give a talk on the security centre. DB confirmed the spreadsheet DK circulated on 31 July for CHEP attendance is a good source of who we can invite to speak about different topics. The programme needs to be fully populated this week, there are less people than normal registered (only 41 delegates registered and we would normally expect an additional 10-15). The attendees list can be looked at to determine speakers (e.g. Ivan Reid). DB will invite Ivan to speak as there is sufficient space. LZ would be worth discussing, DC could give a talk or consider who he can invite to contribute.

ACTION 603.4: PG will invite Ian Collier to present and others for Session 1 & 2.

ACTION 603.5: JC to invite speakers for session 3.

ACTION 603.6: DK will contact Duncan Rand and Dan Traynor and invite them to speak

ACTION 603.7: AS will discuss with Andrew Heath to discuss a talk in session 4.

ACTION 603.8: DB will discuss external contacts with PC and invite Ivan Reid to give a talk.

Nothing of note to discuss.

6. Standing Items

SI-0 Bi-Weekly Report from Technical Group (DC)
There was no technical meeting last week so nothing of note to discuss.

SI-1 Dissemination Report (SL)
Nothing of note to report.

SI-2 ATLAS Weekly Review and Plans (RJ)
Nothing of note to report.

SI-3 CMS Weekly Review and Plans (DC)
Nothing of note to report

SI-4 LHCb Weekly Review and Plans (PC)
Nothing of note to report.

SI-5 Production Manager’s report (JC)
JC was not in attendance – no report submitted.

SI-6 Tier-1 Manager’s Report (GS)
The main item to report is the change made last Thursday morning (4th August) bringing the second (“backup”) OPN link into operation.
So far this has worked well. I attach two plots:
1) The “primary” OPN link for the last month. This shows the saturation of the single 10Gbit link that had been occurring. It also shows the fall away that corresponds to the LHC machine development, followed by traffic picking up again.
2) The “backup” link for the last week. This shows:
– Some traffic outbound from RAL from Wednesday to Thursday. It was not part of the plan to start this ahead of balancing the inbound traffic. However, this did not cause problems we were aware of.
– The initial full load on the link (i.e. on both links). The test we made shortly afterwards can be seen. Then the pattern is that we saturate the link(s) at times.

The connection is made using the UKLight router. We, of course, still plan to replace this. Work has been progressing well. Testing of the replacement as far as practical is largely done.

– I had been reporting a specific problem found in the testing of Castor 2.1.15. Work with CERN uncovered a configuration error in one of our configuration file. This was coupled with finding a bug in the way Castor handled comments in the configuration file.
Resolving this has meant that the remaining testing (both functional and performance) has been able to continue.
– Three new disk servers have been added to each of the disk caches in front of the tape system for Atlas, CMS and LHCb. This will enable the removal of some of the oldest disk servers from these areas.

Tape System:
– The tape system has been working well in recent weeks. We have had a meeting with Oracle last week to wrap-up the issues with the ACSLS library control software. We have another meeting this week to review the hardware problems we had. We anticipate that we will go ahead with a preventative maintenance on the libraries which will require a short-working-day downtime.
– The migration of Atlas data from ‘C’ to ‘D’ media is almost complete. There are only around 30 (out of the 1300) tapes left to migrate.

Batch systems:
– The LSST VO has been enabled on the batch farm.
– The 2009 worker nodes are being drained from the batch system ahead of their use as tests systems before final decommissioning.
– HPE Worker nodes: Around three-quarters of them have passed acceptance testing. They will undergo benchmarking and power measurements and are expected in service by the end of the month. The remainder are being followed up.

The database behind the LFC was moved to new hardware on Monday 1st August.

The availability figures for July 2016 for all four LHC VOs plus ‘OPS’ were 100%.

SI-7 LCG Management Board Report of Issues (DB)
No recent MB after DB circulated previously.

SI-8 External Contexts (PG)
Nothing to report.

Next PMB is 22 August followed by the F2F in Ambleside.

600.1: DC to contact Julia Sedgebeer at Imperial to informally discuss and address SuperNemo’s computing needs and request Daniella and Tom to await outcome of these discussions before progressing further. Ongoing.
600.2: DB/PC will consider whether to contact the head of SuperNemo in the UK discussing support requirements. Ongoing.
601.1: ALL to look at the draft policy document supporting new VOs and feedback comments to PC by the end of this week. Done.
601.6: PC to check guidelines to submit PRD to STFC to develop elements on top of openstack to allow other communities to benefit from the cloud. Done.
602.1: DK will put costs for CHEP and WLCG workshop attendance into a spreadsheet for PMB to consider. Done.
602.2: AM and AS to resolve the LHCb request at the end of July. Ongoing.
602.3: AS will request that Jens make a presentation to the PMB supported by a written report on plans for AAAI project with Dirac as well as proposed reporting. Ongoing. (UPDATE: this could perhaps be presented at GridPP37 Session 6)

602.4: AS will contact Charlotte to determine how much effort in total STFC funding in the AAAI project so that we can see if there are other contributions we should expect. Ongoing

602.5 PG will consider a title for GridPP37. Done.

602.6: PG will pull together a session relating to the Accounting portal perhaps after the tea break on day 1 and potentially running into the afternoon. Done.

602.7: PC will approach George Beckett to provide an update from LST and pull together a session on non-LHC VOs. Done.

602.8: JC will discuss with Anna the possibility of giving a talk at non-LHC VO session and put together a session with suggested speakers, e.g. Sam Skipsey and Brian. Ongoing.

602.9: AS will consider potential content for Tier1 session. Ongoing.
602.10: PG will circulate information on disk profiling at T2s for Brian and PC to use in forthcoming presentation. Done

ACTIONS AS OF 08.08.16
600.1: DC to contact Julia Sedgebeer at Imperial to informally discuss and address SuperNemo’s computing needs and request Daniella and Tom to await outcome of these discussions before progressing further. Ongoing.
600.2: DB/PC will consider whether to contact the head of SuperNemo in the UK discussing support requirements. Ongoing.
602.2: AM and AS to resolve the LHCb request at the end of July. Ongoing.
602.3: AS will request that Jens make a presentation to the PMB supported by a written report on plans for AAAI project with Dirac as well as proposed reporting. Ongoing. (UPDATE: this could perhaps be presented at GridPP37 Session 6)
602.4: AS will contact Charlotte to determine how much effort in total STFC funding in the AAAI project so that we can see if there are other contributions we should expect. Ongoing
602.8: JC will discuss with Anna the possibility of giving a talk at non-LHC VO session and put together a session with suggested speakers, e.g. Sam Skipsey and Brian. Ongoing.
602.9: AS will consider potential content for Tier1 session. Ongoing.
603.1: AS will discuss with Yens and confirm agreement for 150 TB tape storage capacity for the Nuclear Physics request on the provision it can be accessed by existing mechanisms in the GridPP suite of tools.

603.2: DK will circulate costings for single links to determine which options would be affordable.

603.3: PG will include an item on F2F agenda for costs so that decisions can be made on the best way forward for OPN load.
603.4: PG will invite Ian Collier to present and others for Session 1 & 2.

603.5: JC to invite speakers for session 3.

603.6: DK will contact Duncan Rand and Dan Traynor and invite them to speak

603.7: AS will discuss with Andrew Heath to discuss a talk in session 4.

603.8: DB will discuss external contacts with PC and invite Ivan Reid to give a talk.


Primary OPN link last month

Back-up OPN link last week