GridPP PMB Meeting 682

GridPP PMB Meeting 682 (08/10/18)
=================================
Present: Tony Cass, Jeremy Coles, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Roger Jones, Steve Lloyd, Gareth Roy (Chair), Andrew Sansum, Louisa Campbell (Minutes).

Apologies: Dave Britton, Pete Clarke, David Colling, Dave Kelsey, Andrew McNab.

1. Tier-1 Spend Update
======================
Email regarding the Tier-1 spend from DK. AD confirmed this is all agreed and ISIS was a main beneficiary so c. £200K will be used with their spend code for Procurement. We will be able to spend the full amount of £1.05M on Capital this year. There may appear a slight overspend in the accounting but this is in hand and acceptable.

2. Tier-1 CPU Usage
===================
AD circulated slides on the Tier-1 CPU usage and provided a detailed overview of their content. AD confirmed the graphs showed %age of experiments fair share utilisation rather than %age of the experiments pledge/allocation and there was discussion elements of the content.

3. AOCB
=======
a) PG had hoped to be able to finalise the QR summary but this is not quite complete and should be available next week.

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
DC not present and no report submitted.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
Work is ongoing at RAL Frontier service moving to new h/w which is hoped will be completed without downtime. RAL dropping requirements for Oracle and support for Frontier was agreed in the Support meeting this week for Atlas. There are improvements in the infrastructure that will ensure its robustness. Thanks were expressed for all the support and particularly to AD. EOS setup at Birmingham was agreed and Atlas will share the servers with Alice.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
DC not present and no report submitted

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
AM not present and no report submitted

SI-4 Production Manager’s report (JC)
————————————-
1. The EGI OMB meeting today (https://wiki.egi.eu/wiki/Agenda-2018-10-08) raises no specific UK concerns. Our September A/R results were >97%. One item of note is that EGI is now requesting fullscale rollout of storage accounting.

2. The 2018 Fall HEPiX takes place this week: https://indico.cern.ch/event/730908/timetable/#all.detailed. Just a couple of UK participants (including DPK of course).

3. Birmingham has started the process to decommission its DPM (i.e. moving entirely to EOS).

4. The GridPP VOMS went down on 23rd September and the outage was picked up by snoplus who were impacted for about 12 hrs. The problem was a corruption in the Manchester VOMS mysql database. The problem was resolved within an hour of site notification. In theory our two backup VOMS instances would have allowed snoplus to continue uninterrupted but their sites only pointed at the master instance. Robert passed back a recommendation that their sites update to also allow proxies from the backup VOMS instances.

5. Notes from the September GDB have been published https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20180912. The October GDB takes place next week https://indico.cern.ch/event/651358/ where there is a further update on the Cost Modelling work.

6. In my last circulated report the August availability/reliability for QMUL was pending an update. I understand that the site had “cooling issues which required the site to be taken off line and a planned power down to upgrade [the] building power supply”. New cooling units were installed and commissioned. The outage impacted the modeal VO as their data is at QMUL. SL noted additional issues getting data due to a potential bug in Storm which has now been resolved.

SI-5 Tier-1 Manager’s Report (AD)
———————————
1) LFC outages affecting T2K, SNO+, MICE
Summary: The service has been fully working since the morning of Tuesday October 2nd and we believe the underlying issues has been understood.

History:
Switch to MSQL – 15th August
First service interruption (writes stopped working) – 19th August Fixed 21st September.
Second service interruption (writes stopped working) – Tuesday 25th September.
Fixed October 2nd.

It was not a database corruption as initially reported but rather an error in the migration process that meant a row in a table was not migrated. This did not immediately cause a problem as it allowed around 6000 new entires to be written to the database before it filled up. Both times we encountered the problem, the database stopped working at exactly the same number of rows. This lead us to understand and then fix the problem.

The LFC software is very unsupported and we were somewhat lucky to find the solution. I am trying to push ahead as fast as possible with the migration away from LFC to DFC. Daniela Bauer has been leading the technical work this.

2) Echo memory upgrades
Memory upgrades of the Clustervision storage nodes we completed on the 28th September. All Echo storage nodes have now had their memory upgraded. The Clustervision nodes have been functionally tested to work in a Ceph cluster. A few minor issues were uncovered on a small number of machines. The remaining machines are currently being prepared to put into Echo and will be phased into production over the next two weeks. We expect them to be fully in production by the week beginning 22nd October.

3) FTS / gfal version issues
The newest version of gfal made changes to the way certificate are processed. This broke transfers to a significant number of sites (some who hadn’t upgraded since 2012!). Tier-1 was ticketed to role back the FTS service, which it promptly did.

There is a separate issues involving the RAL FTS service being unable to connect to a handful of machines at Triumf and a Tier-2 in Columbia over IPv6. We are working closely with the other sites. So far it appears to be a configuration problem at their end.

4) New Castor tape service:
Successfully ran ATLAS tape test, performance was excellent at 2GB/s. It was reported at the WLCG Archival Storage WG (https://indico.cern.ch/event/756338/attachments/1723845/2784624/update-atlas-data-carousel-wlcg-wg.pdf). We were second only to IN2P3 who got 50% better throughput but had 36 T10KD drives available compared to our 8 T10KD drives. As we migrate more VOs to the new Castor instance we will have more drives available (up to 22) and this should mean we are likely to double our overall throughput.

The tape endpoint is now available to the full range of SAM tests and we are resolving a handful of remaining errors. No intervention has been scheduled yet to move over ATLAS production work, but would expect this to be sometime in October. CMS will follow around a month later (to give time to resolve any residual ATLAS problems but still give sufficient time for CMS to resolve issues before Christmas).

5) Procurement
On Thursday 27th September, UK SBS came to RAL and we had a full day’s meeting to sort out the procurement documents. On Friday we received the risk assessment document which we will need to sign because we are specifying that we want to buy from two separate suppliers. I expect the tenders to be issued this week.

6) Dune Transfer testing.
I was able to convince the CERN Davix developers that not being able to transfer files over 5GB was a critical problem (for SKA and DUNE). They have now implemented a version of Davix that supports this (https://its.cern.ch/jira/browse/DMC-1090) which took 5 days to write. This is now being tested and should allow Echo S3 to be properly used by DUNE.

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
DB not in attendance, no report submitted.

SI-7 External Contexts (PC)
———————————
PC not in attendance, no report submitted.

REVIEW OF ACTIONS
=================
ACTIONS AS OF 01/10/18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
675.1: DC to sign off report on Tier-1 LHC usage. Ongoing.
678.1: RJ, to finalise the Experiment Support background document by end September (Update: RJ will submit very soon). Done.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in – GR will upload into Googledocs for info). Ongoing.
680.2: JC will follow up GDPR implications relating to VOMS with DK. Ongoing.

ACTIONS AS OF 08/10/18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress. 08/10/18 – Leicester are producing a PO for tapes and will send to AD to produce an invoice). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
675.1: DC to sign off report on Tier-1 LHC usage. Ongoing.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. Ongoing.
678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend. This is almost complete and awaiting a few minor elements to be worked in – GR will upload into Googledocs for info). Ongoing.
680.2: JC will follow up GDPR implications relating to VOMS with DK. Ongoing.