GridPP PMB Meeting 680

GridPP PMB Meeting 680 (24.09.18)
=================================
Present: Dave Britton (Chair), Tony Cass, Pete Clarke, Jeremy Coles, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Steve Lloyd, Andrew McNab, Gareth Roy, Andrew Sansum, Louisa Campbell (Minutes).

Apologies: David Colling, Dave Kelsey, Roger Jones.

1. Pledges
==========
PG sent a spreadsheet to GR and DB who suggested changes to weights from non-Atlas sites to half the weight for Atlas this time around, ie for 2 years. This will make little difference to Tier-2 figures and relates to how pledges are distributed on an internal model as a sanity check. Pledges at Tier-2 will be met and Tier-1 pledge numbers have not been fully included – PG stated what the request was in relation to pledge, he has discussed with AD and it is very likely this should be met so long as the funds suggested are forthcoming and taking account of potential price changes due to fluctuating conversion rates. If capital is received there is c. 10% headroom on price as the budget predicts similar prices to last year. DB will include comments for Tier-1 noting ‘subject to funding confirmation’.

2. Quarterly Reports
====================
All reports have now been received except CMS. PG will summarise and process then document the procedures to pass on to Matt at Lancaster. AD noted that if instructions were documented then Katie has a certificate and can process the CMS information if required and keep DC informed in order to get the report completed.

3. AOCB
=======
SL invited feedback on his proposed background document in order to progress it, specifically on CMS (DC) and IRIS (PC) and general comments (DB).

4. Standing Items
===================

SI-0 Bi-Weekly Report from Technical Group (DC)
———————————————–
No report submitted.

SI-1 ATLAS Weekly Review and Plans (RJ)
—————————————
– The new RAL Castor tape stager has been set up with 8 tape drives allocated to ATLAS. It has been defined in AGIS and Rucio and a few test files have been migrated to tape. We hope to be able to perform a large-scale tape read test this week, as part of the ATLAS development for the “tape carousel”.

– The old Birmingham DATADISK has been drained and is now “effectively empty”. Jobs at Birmingham have been using Manchester storage. A first look doesn’t show a big drop in efficiency running like this.

– ATLAS is migrating the UK cloud to using Harvester. Most Tier-2s’ production queues have now been switched.

PG noted there was a report at the storage meeting last week on work being undertaken relating to getting Atlas to work on the EOS system, this differs from recent PMB discussion. AD noted Birmingham mentioned going to EOS and asked if Atlas was okay with this and it was originally agreed so long as they could meet the stipulation they could provide 400TB for Atlas to use – new space tokens in Atlas in EOS. It was questioned whether Atlas needs the disk at Birmingham and this was raised at the Atlas meeting on Thursday, GR made clear no pledge was made on this and it would need thorough testing. Tim thought bringing all storage back to Birmingham was the plan, but this needs to be clarified. AD suggested if Atlas does not need the pledge it should be clearly informed so decisions could be made.

ACTION 680.1: DB will write to AD and RJ regarding Atlas working on the EOS system at Birmingham.

SI-2 CMS Weekly Review and Plans (DC)
————————————-
No report submitted.

SI-3 LHCb Weekly Review and Plans (PC)
————————————–
Nothing significant to report.

SI-4 Production Manager’s report (JC)
————————————-
1. MoEDAL simulations that previously ran hit issues last week, we suspect because they relied on QMUL’s SE which has been having problems. A few weeks ago the simulations ran at greater than 90% efficiency.

2. Sites are seeing activity with data movement for ProtoDUNE.

3. We have been seeing various problems with perfSONAR hosts/monitoring since versions 4.1 and 4.1.1 have had problems. The recommendation is now to move to CentOS7 and version 4.1.2.

4. Under GDPR, sites that expose services to users need to ensure data policies are accessible. Follow-up is needed for the VOMS case where the web-interface is generic. Dave Kelsey gave an EOSC-HUB update at the last GDB: https://indico.cern.ch/event/651357/contributions/3128684/attachments/1714247/2764866/Kelsey12sep18.pdf and the WLCG ops Coordination activity is following up (https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes180913).

ACTION 680.2: JC will follow up GDPR implications relating to VOMS with DK.

SI-5 Tier-1 Manager’s Report (AD)
———————————
1) LFC outage affecting T2K, SNO+, MICE
For the last month, we have been having problems with the LFC service affecting T2K, SNO+ and MICE. While you were able to read files without problems, writing files would fail or register files with zero (or non-sensical) sizes. Last week we identified a database corruption. We declared a downtime at the end of last week to perform a database roll back. In the end have had to roll back to a database from the 18th August. There are roughly 5000 files that have been written successfully since this update. We have a log and will contact the sites storing the data to confirm we have the correct size before re-registering them in the LFC. Note: the corruption started 3 days after we had migrated to the new database backend. We have not identified the cause of the database corruption. There will be a review of this downtime. I will be circulating the plan to migrate to the DIRAC File Catalogue to the VOs still using the LFC as soon as possible.

2) SNO+ and Na62 tape data loss.
46 files have had to be declared lost from Castor tape. 10 files were lost belonging to SNO+ and 36 for Na62. The files were written to Castor successfully. However before they got migrated there was a Castor problem that removed the flag telling Castor to copy them to tape. They therefore never got put on tape and were eventually deleted from disk. We are implementing extra checks to catch this in future.

3) ClusterVision memory upgrade
Memory was shipped on Friday and is arriving today. Machine are not in production yet so it is straight forward to upgrade the machines. This should be completed this week. They will then be added into the Ceph dev cluster for a few days before being deployed into Echo. Echo is not short of spare capacity at the moment.

4) Status of 2018 Procurement
I have arranged for two members of the SBS procurement team to visit RAL on Thursday this week. The aim is that after that meeting the tenders will be in a position to submit. The only issue so far has been a warning from SBS that awarding the contracts to two separate vendors might be opened to challenge. This risk is no different from previous years.

5) Dune Transfer testing
a) It appears that Dune were able to accidentally delete the S3 bucket (in Echo) with all their data in. They don’t remember doing it but our logs show the command coming from the same source as many other commands. We suspect it was a typo on their part. We are looking in to ways of reducing the risk of this happening (but if a production user wants to delete all their data they can!

b) We have reached a stumbling block with the Dune transfer testing with Echo S3. Dune have files sizes frequently up to 8GB and these are failing. This is because the S3 API doesn’t allow uploads greater than 5GB, without using multi-part uploads (It is recommended you use multi-part uploads when going above 100MB). The gfal commands (and DynaFed) do not currently support this and development effort is needed. We are evaluating possible solutions (There are plenty of things that will work, they are just not traditional ‘Grid’ solutions).

SI-6 LCG Management Board Report of Issues (DB)
———————————————–
There was an MB last week and DB summarised:
a) Note at the beginning ProtoDune were now using LHCOPN.
b) Next WLCG HSF joint meeting on Eastern Seaboard of the US, possibly w/c 18 March 2019 (potential clash with US computing meeting).
c) Discussion of new security issue L1 Terminal fault – spectrum meltdown and reloaded, there is no public knowledge of that yet. CERN, Centros 7 and other patches exist but require machine reboot which may have some impact.
d) Update on Storage accounting but this was challenging to hear, it appears to be work in progress.

SI-7 External Contexts (PC)
———————————
PC and DB were at the scientific computing forum in CERN last week. There was a presentation on tape and disk storage that was interesting but inconclusive. The technology seems fine, there is concern over the small number of suppliers though recommendations are not clear. DB noted the forthcoming submission of large research grants (GridPP6) so will continue to progress that based upon existing knowledge.
HPC use in HEP – backdrop to this is the US is being forced to take this route despite it not being the optimal use of our computing, they are discussing a 1000 Petaflop machine, compared to the UK aspiration of 10 Petaflop machine.
DB mentioned a project called ArisHEP in the US which relates to manpower (£25M over 5 years to create software) and PI is Peter Elmer.
There has been discussion of Dune and WLCG relating to doing things in common with same technical infrastructure but different high level functions as the PMB have previously discussed.

REVIEW OF ACTIONS
=================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
675.1: DC to sign off report on Tier-1 LHC usage. Ongoing.
675.2: RJ to sign off report on Tier-1 LHC usage. (UPDATE: AD will supply monthly figures on pledges & usage) Ongoing.
678.1: RJ, to finalise the Experiment Support background document by end September.
678.2: DK to finalise the Security, Trust and Identity background document by mid October.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September.
678.4: SL to finalise the Tier2 background document by end September.

678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend) Ongoing.
678.6: DB will send an email to the Collaboration Board. (UPDATE – response received from Durham – UK Phenogrid had used a lot of resource over summer but Durham was installing £100K h/w which would repay this to the research pool). Done.
678.7: DB, PG and GR will discuss how GR can take forward Pledges. Done.

ACTIONS AS OF 24.09.18
======================
644.4: AD will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing.
667.2 PG will do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing.
675.1: DC to sign off report on Tier-1 LHC usage. Ongoing.
675.2: RJ to sign off report on Tier-1 LHC usage. (UPDATE: AD will supply monthly figures on pledges & usage) Ongoing.
678.1: RJ, to finalise the Experiment Support background document by end September. Ongoing.
678.2: DK to finalise the Security, Trust and Identity background document by mid October. Ongoing.
678.3: AD to finalise the Tier1 background document, including tape strategy by end September. Ongoing.
678.4: SL to finalise the Tier2 background document by end September. Ongoing.

678.5: JC to finalise the Storage background document by end September.
(UPDATE: 17 October meeting with Tony Medland – DB and PC will attend) Ongoing.
680.1: DB will write to AD and RJ regarding Atlas working on the EOS system at Birmingham.

680.2: JC will follow up GDPR implications relating to VOMS with DK.