GridPP PMB Meeting 580

GridPP PMB Meeting 580 (23.11.15)
=================================
Present: Dave Britton (Chair), Pete Gronbech, Tony Cass, Andrew Sansum, Jeremy Coles, Steve Lloyd, Gareth Smith, Peter Clarke, Dave Kelsey, Dave Colling, Tony Doyle, Andrew McNab, Roger Jones, (Minutes – Louisa Campbell)

Apologies: Pete Clarke, Claire Devereux.

1. GridPP4+ h/w grants status
==============================
PG update – 16 x grants have been provided for in JES and 14 are currently with STFC so this has progressed quickly. Two institutes remain outstanding – PG confirms they are almost ready to submit. STFC are already confirming approval which is a very quick turnaround.

2. Update on ALICE requirements (PG)
Following on from last week’s PMB, PG has emailed ALICE regarding the current position and advised they will not be asked to delete data and we are currently checking if there is any space to expand but we cannot yet confirm that. DB awaits a response from AS who was not available last week. We need to check capacity bur really also need to input realistic requirements from the other experiments AS suggests he can run various models but these may give different results. It is not clear if ALICE are just writing data or whether they are actually using it as well. If the latter, then there will be impact on tape drives if they are using constantly rather than archiving. GS mentioned we have been exporting a good deal recently and it seems that ALICE is a significant component of this.

ACTION 580.1 AS will look at tape planning to determine if we want to take forward increasing ALICE space and if this can be easily accommodated in all the scenarios.

3. Other grants
===============
PG confirms he has made enquiries at STFC and they advised we will very soon have a response on the GridPP5 grant.

4. AOCB
=======
a) Current status of IC GridPP Cloud, following staff changes?
—————————————————————–
Adam used to look after this and has since moved on – it was previously planned to join EGI cloud at one time but unclear whether that happened. AM confirmed this did not happen as there were issues and incompatibilities, we were one round of testing away from being certified as a member but did not progress. Since then IC have built and are testing a new Cloud – it is not clear if OCCI is operational, but our resources from the existing Cloud will be incorporated. If joining EGI cloud it is important to ensure this is possible, but it has never had a significant impact on any of the experiments. The original Cloud remains operational but is currently decreased to 2 machines with a small amount of resource behind it, the remaining resource is on new machines. Some, eg AMcN jobs and a couple of others, are running on the old Cloud. We can move them over to the new Cloud and maintain the old one on minimum resources until all remaining jobs have been moved. Virtual private clouds don’t work with EC2 interface – our customers use and expect that so it is essential to ensure this is robust.

EGI cloud mailing list is still active, but they are allowing sites to join and use an open stack storage space. There is ongoing discussion on whether to allow open stack API and we should consider the feasibility of this. It is worth considering joining the OCCI cloud again on that basis and to re-join mailing list. DB agrees this would be interesting as it could negate long-running OCCI issues experienced. Oxford and Imperial are thought to have put one up. More discussion needs to be undertaken on whether to pursue this.

b) WLCG taskforce
——————–
WLCG taskforce looking at info and what is in REBUS. They may propose changes to the way REBUS works and is used. They are drafting a future use document. GridPP appeared in the current document and we are being asked if we want any input into the new document. The day to day use is well covered but high level use is not, eg PMB use. DB suggests it is often challenging to find what is required through REBUS – the information is usually there, but difficult to locate so it could be presented better. One suggestion is to provide sites with a way to supply best estimates of their expected capacity requirements in the future rather than pledges which are not always accurate – this would be very helpful for planning. Also for capacity as there is some ambiguity on listing machines capacity if they are switched on. DB notes inconsistency between information available on some Tier-1 sites and REBUS information, this could be explained by equipment being listed but switched off. Some individuals on experiments pushing for per-site pledges (e.g. ATLAS and CMS) – but it is thought preferable for us to pledge at Tier-2 level to ensure flexibility in resolving issues at that level – inputting site level data would require a great deal of work. The more fine grained pledges become, the more conservative they become. It may not be appropriate to announce full capacity but we can announce the fraction of the pledge we are intending to provision at each site.

ACTION 580.2 AMcN will send round an email to PMB to ask for suggestions for inclusion.

c) PC suggestion for F2F meeting in January/February
——————————————————
DB suggests the proposed timing is too challenging to accommodate due to a multitude of other commitments and meetings as well as start of term teaching. We have arranged a F2F meeting in Pitlochry in April (GridPP36) so unless something pressing arises, this suggestion can be kept under review.

4. Standing Items
===================
SI-0 Bi-Weekly Report from Technical Group (DC)
——————————————-
No report as last week’s technical meeting was cancelled.

SI-1 Dissemination Report (SL)
——————————————-
##GridPP Dissemination Officer Notes for PMB

###Providing “Letters of Recommendation” for SME users

During the GridPP UserGuide testing process, two users were asked for “Letters of Recommendation” in order to be granted grid certificates. TW queried this policy via the UK CA support service and got the reply featured in [1]. So a question for the PMB – can TW supply this letter of recommendation on behalf of GridPP for potential SME users? i.e. specifying GridPP as the “academic project” that entitle them to a grid certificate?

###GridPP Website 2.0

####GridPP Travel policy TW has updated GridPP travel policy [2].

Thought and suggestions, as ever, appreciated.

[1] – reply from UK CA support: Note that the RA Operator not only confirms the identification of the person (face to face visit with a PhotoID they have a reasonable chance of checking isn’t fake), but also that they are entitled to a UK eScience Certificate. I don’t have the exact wording of the entitlement to hand, but it is along the lines of “involved with a UK eScience/Grid/Cloud project”. As such an arbitrary SME person can’t just roll up and demand a certificate (not that it’d be much use to them in any case since pretty much only eScience/Grid/Cloud servers/services trust them – not browsers for instance). I’m guessing that the RA Operator in question wants to ascertain what they need the certificate for so since they aren’t in UK HE/FE then their entitlement will likely revolve around their association with an academic project that requiries their involvement and them having a certificate; in which case the project PI or Co-PI would be a good person.

[2] https://vm36.tier2.hep.manchester.ac.uk/collaboration/travel/

ACTION 580.3 AS will investigate discrepancy between what Tom is being sent and realities for certificates/authentication.

ACTION 580.4 After DB updates Visit Notices wording and send PMB message to use site immediately to test any issues and if all works well fully move ahead with new website.

SI-2 ATLAS Weekly Review and Plans (RJ)

—————————————
Nothing to report globally. Locally, ATLAS were hit with a rogue programme that generated lots of requests which filled up the log.

SI-3 CMS Weekly Review and Plans (DC)
——————————————-
Nothing to report.

SI-4 LHCb Weekly Review and Plans (PC)
——————————————-
Queen Mary asking if possible to become a T2-D (Tier-2 with Data) site for LHCb – AM has been asking around if this is viable and it appears to be. LHCb is attempting to remove SRM use and AM believes QM should still be viable as a T2-D site for LHCb despite this.

SI-5 Production Manager’s report (JC)
——————————————-
1. We continue to review ‘Other VO’ engagements at the weekly ops meeting. Several of the VOs/groups have infrequent updates but we continue to see steady progress with HPC DiRAC (now archiving codes and .svn directories by tar’ing and gzip’ing them first which is an important factor in reducing the number of files) and LSST (just addressing an issue with expiring proxies – FNAL have agreed to 7 day proxies).

2. Andrew McNab has introduced a new depo.gridpp.ac.uk service for uploading files via HTTPS.

3. We continue to see annoying low availability alarms in the ROD dashboard, but the rolling average for the availability has apparently reduced to 20 days which is a slight improvement.

4. There is some IPv6 progress across GridPP sites but the lack of a clear timeline for the production service availability of IPv6 nodes is reducing traction with University networking teams. The latest view is in https://www.gridpp.ac.uk/wiki/IPv6_site_status.

5. Sites are being asked to patch for an Network Security Services vulnerability.

6. An overview of WLCG wide operations activities can be found in the minutes from the last Operations Coordination Team meeting: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes151119. One item under consideration is the future of the Information System (taking account of experiment needs and plans for Glue 2.0). Details can be found under https://indico.cern.ch/event/454975/attachments/1188757/1724809/ISTF-minutes-12112015.pdf.

7. There was an outage of the GridPP website last Tuesday. There appears to have been a configuration error in the Manchester name servers.

8. Tony Price informed me that: “We are applying for a follow on grant for Pravda and I have been asked about computing costs. I know that we currently do not pay for services but is this something which is likely to change over the next 12 months. My bosses are extremely pleased with the CPU time gridpp enables us to use and we do not want to lose this.” I will find out how much resource was actually (found to be) useful.

ACTION 580.5 JC to provide information on how much resource Pravda was found to be useful.

SI-6 Tier-1 Manager’s Report (GS)
——————————————-
General:
– Generally a quiet week last week. We are chasing down systems that still need the RedHat crypt libraries updated.

Castor:
– No significant changes during last week. We made a parameter change (increase number of nameserver database connections) for LHCb as we try and fix a low level problem whereby some batch jobs fail to write their results into Castor.
– We have seen no further occurrences of the problem whereby servers in one particular batch/configuration (ClusterVision11 servers running SL6 in tape-backed service classes) show a problem during name lookups.

Networking:
– We continue keeping a close watch on some low-level packet loss within our network. We also continue with the changes needed to remove the old ‘core’ switch from the network. There was a site ‘warning’ for an hour last Wednesday for one of the steps – which went well.
– We have seen an increase in external traffic round the bypass link which was saturating during the first part of last week. The OPN has been busy in the last couple of months. One of the changes needed to remove the old core switch is a reconnection of the UKLight Router. We are looking to increase the bandwidth through this both between the Tier core network (increase from 20Gbit to 30 or 40Gbit); and double the bandwidth round the bypass route from 10 to 20Gbit. Work is ongoing regarding the removal of the UKLight router.

Batch:
We have not seen any more problems with Atlas Hammercloud tests failing (loss of heartbeat) although the original cause is not understood.

Procurement:
Action: 579.6 GS to provide high level milestone dates for procurement, plans, timing:
– The CPU and Disk tender documents are visible (i.e. the tender is live). Dates when tenders went live: CPU: Friday 13th Nov; Disk:
Tuesday 17th Nov.
– Tender Set to Close on 18th December.
– Review responses before Christmas and ask any questions of the vendors if needed.
– We have stated that we will inform the vendors of the result by the week beginning the 15th January. If we have the results
sooner, the results will go out then.
– Delivery is expected within 8 weeks after the contract is placed.

Issues regarding order placing and processing, but the REC is much quicker than previous. New online system caused some extra work but it appears to be resolved. It would be helpful for explanation of HAG and who is on the valuation team. There are some concerns over procurement that have been recently discussed at STFC.

SI-7 LCG Management Board Report of Issues (DB)
———————————————–
No report.

REVIEW OF ACTIONS
=================
578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered. Ongoing.

578.8 DB will consider reputation risks of inadequate support of new VOs. Ongoing.

579.1 DB to write a paragraph with guidelines on how small sites are expected to use GridPP4+ h/w spends and circulate to PMB in the next few days for comment. Done.

579.2 DB to contact AS to comment on tape planning before agreeing that ALICE can use the existing 850TB, but we cannot increase this as 680 was agreed until end GridPP5). Done, but DB will modify slightly and add brief sentence regarding raising experiments that each site hosts. Done

579.3 PG to keep Catalyn informed that ALICE won’t be asked to delete data and will be advised we are considering scope to increase data but this cannot be agreed at the moment. Done.

579.4 DB will adjust the wording on visit notices to make clear that staff are required to make the most cost-effective travel and accommodation arrangements. Ongoing.

579.5 CD will forward Karen Padmore’s contact details to SL for more information on potential SMEs. Done.

579.6 GS to provide high level milestone dates for procurement, plans, timing. Done.

ACTIONS AS OF 23.11.15
======================
578.2 AM and JC to investigate EGI community platforms to see whether VAC and possibly DIRAC could be registered. Ongoing.

578.8 DB will consider reputation risks of inadequate support of new VOs. Ongoing.

579.4 DB will adjust the wording on visit notices to make clear that staff are required to make the most cost-effective travel and accommodation arrangements.

580.1 AS will look at tape planning to determine if we want to take forward increasing ALICE space and if this can be easily accommodated in all the scenarios.

580.2 AMcN will send round an email to PMB to ask for suggestions for inclusion.

580.3 AS will investigate discrepancy between what Tom is being sent and realities for certificates/authentication.

580.4 After DB updates Visit Notices wording and send PMB message to use site immediately to test any issues and if all works well fully move ahead with new website.

580.5 JC to provide information on how much resource Pravda was found to be useful.