General updates
|
Monday 7th November
Monday 31st October
- Please register for the HEPSYSMAN meeting next week. Agenda.
- The APEL team has scheduled a down time for 10:30am - 12:30pm (UTC) on the 1st November. This is to allow us to upgrade our machines with a new kernel.
- There is an extension of the paper call for the International Symposium on Grids and Clouds (ISGC) 2017.
- The next WLCG GDB is on 9th November. The agenda can be found here.
- GridPP website: AM informs us the web server VM and the hypervisor are now running shiny new kernels. He undertook some preliminary checks and the pages, Wiki, database etc all look ok, but we should inform him of any issues observed.
- Thread related to: Change to Approved VOs (and RPMs). What were the conclusions?
Monday 24th October
- Message to GridPP CB regarding Tier-2 hardware spend allocations. GridPP will use the second table.
- Minutes are available from today's WLCG ops meeting.
- The APEL team have notified us that: The APEL Accounting Repository has been doing some internal data processing of its cloud data so to enable this to process at maximum speed the summarising of grid sites has been suspended. This is reflected in the portal not being updated and the Apel.Pub tests showing several days since your site last published.
- HEPiX took place last week. See the detailed agenda for more information.
- Request for Grid Engine site to test a new implementation of the machine job features plug-in.
Monday 17th October
- US: PNNL LHCONE system outage planned October 17-21
- There will be an Operations Portal OTAG meeting to discuss new requirements between now and the end of EGI-Engage.
- APEL Accounting - Please Schedule Republishing: The APEL Service is having problems caused by sites republishing large numbers of jobs. This can either be due to conscious gap publishing or to fixing a problem that had existed for some time (eg out of date certificate, parser cron not running).
- On a general note... The UK HPC facility (Archer) is labelled in production to accept ATLAS jobs but is not in general use. This inconsistency has been discussed elsewhere and was passed to ATLAS for discussion.
- ECDF tests: As the site is now exclusively SL7 they are failing some of the SAM probes that were not designed for this OS. The problematic probe is "org.lhcb.WN-sft-lcg-rm-gfal”. Other sites may be hit in a similar way. It affected LHCb production results.
- T2 reliability & availability for September 2016:
- There was a report that the "UI WN Tarball" was "Missing vomses config in CVMFS UI". Matt has fixed it.
- Duncan circulated a request to update https://www.gridpp.ac.uk/wiki/IPv6_site_status
- There will be a Campus Network Engineering for Data Intensive Science workshop, 19th October, in London.
- ATLAS - APEL accounting records comparison from Alessandra: The accounting TF has now established a dashboard to compare results from the experiments accounting records with the APEL results. This contains all the experiments latest values which are updated each month with the numbers of the previous finished month. There is also a historical view these are the UK sites from January until August.
- The WLCG workshop took place last week - agenda
- HEPiX takes place this week - agenda.
Monday 26th September
|
WLCG Operations Coordination - AgendasWiki Page
|
Monday 7th November
Tuesday 25th October
Monday 3rd October
- There was a WLCG ops coordination meeting last week: Agenda. (Good to review in the ops meeting).
Monday 26th September
Monday 19th September
|
Tier-1 - Status Page
|
Tuesday 8th November
A reminder that there is a weekly Tier-1 experiment liaison meeting. Notes from the last meeting here
- Still some mopping up after CVE-2016-5195
- The CVMFS Stratum0 server has been replaced with newer hardware.
- Intervention by Oracle on the Tier1 tape library went OK last Wednesday.
- Owing to staff availability the upgrade of Castor to version 2.1.15 is being scheduled to tale place in January.
|
Storage & Data Management - Agendas/Minutes
|
Wednesday 09 Nov
- Storage related issues from hepsysman?
- ...
Wednesday 02 Nov
- Big picture: an attempt to explain where GridPP fits with other things such as "UKT0" and other infrastructures
Wednesday 26 Oct
- Feedback from WLCHEPiXG - don't miss it!
Wednesday 19 Oct
- Long list of loose ends - accounting and information systems, IPv6 surprising successes
Wednesday 12 Oct
- Initial impressions from WLCG workshop and CHEP-so-far
- Coming events where GridPP storage-and-data-management could be, will be, or should be (re)presented.
|
Tier-2 Evolution - GridPP JIRA
|
Monday 24th October
- Started HTCvcm (HTCondor Vacuum VM), using ATLAS VMs as the starting point, to provide a generic HTCondor client VM that will connect to HTCondor pools run by the local site, experiments, or larger sites.
- Merging LHCb multipayload VM code into DIRAC Pilots repo.
- CernVM team updated CernVM to use kernel with fix for CVE-2016-5195 ("DirtyCOW")
Monday 17th October
- Validation of APEL accounting of VM resources and VM-only sites has been completed.
- From 4th October: A Lightweight sites questionnaire for WLCG sites has been circulated. The aim is to get to a "matrix" of approaches that sites can choose from, depending on criteria that is covered in the questionnaire.
Tue 10 Oct
- Vac-in-a-Box 00.34 supports Vac 01.00 itself rather than pre-release (note that upgrading to 01.00 requires a reboot due to network layout changes)
- "Vacuum Platform" specification published as HSF-TN-2016-04
Wed 05 Oct
|
Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06
|
Monday 26th September
- A problem with the APEL Pub and Sync tests developed last Tuesday and was resolved on Wednesday. This had a temporary impact on the accounting portal.
Tuesday 14th June
- GridPP accounting switched to use the 'new' EGI accounting portal.
- APEL delays from UK sites look about 'normal' (i.e. delays are typical).
Tuesday 9th February
- 4th Feb: The data from the APEL summariser that was fixed yesterday has now propagated through the data pipeline and the Accounting Portal views and the Sync and Pub tests are all working again.
- Sheffield is slightly behind other sites (but looks normal) and so is QMUL.
|
Documentation - KeyDocs
|
Tue 1 Nov
Publishing tutorial updated to use new wording for various measurements.
https://www.gridpp.ac.uk/wiki/Publishing_tutorial#Accounting_transmissions
Tue 20th Sept
GridPP Approved VOs now has link to RPM versions of the VOMS records. They are available for now via the VOMS RPMS Yum Repository. The latest version, which is consistent with the Yaim records in the Approved VOs doc, is 1.0-1. Plan is that when VO records change, Approved VOs doc version will be incremented, and RPMs of changed VOs (only those) will be released carrying the same version stamp as the document. Thus a site that upgrades to "latest" will get the records compatible with the newest version of the GridPP Approved VOs document.
Note: A typical RPM contains as so:
[sjones@hep169]$ rpm -qlp gridpp-voms-dteam-1.0-1.noarch.rpm
/etc/grid-security/vomsdir/dteam
/etc/grid-security/vomsdir/dteam/voms.hellasgrid.gr.lsc
/etc/grid-security/vomsdir/dteam/voms2.hellasgrid.gr.lsc
/etc/vomses/dteam-voms.hellasgrid.gr
/etc/vomses/dteam-voms2.hellasgrid.gr
/root/vo_xml/dteam.xml
The vomsdir (lsc) files (which list the DNs and CA DNs of acceptable certificates) and the vomses files (which give the coordinates of VOMS servers of various VOs) are provided, as if they were created by YAIM in the normal locations. No other features of YAIM are facilitaed by these RPMs. Thus they are useful for migrating from YAIM, but do not provide all the functions of YAIM such as setting SW dirs or other ENV vars etc.
Tue 6th Sept
Benchmarking procedure. Contains instructions for ARC/Condor, CREAM/Torque, VAC. Needs to be updated for use with other systems.
https://www.gridpp.ac.uk/wiki/Benchmarking_procedure
Mon 1st Aug
LZ VO now up to date in portal, and will be updated in Approved VOs automatically from now on. Sites supporting LZ are advised to read LZ VOMS settings section of https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs (which is between LSST and MAGIC!)
Tue 26th July
Elena has provided VOMS info for DUNE. I'm maintaining it by hand, at present, similarly for LZ.
Both should be present and correct in the Operations Portal, but are not.
https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs
General note
See the worst KeyDocs list for documents needing review now and the names of the responsible people.
|
Interoperation - EGI ops agendas
|
Monday 7th November
- There was an EGI Ops meeting today: agenda
- UMD 3.14.5 released today
- VOMS 3.5.0, which makes RFC proxies the default for voms-proxy-init
- UMD 4.3.0 'October' release, release candidate ready, to be released by end of this week, including:
- ARC, GFAL2, XROOT, Davix, dCache, ARGUS, Gridsite, edg-mkgrid, umd-release for CentOS7
- please start using UMD4/SL6 or UMD4/CentOS7 instead of UMD3/SL6 & please don't use anymore EMI3
- Think there may be a campaign around this soon
- Downtimes due to the vulnerability CVE-2016-5195: request an A/R recomputation
- All the resource centres that were affected by the vulnerability CVE-2016-5195 and that declared a downtime between 2016-10-20 16:00 UTC and 2016-10-31 18:00 UTC are invited to request a recomputation of A/R figures for the days in which the downtime was ongoing.
- ARGO proposal to use GOCDB as the only source of topology information
- VAPOR 2.1 released in September, it replaces GSTAT
|
Monitoring - Links MyWLCG
|
Tuesday 1st December
Tuesday 16th June
- F Melaccio & D Crooks decided to add a FAQs section devoted to common monitoring issues under the monitoring page.
- Feedback welcome.
Tuesday 31st March
Monday 7th December
|
On-duty - Dashboard ROD rota
|
Monday 17th October
- Mostly quiet. We've got six outstanding tickets, four of which have been there for a while. There's one new ticket against Liverpool's ARC-CEs. The final one is there purely to silence the availability alarm at EFDA-JET until the decommissioning process is complete.
Monday 19th September
- Fairly quiet week, with just the usual suspects.
- ROD responses received.
Monday 22nd August
- Unusually quiet week. Nothing significant.
- Portal very slow at times though.
- New rota meet-o-matic request circulated - ROD team members please respond this week!
Monday 28th June
- The ROD rota needs to be updated.
|
Rollout Status WLCG Baseline
|
Tuesday 7th December
- Raul reports: validation of site BDII on Centos 7 done.
Tuesday 15th September
Tuesday 12th May
- MW Readiness WG meeting Wed May 6th at 4pm. Attended by Raul, Matt, Sam and Jeremy.
References
|
Security - Incident Procedure Policies Rota
|
Tuesday 1st November
- Dirty COW vulnerability - CVE-2016-5195
- Small number of sites still showing in the EGI monitoring after the deadline. Please check the OPS Portal and acknowledge ticket promptly if you get one with explanation and plan for update.
- The Dutch cybersecurity center (CERT of Dutch government) just published its annual threat assessment report. It gives a very good overview of trends in threats and actors, highly recommended reference material!
- EGI Security Policy Group meeting this week [1]
Tuesday 25th of October
- Dirty COW vulnerability - CVE-2016-5195
- Sites are asked to act to mitigate this as soon as possible - see the advisory. Hopefully by the time the meeting comes we'll have more information on an SL fix (SL7 is available) - when this comes sites will have 7 days to update. Sites not able or wanting to apply mitigation before official patches are available have the option to go into Downtime, without penalty of loss of availability (until 3 days after official patches are available, agreed by EGI Operations).
- EGI-SVG-2016-11476 (canl-c)
- One or two sites still popping up on the monitoring each week.
Monday 17th October
- Due to problems with pattern matching filters, Pakiti was not complaining about some instances until recently. This was in connection to vulnerability EGI-SVG-2016-11476.
- There was an EGI Trust Anchor release 1.78-1. Please upgrade by 2016.10.18 at your earliest convenience. Please check the release notes for more details
- FedCloud Sites have received a 'Heads Up'.
- We are down to a few SL5 services.
Tuesday 4th October
- Sites not upgrading for EGI-SVG-2016-11476 have now been ticketed by EGI CSIRT. Although WNs not thought to be vulnerable they are asked to be upgraded as indicator of compliance elsewhere. No UK sites have been ticketed _BUT_ it looks like the monitoring is only working through CREAM CEs so there may be ARC CE installations that would show "vulnerable" if/when the monitoring is fixed. Please check.
- Some randon stuff from DI4R
- Keynotes (again) on European Open Science Cloud and Human Brain Project
- Good run through of pilots/plans for "Enabling federated login to WLCG" [2]
- Anybody developing software might like to look at OSG's Rob Quick's presentation on the Software Assurance Marketplace SWAMP, a free software QA tool which can help improve code quality and security.
- Summary of the "WISE people take action on Security" workshop
- High level stuff on activities around procurement of infrastructure from public cloud providers [3]
- Bruce Becker gave a good/amusing/thoughtful lightening talk on managing distributed infrastructure in Africa [4]
Tuesday 27th September
- One "critical" risk vulnerability EGI-SVG-2016-11476 reported 21/09. Updates should be applied by 29/09.
- IGTF CA distribution 1.77 release 1 is now available for download from the Repository (and mirrors) [5]
- Matt Doidge has joined the UK NGI security team. Thanks Matt - Ian.
- WISE Security for Collaborating Infrastructures workshop @ DI4R conference 27/09/2016 [6]
The EGI security dashboard.
|
|
Services - PerfSonar dashboard | GridPP VOMS
|
- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).
Tuesday 25th October
- Duncan has recreated the UK perfSONAR mesh. Link here!
Monday 19th September
- UK eScience CA - certificate issuance problems. Jens reported that on 15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was not correctly configured.
- A large number of site admins and other GridPP supporters appeared to be suspended from the dteam VO last week. “During a planned upgrade operation of VOMS service, a system malfunction occurred. As a result, some users received false notification about membership expiration. We are in contact with the software development team in order to identify the cause.”
|
Tickets
|
Monday 31st October 2016, 15.40 GMT
29 Open UK Tickets this week.
Tier 1
124244 (5/10)
LHCB having cvmfs-ish problems, no news for a while (since the day of submission). In progress (5/10)
124606 (24/10)
CMS consistence checking ticket. The ticket asks for lists of LFNs, perhaps it got lost in noise of last week? The submitter is getting restless. In progress (24/1)
124478 (17/10)
A WMS ticket from an na62 user - this ticket is in a weird limbo as it calls for help from a WMS support unit which of course no longer exists. This ticket risks getting stuck.
Also Dan asks in the ticket how to get na62 added to the list of VOs in GGUS - I'll look into this (unless someone has that information handy?).
ECDF
124592 (22/10)
LHCB problems with an ECDF arc ce. Andy thought he had fixed things, but asked for confirmation a week ago. Waiting for reply (24/10)
Imperial
124241 (5/10)
NA62 having problems with the IC WMS. Daniela asked if there's another UI the Imperials can use to test things as they could not reproduce the error with their UI. Waiting for reply (17/10)
Liverpool
123962 (19/9)
John's schooling of Biomed in the art of Spacetokening continues with the creation of the spacetoken BIOMEDDISK. Perhaps if other sites are supporting biomed on their SEs they could follow suit as an incentive to biomed? In progress (31/10)
IPv6 Perfsonar
124487 (Oxford)
124616 (Durham)
Some sites appear to be having problems with their IPv6 - but teething problems are expected I suppose. Oliver asks for Durham if reverse DNS is needed for IPv6 mesh tests to work (my thought is yes, but I'm often wrong).
|
Tools - MyEGI Nagios
|
13th September 2016
19th July
Both instances of gridppnagios at Oxford and Lancaster has been decommissioned.
12th July 2016
Central ARGO monitoring service has started from 1st of July. All grid resources are monitored through two Nagios instances
https://argo-mon.egi.eu/nagios/
https://argo-mon2.egi.eu/nagios/
It has same interface as gridppnagios. Alarms from these instances goes to Operational Dashboard
http://argo.egi.eu/ is a web interface which provides availability/reliability figures and site status. It is equivalent of old myegi interface with some additional services.
I am planning to decommission both instances of gridppnagios in coming weeks. I have stopped nagios and httpd on both instances so it will not send tests to grid resources in UK. I will also decommission storage-monit.physics.ox.ac.uk which was only used for storage replication test.
We will keep vo-nagios.physics.ox.ac.uk running until we get a replacement for vo-monitoring.
Monday 13th June
- Active Nagios instance moved to Lancaster
Tuesday 5th April 2016
Oxford had a scheduled network warning so active nagios instance was moved from Oxford to Lancaster. I am not planning to move it back to Oxford for the time being.
Tuesday 26th Jan 2016
One of the message broker was in downtime for almost three days. Nagios probes picks up a random message broker and failover is not working so a lot of ops jobs hanged for long time. Its a known issue and unlikely to be fixed as SAM Nagios is in its last leg. Monitoring is moving to ARGO and many things are not clear at the moment.
Monday 30th November
- The SAM/ARGO team has created a document describing Availability reliability calculation in ARGO tool.
|
VOs - GridPP VOMS VO IDs Approved VO table
|
Tuesday 19th May
- There is a current priority for enabling/supporting our joining communities.
Tuesday 5th May
- We have a number of VOs to be removed. Dedicated follow-up meeting proposed.
Tuesday 28th April
- For SNOPLUS.SNOLAB.CA, the port numbers for voms02.gridpp.ac.uk and voms03.gridpp.ac.uk have both been updated from 15003 to 15503.
Tuesday 31st March
- LIGO are in need of additional support for debugging some tests.
- LSST now enabled on 3 sites. No 'own' CVMFS yet.
|
Site Updates
|
Tuesday 23rd February
ALICE:
All okay.
RHUL 89%:89%
Lancaster 0%:0%
RALPP: 80%::80%
RALPP: 77%:77%
- RHUL: The largest problem was related to the SRM. The DPM version was upgraded and it took several weeks to get it working again (13 Jan onwards). Several short-lived occurrences of running out of space on the SRM for non-ATLAS VOs. For around 3 days (15-17 Jan) the site suffered from a DNS configuration error by their site network manager which removed their SRM from the DNS, causing external connections such as tests and transfers to fail. For one day (25 Jan) the site network was down for upgrade to the 10Gb link to JANET. Some unexpected problems occurred extending the interruption from an hour to a day. The link has been successfully commissioned.
- Lancaster: The ASAP metric for Lancaster for January is 97.5 %. There is a particular problem with ATLAS SAM tests which doesn’t affect the site activity in production and analysis and this relates to the path name being too long. A re-calculation has been performed.
- RALPP: Both CMS and LHCb low figures are due to specific CMS jobs overloading the site SRM head node. The jobs should have stopped now.
|
|