Operations Bulletin 010914

Bulletin archive

Week commencing 25th August 2014

Task Areas

General updates

Monday 25th August

GridPP33 took place in Ambleside last week.
Our thanks to Sam and Mohit, Year in Industry students, who have now finished. There will be less ticket prompts until new students are in place.
Reminder of 17th March message: new VOMS servers for Ops and LHC experiments. The deadline is Monday 15th September. The experiment pre-prod instances will switch earlier. Already started: ALICE 23rd July; LHCb 22nd August. Pending ATLAS 28th August; CMS 28th August.
For CMS: transition Savannah to GGUS (CMS Computing Operations): September 1st - Disable submission of new tickets; September 30th - Close Savannah (still open issues will be transferred to GGUS).
ATLAS RIPE probes handed out to some GridPP sites; at these a welcome notification message should have been received.
A UK CA TAG meeting is planned for 3rd September. One discussion item concerns an opportunity to migrate the UK e-Science CA to a new commercial CA as part of a JANET agreement.
On 18th August the main DNS servers associated to the egi.eu domain were switched from Nikhef to CESNET.

Tuesday 12th August

Bristol suggests it is seeing connection problems mostly, but not exclusively, to US sites.
The GOCDB test server has been updated to v5.3.
GOCDB has received a new service type request for ‘egi.Perun’. As required by the lightweight EGI review process, we are required to respond with any suggestions/issues before 14th August. Perun is used by the EGI Fed Cloud to manage users access rights to cloud services. Therefore, every cloud VO needs to be supported by Perun, this is why it has been requested to be properly registered and then monitored.
LHCb have reported that the dCache problems they have seen recently do not seem to have any correlation with a particular dCache version. Even different endpoints in the same site could fail or work OK. This is all related to xrootd endpoints and some sites have solved the issues that seem to be caused by misconfigurations on their site.
GridPP will receive some RIPE probes for distribution - note there is now a waiting list for UK based requests.

Tuesday 5th August

There was an EGI Operations Management Board (OMB) meeting last week. Several UK issues (VO DM/job approaches, NFS area futures and availability alarm handling) were input for discussion, but due to the tight agenda will be reviewed at the next EGI ops meeting.
Some things to note from the OMB (see also the meeting minutes):
- Editing of the the EGI wiki is now EGI members (SSO) and on request.
- A reminder to keep GOCDB information up-to-date - it is used to populate various tools.
- A federated cloud security survey is in progress.
- There is an EGI Big Data conference 24th-26th September.
- Resource requests - 19 pools registered. 13 available for allocation. It is a brokering service only. There is one request in the system for cloud resources.
- There is a new draft Resource Centre OLA for comment till 15th August. Updates coming for Technical Policy, User and EGI.eu SLAs. Refer to the performance wiki page for a chart showing relevance.
- A monthly release of the Ops portal following 1 week testing and input from TAG has been proposed.
- SAM probes Task Force has been setup to assess the support status of probes and improve documentation. An initial list of issues is available.
- There are plans for VAPOR - combined Vo Administration and operations PORtal. There is a prototype available.

WLCG Operations Coordination - Agendas

Tuesday 26th August

There was a WLCG coordination meeting last Thursday. [1] are available.
News: CERN-IT to terminate the SLC5-based interactive and batch services (lxplus5 and lxbatch5) soon. The current target date is 30 September 2014.
A study to assess how operational effort in WLCG is used and could be optimised will launch in the next weeks. This will cover the management of sites and site services. It will (generally) not cover the experiment computing operations.
MW baselines: No recent updates
MW Issues: Storm and Argus integration issues.APEL fails to parse accounting records , affecting APEL 1.2.1 (released mid-August). Sites affected should move to 1.2.2. CVMFS upgrade to 2.1.19 almost done.
Oracle: upgrade plans now available.
T0: ARGUS latest version deployed. Looking at decommissioning AFS UI. A few users have already contacted CERN pointing out that they need SLC5 to build their software, as they haven't completed the porting to SLC6 yet - plan to push users to VMs on OpenStack.
Confirmation wanted on the AFS UI tarball support.
T1: No feedback.
T2: No feedback.
ALICE: steady production and analysis activities throughout the past weeks.
ATLAS: No report.
CMS: Finishing samples for CSA14; Computing Analysis Software challenge 2014 extended till mid-September. Users happy with AAA and miniAOD. Reminder for sites: Need to change xrootd redirectors, see this hn post; Need to adapt site-local-config.xml to include <phedex-node value=“Tx_CO_Site{_type}"/> (e.g. value=“T1_DE_KIT_Disk") in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out> NEW; Need to upgrade to CVMFS >= 2.1.19 immediately.
LHCb: Low activity, mainly monte carlo simulation and user jobs. For SAM/Nagios in order to probe the ARC CEs at several UK sites, the probes are submitted now via a WMS instance from RAL-LCG2. The WMS instance was confirmed to be kept in production also for this purpose at least until 2015.
Tracking tools: no report.
FTS3: no report.
glexec: no report.
Machine/job features: Developer is leaving OSG.
MW readiness: A new version of the WLCG Package Reporter has been released. A new BDII update 9 and Cream-ce 1.6.3 for CMS verification being deployed.
Multicore: no report.
SHA-2: Progress with new VOMS servers - compliance with the WLCG infrastructure being tested, ALICE results show CREAM/ARGUS config issues at some sites. Broadcast next week with hard deadline for 15th September. Sites that fail the SAM preprod tests by the end of Aug will be ticketed .
WMS decommissioning: Condor validation - ATLAS and CMS ready. Deployment to production is planned on Wed 1st of October 2014.
IPv6: Ewan ran tests on pure IPv6 EMI-3 UI. Mixed results.
Squid monitoring/HTTP proxy: Reactivated Squid Monitoring TF to track its task list.
Network and transfer metrics: Tasks/membership updated. perfSONAR Toolkit 3.4rc2 became available for testing, version 3.4 is a major milestone for the WG as it enables access via REST API and introduces several important performance improvements, therefore deployment campaign will follow once we get a stable release.

Tuesday 12th August

The next meeting is on 21st August.

Tier-1 - Status Page

Tuesday 12th August

We have resumed draining disk servers after the Castor 2.1.14 upgrade. There were some problems with this that are now resolved.
We have announced that we will shutdown both the FTS2 service and the software server used by the small VOs on the 2nd September.

Storage & Data Management - Agendas/Minutes

Monday 11th August

Pool nodes at RHUL have received test errors.

Tuesday 5th August

The list of work Jens reviewed last Wednesday
- WebFTS testing
- Updating storage documentation (the wiki) and testing it
- Upgrading DPM 1.8.7s?
- GLUE 2.0 for storage revisited?
- IPv6
- WebDAV

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 26th August

Sheffield has stopped publishing.

Tuesday 12th August

Accounting looks behind for UCL, Sheffield and Sussex.

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 26th August

KeyDocs now working again. Several documents assigned to Jeremy for re-allocation. Owners need discussion.

Tuesday 12th August

The keydocs php scripts are not yet working, so we cannot restart our review process....

Interoperation - EGI ops agendas

Tuesday 14th July

Last meeting yesterday.

Agenda: https://wiki.egi.eu/wiki/Agenda-14-07-2014
Minutes:

URT: see agenda for details
SR: In verification: gfal2 v. 2.5.5; active: globus-info-provider-service v. 0.2.1 cream v. 1.16.3; ready to be released: storm v. 1.11.4 lb v. 11.1 wms v. 3.6.5 dcache v. 2.6.28
DMSU report: CREAM CLI/GridSite SegFaults at Long-Lived Proxies solved
Migration of Central SAM services: Note to make sure that if being reinstalled that patches are applied
EMI-2/APEL-2 - Looks like UCL is still publishing with APEL-2 publisher
Hoped that gr.net issues resolved on Monday. Summary of discussion to be in minutes.
Next meeting placeholder 28th July, but may not happen (OMD depending)
Please fill out this UMD customer satisfaction survey in the next couple of weeks if you had a moment: https://www.surveymonkey.com/s/MQ6G8BZ

Tuesday 1st July

Today's ops meeting cancelled - partly due to forthcoming 4th EGI annual review.
EMI-2 decommissioning: The situation is followed by COD (GGUS 106354). "Please remember that we passed the decommissioning deadline and after today - Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points (source PROC16).
There is STILL use of UMD2/EMI2 APEL clients to send accounting data. As of today there are 20 sites (see latest list) still using UMD2/EMI2 APEL clients

Updates requested for the early adopters table.

Monitoring - Links MyWLCG

Monday 18th August

Consolidation meeting last Friday, Messaging and SAM 3 UI: https://indico.cern.ch/event/334354/
- Looking for experiment input on SAM3 UI with a view to bringing it into production ~ end of October

Kick-off meeting discussing cvmfs monitoring in squid monitoring TF arranged for 28th August.

On-duty - Dashboard ROD rota

Tuesday 26th August

RAL : Nagios jobs staying in queue for long time - to be investigated.
Sussex : Matt needs help probably from some SGE experts.
UCL : No acknowledgement from the site (ticket escalated to second level).
100IT : There is an alarm from EGI federated cloud - this needs discussion.
Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.

Tuesday 12th August

Last week was quiet.
Still one or to responses needed for next rota allocations.

Rollout Status WLCG Baseline

Tuesday 26th August

EMI3 WN tarball update needed soon (GGUS 107869)

Monday 28th July

UMD v.3.8.0 was released on 24th July.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Monday 11th August

Topics as mentioned during the last GridPP technical meeting.

There is an issue at the moment in the evaluation of vulnerabilities causing everything rated 'High' by Pakiti to display as 'Critical' in the Dashboard.

The EGI security dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Tuesday 12th August

A reminder to update site status information in the IPv6 pages.
There is a new version (v3.4rc2) of perfSONAR being tested at QMUL [2]. Details here [3].
We will shortly review issues being picked up by perfSONAR and the steps to take when investigating.

Tuesday 22nd July

There was a problem with VOMS admin that was noticed last Thursday: VO page requestes resulted in the WEB-INF directory of the jetty app being displayed. A server restart fixed the problem.

Tuesday 17th June

The GridPP VOMS server was updated on 11/06/2014 - no issues reported.

Tickets

Monday 25th August 2014, 22.30 BST
29 Open UK tickets this week.

SUSSEX
107814(22/8)
Ops failures on the Sussex Cream. Matt sent a e-mail to TB-SUPPORT about this so if anyone could chime in that would be appreciated - atlas are running fine, but Ops tests (and possibly other jobs coming in via WMS) are hitting a spot of bother. In my experience delegation errors like this often pass in time, but the errors have been going for over 4 days. Any help appreciated.

107801(21/8)
Perhaps this problem is also affecting Sno+? In Progress (22/8)

RALPP
107844(24/8)
Just a heads up that this atlas "no free space" ticket has been reopened with (possibly unrelated) srm errors. Reopened tickets often sneak past our sentries. Reopened (24/8)

SNOPLUS SOFTWARE DIR to CVMFS (21/8)
LIVERPOOL: 107796
SHEFFIELD: 107798
QMUL: 107799
Sno+ has asked sites to have their VO SW DIR environmental variable point to their cvmfs directory. All three sites are on it, something for anyone rolling out Sno+ support.

BRISTOL
106325(18/6)
Winnie has spotted that the CMS pilots losing contact to the submission host problem is only (at least recently) affecting their ARC CE. Whilst the CE flavour isn't the only difference between the clusters this is strongly suggesting that the problem isn't with the site firewall. On Hold (19/8)

UCL
107711(15/8)
UCL received an Apel-Pub Ops ticket nearly a fortnight ago which has yet to be even acknowledged. I suspect Ben is on holiday, can someone (looking at the Londoners) poke through other channels? Assigned (15/8)

TIER 1
107815(22/8)
DirectJobSubmit Ops failures at the Tier 1. Catalin asks if the Ops jobs can be tuned and have their registration timeouts increased - as it appears only the test jobs are suffering failures of this kind. Waiting for reply (22/8)

Tools - MyEGI Nagios

Monday 14th July

Winnie reported on Saturday 12th July that most of the UK sites are failing nagios test. Problem started with unscheduled power cut at a Greek site hosting EGI Message broker (mq.afroditi.hellasgrid.gr) around 2PM on 11th July. Message broker was put in downtime but topbdii's continued to publish it for quite long time. Stephen Burke mentioned in TB support thread that now default caching time is 4 days. When I checked on Monday morning only Manchester was still publishing mq.afroditi and it went away after Alessandra manually restarted top bdii. It seams that Imperial is configured with much shorter cache time. Only Oxford and Imperial was almost not affected and the reason may be that Oxford WN's have Imperial top bdii as first option in BDII_LIST. Other NGI's have reported same problem and this outage is likely to be considered when calculating availability/reliability. All Nagios tests came back to normal now.

Emir reported this on tools-admin mailing list "We were planning to raise this issue at the next Operations meeting. In these extreme cases 24h cache rule in Top BDII has to be somehow circumvented."

Tuesday 1st July

There was a monitoring problem on 26th June. All ARC CE's were using storage-monit.phyics.ox.ac.uk for replicating files as part of the nagios testing. storage-monit was updated but not re-yaimed until later. Storage-monit was broken for the morning leading to all ARC SRM tests failing.

Tuesday 24th June

An update from Janusz on DIRAC:
We had a stupid bug in Dirac which affected the gridpp VO and storage. Now it is fixed and I was able to successfully upload a test file to Liverpool and register the file with the DFC
The async FTS is still under study, there some issues with this.
I have a link to software to sync user database from a VOMS server, haven’t looked into this in detail yet.

VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

HyperK.org request for support from other sites
- 2TB storage requested.
- CVMFS required

Cernatschool.org
- WebDAV access to storage -world read works at QMUL.
- ideally will configure federated access with DFC as LFC allows.

Monday 16 June 2014

CVMFS
- Snoplus almost ready to move to CVMFS - waiting on two sites. Will use symlinks in existing software

VOMS server: Snoplus has problems with some of the VOMS servers - see ggus 106243 - may be related to update.

Tuesday 15th April

Is there interest in an FTS3 web front end? (more details)

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 20th May

Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 27th August 2014

Operations report
Successful test of modified disk server draining procedure.
Five disk servers added to cache for AtlasTape.
The termination of the FTS2 service has been announced for the 2nd September.
The software server used by the smaller VOs will be turned off - also on 2nd September.
We are planning to turn off access to the cream CEs (possibly keeping them open for Alice) - although no date yet decided for this.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

N/A

To note

N/A

Operations Bulletin 010914

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools