Operations Bulletin 201014

From GridPP Wiki

Jump to: navigation, search

Bulletin archive

Week commencing 13th October 2014

Task Areas

General updates

Monday 13th October

UMD 3.8.1 was released on 9th October.
There was a GridPP technical meeting last Friday. Minutes are available.
The CHEP 2015 abstract deadline has moved to 25th October. Who has or will be submitting something?
HEPiX is taking place this week in Nebraska.
Minutes and actions have been made available from the October GDB.

Tuesday 7th October

A reminder to please share your work with everyone via blog posts. In the core-ops meeting it was suggested that there be an incentive... we'll consider that!
Ewan will take a closer look at the middleware package reporter (the Pakiti contender... or ally).
Matt will be (trialing) following up on VO Nagios errors from GridPP Nagios.
There is an IPv6 quarterly meeting this week.
There is a GDB tomorrow 8th October.
GridPP collaboration meeting now scheduled for April 28th to 30th 2015

Tuesday 30th September

There is an IPv6 meeting next week 6th-7th October.
There is a GDB next week on 8th October.
Storage placement - survey TBC.

WLCG Operations Coordination - Agendas

Tuesday 14th October

The next WLCG Network and Transfer Metrics WG meeting is on Wednesday 15th October.
The next ops coordination meeting is on Thursday 16th. Any T2 issues to raise?

Tuesday 7th October

* There was a WLCG ops coordination meeting last Thursday. (Agenda: Minutes). Some notes follow...

News: HEP_OSlibs-7.0.0-0.el7.cern.x86_64.rpm for CentOS7 has been released; CHEP 2015 15th October [s://indico.cern.ch/event/304944/call-for-abstracts/ abstract deadline] approaching; comments on Shellshock.
MW baselines: New version of the UI and WN estimated for next UMD end October; dCache 2.2.x decommissioning deadline is 31-10-2014
MW issues: xroot package deployed with ROOT 6 breaks access to dCache storage, affecting LHCb. Fix coming. CREAM, WMS, L&B, UI, WN cannot be installed at the moment because the classads package ( dependency for all of them ) was declared an orphan in EPEL!
T0 & T1 updates: Mainly SE upgrades
Oracle: Upgrade plans updated.
T0 news: WMSes decommissioned 1st October.Lxplus5 will be stopped in October; AFS UI (removal) discussion ongoing.
T1 feedback: NTR
T2 feedback: NTR
ALICE: Investigation of job failure rates and inefficiencies; HLT farm running as an ALICE site since Sep 24.
ATLAS: DC14 ongoing. Multi-core recommendation: 16GB physical memory per job. Serial production tasks in future will be limited. ARC-CE tests in ATLAS-CRITICAL from 1st October.
CMS: Scale testing of HTCondor and GlideinWMS by OSG - various issues. Reminder: Participate in space monitoring; Update xrootd fallback configuration.
LHCb: dCache storage sites broken when accessed by ROOT6/xrootd; new stripping campaign is currently being prepared; testing new VOMS.
glexec: NTR
Machine job features: NTR
MW readiness: Meeting on 1st October. DPM, CREAM and BDII verification exercises. MW package reporter development. Next meeting 19th November.
Multi-core: 50M events/daily for ATLAS. Continue deployment.
SHA-2: Testing new VOMS for each experiment.
WMS decommissioning: with the deployment of the Condor SAM probes nothing is using WMS anymore. Machines off. WG will end.
IPv6: LHCbDIRAC tested and working
Squid monitoring/HTTP proxy: NTR
Network & Tmetrics WG: Shellshock & perfSONAR news. PS 3.4 coming.

Tier-1 - Status Page

Tuesday 14th October

Access for all VOs to our CREAM CEs has been stopped (apart from ALICE and SNO+).
Following problems reported last week with a disk array that supports the Castor databases we have been running with a temporary configuration. The disk array has been fixed and this morning we have a downtime of the Castor Atlas and GEN instances to revert to the 'normal' database configuration.

Storage & Data Management - Agendas/Minutes

Wedn 15 Oct

Report on T1 CEPH plans from Alistair Dewhurst
Feedback from last week's DPM collaboration meeting in Naples

Wedn 01 Oct

Summary of all the exciting events in Amsterdam last week - EUDAT, EGI big data, RDA
DPM 1.8.9 early testing, and (separately) xroot4 early-ish testing. Supporting multiple VOs in one xroot server.

Wedn 17 Sept

iRODS - what it is and why it should choose to collapse on Betelgeuse 7.
Technical problems with Vidyo

Wedn 10 Sept.

High load at L'pool causing low throughput - how to throttle xroot transfers (and is the load necessary or a bug?)
Still testing WebFTS
Prep for DPM workshop

Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06

Tuesday 7th October

GridPP metrics need updating for CMS. Any comments on the metrics page at the moment?
APEL issues for Birmingham and Sussex, and the portal appears to stop at 1st October (being followed-up).

Tuesday 30th September

Slight delay for Birmingham and Sussex.

Tuesday 23rd September

Slight APEL delay for Birmingham .

APEL status: An issue at Sheffield?

Documentation - KeyDocs

See the worst KeyDocs list for documents needing review now and the names of the responsible people.

Tuesday 7th October

Keydocs were reviewed at the core-ops meeting last week. The situation with updates is improving.
Main GridPP website expected to use Wordpress with a plug-in to cover the gridsite aspects.

Interoperation - EGI ops agendas

Monday 6th October

There was a meeting today - link: https://wiki.egi.eu/wiki/Agenda-06-10-2014

EMI-WN 3.1.0 in SR: if anyone is running this in production please get in touch to help get this past rollout
MySQL 5.0 noted to be under Oracle Lifetime Sustaining Support (for some time now).
- See agenda for guidance on middleware consequences
classads "retired" from EPEL repos
- See Kashif's ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=108878
- In the short term these are in the EMI 3 third party repo
- long term seek to return these to EPEL
SL/SLC/CentOS 5 Support Lifetime
- This was highlighted, though not suggested to be urgent

Monitoring - Links MyWLCG

Tuesday 14th October

Meeting last Friday, agenda: https://indico.cern.ch/event/343515/ minutes: https://indico.cern.ch/event/343515/material/minutes/minutes.html
Presentation on FTS3 monitoring: https://indico.cern.ch/event/343515/contribution/5/material/slides/0.pdf
SAM2 to be turned off 1st December (pre-prod 1st November): https://indico.cern.ch/event/343515/contribution/6/material/slides/1.pdf

Next meeting ~ 31st October

On-duty - Dashboard ROD rota

Monday 29th September

Rota being updated.

Monday 22nd September

Quiet week - little to report.
EGI is looking for people to join the ops portal review and testing TF.

Tuesday 2nd September

Sussex is back in business - kept closing their low availability alarm wrt the GGUS ticket.
The UCL ticket is now finally receiving some attention.
Ongoing problems at RAL.

Tuesday 26th August

RAL : Nagios jobs staying in queue for long time - to be investigated.
Sussex : Matt needs help probably from some SGE experts.
UCL : No acknowledgement from the site (ticket escalated to second level).
100IT : There is an alarm from EGI federated cloud - this needs discussion.
Durham : Availability alarms - require constant closing with some comments. Ticket with devs is open.

Tuesday 12th August

Last week was quiet.
Still one or to responses needed for next rota allocations.

Rollout Status WLCG Baseline

Tuesday 26th August

EMI3 WN tarball update needed soon (GGUS 107869)

Monday 28th July

UMD v.3.8.0 was released on 24th July.

References

Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from the bdii, so they should all be reasonably up-to-date:
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

Security - Incident Procedure Policies Rota

Tuesday 14th October

Shellshock updates and follow-up
Banning challenge status

Tuesday 30th September

Shellshock - advice and follow-up.
Note particularly the advisories from WLCG/EGI.

Tuesday 23rd September

High priority vs critical tests in pakiti.
FAX update

The EGI security dashboard.

Services - PerfSonar dashboard | GridPP VOMS

- This includes notifying of (inter)national services that will have an outage in the coming weeks or will be impacted by work elsewhere. (Cross-check the Tier-1 update).

Monday 13th October

perfSONAR 3.4 has been released. Clear documentation on what to do (clean reinstall) coming this week together with information on mesh updates. See the GDB presentation slides 13 and 14.
RIPE have sent a reminder to connect probes that have been handed out (some weeks ago now). Please could the following sites check their status: Lancaster; Brunel; Sussex; and ECDF. 20599 at RAL has never properly connected (DHCP issue?).

Tuesday 7th October

There was a perfSONAR operations meeting last Friday 4th October. See the minutes.

Tuesday 23rd September

There was an LHCONE/LHCOPN meeting last week. These few talks will be of general interest: LHCONE global NOC proposal; perfSONAR update and CMS P2P applications.

Tickets

Monday 13th October 2014, 15.00 BST

Other VO Nagios:
VO Nagios

Site's seeing problems at the time of writing are:
Lancaster - long term errors for pheno & gridpp on one CE (still to be fixed).
Liverpool - short term errors for snoplus on a cream CE (looks like it's rejecting jobs).
RALPP - short term errors for southgrid and pheno on ARC CEs (job submission problem).
Bristol - short term error for southgrid ("Job submission to LRMS failed").
Sheffield - long term errors for gridpp on multiple CEs, short term errors for pheno, t2k and snoplus (timeouts affecting job submission).
QMUL - long term errors for t2k on their SE ("GlueVOInfoPath or GlueSAPath not published").
TIER 1 - long term errors for t2k and snoplus on their respective SEs.

I'm still figuring out how best to present this, please bare with me.

23 Open UK tickets this week 10 Green, 3 Yellow, 1 Orange and 10 Red.

Tier 1
108944(1/10)
CMS having trouble finding some files at RAL during a AAA access test. The RAL team has satisfied the ticket to the first order (confirming that the files in question are indeed in castor), so the ticket could be solved - or at least CMS could be asked to see if they still have trouble accessing the files. In progress (1/10)

108546(16/9)
An atlas ticket, about some job failures that might well not be relevant any more. Looking very stale, and possibly like it could be closed. In progress (22/9)

Also on the probably should be on hold list: 106324(CMS)

And Chris W, could you please take a peek at: 107880
(Sno+'s odd suse user group needing help).

SUSSEX
108765(24/9)
ROD ticket about the state of the Sussex BDII output. Matt RB tracked it to a problem with their (updated) SGE and has submitted a ticket (109263) which appears to have been picked up. Correctly On Hold (13/10)

IMPERIAL/DIRAC
108723(23/9)
Ticket from Chris W, asking some question about DIRAC. It really could do with some input from him, and Daniela points out the existence of the new dirac user mailing list as a better place for such discussion: https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users. Waiting for reply (1/10)

SHEFFIELD
Could this Sno+ ticket: 109223 (jobs not be assigned to Sheffield) be related to this Sno+ ticket: 109207 (Sno+ SW DIR needs to be pointed to cvmfs)? Just a naive thought if the SW_DIR was one of the requirements for jobs.

That's all Folks, please let me know if I've missed anything out.

Tools - MyEGI Nagios

Tuesday 16th Sep

Multi VO nagios maintained at Oxford has been upgraded to add ARC CE tests.
https://vo-nagios.physics.ox.ac.uk/nagios/
It is currently monitoring gridpp, pheno, t2k.org, snoplus.snolab.ca, vo.southgrid.ac.uk
Should we start monitoring it more actively and open ticket for sites failing tests ?

VOs - GridPP VOMS VO IDs Approved VO table

Monday 11th August

Steve J sent an email to hyperk on 7th regarding "software directory for Hyperk (CVMFS)" and entries in the VO ID card.

"Monday 14th July 2014"

HyperK.org will initially use remote storage (irods at QMUL) - so CPU resources would be appreciated.

"Monday 30 June 2104"

HyperK.org request for support from other sites
- 2TB storage requested.
- CVMFS required

Cernatschool.org
- WebDAV access to storage -world read works at QMUL.
- ideally will configure federated access with DFC as LFC allows.

Impact
- Citation policy (https://www.gridpp.ac.uk/acknowledging.html)

Site Updates

Tuesday 9th September

Intel announced the new generation of Xeon based on Haswell.

Tuesday 20th May

Various sites but notably Oxford have ARGUS problems. 100s of requests seen per minute. Performance issues have been noted after initial installation at RAL, QMUL and others.

Meeting Summaries

Project Management Board - Members Minutes Quarterly Reports

Empty

GridPP ops meeting - Agendas Actions Core Tasks

Empty

RAL Tier-1 Experiment Liaison Meeting (Wednesday 13:30) Agenda Meeting takes place on Vidyo.

Wednesday 15th October 2014

Operations report
The problems reported last week where the failure of a battery in a disk array caused problems for the Castor databases has been fixed. The battery pack was replaced, the databases re-synchrnosed and yesterday the databases were moved back to their normal configurations.
We are approaching the half way mark for migrating CMS data to the 'D' tapes.
Theer was an upgrade made to the ARC CEs and Alice are running jobs through one of them.

WLCG Grid Deployment Board - Agendas MB agendas

Empty

NGI UK - Homepage CA

Empty

Events

UK ATLAS - Shifter view News & Links

Empty

UK CMS

Empty

UK LHCb

Empty

UK OTHER

To note

Retrieved from "https://www.gridpp.ac.uk/w/index.php?title=Operations_Bulletin_201014&oldid=6361"

Operations Bulletin 201014

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools