Tier1 Operations Report 2018-01-03

RAL Tier1 Operations Report for 13th December 2017

Review of Issues during the week 13th to 20th December 2017.

Echo: • Background scrubbing has been going on. This has flushed out more bad disks – causing some callouts through the week.

Network: • Emergency card replacement at Harwell PoP on Thursday morning. This was announced to us and caused a short break in two out of the three OPN links (as expected)

Current operational status and issues

None

Resolved Castor Disk Server Issues

GDSS743 (AtlasDataDisk - D1T0) is back in production.
GDSS705 (AtlasTape - D0T1) is back in production.

Ongoing Castor Disk Server Issues

None

Limits on concurrent batch system jobs.

CMS Multicore 550

Notable Changes made since the last meeting.

Infrastructure: • There was a successful generator load test last Wednesday (13th Dec).

Certificates: • The re-updating to pick up the updated UK CA certificate in the IGTF 1.88 rollout took place successfully last Tuesday (12th) as planned.

Castor: • Three Castor disk servers were moved from LHCb tape buffer to their disk-only (D1T0) storage.

Entries in GOC DB starting since the last report.

No downtime scheduled in the GOCDB between 2017-12-12 and 2017-12-20

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Ongoing or Pending - but not yet formally announced:

Listing by category:

Castor:
- Update systems (initially tape servers) to use SL7 and configured by Quattor/Aquilon.
- Move to generic Castor headnodes.
Echo:
- Update to next CEPH version ("Luminous").
Networking
- Extend the number of services on the production network with IPv6 dual stack. (Done for Perfsonar, FTS3, all squids and the CVMFS Stratum-1 servers).
Services
Internal
- DNS servers will be rolled out within the Tier1 network.

Open GGUS Tickets (Snapshot during morning of meeting)

Ticket-ID	Type	VO	Notified Site	Resp. Unit	Status	Priority	Creation	Last Update	ToI	Subject
132540	TEAM	lhcb	RAL-LCG2	NGI_UK assign to:lcg-support@gridpp.rl.ac.uk	in progress	top priority	2017-12-18 09:32:00	2017-12-18 11:36:00	Other	Upload problems at RAL
132336	USER	ops	RAL-LCG2	NGI_UK	in progress	less urgent	2017-12-06 14:34:00	2017-12-18 11:40:00	Operations	[Rod Dashboard] Issue detected : org.nagios.GLUE2-Check@site-bdii.gridpp.rl.ac.uk
132314	USER	ops	RAL-LCG2	NGI_UK assign to:lcg-support@gridpp.rl.ac.uk	in progress	less urgent	2017-12-05 10:48:00	2017-12-18 14:10:00	Operations	[Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-SRM-result-ops@arc-ce02.gridpp.rl.ac.uk
131815	USER	t2k.org	RAL-LCG2	NGI_UK	in progress	less urgent	2017-11-13 14:42:00	2017-12-01 19:30:00	Storage Systems	Extremely long download times for T2K files on tape at RAL
130207	USER	mice	RAL-LCG2	NGI_UK assign to:lcg-support@gridpp.rl.ac.uk	on hold	urgent	2017-08-24 09:46:00	2017-12-18 17:22:00	Network problem	Timeouts when copyiing MICE reco data to CASTOR
127597	USER	cms	RAL-LCG2	NGI_UK assign to:lcg-support@gridpp.rl.ac.uk share with:sexton@fnal.gov	on hold	urgent	2017-04-07 10:34:00	2017-10-05 09:14:00	File Transfer	Check networking and xrootd RAL-CERN performance
124876	USER	ops	RAL-LCG2	NGI_UK assign to:lcg-support@gridpp.rl.ac.uk	on hold	less urgent	2016-11-07 12:06:00	2017-11-13 16:55:00	Operations	[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
117683	USER	none	RAL-LCG2	NGI_UK assign to:lcg-support@gridpp.rl.ac.uk	on hold	less urgent	2015-11-18 11:36:00	2017-11-06 16:59:00	Information System	CASTOR at RAL not publishing GLUE 2

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Atlas Echo
13/12/17	100	100	100	100	100	100
14/12/17	100	100	100	100	100	100
15/12/17	100	100	100	100	100	100
16/12/17	100	100	100	100	100	100
17/12/17	100	100	100	100	100	100
18/12/17	100	100	100	100	100	100
19/12/17	100	100	100	100	100	100

Hammercloud Test Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC Echo = Atlas Echo (Template 841);CMS HC = CMS HammerCloud


Day	Atlas HC	CMS HC	Comment
13/12/17	99	100	Atlas HC Echo - No test run in time bin
14/12/17	100	100	Atlas HC Echo - No test run in time bin
15/12/17	100	100	Atlas HC Echo - No test run in time bin
16/12/17	98	100	Atlas HC Echo - No test run in time bin
17/12/17	85	100	Atlas HC Echo - No test run in time bin
18/12/17	86	100	Atlas HC Echo - No test run in time bin
19/12/17	100	100	Atlas HC Echo - No test run in time bin

Notes from Meeting.

Ceph scrubbing is now running daytime only to help reduce call-outs at nights.

Tier1 Operations Report 2018-01-03

RAL Tier1 Operations Report for 13th December 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools