RAL Tier1 Operations Report for 4th December 2013

Review of Issues during the week 27th November to 4th December 2013.

There was a problem reported last week with one of the WMS systems, WMS05, caused by a user job filling up the available space. Our initial clean-up was insufficient and WMS05 again had a rather full disk and stopped accepting jobs overnight Thursday/Friday.
One file has been reported lost to Atlas. It was found to be missing during the (ongoing) Atlas file renaming.

Resolved Disk Server Issues

Two disk servers (gdss238, gdss239) in AtlasHotDisk were out of production from Thursday to Friday (28-29 Nov) as they were physically moved. (The rack space being required for this year's purchases).

Current operational status and issues

Ongoing Disk Server Issues

Notable Changes made this last week.

On Friday 29th Nov. the site-BDIIs were updated to EMI-3 update 9.
Some batch system parameters have been adjusted as experience is gained with the new system, notably when Atlas were running a large number of whole node jobs.

Declared in the GOC DB

Wednesday 11th December: UPS/Generator Load Test at 10:00. Site in 'warning' state.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

There will be an interruption to the small VO's software server as it to be physically moved.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
Networking:
- Possible move of Tier1 core network switch in January (TBC).
- Implementation of new site firewall.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)

Entries in GOC DB starting between the 27th November and 4th December 2013.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgfts.gridpp.rl.ac.uk,	UNSCHEDULED	OUTAGE	26/11/2013 15:00	26/11/2013 15:15	15 minutes	Investigating problems with restarting FTS2 service after intervention earlier today
lcgft-atlas.gridpp.rl.ac.uk, lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	26/11/2013 09:30	26/11/2013 15:00	5 hours and 30 minutes	Outage of LFC, FTS2 and Atlas 3D/Frontier during work on disk array used by back end database.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
98249	Red	Urgent	Waiting Reply	2013-10-21	2013-11-18	SNO+	please configure cvmfs stratum-0 for SNO+ at RAL T1
98122	Red	Less Urgent	Waiting Reply	2013-10-17	2013-11-18	cernatschool	CVMFS access for the cernatschool.org VO
97868	Red	Less Urgent	Waiting Reply	2013-10-08	2013-12-03	T2K	CVMFS for t2k.org
97385	Red	Less Urgent	In Progress	2013-09-17	2013-11-18	HyperK	CVMFS for hyperk.org
97025	Red	Less urgent	On Hold	2013-09-03	2013-11-05		Myproxy server certificate does not contain hostname
86152	Red	Less Urgent	On Hold	2012-09-17	2013-10-18		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
27/11/13	100	91.1	100	100	58.4	Ongoing problem that affected all sites. (For Alice additional scheduling issue - see 28/11)
28/11/13	100	51.5	100	100	100	Problem scheduling Alice test jobs coming into the 'whole node' queue.
29/11/13	100	100	100	100	100
30/11/13	100	100	100	100	100
01/12/13	100	100	100	100	100
02/12/13	100	100	100	100	100
03/12/13	100	100	100	100	100

Tier1 Operations Report 2013-12-04