Latest revision as of 12:17, 30 January 2013

RAL Tier1 Operations Report for 30th January 2013

Review of Issues during the week 23rd to 30th January 2013.

The work on the main site power supply has been completed. This started last June and one half of the switchgear was brought into service on 17 September. The work on the second half has been completed and was brought into use on Monday (28th Jan). This restores resilience in this part of the site power supply.

Resolved Disk Server Issues

GDSS594 (GenTape - D0T1) Was taken out of production Tuesday (22nd Jan) with multiple disk failures. It was returned to service on Thursday (24th).
GDSS433 (AtlasDataDisk - D1T0) failed with a read only filesystem on Friday (25th Jan). It was returned tos ervice on Sunday (27th).

Current operational status and issues

The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
Problems with the Top-BDII are seen and known to cause problems (the daemon restarts). The rolling upgrade of the Top-BDII is underway.
System set-up for participation in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.

Ongoing Disk Server Issues

Notable Changes made this last week

Thurs (24th Jan) The batch farm was configured to have access to CVMFS areas for na62 and mice.
On Tuesday (29th Jan) the Argus server was upgraded to EMI-2/SL6.
On Tuesday (29th Jan) the RAL status page (http://www.gridpp.rl.ac.uk/status/) was modified to shows tape usage information.
On Wednesday (30th Jan) and upgraded version of the maui batch scheduler was installed.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Replace central switch (C300). This will:
  - Improve the stack 13 uplink.
  - Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Upgrade Top-BDIIs to latest (EMI-2) version on SL6.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 23rd and 30th January 2013.

There were no unscheduled entries in the GOC DB for this period.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11	SCHEDULED	WARNING	29/01/2013 10:00	29/01/2013 12:00	2 hours	Update of Argus Server to SL6/EMI-2
Whole Site	SCHEDULED	WARNING	28/01/2013 08:00	28/01/2013 20:00	12 hours	Following completion of work on the power supply to RAL new equipment will be switched in. This will be in parallel with the existing equipment and re-enables redundancy in the tranformer/switchgear.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
90995	Green	Less Urgent	In Progress	2013-01-29	2013-01-30	CMS	Stageout errors for single workflow at RAL
90986	Green	Urgent	In Progress	2013-01-29	2013-01-29	NA62	FTS channell BELGRID-UCL to RAL-LCG2 for na62
90844	Green	Less Urgent	In Progress	2013-01-26	2013-01-28		LFC for cernatschool.org
90528	Red	Less Urgent	In Progress	2013-01-17	2013-01-17	SNO+	WMS not assiging jobs to sheffield
90151	Red	Less Urgent	Waiting Reply	2013-01-08	2013-01-24	NEISS	Support for NEISS VO on WMS
89733	Red	Urgent	In Progress	2012-12-17	2013-01-21		RAL bdii giving out incorrect information
86152	Red	Less Urgent	On Hold	2012-09-17	2013-01-16		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
23/01/13	100	100	100	100	100
24/01/13	87.5	100	100	100	100	Failed the Site-BDII test as the SL6 test CE had wrong string in GlueHostOperatingSystemName.
25/01/13	100	100	100	100	100
26/01/13	100	100	68.6	100	100	The CE test jobs did not run within the time allowed. Problem of hitting maximum number of AtlasSGM jobs. These were queued behind SL6 Atlas S/W validation jobs.
27/01/13	100	100	96.2	100	100	Four SRM test failures - unable to delete file from SRM.
28/01/13	100	100	100	100	100
29/01/13	100	100	77.4	100	100	Repeat of problem of 26/1/13. Fix to batch scheduler did not work and the jobs queued behind more SL6 S/W validation jobs.