Latest revision as of 13:16, 2 January 2013

RAL Tier1 Operations Report for 2nd January 2013

Review of Issues during the fortnight

19th December 2012 to 2nd January 2013.

This period mainly covers the Christmas & New Year Holidays (from Friday 21st Dec to Wednesday 2nd Jan). With the exception of the Atlas Castor database problem (see below) it was a fairly quiet period.

On Christmas Day (25th Dec) a problem appeared with the Atlas Castor stager and SRM databases. This took some time to track down and resulted in intermittent performance of the Atlas Castor instance until the 27th. The cause was finally traced to a bad error/warning return resulting from a database password that had not expired but was due to do so shortly.
On Tuesday (1st Jan) at the end of the afternoon one of the four top BDII nodes failed. The Top-BDII ran in a degraded manner until the following morning.
Over the holiday there were a couple of minor batch issues picked up and fixed by the on-call team although these did not significantly affect batch work.

Resolved Disk Server Issues

GDSS447 (AtlasDataDisk - D1T0) failed with a read only filesystem overnight 18/19 Dec. It was ready to go back into production the next day. However, owing to an error this was not done fully until 24th Dec.
GDSS449 (AtlasDataDisk - D1T0) failed with a read only filesystem on Tuesday 31st Dec. It was returned to production the next day (1st Jan).

Current operational status and issues

The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.

Ongoing Disk Server Issues

Notable Changes made this last week

On Wednesday/Thursday 19/20th Dec a firmware upgrade was rolled out to one batch of disk servers following a higher rate of problems in that batch.
The Post mortem report for the Power Incident on 20th November has been prepared and is available at: RAL_Tier1_Incident_20121120_UPS_Over_Voltage (Repeat of information in last report).

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 19th December 2012 and 2nd January 2013.

There was one unscheduled outage in the GOC DB for this period which is for the Atlas Castor problems that began on Christmas Day.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas	UNSCHEDULED	OUTAGE	25/12/2012 06:00	25/12/2012 12:31	6 hours and 31 minutes	ATLAS SRM database problems

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
89733	Red	Urgent	In Progress	2012-12-17	2012-12-20		RAL bdii giving out incorrect information
86152	Red	Less Urgent	On Hold	2012-09-17	2012-10-31		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
19/12/12	100	100	100	100	100
20/12/12	100	100	100	100	100
21/12/12	100	100	100	100	100
22/12/12	100	89.6	89.5	86.0	100	Monitoring problem affected a number of grid sites.
23/12/12	100	100	100	95.1	100	Single SRM test failure "user timeout".
24/12/12	100	100	100	100	100
25/12/12	100	100	45.7	100	100	Database problem with cryptic error.
26/12/12	100	98.3	38.1	99.5	100	Atlas - ongoing from 25/12. Alice & CMS: Monitoring /BDII problem.
27/12/12	100	100	73.6	90.6	100	Atlas - ongoing from 25/12. CMS: Monitoring /BDII problem plus a single SRM test failure.
28/12/12	100	100	100	95.9	100	Single SRM test failure "user timeout".
29/12/12	100	100	100	100	100
30/12/12	100	100	100	100	100
31/12/12	100	100	99.1	100	100	Single SRM Put failure.
01/01/13	100	100	100	91.8	100	Two SRM test failures "user timeout".

Difference between revisions of "Tier1 Operations Report 2013-01-02"

Latest revision as of 13:16, 2 January 2013

RAL Tier1 Operations Report for 2nd January 2013

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools