RAL Tier1 Operations Report for 5th December 2012

Review of Issues during the week 28th November to 5th December 2012

Overnight Wed/Thu 28/29 Nov there was a problem that affected all of Castor caused by a crash of the Castor permissions database. This was resolved by the Castor On-Call Team during the night.
On Thursday 29th Nov there were problems with the Castor CMS instance traced to a database problem. Fixed at the end of the afternoon by moving the CMS castor database to a different node.
On Friday (30th Nov) there was a problem with the Atlas Frontier service with both nodes unavailable. Atlas raised a GGUS ticket. The problem was fixed and the monitoring of this service is being improved.
On Monday (3rd Dec) a high rate of failures for Atlas Castor were seen. This was fixed by a bounce of the Atlas Castor database at around 13:00 that day.
On Tuesday morning (4th Dec), around 08:00 local time, there was a transitory problem that caused a high rate of Castor SRM failures (seen in the FTS). The root of the problem has not been definitively identified but appears to be a network issue.

Resolved Disk Server Issues

GDSS673 (CMSTape - D0T1) crashed on Tuesday morning, 27th Nov. It was returned to production on Saturday (1st Dec) following a firmware update (required to help identify a faulty disk within the array) and RAID array verification.
GDSS647 (LHCbDst - D1T0) failed on Thursday (29th) with a problem on a system partition. returned to service on Monday (3rd Dec).
GDSS661 (AtlasDataDisk - D1T0) crashed on Saturday (1st Dec) - returned to service on Monday (3rd Dec).

Current operational status and issues

Following the power incident the Tier1 has been running with reduced resilience, particularly as regards power supplies for the fibrechannel SAN switches used in the database infrastructure. This particular issue is now resolved. Work continues to replace and re-stock items such as power supplies and PDUs.
There is an ongoing problem with Castor Atlas and GEN stager daemons using memory. A regular re-starter is now in place for this daemon and further investigations are taking place with assistance from the Castor developers.
The batch server process sometimes consumed memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
Following checks made on 20th November (at the time of the power incident) it is believed the diesel generator should now work in the event of a further power cut. However, this has not yet been tested. A test (To be confirmed) is proposed for Tuesday 11th December.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.

Ongoing Disk Server Issues

None

Notable Changes made this last week

On Tuesday (4th Dec) replacement power supplies for the fibrechannel SAN switches used in the database infrastructure were obtained and installed. This removes the most significant resilience issue remaining after the power incident of 20th November.
The final two batches of worker nodes were drained over the weekend and upgraded to EMI-2 (SL5) on Monday (3rd Dec).
The final two batches of worker nodes had their overcommit increased to make use of hyperthreading yesterday (4th Dec.)

Declared in the GOC DB

Thursday 6th December. Warning on Castor 'GEN' instance while debugging Castor stager memory leak.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- None
Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
Infrastructure:
- Test of move to diesel power in event of power loss. (Proposed - Tuesday 11th December).
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 28th November and 5th December 2012

There were no unscheduled outages in the GOC DB for this period.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgui02.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	28/11/2012 10:00	28/11/2012 12:00	2 hours	Re-install with EMI software version (Upgrade postponed from last week).

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
88596	Red	Very Urgent	In Progress	2012-10-19	2012-12-01	T2K	Jobs don't get delgated to RAL
86690	Red	Urgent	In Progress	2012-10-03	2012-12-04	T2K	JPKEKCRC02 missing from FTS ganglia metrics
86152	Red	Less Urgent	On Hold	2012-09-17	2012-10-31		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
28/11/12	100	100	100	95.7	100	Single failure of SRM test "User timeout over"
29/11/12	96.9	100	100	62.5	91.9	Castor permission DB crashed plus for CMS another DB problem.
30/11/12	100	100	100	100	95.8	Single SRM test failure. Probably caused by reboot of Router A.
01/12/12	100	100	100	100	100
02/12/12	100	100	100	100	100
03/12/12	100	100	99.1	100	100	Single SRM test failure. Database problem.
04/12/12	100	100	99.5	95.9	100	Single failures of SRM tests. Transient network problem.

Tier1 Operations Report 2012-12-05

RAL Tier1 Operations Report for 5th December 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools