RAL Tier1 Operations Report for 28th November 2012

Review of Issues during the fortnight 14th to 28th November 2012

There has been a recurring problem with castor Atlas and GEN stager daemons using memory. This has caused a number of problems, the first on Thursday afternoon (15th Nov) for GEN. the problem also affected Atlas on 16th & 17th. A regular re-starter is now in place for this daemon.
A known problem of the batch server process consuming memory has been seen again on a number of occasions since the power cut of the 7th.
Overnight 19/20 Nov there was a failure of one of the network stacks (stack 15) which was resolved the following morning. This affected a small number of services including the Atlas Frontier squids.
On Tuesday 20the November there was a major power incident during a planned intervention on the electrical system for the UPS. This resulted in an over-voltage applied to systems on UPS power. All Tier1 systems were unavailable for around 24 hours. Castor services were unavailable for around 50 hours, and batch systems were brought back after that. There were numerous broken power supplies, PDUs and network switches. Some services (notably batch capacity and tape throughput) have been reduces since then. The Tier1 remains with very limited resilience until more replacements can be obtained. On Friday 23rd, the first full day of services following this incident, there were still some residual issues including a rack of eight disk servers (mixture of Atlas & CMS) being unavailable for two to three hours and a short network break as one of the site routers (Router A) was restarted. A Post Mortem report for this incident is being prepared.

Resolved Disk Server Issues

GDSS439 (AtlasDataDisk) failed with a read only filesystem in the early morning of 17th Nov. returned to service on the morning of 18th Nov.
GDSS629 to GDSS632 (AtlassDataDisk) & GDSS633 to GDSS636 (CMSTape) were unavailable for a few hours on Friday 23rd Nov when the power to the rack was tripped.
GDSS611 (LHCbDst - D1T0) was unavailable for a few hours on Friday 23rd Nov. The Castor partitions were not mounted owing to a RAID error.
GDSS523 (CMSTape) shut itself down following a (believed erroneous) over-temperature report in the early hours of Sunday 25th November. It was booted up and drained before being checked out, IPMI firmware updated, and returned to service lunchtime on 26th Nov.
GDSS523 (CMSTape) was showing a similar temperature problem to GDSS523 (above). It was also drained out on Sunday 25th Nov, and returned to service the next day.

Current operational status and issues

Although the planned work on Tuesday 20th November resulted in a major problem, some investigation into the causes of the diesel generator not cutting in was made. A minor fault was found, along with a sensitive trip setting. These were corrected and it is believed the diesel generator would now work - although this has not yet been tested.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. The work is running to schedule. One half of the new switchboard has been refurbished and was brought into service on 17 September.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.

Ongoing Disk Server Issues

GDSS673 (CMSTape - D0T1) crahsed on Tuesday morning, 27th Nov. It is currently being checked out.

Notable Changes made this last week

On Thursday (15th Nov) the job numbers on two batches of worker nodes was increased on two further batches of worker nodes.
On Monday (19th Nov) the MyProxy service was upgraded from UMD-1 to UMD-2.
This morning (28th Nov) lcgui02 was upgraded to EMI-2. Both UIs have now been upgraded.
The worker nodes are being progressively re-installed with EMI-2 WN software. Two batches were done on the morning of the 20th November, before the power incident. Since then most of the remaining batches have been done. The final two batches will be drained out this weekend ahead of re-installation
CVMFS continues to be available for testing by non-LHC VOs (including "stratum 0" facilities).
Test instance of FTS version 3 continues to be available.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- None
Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Continuing overcommit on WNs to make use of hyperthreading.
Infrastructure:
- Test of move to diesel power in event of power loss.
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 14th and 28th November 2012

There were five unscheduled outages in the GOC DB for this period. One was for a restart of the Castor GEN stager to investigate the memory leak problem. The other four were all relating to the power incident on 20th November.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgui02	SCHEDULED	OUTAGE	28/11/2012 10:00	28/11/2012 12:00	2 hours	Re-install with EMI software version (Upgrade postponed from last week).
All CEs (all batch)	UNSCHEDULED	OUTAGE	22/11/2012 14:40	22/11/2012 17:00	2 hours and 20 minutes	Batch services still down following outage for power incident.
Whole site	UNSCHEDULED	WARNING	22/11/2012 14:40	23/11/2012 17:00	1 day, 2 hours and 20 minutes	Systems at Risk after recovery from power incident.
lcgui02	SCHEDULED	OUTAGE	21/11/2012 10:00	21/11/2012 12:00	2 hours	Re-install with EMI software version.
All Castor SRM endpoints and all CEs.	UNSCHEDULED	OUTAGE	20/11/2012 12:18	22/11/2012 14:40	2 days, 2 hours and 22 minutes	All storage and batch services down due to power incident
All services except Castor SRM endpoints and CEs.	UNSCHEDULED	OUTAGE	20/11/2012 12:18	21/11/2012 16:00	1 day, 3 hours and 42 minutes	All services down due to power incident
Castor GEN (srm-alice, srm-biomed, srm-cert, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k)	UNSCHEDULED	OUTAGE	20/11/2012 11:30	20/11/2012 11:40	10 minutes	Outage while we reboot the castor headnodes. This is part of an ongoing investigation into a memory leak.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
88596	Red	Very Urgent	In Progress	2012-10-19	2012-11-28	T2K	Jobs don't get delgated to RAL
86690	Red	Urgent	In Progress	2012-10-03	2012-11-06	T2K	JPKEKCRC02 missing from FTS ganglia metrics
86152	Red	Less Urgent	On Hold	2012-09-17	2012-10-31		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
14/11/12	100	100	100	95.9	100	Single SRM test failure "user timeout".
15/11/12	97.2	72.7	92.0	92.2	100	Problem with CE configuration.
16/11/12	100	96.0	98.4	100	100	Castor stager memory problem
17/11/12	100	100	85.3	100	100	Castor stager memory problem
18/11/12	100	81.7	100	100	100	Jobs timed out
19/11/12	100	89.9	100	96.0	100	Alice: Jobs timed out; CMS: SRM problem.
20/11/12	30.7	54.7	51.6	44.4	53.9	Power incident took Tier1 down.; before that monitoring problem affected all UK OPS tests.
21/11/12	0	0	0	0	0	Power incident
22/11/12	33.6	28.7	32.0	29.1	34.2	Power incident
23/11/12	100	100	80.3	100	100	Problem with Atlas' monitoring
24/11/12	100	98.6	99.0	95.9	100	Alice: Problem with CEs; Atlas & CMS - single SRM test failure.
25/11/12	100	100	100	100	100
26/11/12	100	88.5	100	100	100	Batch problem.
27/11/12	100	100	100	100	100

Tier1 Operations Report 2012-11-28

RAL Tier1 Operations Report for 28th November 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools