RAL Tier1 Operations Report for 21st December 2011

Review of Issues during the week 14th to 21st December 2011.

In the evening of Wednesday 14th Dec. there were two outages of the site network. One from around 21:00 to 22:05, the second from 23:15 to 00:15. Network staff were called into site both times. Following these breaks one of the DNS servers used by the Tier1 was showing very slow response times. The Atlas SRM did not recover from these problems, and indeed it seemed to trigger another problem with the Atlas SRM database. Following investigations on Thursday morning the Atlas SRM database was migrated to new hardware. (This is hardware that has been previously prepared and to which we are planning to migrate all Castor databases on 5th January.). The service was finally restored early afternoon. Including the Network breaks the Atlas SRM was unavailable for a total of 18 hours. A Post Mortem is being prepared for this incident and can be seen at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111215_Network_BReak_Atlas_SRM_DB

On Friday morning (16th Dec) there was a hang of the PLUTO database which resulted in an outage for the CMS and GEN Castor instances of around three hours. This was resolved by the Database team around 09:30.
On Saturday morning (17th Dec) there was a further outage of the site network link between around 04:15 and 05:00. This was fixed by networking staff who were called in. A board in the Site Access Router was replaced. This incident did not cause knock-on problems for the Tier1 (as compared with the outages on 14/15 Dec.)
On Monday (19th) it was found that the CIP was reporting stale information. This was a consequence of the move of the Atlas SRM database on the 15th and was immediately fixed.

Resolved Disk Server Issues

Early afternoon on Thursday 15th Dec. GDSS266 (LHCbDst – D1T0) was taken out of production with a double disk failure following which and the rebuilds were going very slowly. It was returned to production aroun 09:30 the following morning.
Around 05:00 on Sunday (18th Dec) GDSS568 (AtlasDataDisk - D1T0)) failed with a read-only filesystem on its system drives. Following a disk replacement it was returned to service around 10:00 on Monday (19th).

Current operational status and issues.

None, but note the Tier1 PLans for the Christmas & New Year Holidays as detailed in the blog at:

http://www.gridpp.rl.ac.uk/blog/2011/12/20/ral-tier1-%E2%80%93-plans-for-christmas-new-year-holiday/

Ongoing Disk Server Issues

GDSS332 (LHCbDst) has failed with a Read Only filesystem around 10:10 this morning. Currently under investigation.

Notable Changes made this last week

None

Forthcoming Work & Interventions

Thursday 5th January. First step of migration of Oracle database infrastructure. This is the move of Castor databases and will cause an outage of Castor for some hours (with a batch drain beforehand.) The opportunity will also be taken to move some racks of older nodes (required to create space for new deliveries).
Monday 9th January. Lcgui01 will be updated to UMD version of UI.
Tuesday 10th January: Update of final pair of DNS servers at RAL to new hardware.

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Castor:
- Castor 2.1.11 upgrade.
- SRM 2.11 upgrade
- Replace hardware running Castor Head Nodes.
- Move to use Oracle 11g.
- Update to the Castor Information Porvider (CIP).
Networking:
- Changes required to extend range of addresses that route over the OPN.
- Re-configure networking in UPS room.
- Install new Routing & Spine layers.
- Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
Fabric:
- BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
Grid Services:
- Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
VO:
- Address permissions problem regarding Atlas User access to all Atlas data.
- Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 14th and 21st December 2011.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Castor CMS & GEN instances (srm-alice, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k)	UNSCHEDULED	OUTAGE	16/12/2011 06:00	16/12/2011 09:00	3 hours	Backdated Entry to Track Outage now Fixed: Database Hang-up caused outage of some CASTOR instances.
srm-atlas	UNSCHEDULED	WARNING	15/12/2011 15:15	16/12/2011 12:00	20 hours and 45 minutes	To fix the problem causing the outage of srm-atlas we have moved the database to new hardware. Declaring an At Risk (Warning) overnight on this service.
srm-atlas	UNSCHEDULED	OUTAGE	15/12/2011 11:00	15/12/2011 15:15	4 hours and 15 minutes	Following network problems overnight we have a problem with srm-atlas. Although the network issues are resolved a subsequent database issue is stopping srm-atlas working.
srm-atlas	UNSCHEDULED	OUTAGE	14/12/2011 22:05	15/12/2011 11:00	12 hours and 55 minutes	Following two site outages problems are persisting on srm-atlas.
Whole site	UNSCHEDULED	OUTAGE	14/12/2011 21:00	14/12/2011 22:05	1 hour and 5 minutes	Networking failure caused site outage. Staff called out to fix.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
77528	Green	Less urgent	In Progress	2011-12-16	2011-12-16	H1	hone jobs cannot be submitted through lcgwms03.gridpp.rl.ac.uk wms-server
77026	Red	Less Urgent	On Hold	2011-11-29	2011-12-15		BDII
74353	Red	very urgent	In Progress	2011-09-16	2011-12-07	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-12-15		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-09-20		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-09-20		No GlueSACapability defined for WLCG Storage Areas

Tier1 Operations Report 2011-12-21

Contents

RAL Tier1 Operations Report for 21st December 2011

Review of Issues during the week 14th to 21st December 2011.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 14th and 21st December 2011.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools