RAL Tier1 Operations Report for 14th December 2011

Review of Issues during the week 7th to 14th December 2011.

The Operational error on Friday afternoon (2nd Dec) that led to the software server used by the non-LHC VOs being unavailable over a weekend has been Post Mortemed and this is available at:

 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111202_VO_Software_Server

As reported last week there were problems last Tuesday & Wednesday (6/7 Dec) on the Atlas SRM. During Wednesday there were more general problems and these were traced to very slow response from one of the DNS servers. During Wednesday the DNS issues were resolved, partly by the Tier1 re-configuring a large number of nodes to select the DNSs servers in a different order which lightened the load on the one failing DNS server. It is possible that the various problems seen earlier (including with the Atlas SRM) were triggered by the DNS issues.
CMS migration to tape was not working overnight Thursday-Friday (7/8 Dec). This was picked up late Thursday afternoon but initial attempts to resolve it failed. Work continued Friday morning and a configuration problem was found and fixed during that morning.

Resolved Disk Server Issues

Early Wednesday evening (7th Dec) gdss428 (AtlasDataDisk) failed with a read-only file system. It was restored to service in read-only mode (as a precaution) on Thursday morning (8th), and back into full production on Tuesday (13th) after no further problems were seen.

Current operational status and issues.

Work is ongoing with the Perfsonar network test system. The throughput measurements are being made with Perfsonar running on dedicated hardware, rather than a virtual machine. Studies of network behaviour with Perfsonar will be ongoing and, unless specific operational problems arise, will no longer be tracked here.

Ongoing Disk Server Issues

None

Notable Changes made this last week

Roll-out of UMD versions of Top BDIIs complete.
Changes to job limits on our grid4000M queue following ATLAS creating a new queue pointing specifically to our grid4000M queue.
DNS servers at RAL were updated on Saturday 10th Dec. (As planned two servers remain to be updated.)

Forthcoming Work & Interventions

Thursday 5th January. First step of migration of Oracle database infrastructure. This is the move of Castor databases and will cause an outage of Castor for some hours (with a batch drain beforehand.)

Declared in the GOC DB

None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.

Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.)
Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
Castor:
- Castor 2.1.11 upgrade.
- SRM 2.11 upgrade
- Replace hardware running Castor Head Nodes.
Networking:
- Changes required to extend range of addresses that route over the OPN.
- Re-configure networking in UPS room.
- Install new Routing & Spine layers.
- Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
Fabric:
- BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.)
Grid Services:
- Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
VO:
- Address permissions problem regarding Atlas User access to all Atlas data.
- Migrate Atlas LFC to CERN.

Entries in GOC DB starting between 7th and 14th December 2011.

There were no entries in the GOC DB for this last week.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
77026	Red	Less Urgent	Waiting Reply	2011-11-29	2011-12-07		BDII
76564	Red	Very urgent	In Progress	2011-11-17	2011-12-14		geant4 jobs abort on lcgce05.gridpp.rl.ac.uk
74353	Red	very urgent	In Progress	2011-09-16	2011-12-07	Pheno	Proxy not renewing properly from WMS
68853	Red	less urgent	On hold	2011-03-22	2011-11-07		Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
68077	Red	less urgent	in progress	2011-02-28	2011-09-20		Mandatory WLCG InstalledOnlineCapacity not published
64995	Red	less urgent	in progress	2010-12-03	2011-09-20		No GlueSACapability defined for WLCG Storage Areas

Tier1 Operations Report 2011-12-14

Contents

RAL Tier1 Operations Report for 14th December 2011

Review of Issues during the week 7th to 14th December 2011.

Resolved Disk Server Issues

Current operational status and issues.

Ongoing Disk Server Issues

Notable Changes made this last week

Forthcoming Work & Interventions

Declared in the GOC DB

Advanced warning for other interventions

Entries in GOC DB starting between 7th and 14th December 2011.

Open GGUS Tickets

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools