RAL Tier1 Operations Report for 9th January 2013

Review of Issues during the week 2nd to 9th January 2013.

Some problems seen with the Top-BDII. The relevant daemon is restarted automatically but some look-up failures seen as it rebuilds its cache.

Resolved Disk Server Issues

Current operational status and issues

The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is over-running slightly with an estimated completion of date of 13th January. (Original date was 18th Dec.)
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites, in particular we are seeing some poor outbound rates - a problem which disappears when we are not loading the network.

Ongoing Disk Server Issues

Notable Changes made this last week

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates.
- Improve the stack 13 uplink
- Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network.
- Update Spine layer for Tier1 network.
- Replacement of UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Upgrades to BDIIs to latest version on SL6.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 2nd and 9th January 2013.

There were no entries in the GOC DB for this period.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
90151	Green	Less Urgent	In Progress	2013-01-08	2013-01-08	NEISS	Support for NEISS VO on WMS
90132	Green	Very Urgent	Waiting Reply	2013-01-07	2013-01-08	LHCb	Raw files not on storage
89733	Red	Urgent	In Progress	2012-12-17	2012-12-20		RAL bdii giving out incorrect information
86152	Red	Less Urgent	On Hold	2012-09-17	2012-10-31		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
02/01/13	100	100	100	100	100
03/01/13	100	100	99.3	100	100	Single failure - unable to delete file from SRM.
04/01/13	100	100	100	100	100
05/01/13	100	100	100	100	39.6	LHCb monitoring problem
06/01/13	100	100	99.2	100	0	Atlas: Single failure - unable to delete file from SRM.; LHCb monitoring problem
07/01/13	100	100	100	100	60.0	LHCb monitoring problem
08/01/13	100	100	100	95.9	100	Single SRM test failure "user timeout".

Tier1 Operations Report 2013-01-09