RAL Tier1 Operations Report for 23rd January 2013

Review of Issues during the week 16th to 23rd January 2013.

The AFS problem reported last week was resolved on Monday. Initially we had thought there would be some data loss. However, it was possible to recover the data and the AFS service was returned to production. However, not all users were aware of the backup policy for the AFS service.
On Saturday (19th Jan) there was a problem with the distribution of renewed CRLs at CERN. This led to SUM test failures from Atlas, CMS and LHCb. For Atlas there followed a sequence of failures to delete the test file from the SRM extending the period of unavailability into the following day.

Resolved Disk Server Issues

GDSS235 (AtlasHotDisk) failed with aread only filesystem on Monday (21st Jan). Returned to production the following day.

Current operational status and issues

The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. One half of the new switchboard has been refurbished and was brought into service on 17 September. The work on the second is scheduled for completion this Friday (25th January).
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
The testing of FTS3 is continuing. (This runs in paralle with our existing FTS2 service).

Ongoing Disk Server Issues

GDSS594 (GenTape) Was taken out of production Tuesday (22nd Jan) with multiple disk failures.

Notable Changes made this last week

On Tuesday (23rd Jan) a first SL6/EMI-2 Top-BDII was introduced into the alias. This is the start of a transparent rolling upgrade of all the Top-BDII nodes.
System set-up for participation in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE set-up and some initial tests run.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1 and update the core to remove the C300 central switch
  - Change the way the Tier1 connects to the RAL network.
  - The above changes will lead to the removal of the UKLight Router.
- Network trunking change as part of investigation (& possible fix) into asymmetric data rates. The major network change (above) is expected to resolve this although a separate change may still be done earlier.
- Improve the stack 13 uplink. The major network change (above) will resolve this although a separate change may still be done earlier.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Upgrade Top-BDIIs to latest (EMI-2) version on SL6.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 16th and 23rd January 2013.

There were no entries in the GOC DB for this period.

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
90589	Green	Urgent	In Progress	2013-01-19	2013-01-21	Atlas	RAL-LCG2 SOURCE FT failures
90528	Yellow	Less Urgent	Waiting for Reply	2013-01-17	2013-01-17	SNO+	WMS not assiging jobs to sheffield
90151	Green	Less Urgent	In Progress	2013-01-08	2013-01-21	NEISS	Support for NEISS VO on WMS
89733	Red	Urgent	In Progress	2012-12-17	2013-01-21		RAL bdii giving out incorrect information
86152	Red	Less Urgent	On Hold	2012-09-17	2013-01-16		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
16/01/13	100	100	99.1	100	100	Single SRM test failure - unable to delete file from SRM
17/01/13	100	100	100	100	100
18/01/13	100	100	100	100	100
19/01/13	100	100	73.4	72.6	79.2	Problem with updating of CRLs at CERN - compounded by our slower rate of updates of the CRL.
20/01/13	100	100	88.9	100	100	Multiple failures of SRM test - unable to delete file from SRM following from previous day's problem.
21/01/13	100	100	100	100	100
22/01/13	100	100	100	100	100

Tier1 Operations Report 2013-01-23