RAL Tier1 Operations Report for 6th February 2013

Review of Issues during the week 30th January to 6th February 2013.

There was a problem with the Atlas Castor during the night/morning of Thursday 31st January. This was traced to a single unresponsive disk server. Rebooting the server fixed the problem.

Resolved Disk Server Issues

GDSS644 (AtlasScratchDisk D1T0) was found to be responding very slowly on Thursday (31st Jan) and causing problems for the Atlas Castor instance and was rebooted.

Current operational status and issues

There has been an intermittent problem over the last couple of days (5/6 Feb) with the start rate for batch jobs that is being investigated.
There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
System set-up for participation in xrootd federated access tests for Atlas.
Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.

Ongoing Disk Server Issues

Notable Changes made this last week

On Monday (4th February) the upgrading of the Top-BDII to newer systems running SL6/EMI-2 was completed. There are now three systems in the top-bdii alias.
H1 have been added to the CVMFS system for smaller VOs.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Upgrade to version 2.1.13
Networking:
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
  - Improve the stack 13 uplink.
  - Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1
  - Change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Removal of AFS clients from Worker Nodes.
Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.

Entries in GOC DB starting between 30th January and 6th February 2013.

None

Open GGUS Tickets (Snapshot at time of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
91152	Green	Less Urgent	In Progress	2013-02-04	2013-02-04	CMS	RAL tape migration
91146	Green	Urgent	In Progress	2013-02-04	2013-02-05	Atlas	RAL input bandwith issues
91060	Yellow	Less Urgent	On Hold	2013-01-31	2013-02-01	CMS	glexec issues on a subset of worker nodes
91029	Red	Very Urgent	In Progress	2013-01-30	2013-02-06	Atlas	FTS problem in queryin jobs
90528	Red	Less Urgent	In Progress	2013-01-17	2013-02-04	SNO+	WMS not assiging jobs to sheffield
90151	Red	Less Urgent	Waiting Reply	2013-01-08	2013-02-04	NEISS	Support for NEISS VO on WMS
89733	Red	Urgent	In Progress	2012-12-17	2013-02-04		RAL bdii giving out incorrect information
86152	Red	Less Urgent	On Hold	2012-09-17	2013-01-16		correlated packet-loss on perfsonar host

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
30/01/13	100	100	94.9	100	100	Multiple failures "unable to delete file from SRM", plus one 'user timeout' failure.
31/01/13	100	100	90.1	100	100	Atlas Castor instance showing lots of timeouts. Traced to a single disk server that was very unresponsive. Reboot of disk server fixed it.
01/02/13	100	92.3	100	100	100	330 min timeout for the job exceeded. Cancelling the job.
02/02/13	100	100	98.5	100	100	Single SRM test failure - unable to delete file from SRM
03/02/13	100	100	98.2	100	100	One user timeout, one failure to delete file.
04/02/13	100	100	100	100	100
05/02/13	100	100	100	100	100

Tier1 Operations Report 2013-02-06