RAL Tier1 Operations Report for 19th March 2014

Review of Issues during the week 12th to 19th March 2014.

On Wednesday early evening (12th March) there was a failure of the primary link to CERN between 17:00 and 19:00. Traffic flowed over the backup link. However, the failover was not clean and during this time we were failing the VO SUM tests.
There was a problem with one of the FTS2 agent systems in the early hours of Thursday 13th March. Owing to a configuration error the hypervisor hosting this virtual machine rebooted and this particular system was not configured to re-start. This was resolved by the primary on-call.

Resolved Disk Server Issues

Current operational status and issues

There have been problems with the CMS Castor instance through the last week. These are triggered by high load on CMS_Tape - with all the disk servers that provide the cache for this ervice class running flat out (as far as network connectivity goes). Work is underway to increase the throughput of this disk cache.
The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
The problem of full Castor disk space for Atlas has been eased. Working with Atlas the file deletion rate has been somewhat improved. However, there is still a problem that needs to be understood.
Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape.

Ongoing Disk Server Issues

Notable Changes made this last week.

The move of the Tier1 to use the new site firewall took place on Monday 17th March between 07:00 and 07:30. FTS (2 & 3) services were drained and stopped during the change. The batch system was also reconfigured such that new batch jobs world not startt during this period. The change was successful. There was a routing problem that affected the LFC in particular and external access from many worker nodes but that was fixed in around an hour.
One batch of WNs now updated to EMI-3 version of WN a week ago. So far so good.
The EMI3 Argus server is in use for most of the CEs and one batch of worker nodes.
The planned and announced UPS/Generator load test scheduled for this morning (19th March) was cancelled.

Declared in the GOC DB

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
Castor:
- Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
Networking:
- Update core Tier1 network and change connection to site and OPN including:
  - Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
  - These changes will lead to the removal of the UKLight Router.
Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.

Entries in GOC DB starting between the 12th and 19th March 2014.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole Site	SCHEDULED	WARNING	17/03/2014 07:00	17/03/2014 17:00	10 hours	Site At Risk during and following change to use new firewall.
lcgfts.gridpp.rl.ac.uk, lcgfts3.gridpp.rl.ac.uk	SCHEDULED	OUTAGE	17/03/2014 06:00	17/03/2014 09:00	3 hours	Drain and stop of FTS services during update to new site firewall.
srm-cms.gridpp.rl.ac.uk	UNSCHEDULED	OUTAGE	14/03/2014 09:40	14/03/2014 10:26	46 minutes	Problem with CMS Castor instance being investigated.
srm-cms.gridpp.rl.ac.uk	UNSCHEDULED	OUTAGE	14/03/2014 04:15	14/03/2014 07:15	3 hours	Currently investigtating problems with Oracle DB behind Castor CMS

Open GGUS Tickets (Snapshot during morning of meeting)


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
101968	Green	Less Urgent	On Hold	2014-03-11	2014-03-12	Atlas	RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
101079	Red	Urgent	In Progress	2014-02-09	2014-03-17		ARC CEs have VOViews with a default SE of "0"
101052	Red	Urgent	In Progress	2014-02-06	2014-03-17	Biomed	Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
99556	Red	Very Urgent	In Progress	2013-12-06	2014-03-06		NGI Argus requests for NGI_UK
98249	Red	Urgent	Waiting Reply	2013-10-21	2014-03-13	SNO+	please configure cvmfs stratum-0 for SNO+ at RAL T1
97025	Red	Less urgent	On Hold	2013-09-03	2014-03-04		Myproxy server certificate does not contain hostname

Availability Report


Day	OPS	Alice	Atlas	CMS	LHCb	Comment
12/03/14	100	100	91.4	93.7	90.4	There was a failure of the Primary OPN link to CERN. Traffic flipped to backup link but the failover was not complete.
13/03/14	100	100	100	91.7	100	2 SRM test failures (both "User Timeout") See next entry for cause.
14/03/14	100	100	100	72.1	100	SRM test failures. The appear as "User Timeout" problem as yesterday - bad request in the database.
15/03/14	100	100	100	100	100
16/03/14	100	100	100	100	100
17/03/14	100	100	99.1	87.7	100	Atlas: Single SRM Test ("User Timeout"); CMS: Continuation of what are believed to be load triggered problems in CMS_Tape.
18/03/14	100	100	97.9	64.2	100	Atlas: Single SRM Test failure ("could not open connection to srm-atlas"); CMS: Continuation of above problems.

Tier1 Operations Report 2014-03-19