RAL Tier1 Operations Report for 6th June 2012

Review of Issues during the week 30th May to 6th June 2012

Note that we have just had a four-day 'weekend' to celebrate the Queen's Jubilee. Overall operations continued OK through the weekend. There was problem with one of the Top BDIIs on Sunday (3rd June), resolved by the on-call person. (Also note one disk server issue reported below).
Last Wednesday (30th May) Inaccessible files reported by ALICE were found out to be timeouts within the xrootd manager. resolved by increasing the timeout threshold.
There was a problem with Castor overnight Thursday-Friday (30 May - 1 June) caused by some systems running out of memory. This was triggered by the Castor DLF database being taken down earlier in the week and the relevant daemon starting to consume memory. This particularly affected LHCb.
There was a problem with the CMS tape migrations at the end of last week. A significant backlog (around 12k files) built up. The problem was understood and fixed on Friday (1st June) and the backlog processed by the end of Saturday.

Resolved Disk Server Issues

GDSS496 (CMSTape - D0T1) had a problem in the early hours of Saturday 2nd June. The problem was traced to the RAID interface card hanging up. The server was returned to production later that morning.

Current operational status and issues

Investigations into an ongoing communications problem between the CEs and the batch server continue.
There is a known problem with the handling of some certificates within FTS that is currently causing problems for outgoing CMS FTS transfers.
There are problems with the Top BDIIs that are being investigated and worked around.
A regular test this morning (Wed. 6th June) failed to start the backup diesel generator. Specialists are being called to investigate. This means that should there be a general power failure in the meantime we would not have the diesel generator backup power.
WMS03 is currently out of service for database maintenance and service re-configuration.

Ongoing Disk Server Issues

GDSS374 (AtlasTape - D0T1) and GDSS607 (LHCbDst - D1T0) are both drained and undergoing re-acceptance testing following earlier failures.

Notable Changes made this last week

On Wednesday (30th May) all BDIIs (Site & Top) were upgraded to the latest EMI version (EMI-1 update 15).
On Thursday (31st May) the CMS Castor instance was successfully changed to use the new "Transfer Manager" scheduler. (The LHCb & GEN instances were done earlier in the week.)
This morning (Wednesday 6th June) a new version of the Castor Information Provider (CIP) was brought into service.
Errata and kernel updates are being deployed on worker nodes.
On Friday (1st June) ALICE were given access to grid3000M queue.

Declared in the GOC DB

Deploy Transfer Manager for Atlas Castor instance on Thursday 7th June 2012 09:00-11:00 ATLAS

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

The following items will particularly affect services:

Listing by category:

Databases:
- Regular Oracle "PSU" patches are pending.
- Switch LFC/FTS/3D to new Database Infrastructure.
- Update LFC/FTS databases to Oracle 11.
Castor:
- Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
- Upgrade to version 2.1.12.
Networking:
- One step still remains in extending the IP address range used for disk servers that will use the OPN.
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer. There is now a firm date of 19th June for upgrading the Site Access Router. The replacement of the UKLight Router will follow.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Reconfiguration and maintenance operation will be scheduled for lcgwms03 (non-LHC WMS) from 1-7 June.
- Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.|
Infrastructure:
- The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.

Entries in GOC DB starting between 30th May and 6th June 2012

There were no unscheduled outages during the last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgwms03	SCHEDULED	OUTAGE	01/06/2012 12:00	07/06/2012 14:00	6 days, 2 hours	database maintenance and service re-configuration
Castor CMS instance: srm-cms	SCHEDULED	OUTAGE	31/05/2012 09:00	31/05/2012 11:00	2 hours	downtime to upgrade CMS castor instance to use Transfer Manager
Castor GEN instance: srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k	SCHEDULED	OUTAGE	30/05/2012 09:00	30/05/2012 11:00	2 hours	downtime to upgrade GEN castor instance to use Transfer Manager

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
68853	Red	Less Urgent	On hold	2011-03-22	2012-04-20		Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
82740	Amber	Less Urgent	Waiting Reply	2012-05-31	2012-05-31	Biomed	CREAM CE lcgce05.gridpp.rl.ac.uk is not working for VO biomed

Tier1 Operations Report 2012-06-06