RAL Tier1 Operations Report for 30th May 2012

Review of Issues during the week 23rd to 30th May 2012

Last Thursday (24th) a problem found with xrootd on a draining disk server affected CMS. Workaround in place.
One of Site BDIIs failed on Saturday (26th) and was removed from DNS alias.
On Saturday (26th) a problem with a power controller caused two service nodes (APEL and one of the CMS squids) to fail. The CMS squid was removed from the relevant CMS configs on the same day. These services were restored on Monday (28th).

Resolved Disk Server Issues

GDSS644 (atlasStripInput) was found to have an incorrect installation on 12th May. It was drained and re-installed and returned to service on the 29th May.

Current operational status and issues

Investigations into an ongoing communications problem between the CEs and the batch server continue.
There is a known problem with the handling of some certificates within FTS that is currently causing problems for outgoing CMS FTS transfers.

Ongoing Disk Server Issues

GDSS374 (AtlasTape - D0T1) and GDSS607 (LHCbDst - D1T0) are both drained and undergoing re-acceptance testing following earlier failures.

Notable Changes made this last week

On Monday (28th May) the LHCb Castor instance was successfully changed to use the new "Transfer Manager" scheduler.
On Wednesday morning (30th May) the Transfer Manager was deployed for the Castor GEN instance.
The older disk servers in AliceDisk have now been drained and removed. This means the space token now has around 200TB of storage as planned. (It was temporarily over-allocated after newer servers were added before the old ones removed.) Of note is that the draining uncovered around 4000 files that were listed in the Castor Nameserver as of zero size, but did occupy space on disk. None of these files were recorded by Alice as present at RAL (so there was no data loss) but this did represent some dark data.
Errata and kernel updates are being deployed on worker nodes.

Declared in the GOC DB

Deploy Transfer Manager for Castor. Dates for this are now in the GOCDB.
- 30 May 2012 09:00-11:00 Gen
- 31 May 2012 09:00-11:00 CMS
- 07 Jun 2012 09:00-11:00 ATLAS

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

The following items will particularly affect services:

Listing by category:

Databases:
- Regular Oracle "PSU" patches are pending.
- Switch LFC/FTS/3D to new Database Infrastructure.
- Update LFC/FTS databases to Oracle 11.
Castor:
- Update the Castor Information Provider (CIP) (Wednesday 6th June)
- Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
- Upgrade to version 2.1.12.
Networking:
- One step still remains in extending the IP address range used for disk servers that will use the OPN.
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer. There is now a firm date of 19th June for upgrading the Site Access Router. The replacement of the UKLight Router will follow.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Reconfiguration and maintenance operation will be scheduled for lcgwms03 (non-LHC WMS) from 1-7 June.
- Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.|
Infrastructure:
- The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. This has been postponed from the 14th May.

Entries in GOC DB starting between 23rd and 30th May 2012

There were no unscheduled outages during the last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Castor GEN instance: srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k.	SCHEDULED	OUTAGE	30/05/2012 09:00	30/05/2012 11:00	2 hours	downtime to upgrade GEN castor instance to use Transfer Manager
srm-lhcb.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	28/05/2012 09:00	28/05/2012 11:00	2 hours	downtime to upgrade LHCb castor instance to use Transfer Manager

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
68853	Red	Less Urgent	On hold	2011-03-22	2012-04-20		Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
82100	Red	Less Urgent	On hold	2012-05-10	2012-05-28	SNO+	default se
82496	Amber	Less Urgent	In Progress	2012-05-24	2012-05-29	T2K	Cannot delegate proxies to FTS	}

Tier1 Operations Report 2012-05-30