RAL Tier1 Operations Report for 16th May 2012

Review of Issues during the week 9th to 16th May 2012

On Friday (11th May) evening from 6:00pm until approximately midnight there was an issue with file transfers via the SRMs. The issue appears to have been related to CRLs from CERN not been updated.
On Sunday May 13th at approx 07:30, the CEs lost contact with the batch server and jobs could not be submitted. The POC restarted torque and maui. We failed some SAM tests because of this.
On Monday May 14th 13 disk servers (520TB) deployed into atlasStripInput, they immediately developed problems and were removed from service.
On Tuesday May 15th there were file transfer issues. This was due to the CERN CRL not being updated.

Resolved Disk Server Issues

GDSS607 (LHCbDst - D1T0) failed with FSProbe errors on Friday evening (4th May). It has been drained and removed from service.

Current operational status and issues

Investigations into an ongoing communications problem between the CEs and the batch server continue.
There have been no further problems in the last week on the UKLight-SAR link although we will continue to track this here.
There is a known problem with the handling of some certificates within FTS that is currently causing problems for LHCb FTS transfers.

Ongoing Disk Server Issues

On Monday Afternoon (14th May at 15:30) gdss374 (atlasTape (d0t1)) developed fsprobe errors. It was put into ReadOnly state. Subsequently we discovered that there were 34 files on this machine with bad checksums. These files have been declared lost.
Today Wednesday 16th May, gdss467 (LHCbDst) was found to have memory errors. The machine had no files on it, so it has been removed from service for memory checks.

Notable Changes made this last week

Thursday 10th May - TapeGateway was deployed for the GEN Castor instance.
Tuesday 15th May - TapeGateway was deployed for LHCb Castor instance.
Wednesday 16th May - TapeGAteway was deployed for the Atlas and CMS castor instances.
Wednesday 16th May - Upgrade of the non LHC LFC (v1.8.2).

Declared in the GOC DB

Wednesday 16th May - Short interruption to the Castor Atlas and CMS instances as they are reconfigured to use the newer Castor Tape Gateway
Wednesday 16th May - Outage to update non-LHC LFC to v1.8.2.

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

Databases:
- Regular Oracle "PSU" patches are pending.
- Switch LFC/FTS/3D to new Database Infrastructure.
- Update LFC/FTS databases to Oracle 11.
Castor:
- Deploy Transfer Manager for Castor. We now have proposed dates for this
  - 28 May 2012 10:00-11:00 LHCb
  - 30 May 2012 10:00-11:00 Gen
  - 31 May 2012 10:00-11:00 CMS
  - 07 Jun 2012 10:00-11:00 ATLAS
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
- Upgrade to version 2.1.12.
Networking:
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer.
- Addition of caching DNSs into the Tier1 network.
Grid Services:
- Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.
Infrastructure:
- The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. This has been postponed from the 14th May.

Entries in GOC DB starting between 2nd and 9th May 2012

There were no unscheduled outages during the last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	16/05/2012 10:00	16/05/2012 11:00	1 hour	D/T to upgrade Atlas and CMS Castor instances to use Tape Gateway
lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	16/05/2012 10:00	16/05/2012 12:00	2 hours	gLite3.2 update to LFC v1.8.2
srm-lhcb.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	15/05/2012 10:00	15/05/2012 10:35	35 minutes	downtime to upgrade LHCb castor instance to use tape gateway
srm-alice.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-superb.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk,	SCHEDULED	OUTAGE	10/05/2012 09:00	10/05/2012 10:00	1 hour	Move Castor GEN instance to use the Tape Gateway.

Open GGUS Tickets


GGUS ID	Level	Urgency	State	Creation	Last Update	VO	Subject
68853	Red	Less Urgent	On hold	2011-03-22	2012-04-20		Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
82100	Yellow	Less Urgent	In progress	2012-05-10	2012-05-14	snoplus.snolab.ca	default se
82148	Team	top priority	In progress	2012-05-11	2012-05-16	Atlas	RAL-LCG2: failed to contact on remote SRM

Tier1 Operations Report 2012-05-16