RAL Tier1 Operations Report for 8th June 2011

GDSS294 (CMSFarmRead - D0T1) failed with a read-only file system on the evening of Monday 9th May. It was returned to production on Tuesday (7th June) - its RAID controller card has been replaced.
GDSS365 (CMSTemp - D1T0) reported a read only filesystem and was taken out of production on 20th May. It was returned to production on Tuesday (7th June) have re-done acceptance testing.
Thursday 2nd June. GDSS135 (AtlasFarm d0t1) Out of production for around 14 hours (early evening Wed 1st June until the morning of Thursday 2nd June). The system suffered a kernel panic & the RAID array was degraded.
Wednesday - Thursday (1 - 2 June) and ongoing since. Problems seen with Castor CMS instance. This was triggered by a specific CMS workflow, although it is not yet understood why it causes the problems seen.
Overnight Thursday - Friday (2 - 3 June) Received a number of call-outs on our Top-BDIIs. CERN-PROD site information not found in top-level BDII.
On Monday (6th June) Found 28 corrupt Atlas files. These were old files that gave problems while Atlas were copying files from stripInput to simRaw ( in order to archive files which were previously only on disk.)
Changes made this last week:
- Increased in job start rate for lhcb-pilot jobs.
- Tuesday 7th. Change to Castor garbage collection algorithm (to "LRU") for the LHCRawRdst service class.
- Migration of a CE07 from LCG to Cream CE completed. System returned to production on Monday (6th June).

Since CE07 was taken out of service for re-installation as a CREAM CE, LHCb are not reporting any CE availability via GridView. As a result we no longer see the effect of intermittent problems with CVMFS (which still occur) on this availability view.
We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The investigation into this is still ongoing.

CE08 is being drained and will be unavailable for a glite update. Scheduled Tuesday - Tuesday (7-14 June).
At Risk on castor for Atlas & CMS for an hour on Thursday morning, 9th June. OS and Oracle parameter changes needed to follow up with problem gathering Oracle statistics.

The following items are being discussed and are still to be formally scheduled:

Updates to Site Routers (the Site Access Router and the UKLight router) are required.
Upgrade Castor clients on the Worker Nodes to version 2.1.10.
Address permissions problem regarding Atlas User access to all Atlas data.
Minor Castor update to enable access to T10KC tapes.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Further updates to CEs: (Convert CE06 to a CREAM CE; Quattorize CE05; Glite updates on CE09). Priority & order to be decided.

There were no unscheduled entries in the GOCDB for this period.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgce08	SCHEDULED	OUTAGE	07/06/2011 16:00	14/06/2011 12:00	6 days, 20 hours	update of glite3.2 CREAM release
lcgce07	SCHEDULED	OUTAGE	24/05/2011 09:00	06/06/2011 10:30	13 days, 1 hour and 30 minutes	drain and decommission as lcg-CE

Tier1 Operations Report 2011-06-08