Tier1 Operations Report 2011-03-16

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 16th March 2011

Review of Issues during the week from 9th to 16th March 2011.

  • On Wednesday 9th March GDSS120 (LhcbRawRdst, D0T1) was taken out of production and put into a draining state following a lot of Castor errors. The cause is believed to be a user error, and the system was returned to production the following day.
  • lcgft-atlas (Atlas frontier server) was unavailable overnight from Thursday to Friday (10-11 March) owing to an operational error.
  • On Saturday morning 12th March. Problem on WMS used by the Regional Nagios caused some apparent loss of availability of our CEs.
  • On Saturday 12th March. Problem with top BDIIs. The root cause has not been established, but look-ups from RAL to a BDII at CERN were very slow.
  • On Monday afternoon, 14th March, there was a problem with the FTS caused by a database issue. Following this there were problems connecting to both srm-atlas and srm-cms. (FTS transfers were still failing.) Both Atlas & CMS SRMs were stopped and queues cleared out before restarting. At this point the problem was resolved, but the underlying cause has not been found.
  • Summary of changes made during the last week:
    • On Wednesday 9th the Castor Nameserver was successfully upgraded to version 2.1.10.
    • While the above Castor upgrade was going on, one of the disk arrays used by Castor Oracle databases was re-cabled to have an isolating transformers in its power feed.
    • Atlas file re-naming has taken place (following the disk pool merge). A problem of file ownership caused by the rename was resolved over the weekend.
    • On Thursday 10th March the glite update was successfully applied to the Site BDIIs.
    • On Friday 11th March a modification was made to the Castor configuration of disk servers to roll-out a fix to a problem with logging that had been causing some server hangs. The fix had been running OK for CMS disk servers for some weeks.
    • On Tuesday 15th March there was an outage of the Tier1 services for during a successful site network update. At this time the remaining disk arrays used by Oracle databases were also re-cabled to have isolating transformers in their power feeds. Initial indications are that this has resolved the power related problems seen by these units.

Current operational status and issues.

  • On Thursday 14th March one of the top-bdii nodes failed to boot correctly after it was accidentally power cycled. Some lookups would have failed until the DNS was updated to remove this node from the top-bddi set. The cause of the configuration problem that stopped a successful restart is being investigated.
  • On Saturday 12th March gdss188 (atlasSimRaw – D0T1) had a Read Only file system. It was taken out of production and the cause is being investigated. There were no un-migrated files on the server so it has not been causing any loss of file availability.
  • On Sunday 13th March there was a problem with one of the three LHCB SRMs systems (lcgsrm0660) which had failed. This was removed from the SRM triplet by an emergency DNS change. The system is still under investigation.
  • On Tuesday Evening 15th March there was a Read Only file system reported on GDSS426 (AtlasDataDisk - D1T0). Server currently out of production for investigation.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). This is being investigated.
  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. An automated capture of more diagnostic information is in place and we still await the next occurrence.

Declared in the GOC DB

  • None.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Castor upgrades (to version 2.1.10) - Possibly late March (27/28/29 during technical stop.) Details of dates for discussion at this meeting.
    • Updates to Castor clients on Worker Nodes & Atlas request to add xrootd libraries to worker nodes.
  • Switch Castor to new Database Infrastructure.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Small isolating transformers have been received that should eliminate the electrical noise seen by the disk arrays hosting the Oracle databases. Plan to install for one of Castor systems on 8th March (during castor stop - now done), with others during networking intervention on 15th March.

Entries in GOC DB starting between 9th and 16th March 2011.

There were no unscheduled entries this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED OUTAGE 15/03/2011 07:30 15/03/2011 13:00 5 hours and 30 minutes Whole site outage for site networking upgrades.
All CEs SCHEDULED OUTAGE 14/03/2011 19:30 15/03/2011 07:30 12 hours Drain of batch system ahead of outage for site networking upgrades.
site-bdii.gridpp.rl.ac.uk SCHEDULED AT_RISK 10/03/2011 10:00 10/03/2011 13:00 3 hours gLite updates on the site BDIIs
lfc-atlas.gridpp.rl.ac.uk SCHEDULED AT_RISK 10/03/2011 09:00 11/03/2011 15:00 1 day, 6 hours At Risk while renaming Atlas files in bulk
All Castor (All SRM endpoints) SCHEDULED OUTAGE 09/03/2011 08:00 09/03/2011 15:00 7 hours Update to Castor Nameserver
All CEs SCHEDULED OUTAGE 08/03/2011 20:00 09/03/2011 15:00 19 hours Batch stop for Castor Nameserver update.
lcgic01.gridpp.rl.ac.uk SCHEDULED OUTAGE 02/03/2011 12:00 01/04/2011 01:00 29 days, 12 hours RGMA registry to be decommissioned off