Tier1 Operations Report 2011-03-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd March 2011

Review of Issues during the week from 16th to 23rd March 2011.

  • In the early hours of Friday morning 18th March one of the network switch stacks failed. Staff attended onsite and resolved the problem which had lasted around 3 hours in total. Services that were degraded during the problem were the front ends to the fts and lfc-atlas.
  • On Friday evening disk server gdss460 (AliceTape) failed with a Read Only File system. There were no files awaiting migration to tape on this server. It was returned to production on Tuesday (22nd March).
  • On Saturday 12th March gdss188 (AtlasMCTape – D0T1) had a Read Only file system and was taken out of production. It was returned to production this morning (Wednesday 23rd). There were no un-migrated files on the server so it did not cause any loss of file availability.
  • On the evening of Saturday 19th March there were load issues on the Atlas software server. The total number of Atlas batch jobs, along with the rate of job starts, was throttled back. There was a recurrence of this yesterday (Tuesday 22nd March).
  • Summary of changes made during the last week:
    • During the outage a week ago (15th March) the parameter change to increase the shared memory for LUGH & SOMNUS was done.
    • On Tuesday 22nd March (during a pause of batch work for a site network router reboot) the Castor client software on the Worker Nodes was updated to version 2.1.9.

Current operational status and issues.

  • On Thursday 14th March one of the top-bdii nodes failed to boot correctly after it was accidentally power cycled. Some lookups would have failed until the DNS was updated to remove this node from the top-bdii set. The cause of the configuration problem that stopped a successful restart is understood, although a fix has not yet been implemented.
  • On Sunday 13th March there was a problem with one of the three LHCB SRMs systems (lcgsrm0660) which had failed. This was removed from the SRM triplet by an emergency DNS change. A replacement machine has been allocated and is being prepared.
  • On Tuesday Evening 15th March there was a Read Only file system reported on GDSS426 (AtlasDataDisk - D1T0). The server was put back into production in a Read-Only mode during the afternoon of Wednesday 16th March. This morning (23rd March) it has been put into draining mode. Once completed it will be removed from production for further tests.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). This is being investigated.
  • We are aware of a problem with the Castor Job manager that can occasionally hang up. The service recovers by itself after around 30 minutes. An automated capture of more diagnostic information is in place and we still await the next occurrence.
  • We note that the introduction of isolating transformers into the power feeds to the disk arrays for the Oracle Databases appears to have been successful.

Declared in the GOC DB

  • None.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Castor upgrades (to version 2.1.10) - Current proposed dates are Monday 28th March (CMS); Wednesday 30th March (all other Castor instances). To be confirmed at this meeting.
  • Atlas request to add xrootd libraries to worker nodes.
  • Switch Castor to new Database Infrastructure.
  • Address permissions problem regarding Atlas User access to all Atlas data.

Entries in GOC DB starting between 16th and 23rd March 2011.

There were two unscheduled entries this last week, both linked to the failure of a network switch stack overnight Thursday-Friday (17-18) March.

Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED WARNING 22/03/2011 07:50 22/03/2011 10:00 2 hours and 10 minutes At Risk during (and following) reboot of site network router.
lfc-atlas.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 18/03/2011 01:00 18/03/2011 09:12 8 hours and 12 minutes Two out of three LFC machines are unavailable
lcgfts.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 18/03/2011 01:00 18/03/2011 09:13 8 hours and 13 minutes Two out of four machines in the DNS alias are unavailable }