Tier1 Operations Report 2011-01-26

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 26th January 2011

Review of Issues during the week from 19th to 26th January 2011.

  • As reported at last week's meeting, the OPN link to CERN failed over to the backup link at around 7.30am on the 19th January. This was caused by a broken fibre and the link was repaired (and the traffic switched back) the following morning.
  • On Friday (21st Jan) afternoon there was a problem with gdss498 (AtlasSimstrip – MCDisk) – showing up in the Atlas dashboard and FTS display. The system was put in 'read only' mode for around 100 minutes while investigations took place. The problem was to do with checksumming and was traced to a configuration error which was resolved.
  • There was a problem over the weekend (22/23) with Atlas migrations to tape (which reached 4000 pending). A configuration problem was found and fixed on Monday morning and the problem resolved.
  • By Monday (24th Jan) a large backlog in LHCb batch jobs had built up with around 10,000 jobs in the queue. Following contact with LHCb this was traced to a configuration problem on the CE which confused LHCb's jobs submission feedback. This was resolved late Tuesday (25th). Note that we have been running a good number of LHCb jobs (up to the current maximum of 1500) for a part of this last week. Investigations on Wednesday (26th) suggest the same problem has affected Alice with over 2000 jobs queued.
  • On Thursday 23rd December GDSS337 (GenTape) failed. There was only one un-migrated file (for T2K) on it. This server has just been returned to production today (26th Jan) following replacement of memory.
  • Tuesday (25th Jan): WMS03 (the non-LHC WMS) was found to be unresponsive in the morning and the service has not been available overnight Monday-Tuesday.
  • Wednesday 19th Jan: Glite update applied to site-bdii nodes.
  • Sat/Sun 22/23 Jan: The planned power outage in the Atlas building went OK, with no disruption to services other than that announced prior to the work.
  • Monday 24th Jan: Detailed changes to batch configuration to enable scheduling by node. On 26th January a corresponding change on the CEs was made.

Current operational status and issues.

  • GDSS496 (CMSFarmRead) was taken out of production on Thursday 6th January following a problem. There were two un-migrated files on this server which had to be declared as lost to CMS (13th Jan). This system is redoing the acceptance testing before being returned to production. A Post Mortem will be prepared for this failure.
  • Friday 14th Jan. All three partitions on GDSS283 (CMSFarmRead) (which had failed on 25th December) were found to be corrupt. The result was the loss of 30 CMS files. A Post Mortem has been prepared for this data loss incident. See:
 https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101225_CMS_Disk_Server_GDSS283_Data_Loss

A replacement server is being prepared to go into CMSFarmRead.

  • The last couple of weeks we have been seeing a problem where files with bad checksums (resulting from failed (i.e. incomplete) transfers of the files into Castor) cause migrations to tape to block. This is being worked around by contacting LHCb who delete the relevant bad files. A fuller resolution is included in the next version of Castor to be rolled out. LHCb are looking at their systems as regards deleting files after failed transfers.
  • We are aware of a problem with the Castor Job manager that can occasionally hang up. This happened once this last week (on Sunday 23rd January for Atlas) and remains an open issue. The service recovers by itself after around 30 minutes. Work is ongoing to automate the capture of more diagnostic information when this occurs.
  • We have previously reported a problem with disk servers becoming unresponsive. This remains an ongoing issue although there were no instances of it this last week. We believe we understand the problem and are testing the fix.
  • Transformer TX2 in R89 is still out of use.

Declared in the GOC DB

  • 30/31 January: Batch system drain and outage for Castor down (Database updates)
  • 31st January: Castor stop during upgrade of Oracle databases. Will also make use of this Castor/batch stop to:
    • Application of OS updates to batch server.
    • Network (VLAN) reconfiguration as preparation for making more OPN addresses available to the Tier1 and double link to tape systems.
  • 31st Jan - 1st Feb: CMS Castor down for Update to 64-bit OS for disk servers and Oracle update to Castor database.
  • 3rd February: At Risk on Castor while Puppetmaster replaced.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Upgrade to 64-bit OS on Castor disk servers to resolve checksumming problem. Possible dates for this are:
    • GEN - To Be Decided.
  • Increase shared memory for OGMA, LUGH & SOMNUS (rolling change). (OGMA done).
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Upgrade all Oracle databases from version 10.2.0.4 to 10.2.0.5 (assuming this upgrade goes OK at CERN. So far only Castor databases scheduled).
  • Added after meeting: Castor 2.1.9-10 upgrade. Possibly late March.

Entries in GOC DB starting between 19th and 26th January 2011.

There was one unscheduled entry, an "At Risk" while a faulty disk on WMS03 was replaced.

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs SCHEDULED AT_RISK 24/01/2011 10:00 24/01/2011 13:00 3 hours All CEs At Risk during internal reconfiguration of the batch system to enable possibility of scheduling by node.
Whole site SCHEDULED AT_RISK 22/01/2011 08:00 23/01/2011 16:00 1 day, 8 hours Systems at risk during power work in building hosting networking equipment.
lcgic01 SCHEDULED OUTAGE 21/01/2011 14:00 24/01/2011 11:00 2 days, 21 hours System unavailable during electrical work in building over weekend.
lcgwms03 UNSCHEDULED AT_RISK 21/01/2011 11:30 21/01/2011 12:30 1 hour disk replacement - RAID software failure
site-bdii SCHEDULED AT_RISK 19/01/2011 10:00 19/01/2011 13:00 3 hours At-risk for applying gLite updates to the RAL site-level BDIIs