Tier1 Operations Report 2011-06-08

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 8th June 2011

Review of Issues during the week from 1st to 8th June 2011.

  • GDSS294 (CMSFarmRead - D0T1) failed with a read-only file system on the evening of Monday 9th May. It was returned to production on Tuesday (7th June) - its RAID controller card has been replaced.
  • GDSS365 (CMSTemp - D1T0) reported a read only filesystem and was taken out of production on 20th May. It was returned to production on Tuesday (7th June) have re-done acceptance testing.
  • Thursday 2nd June. GDSS135 (AtlasFarm d0t1) Out of production for around 14 hours (early evening Wed 1st June until the morning of Thursday 2nd June). The system suffered a kernel panic & the RAID array was degraded.
  • Wednesday - Thursday (1 - 2 June) and ongoing since. Problems seen with Castor CMS instance. This was triggered by a specific CMS workflow, although it is not yet understood why it causes the problems seen.
  • Overnight Thursday - Friday (2 - 3 June) Received a number of call-outs on our Top-BDIIs. CERN-PROD site information not found in top-level BDII.
  • On Monday (6th June) Found 28 corrupt Atlas files. These were old files that gave problems while Atlas were copying files from stripInput to simRaw ( in order to archive files which were previously only on disk.)
  • Changes made this last week:
    • Increased in job start rate for lhcb-pilot jobs.
    • Tuesday 7th. Change to Castor garbage collection algorithm (to "LRU") for the LHCRawRdst service class.
    • Migration of a CE07 from LCG to Cream CE completed. System returned to production on Monday (6th June).

Current operational status and issues.

  • Since CE07 was taken out of service for re-installation as a CREAM CE, LHCb are not reporting any CE availability via GridView. As a result we no longer see the effect of intermittent problems with CVMFS (which still occur) on this availability view.
  • We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
  • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The investigation into this is still ongoing.

Declared in the GOC DB

  • CE08 is being drained and will be unavailable for a glite update. Scheduled Tuesday - Tuesday (7-14 June).
  • At Risk on castor for Atlas & CMS for an hour on Thursday morning, 9th June. OS and Oracle parameter changes needed to follow up with problem gathering Oracle statistics.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Further updates to CEs: (Convert CE06 to a CREAM CE; Quattorize CE05; Glite updates on CE09). Priority & order to be decided.

Entries in GOC DB starting between 1st and 8th June 2011.

There were no unscheduled entries in the GOCDB for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce08 SCHEDULED OUTAGE 07/06/2011 16:00 14/06/2011 12:00 6 days, 20 hours update of glite3.2 CREAM release
lcgce07 SCHEDULED OUTAGE 24/05/2011 09:00 06/06/2011 10:30 13 days, 1 hour and 30 minutes drain and decommission as lcg-CE