Tier1 Operations Report 2011-05-11

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 11th May 2011

Review of Issues during the week from 4th to 11th May 2011.

  • Last Wednesday (4th May) there were a lot of file transfers inbound for LHCb failing. The LHCbDst service class was having problems with only one disk server having space. Early afternoon two more disk servers (gdss499,513) were added to the service class and a couple of the full servers were drained. The situation then recovered.
  • There was a problem overnight Monday-Tuesday (9-10 May) where the Castor GEN instance was not working owing to a problem in the LSF scheduler with Castor.
  • GDSS206 (AtlasdataDisk - D1T0) was unavailable from Friday evening (6th) until Saturday afternoon. It had two disk failures and this was a precaution while the RAID array rebuilt one of the disks.
  • During the afternoon of Tuesday 10th May there was an outage of the LFC and FTS. During a scheduled At Risk for a routine Oracle update the services were unable to reconnect to the database. This was traced to an Access Control List problem on the databases. A GGUS ticket was received from Atlas. A SIR (Post Mortem) has been requested and is in preparation. (Note: this did not affect the LHCb LFC.)
  • During last week some servers from AtlasStripInput & AtlasScratchDisk were drained for a while to re-distribute free space. This operation is transparent to the users and will be done from time to time as a background task.
  • Changes made this last week:
    • Castor GEN instance SRM updated to version 2.10-2 (Tuesday 10th May).
    • Oracle patches applied to Castor databases (Except that behind CMS & GEN instances as these were too busy to do).
    • Oracle patches applied to 3D, FTS & LFC databases (Tuesday 10th May).
    • OS updates to FTS agent (FTS outage) (Tuesday 10th May).
    • Merging of LHCb Disk Pools (Wednesday 11th May).

Current operational status and issues.

  • On Saturday (30th April) FSPROBE reported a problem on GDSS293 (CMSFarmRead - D0T1) which was removed from production. The server was put back in service on Sunday (1st May.) On Tuesday (3rd May) the server was put into draining mode ahead of further investigations.
  • GDSS294 (CMSFarmRead - D0T1)failed with a read-only file system on the evening of Monday 9th May. It is currently out of production.
  • The load issues on the Atlas software server have continued. We have been running with the maximum number of Atlas production batch jobs, and the start rate held slightly down. Atlas plan to move to CVMFS any day now.
  • We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
  • Atlas have reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). The investigation into this is still ongoing.

Declared in the GOC DB

  • Thursday 12 - Thursday 19th - Drain & Maintenance of WMS03.
  • Tuesday 17 - Thursday 19th - Drain & Maintenance of CE03.

Advanced warning:

  • Wednesday 11 May. Routine maintenance on R89 UPS.
  • Monday 16th May - Oracle patches for Castor databases behind CMS & GEN instances.
  • Tuesday 17th May - Add xrootd libraries to worker nodes.

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 4th and 11th May 2011.

There were no unscheduled entries in the GOCDB for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgftm, lcgfts, lfc-atlas, lfc.gridpp.rl.ac.uk SCHEDULED WARNING 10/05/2011 12:00 10/05/2011 14:00 2 hours LFC and FTS services At Risk during application of Oracle patches to back end database.
lcgfts.gridpp.rl.ac.uk, SCHEDULED OUTAGE 10/05/2011 10:00 10/05/2011 12:00 2 hours FTS agent host OS update (includes drain of transfers for first hour).
Castor GEN instance (SRMs). SCHEDULED OUTAGE 10/05/2011 10:00 10/05/2011 12:00 2 hours Update of SRM to version 2-10.2
lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED WARNING 10/05/2011 09:00 10/05/2011 11:00 2 hours At Risk on LHCb LFC and all 3D databases during application of quarterly Oracle patches.
All Castor (SRMs) SCHEDULED WARNING 09/05/2011 11:00 09/05/2011 15:00 4 hours Castor services At Risk during application of quarterly Oracle updates to back end databases.