Tier1 Operations Report 2011-05-18
From GridPP Wiki
Contents
RAL Tier1 Operations Report for 18th May 2011
Review of Issues during the week from 11th to 18th May 2011.
- On Thursday (12th May) GDSS212 AtlasScratchDisk (d1t0) was taken out of service with a read-only file system. It was returned to production the following day after its RAID card had been replaced.
- On Monday morning (16th) there was a problem with the database systems behind Castor resulting in most of Castor not working for around an hour.
- On Monday afternoon (16th) GDSS432 AtlasDataDisk (D1T0) was taken out of production for just under an hour to replace memory.
- During the afternoon of Tuesday 10th May there was an outage of the LFC and FTS. During a scheduled At Risk for a routine Oracle update the services were unable to reconnect to the database. This was traced to an Access Control List problem on the databases. A GGUS ticket was received from Atlas. A SIR (Post Mortem) has been requested and is in preparation. (Note: this did not affect the LHCb LFC.) See:
https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110510_LFC_Outage_After_DB_Update
- Changes made this last week:
- Monday 16th May - Oracle patches for Castor databases behind CMS & GEN instances.
- Tuesday 17th May - Add xrootd libraries to worker nodes.
- Atlas have progressively migrated to use CVMFS for production batch jobs.
Current operational status and issues.
- On Saturday (30th April) FSPROBE reported a problem on GDSS293 (CMSFarmRead - D0T1) which was removed from production. The server was put back in service on Sunday (1st May.) On Tuesday (3rd May) the server was put into draining mode ahead of further investigations. The draining completed on 4th May when teh server was removed from production for further investigation & tests.
- GDSS294 (CMSFarmRead - D0T1)failed with a read-only file system on the evening of Monday 9th May. It is currently out of production.
- There have been some intermittent problems with CVMFS. This has caused some failures of LHCb SAM Tests (on the CE). This is being investigated.
- We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
- Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). The investigation into this is still ongoing.
Declared in the GOC DB
- Thursday 12 - Thursday 19th - Drain & Maintenance of WMS03.
- Tuesday 17 - Thursday 19th - Drain & Maintenance of CE03.
Advanced warning:
The following items are being discussed and are still to be formally scheduled:
- Updates to Site Routers (the Site Access Router and the UKLight router) are required.
- Upgrade Castor clients on the Worker Nodes to version 2.1.10.
- Address permissions problem regarding Atlas User access to all Atlas data.
- Minor Castor update to enable access to T10KC tapes.
- Networking upgrade to provide sufficient bandwidth for T10KC tapes.
- Microcode updates for the tape libraries are due.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Entries in GOC DB starting between 11th and 18th May 2011.
There were two unscheduled entries in the GOCDB for this period, both 'Warnings' (At Risk). In both cases the GOC DB entries were created later than they should have been. One case (the Oracle update) was an operational oversight, in the other the decision to proceed was only made the day before.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcgce03 | SCHEDULED | OUTAGE | 17/05/2011 09:00 | 19/05/2011 17:00 | 2 days, 8 hours | Drain and Reinstallation of CE |
Whole site | UNSCHEDULED | WARNING | 17/05/2011 09:00 | 17/05/2011 10:00 | 1 hour | At Risk during an internal network configuration change. |
srm-alice, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-superb, srm-t2k, | UNSCHEDULED | WARNING | 16/05/2011 11:00 | 16/05/2011 15:00 | 4 hours | Castor services At Risk during application of quarterly Oracle updates to back end databases behind CMS and GEN instances |
lcgwms03 | SCHEDULED | OUTAGE | 12/05/2011 16:00 | 19/05/2011 15:00 | 6 days, 23 hours | lcgwms03 (non-LHC WMS) drain and maintenance |