Tier1 Operations Report 2011-06-22
From GridPP Wiki
Revision as of 13:05, 22 June 2011 by Gareth smith (Talk | contribs)
Contents
RAL Tier1 Operations Report for 22nd June 2011
Review of Issues during the week from 15th to 22nd June 2011.
- Friday (17th) GDSS256 (AliceTape) had a read only file system and have been taken out of production pending investigation. No migration candidates on the system.
- Sunday (19th) problem seen on the LHCb SRMs - fixed by a restart.
- Sunday evening (19th) Top BDII problems with CERN information missing.
- Monday (20th) Problem with GDSS120 (LHCbRawRdst) which had a file system fault. This was replaced by another server (GDSS163).
- Monday afternoon (20th) CMS LSF machine was also turned off by mistake - resulting in a short outage for srm-cms. (Note added after the meeting.)
- Tuesday (21st June) Problem with srm-cms caused by a deadlock on the (CMS) SRM database.
- Changes made this last week:
- Last Thursday (16th) Oracle and a system parameters updated on remaining Castor database nodes. Although too early for confirmation this should fix a problem to do with gathering the database statistics.
- Monday (20th) At LHCb's request - Increased the maximum wall clock limit on our 6GB queue by 20% (to 120 hours).
Current operational status and issues.
- LHCb are experiencing some issues staging files that is being investigated by the Castor Team.
- The following points are unchanged from previous reports:
- Since CE07 was taken out of service for re-installation as a CREAM CE, both Alice & LHCb are not including any CEs in their availability calculations. As a result our LHCb availability is no longer affected by intermittent problems with the LHCb CVMFS test - although we know this test does still fail from time to time.
- We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
- Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.
Declared in the GOC DB
- Thursday 16th June: Castor (All SRMs) At Risk during OS and Oracle parameter changes needed to follow up with problem gathering Oracle statistics. (Being rolled out to all nodes in RACs following tests last week.)
Advanced warning:
The following items are being discussed and are still to be formally scheduled:
- A newer version of the Castor clients (2.1-10) will be rolled out across the worker nodes within the next couple of weeks.
- Partial update to Castor in order to prepare for the higher capacity "T10KC" tapes. This is likely to take place on Tuesday 5th July during the LHC technical stop.
- Updates to Site Routers (the Site Access Router and the UKLight router) are required.
- Upgrade Castor clients on the Worker Nodes to version 2.1.10.
- Address permissions problem regarding Atlas User access to all Atlas data.
- Networking upgrade to provide sufficient bandwidth for T10KC tapes.
- Microcode updates for the tape libraries are due.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure.
- Further updates to CEs: (Convert CE06 to a CREAM CE; Glite updates on CE09). Priority & order to be decided.
Entries in GOC DB starting between 15th and 22nd June 2011.
There has been one unscheduled entries in the GOCDB for this period. This is for last Wednesday's problem with the CMS Castor instance. (Actually reported last week).
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor (all SRM end points) | SCHEDULED | WARNING | 16/06/2011 10:00 | 16/06/2011 12:00 | 2 hours | At Risk during OS and Oracle parameter changes needed to follow up with problem gathering Oracle statistics. (Being rolled out to all nodes in RACs following tests last week.) |
srm-cms.gridpp.rl.ac.uk, | UNSCHEDULED | OUTAGE | 15/06/2011 11:55 | 15/06/2011 13:25 | 1 hour and 30 minutes | We are investigating a problem with the scheduler within Castor for the CMS instance. |