Tier1 Operations Report 2011-11-30
From GridPP Wiki
Contents
- 1 RAL Tier1 Operations Report for 30th November 2011
RAL Tier1 Operations Report for 30th November 2011
Review of Issues during the week 23rd to 30th November 2011.
- In the middle of last week we had a problem starting Alice batch work. Early Thursday morning (24th Nov) Alice found and fixed a problem with the Alice VO box that resolved the issue.
- On Thursday afternoon (24th Nov) we found a problem of a very low start rate on the farm. This was traced to batch jobs being in a queued state but with a particular execution host allocated. The particular node was disabled from the batch system, followed by the affected jobs being deleted, and the job start rate rose back to normal levels.
Resolved Disk Server Issues
- Gdss456 (AtlasDataDisk), which had failed with a read only file system on Wednesday 28th September. This server had been replaced on 3rd November.
- On Tuesday morning (22nd Nov) gdss375 (AtlasTape D0T1) had two failed drives and was taken out of production. It was returned to production on Friday morning (25th Nov.)
- On Tuesday (29th) the monitoring reported a problem with the xroot daemon on gdss569 (LHCbDst D1T0). This was traced to the clock being out by 90 seconds and fixed later that day.
Current operational status and issues.
- The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only one channel (Birmingham to RAL) remains a cause for concern.
- We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.
- CERN has reported issues with AFS callbacks to RAL worker nodes (29/11/2011). This is being investigated.
Ongoing Disk Server Issues
- GDSS296 (CMSFarmRead - D0T1) was set “read-only” on Monday (28th), and will be removed from production. This follows the “checksum-mismatch" daily checks throwing up four files that had been written the day before as corrupt.
Notable Changes made this last week
- Roll-out of UMD versions of Top BDII & Site BDII well under way.
Forthcoming Work & Interventions
- Saturday 10th December. Replacement of some DNS servers at RAL. These are ones not mainly used by the Tier1. The two remaining DNS servers mainly used by the Tier1 will be updated in January.
Declared in the GOC DB
- None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced:
- Regular Oracle "PSU" patches are pending.
- There are also plans to move part of the cooling system onto the UPS supply. The use of temporary power arrangements means this should no longer require downtime of computer systems.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure. The problems that caused the postponement of this migration are now understood and, apart from some detailed re-configuration, should be ready to go at the start of the new year.
- Networking change required to extend range of addresses that route over the OPN.
- Address permissions problem regarding Atlas User access to all Atlas data.
- Replace hardware running Castor Head Nodes (aimed for end of year).
- Updates to the RAL DNS infrastructure (replacing DNS servers)
Entries in GOC DB starting between 23rd and 30th November 2011.
There were no entries in the GOC DB for this last week.
Open GGUS Tickets
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
76877 | Green | Urgent | In progress | 2011-11-29 | 2011-11-29 | T2K | FTS transfers RALLCG2-VICTORIALCG2 |
76750 | Green | Very Urgent | In progress | 2011-11-23 | 2011-11-29 | T2K | Jobs get aborted due to proxy(?) issues |
76735 | Green | Urgent | In progress | 2011-11-22 | 2011-11-25 | vo.londongrid.ac.uk | lcglb02 GSS error |
76564 | Amber | Very urgent | waiting for reply | 2011-11-17 | 2011-11-29 | geant4 jobs abort on lcgce05.gridpp.rl.ac.uk | |
75395 | Red | urgent | unsolved | 2011-10-17 | 2011-11-28 | T2K | WMS 'jumping' (Set unsolved by L&B support.) |
74353 | Red | very urgent | waiting for reply | 2011-09-16 | 2011-11-22 | Pheno | Proxy not renewing properly from WMS |
68853 | Red | less urgent | On hold | 2011-03-22 | 2011-11-07 | Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s) | |
68077 | Red | less urgent | in progress | 2011-02-28 | 2011-09-20 | Mandatory WLCG InstalledOnlineCapacity not published | |
64995 | Red | less urgent | in progress | 2010-12-03 | 2011-09-20 | No GlueSACapability defined for WLCG Storage Areas |