Tier1 Operations Report 2012-05-16

From GridPP Wiki
Revision as of 10:43, 21 May 2012 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 16th May 2012

Review of Issues during the week 9th to 16th May 2012
  • On Friday (11th May) evening from 6:00pm until approximately midnight there was an issue with file transfers via the SRMs. The issue appears to have been related to CRLs from CERN not been updated.
  • On Sunday May 13th at approx 07:30, the CEs lost contact with the batch server and jobs could not be submitted. The POC restarted torque and maui. We failed some SAM tests because of this.
  • On Monday May 14th 13 disk servers (520TB) deployed into atlasStripInput, they immediately developed problems and were removed from service.
  • On Tuesday May 15th there were file transfer issues. This was due to the CERN CRL not being updated.
Resolved Disk Server Issues
  • GDSS607 (LHCbDst - D1T0) failed with FSProbe errors on Friday evening (4th May). It has been drained and removed from service.
Current operational status and issues
  • Investigations into an ongoing communications problem between the CEs and the batch server continue.
  • There have been no further problems in the last week on the UKLight-SAR link although we will continue to track this here.
  • There is a known problem with the handling of some certificates within FTS that is currently causing problems for LHCb FTS transfers.
Ongoing Disk Server Issues
  • On Monday Afternoon (14th May at 15:30) gdss374 (atlasTape (d0t1)) developed fsprobe errors. It was put into ReadOnly state. Subsequently we discovered that there were 34 files on this machine with bad checksums. These files have been declared lost.
  • Today Wednesday 16th May, gdss467 (LHCbDst) was found to have memory errors. The machine had no files on it, so it has been removed from service for memory checks.
Notable Changes made this last week
  • Thursday 10th May - TapeGateway was deployed for the GEN Castor instance.
  • Tuesday 15th May - TapeGateway was deployed for LHCb Castor instance.
  • Wednesday 16th May - TapeGAteway was deployed for the Atlas and CMS castor instances.
  • Wednesday 16th May - Upgrade of the non LHC LFC (v1.8.2).
Declared in the GOC DB
  • Wednesday 16th May - Short interruption to the Castor Atlas and CMS instances as they are reconfigured to use the newer Castor Tape Gateway
  • Wednesday 16th May - Outage to update non-LHC LFC to v1.8.2.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Deploy Transfer Manager for Castor. We now have proposed dates for this
      • 28 May 2012 10:00-11:00 LHCb
      • 30 May 2012 10:00-11:00 Gen
      • 31 May 2012 10:00-11:00 CMS
      • 07 Jun 2012 10:00-11:00 ATLAS
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. This has been postponed from the 14th May.
Entries in GOC DB starting between 2nd and 9th May 2012

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk, SCHEDULED OUTAGE 16/05/2012 10:00 16/05/2012 11:00 1 hour D/T to upgrade Atlas and CMS Castor instances to use Tape Gateway
lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, SCHEDULED OUTAGE 16/05/2012 10:00 16/05/2012 12:00 2 hours gLite3.2 update to LFC v1.8.2
srm-lhcb.gridpp.rl.ac.uk, SCHEDULED OUTAGE 15/05/2012 10:00 15/05/2012 10:35 35 minutes downtime to upgrade LHCb castor instance to use tape gateway
srm-alice.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-superb.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, SCHEDULED OUTAGE 10/05/2012 09:00 10/05/2012 10:00 1 hour Move Castor GEN instance to use the Tape Gateway.


Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
82100 Yellow Less Urgent In progress 2012-05-10 2012-05-14 snoplus.snolab.ca default se
82148 Team top priority In progress 2012-05-11 2012-05-16 Atlas RAL-LCG2: failed to contact on remote SRM