Tier1 Operations Report 2011-07-06

From GridPP Wiki
Revision as of 11:58, 6 July 2011 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 6th July 2011

Review of Issues during the week from 29th June to 6th July 2011.

  • Tuesday 5th July: During the planned downtime (including of FTS) there was a problem on the FTS database ("somnus") with connectivity to the disk arrays and the FTS database failed over to another node in the Oracle RAC. The FTS was stopped at this time, so this did not cause any problems. The FTS database was put back on its ‘correct’ node during the afternoon.
  • The schedules outages went OK on Tuesday 5th July (see list of changes below). There was a problem restarting LHCb batch work as the batch scheduler change (maui) to drain LHCb batch work 24 hours ahead of the intervention had been left in place once the outages finished. This was resolved during the afternoon.
  • Wednesday 6th July: Problem starting Alice batch jobs. Appears to be caused by stale (old) jobs in the batch queue.
  • On Friday (1st July) a user (from Fusion VO) was banned as their batch jobs were causing problems by creating large files in /tmp. Following communication with the VO/user the user was un-banned on Monday (4th July).
  • Work is ongoing with LHCb regarding use of xrootd. CMS are now using this.
  • Disk Server Issues:
    • Thursday 30th June: Problem reported with gdss313 (AtlasDataDisk - D1T0). The machine was not properly configured in lsf and data on it was not available. This had almost certainly been the case for some time (months) but was undiscovered until data was needed from that server.
    • On Tuesday (28th June) gdss354 (AtlasDataDisk - D1T0), which had had a drive replaced, encountered problems. After having its disk controller replaced it was returned to production in 'Read only' mode on Tuesday (28th). However, further problems were encountered (Thursday 30th June) and the server is being drained since then.
  • The recent intervention on lcgce06 (drain and decommissioning as lcgCE) had to be cancelled because of a problem (with the accounts and groups) identified on the exiting LHC CREAM-CEs (affecting mainly LHCb and CMS). It was resolved at the end of last week, and the lcgce06 intervention will be re-scheduled.
  • Changes made this last week:
    • On Tuesday 5th July the following changes were made:
      • Update to Castor version 2.1.10-1 in order to prepare for the higher capacity "T10KC" tapes.
      • Update to the UKLight & Site Access Routers and intervention on the problematic link between the Site Access Router and the Firewall.
      • Doubling of the link between the C300 and one of the switch stacks within the Tier1.
      • Move many batch worker nodes to a new IP address range and apply CVMFS update.
      • Quattorization of the CMS Squids.

Current operational status and issues.

  • We have observer some packet loss on the main network link from the RAL site (not the route used by our data). The link is being monitored closely since the intervention yesterday (5th) to see if it has been fixed.
  • Issues still remain with LHCb staging files for the LHCbRawRDst service class. This has not been an operational problem during this reporting week because LHCb have not been using this service class so heavily.
  • The following points are unchanged from previous reports:
    • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.

Declared in the GOC DB

  • Tuesday 12th July - Applying regular updates to lcgui02.

Advanced warning:

The following items are being discussed and are still to be formally scheduled and announced:

  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).

Entries in GOC DB starting between 29th June and 6th July 2011.

There were no unscheduled entries in the GOCDB for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (SRMs) & FTS. SCHEDULED OUTAGE 05/07/2011 08:00 05/07/2011 12:00 4 hours Castor 2.1.10-1 upgrade and Site Networking intervention.
All other service nodes (i.e. not CEs, castor/SRM or FTS): lcgapel0676, lcgbdii, lcglb01, lcglb02, lcgrbp01, lcgui01, lcgui02, lcgvo-02-21, lcgvo-alice, lcgvo-s3-03, lcgvo-s3-04, lcgwms01, lcgwms02, lcgwms03, lfc-atlas, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, site-bdii.gridpp.rl.ac.uk SCHEDULED OUTAGE 05/07/2011 08:00 05/07/2011 10:47 2 hours and 47 minutes Services unavailable during site networking intervention.
All CEs SCHEDULED OUTAGE 04/07/2011 20:00 05/07/2011 12:00 16 hours Batch drain and outage during network and Castor interventions.
lcgce07 SCHEDULED OUTAGE 29/06/2011 15:30 04/07/2011 13:00 4 days, 21 hours and 30 minutes draining and VO re-configuration
lcgce06 SCHEDULED OUTAGE 29/06/2011 12:00 01/07/2011 14:45 2 days, 2 hours and 45 minutes Drain and decommission as lcg-CE