Production Team Report 2010-05-24

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Production Team Report for 24th May 2010.

AoD This Week

Mon & Tues: John Wed: Catalin Thu & Fri: John

Last Week (17 May - 24 May)

  • Gareth: AoD (<1 day), Lots of meetings (disk server intervention review, UPS v. EMC), HEP SYSMAN planning
  • John: Worked on status display (mimi-pc), disk server blessing, some quattor training
  • Tiju: AoD (4 days), Worked on Nagios replacement for SAM test, deployed disk-server, some quattor training

Changes to Operating procedures

  • Changes to Disk intervention procedures. Notably:
    • When a disk server fails out of hours the Primary On Call takes the server out of production (as per existing procedures). The VO is notified that this has taken place. (A template message to be produced, to include the service class). This notification to be sent for ===all=== service classes.
    • If the server is down Fabric on-call to be notified at the next convenient time. E.g. if this is overnight, notify the following morning (including weekends). They will check the system and, if necessary, attend on site to start the recovery process.
    • Once Fabric team has intervened the AoD/Primary On call should send a note to the VO giving and estimated time to return the server to service along with a list of files on the server. (Procedure to obtain list be documented as part of this process).
    • In all cases (i.e. all production service classes) Fabric team will start the recovery using a disk taken from a spare pool, rather than waiting for the vendor to supply a replacement.
    • When a server will be unavailable for an extended period (for example during a long rebuild), it will be put into a ‘draining’ state. The draining to be managed by us (Tier1) rather than the users.

Declared Outages in GOC DB

  • Castor Oracle database Sub-request clean-up: LHCb - Tuesday 25th May
  • CE08 - re-config for glexec - Monday 24-26 May.
  • Upading Oracle PSU patch: OGMA (25th May), LUGH (27th May), SOMNUS (2nd June)
  • At Risk for UPS test - Tuesday 1st June.

Technical Stops

  • ===May 31 - June 2:===
    • Tuesday 1st June: UPS test (site At Risk)
    • Wednesday 2nd June: Oracle quarterly patching (TBC)
  • ===June 28-30===
    • Monday 28th June - Transformer checks (Site At Risk at end of afternoon)- TX2
    • Monday 28th June - Transformer checks (Site At Risk for day) - TX3 or 4
  • ===July 26-28===
    • Transformer checks. (2 days - TX1 & TX3 or 4)

Advanced Warning

  • Fabric:
    • Double the network link to the tape robot stack (stack 12), postponed from the last TS. (Requires Castor stop).
    • Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
    • Multipath mods to stop errors. (Not yet sure of effect).
    • Microcode update for tape robot
    • Swap Solaris tape controllers (for robot) over (?)
    • New Atlas software server.
    • New kernels and glibc updates on non-castor Oracle RAC nodes. (Done for LUGH). (Added after meeting.)
    • Move one power unit for one EMC array unit behind teh non-Castor databases to UPS power.
  • Database:
    • Quarterly Oracle patching
  • Grid Services:
    • CEs will be configured for glexec in rotation.
    • Stop SL4 batch service (August)
    • Add Quatorised BDII to Top-BDII set. (Below threshold for technical stop).
    • Update to FTS 2.2.4. (Below threshold for technical stop).
    • glite 3.2 WMS (Below threshold for technical stop).
  • Castor:
    • Possible SRM update
  • Networks:
    • Commissioning OPN link