Production Team Report 2010-06-07

From GridPP Wiki
Revision as of 14:17, 7 June 2010 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Production Team Report for 7th June 2010.

AoD This Week

Mon : John Tues: Catalin Wed: Gareth Thu & Fri: John

Last 2 weeks (24 May - 4 June)

  • Gareth: AoD (1 day), 1 week leave, HEP SYSMAN planning, Following up with EMC/UPS issues, Dust in computer room.
  • John: AoD (4 days) Worked on status display (mini-PC) and on T1 dashboard,
  • Tiju: AoD (3 days), Worked on Nagios replacement for SAM test,

Changes to Operating procedures

  • Restrictions on Access to HPD room:
    • No access to any visitors.
    • Access by staff to be kept to a minimum.
    • Anyone entering the room must wear a face mask which can be obtained from Operations.
  • Changes to Disk intervention procedures. Notably:
    • When a disk server fails out of hours the Primary On Call takes the server out of production (as per existing procedures). The VO is notified that this has taken place. (A template message to be produced, to include the service class). This notification to be sent for ===all=== service classes.
    • If the server is down Fabric on-call to be notified at the next convenient time. E.g. if this is overnight, notify the following morning (including weekends). They will check the system and, if necessary, attend on site to start the recovery process.
    • Once Fabric team has intervened the AoD/Primary On call should send a note to the VO giving and estimated time to return the server to service along with a list of files on the server. (Procedure to obtain list be documented as part of this process).
    • In all cases (i.e. all production service classes) Fabric team will start the recovery using a disk taken from a spare pool, rather than waiting for the vendor to supply a replacement.
    • When a server will be unavailable for an extended period (for example during a long rebuild), it will be put into a ‘draining’ state. The draining to be managed by us (Tier1) rather than the users.
  • Changes to LHC Technical Stops:
    • 4 day technical stop every 6 weeks. Next one on 19-22 July.
    • 1 day stop every 3 weeks for cryo maintenance (on Thursdays). Next one on 24th June.

Declared Outages in GOC DB

  • At Risk for re-balancing LUGH (Morning Tuesday 8th June.)

June 28-30

    • Transformer checks (Site At Risk). TX2 & TX3.

July 26-28

    • Transformer checks. (2 days - TX1 & 4) T.B.C.

Advanced Warning

  • Fabric:
    • Move one power unit for one EMC array unit behind the non-Castor databases to UPS power.
    • Double the network link to the tape robot stack (stack 12), postponed from the last TS. (Requires Castor stop).
    • Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
    • Multipath mods to stop errors. (Not yet sure of effect).
    • Microcode update for tape robot
    • Swap Solaris tape controllers (for robot) over (?)
    • New Atlas software server.
    • New kernels and glibc updates on non-castor Oracle RAC nodes. (Done for LUGH).
  • Database:
    • nothing
  • Grid Services:
    • Upgrade to FTS 2.2.4 (Sometime from Tuesday 15th).
    • Stop SL4 batch service (Start August)
    • Add Quatorised BDII to Top-BDII set. (Below threshold for technical stop).
    • Update to FTS 2.2.4. (Below threshold for technical stop).
    • glite 3.2 WMS (Below threshold for technical stop).
  • Castor:
    • Possible SRM update
  • Networks:
    • Commissioning OPN link