RAL Tier1 weekly operations Grid 20090810

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Catalin
  • Derek
    • Drafting detailed plan for SL5 migration
    • Finished quattorising torque server config
    • Started quattorising WN
    • Tested helpdesk database dump speed
  • Matt
    • SL4/SL5 Migration
      • Get final SL4/SL5 VO requirements
      • Test torque submit filter scripts (for directing jobs to nodes with sl4 or sl5 properties)
    • LFC:
      • ATLAS back-end separation planning (depends on timing information for DB cleanup, and final plans for folding in resilience upgrades)
    • FTS:
      • Document procedure to add domain to CMS cloud (added Ukraine to CERN cloud)
      • Document procedure to deal with site name changes (BNL will change soon)
    • Update gLite middleware on SL4 UI

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
LFC connections hanging 2009-08-07 (10:00) 2009-08-07 (11:00) ATLAS High
WMS02 unavailable 2009-08-08 (01:30) 2009-08-10 (11:00) LHC Medium
lfc0448 - SMART errors detected 2009-06-15 Ongoing ATLAS Low
WMS01 rebooted 2009-08-04 10:45 2009-08-04 11:05 LHC Low
helpdesk DB tables not backed up 2009-07-01 Ongoing None Low

Plans for Week(s) Ahead

Development Priorities

  • Catalin
  • Derek
    • Continue quattorising worker node
    • Document helpdesk installation procedure
  • Matt
    • LFC:
      • ATLAS front-end separation (DNS alias, GOCDB, IS changes)
    • WLCG accounting
    • Test deployment of gLite 3.2 (SL5) UI using Quattor

Resource Requests

Downtimes

Description Start End Affected VO(s)
LFC ATLAS back-end separation August 26 (08:00) August 26 (13:00) ATLAS, MINOS

Requirements and Blocking Issues

Description Required By Priority Status
SL5 Worker Node Kickstart High Post-kickstart configuration needed; not yet suitable for bulk deployment
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
lfc0448 disk failures Low Disk replacement needed

OnCall/AoD Cover

  • Primary OnCall
  • Grid OnCall: Derek (Matt, Wed)
  • AoD: