RAL Tier1 weekly operations Grid 20090803

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Catalin
    • Tune WMS/LB servers
    • Prepare documentation about the LFC separation
    • Intervention on lcgmon01 (+ Kashif and JamesA)
  • Derek
    • Finished quattorising torque server config
    • Started quattorising WN
    • Tested helpdesk database dump speed
  • Matt
    • PPS/CASTOR Pre-Prod interviews (Tuesday)
    • Update SL4/SL5 migration plan (distribute to VOs)
    • LFC ATLAS plan finalised (with Catalin)
    • Checked Derek's quattor-generated torque configuration
    • Track down CMS OoM jobs (Derek banned user at CE level)

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
lfc0448 - SMART errors detected 2009-06-15 Ongoing All Medium
helpdesk DB tables not backed up 2009-07-01 Ongoing none Low
lcgmon01 - SMART errors detected 2009-07-18 2009-07-30 None Medium

Plans for Week(s) Ahead

Development Priorities

  • Catalin
  • Derek
    • Continue quattorising worker node
    • Document helpdesk installation procedure
  • Matt
    • SL4/SL5 Migration
      • Get final SL4/SL5 VO requirements
      • Test torque submit filter scripts (for directing jobs to nodes with sl4 or sl5 properties)
    • LFC:
      • ATLAS front-end separation (DNS alias, GOCDB, IS changes)
      • ATLAS back-end separation planning (depends on timing information for DB cleanup, and final plans for folding in resilience upgrades)
    • FTS:
      • Document procedure to add domain to CMS cloud (added Ukraine to CERN cloud)
      • Document procedure to deal with site name changes (BNL will change soon)
    • WLCG accounting
    • Test deployment of gLite 3.2 (SL5) UI using Quattor
    • Update gLite middleware on SL4 UI

Resource Requests

Downtimes

Description Start End Affected VO(s)
LFC ATLAS front-end separation August 3 August 5 ATLAS
LFC ATLAS back-end separation August 26 (TBC) August 26 (TBC) All

Requirements and Blocking Issues

Description Required By Priority Status
SL5 Worker Node Kickstart High Post-kickstart configuration needed; not yet suitable for bulk deployment
lfc0448 disk failures Medium Disk replacement needed
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.

OnCall/AoD Cover

  • Primary OnCall
  • Grid OnCall: Matt
  • AoD: reschedule Wednesday