RAL Tier1 weekly operations Grid 20090921

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Andrew
    • Added my email address back to gmetric-eff.pl on csflnx353; made a new RPM and spec file
    • Updated to Frontier-squid 4.0rc8 on CMS VOBOXs; made minor adjustments to documentation
    • Looked at LHCb and ATLAS CPU efficiencies for August
    • Developing ganglia monitoring scripts for LFC (in progress)
    • Wrote draft job plans document for APR
  • Catalin
    • applied workaround for workload_manager on WMS03 (malloc issues)
    • LB kickstart tests on new HW (yaim config issues)
    • contacted ALICE for their SW area and SL5 VOBOX
  • Derek
    • SL5 Migration
  • Matt
    • Work with Andrew L/Richard on Job Plans
    • Review progress of disk deployment testing
    • Testing of SL5 batch system
    • Discussed disaster recovery planning with Andrew and Matt V
  • Richard
    • Put into production version 1.0 of a Grid Services dashboard within the RT helpdesk system
    • Developed further Perl scripts for providing custom helpdesk ticket reports and placed these into production. Scripts now in use by Grid team, Production team and CASTOR team.
    • Continued work on using IPTABLES to throttle excessive connection attempts to BDII servers
    • Developed faster methods for logfile analysis to help with BDII logs.

Operational Issues and Incidents

Description Start End Affected VO(s) Severity

Plans for Week(s) Ahead

Development Priorities

  • Andrew
    • Update maui.conf with latest CPU allocations using Quattor
    • Complete work on LFC monitoring
    • Continue to develop a detailed understanding of CMS computing model, data flows and production jobs
  • Catalin
    • make LB02 hotswappable (implies re-kickstart)
    • work on Alice SW worker node and SL5 VOBOX issues
    • discover Quattor world
    • WMS02 draining mode
  • Derek
    • Investigate publishing appropriate HEP-SPEC value in information system
    • Update documentation
    • Metrics report
  • Matt
    • Disaster recovery planning
    • Review Grid Services documentation
  • Richard
    • Investigating BDII
    • Investigating Quattor

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
FTS drain of RAL channels lcgfts01 Unscheduled At Risk Sep 15 (08:00) Sep 15 (13:00) All
LB02 hotswappable lcglb02 Scheduled Outage Sep 21 (09:00) Sep 21 (16:00) All
WMS02 hotswappable lcgwms02 Scheduled Outage Sep 22 (16:00) Sep 30 (17:00) LHC

Requirements and Blocking Issues

Description Required By Priority Status
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon - Sun)
  • Grid OnCall:
  • AoD: