RAL Tier1 weekly operations Grid 20090914

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Andrew
    • Completed UB report
    • Adjusted settings on CLOUDCMSUS-RALLCG2 FTS channel
    • Made adjustments to job tracing Perl script so that it works on all CE's
    • Corrections made to pbslogs2mysql script so that it runs successfully on lcgbatch01
    • Developed ganglia monitoring scripts for LFC (in progress)
  • Catalin
    • glite-WMS upgrade on lcgwms01
    • investigated workload_manager issue on WMSes
    • glite-VOBOX upgrade on Alice
    • GridPP
  • Derek
    • SL5 Migration
  • Matt
    • SL5: Maui configuration for VO s/w installation jobs;
    • Reinstalled FTS2.2 endpoint and tested Group (cloud) functionality;
    • Provided feedback to IN2P3 about RAL batch system;
    • Audited hotswap configuration for software raided service nodes;
    • Modified PBS nodes Nagios check to take into account WNs in downtime.
  • Richard
    • Put into production version 1.0 of a Grid Services dashboard within the RT helpdesk system
    • Developed further Perl scripts for providing custom helpdesk ticket reports and placed these into production. Scripts now in use by Grid team, Production team and CASTOR team.
    • Continued work on using IPTABLES to throttle excessive connection attempts to BDII servers
    • Developed faster methods for logfile analysis to help with BDII logs.

Operational Issues and Incidents

Description Start End Affected VO(s) Severity

Plans for Week(s) Ahead

Development Priorities

  • Andrew
    • Continue work on LFC monitoring
    • Investigate Atlas and LHCb efficiencies for August
    • Update squid
    • Start to understand in detail the CMS computing model
    • Start to understand the scheduling policies on the batch farm
  • Catalin
    • apply workaround for workload_manager on WMS servers
    • Alice SW worker node
    • LB02 draining mode
  • Derek
    • Investigate publishing appropriate HEP-SPEC value in information system
    • Incorporating changes to yaim config from SL5 Migration
    • Updating documentation to reflect new CEs
    • Metrics report
  • Matt
    • Disaster recovery planning
    • Review progress of disk deployment testing
    • Review Grid Services documentation
  • Richard
    • Investigating BDII
    • Investigating Quattor

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
SL5 migration lcgce07 Scheduled Outage Sep 14 (10:30) Sep 16 (12:00) LHC VOs
SL5 migration lcgce01, lcgce02 Scheduled Outage Sep 14 (10:30) Sep 18 (12:00) ALICE, non-LHC
FTS drain of RAL channels lcgfts01 Unscheduled At Risk Sep 15 (08:00) Sep 15 (13:00) All
LB02 hotswappable lcglb02 Scheduled Outage Sep 21 (09:00) Sep 21 (16:00) All

Requirements and Blocking Issues

Description Required By Priority Status
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon)
  • Grid OnCall: Matt (Tue-Sun)
  • AoD: Catalin (Wed)