RAL Tier1 weekly operations Grid 20090928

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Andrew
    • Investigated why efficiences are frequently > 100% since the SL5 migration
    • Obtained access to svn repository on quattor01
    • Added pbslogs2mysql script to Quattor and attempted to deploy
    • Modified pbslogs2mysl script to publish times in UT rather than local time for consistency with APEL
    • Modified the maui configuration in Quattor so that CPU allocations can be inserted directly from the User Board schedule; updated for October allocations (haven't deployed yet); updated documentation
    • Continued work on LFC ganglia monitoring
    • Investigating how CMS jobs work
  • Catalin
    • re-installed LB02 and made it hotswappable, updated documentation
    • made ALICE SW node available
    • drained WMS02
  • Derek
    • Incorporated changes from SL5 migration into CE documentation
    • Released new version of yaim-config rpm with updates from SL5 migration and configuration for new vos
    • Enabled supernemo on SL4 farm
    • Enabled climate-g on ce.ngs
    • Fixed SAM tests not running on ce.ngs after reconfiguration - was due to account mapping issue
    • Added vendor reference custom file to fabric hardware queue
    • Fixed and documented fix for helpdesk mail loop
    • Completed metrics report
    • Kickstarted helpdesk dev box
  • Matt
    • Reviewed progress of disk deployment testing
    • Reviewed Grid Services installation/recovery documentation
    • Helped investigate batch system instabilities
  • Richard
    • Put into production version 1.0 of a Grid Services dashboard within the RT helpdesk system
    • Developed further Perl scripts for providing custom helpdesk ticket reports and placed these into production. Scripts now in use by Grid team, Production team and CASTOR team.
    • Continued work on using IPTABLES to throttle excessive connection attempts to BDII servers
    • Developed faster methods for logfile analysis to help with BDII logs.

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
lcgce07 partition failure 18/09/09 none yet (potentially alice, cms, lhcb lose resilience) medium

Plans for Week(s) Ahead

Development Priorities

  • Andrew
    • Complete successful deployment of pbslogs2mysql using Quattor
    • Move the CPU efficiencies gmetric script to lcgbatch01 (using Quattor); update ganglia scripts as appropriate
    • Check that nagios efficiencies monitoring script is working successfully
    • Deploy updated maui configuration for October CPU allocations
    • Begin developing complete documention for adding a new VO
    • Review CMS VOBOX SLAs
  • Catalin
    • make WMS02 hotswappable (implies re-kickstart)
    • work on ALICE SL5 VOBOX (possibly)
    • Quattor training
    • Castor training
  • Derek
    • n/a
  • Matt
    • Disaster recovery planning
    • Generate disk deployment requests for Q4/09 allocations.
  • Richard
    • Investigating BDII
    • Investigating Quattor

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
WMS02 hotswappable lcgwms02 Scheduled Outage Sep 22 (16:00) Sep 30 (17:00) LHC
Batch system unstable CEs Unscheduled At Risk Sep 25 (08:30) Sep 25 (11:00) All
Oracle ASM patching FTS, FTM, LFCs Scheduled At Risk Oct 01 (13:30) Oct 01 (16:30) All

Requirements and Blocking Issues

Description Required By Priority Status
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Catalin (Mon-Wed), Matt (Thu-Sun)
  • AoD: