RAL Tier1 weekly operations Grid 20090928
From GridPP Wiki
Contents
Summary of Previous Week
Developments
- Andrew
- Investigated why efficiences are frequently > 100% since the SL5 migration
- Obtained access to svn repository on quattor01
- Added pbslogs2mysql script to Quattor and attempted to deploy
- Modified pbslogs2mysl script to publish times in UT rather than local time for consistency with APEL
- Modified the maui configuration in Quattor so that CPU allocations can be inserted directly from the User Board schedule; updated for October allocations (haven't deployed yet); updated documentation
- Continued work on LFC ganglia monitoring
- Investigating how CMS jobs work
- Catalin
- re-installed LB02 and made it hotswappable, updated documentation
- made ALICE SW node available
- drained WMS02
- Derek
- Incorporated changes from SL5 migration into CE documentation
- Released new version of yaim-config rpm with updates from SL5 migration and configuration for new vos
- Enabled supernemo on SL4 farm
- Enabled climate-g on ce.ngs
- Fixed SAM tests not running on ce.ngs after reconfiguration - was due to account mapping issue
- Added vendor reference custom file to fabric hardware queue
- Fixed and documented fix for helpdesk mail loop
- Completed metrics report
- Kickstarted helpdesk dev box
- Matt
- Reviewed progress of disk deployment testing
- Reviewed Grid Services installation/recovery documentation
- Helped investigate batch system instabilities
- Richard
- Put into production version 1.0 of a Grid Services dashboard within the RT helpdesk system
- Developed further Perl scripts for providing custom helpdesk ticket reports and placed these into production. Scripts now in use by Grid team, Production team and CASTOR team.
- Continued work on using IPTABLES to throttle excessive connection attempts to BDII servers
- Developed faster methods for logfile analysis to help with BDII logs.
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
lcgce07 partition failure | 18/09/09 | none yet (potentially alice, cms, lhcb lose resilience) | medium |
Plans for Week(s) Ahead
Development Priorities
- Andrew
- Complete successful deployment of pbslogs2mysql using Quattor
- Move the CPU efficiencies gmetric script to lcgbatch01 (using Quattor); update ganglia scripts as appropriate
- Check that nagios efficiencies monitoring script is working successfully
- Deploy updated maui configuration for October CPU allocations
- Begin developing complete documention for adding a new VO
- Review CMS VOBOX SLAs
- Catalin
- make WMS02 hotswappable (implies re-kickstart)
- work on ALICE SL5 VOBOX (possibly)
- Quattor training
- Castor training
- Derek
- n/a
- Matt
- Disaster recovery planning
- Generate disk deployment requests for Q4/09 allocations.
- Richard
- Investigating BDII
- Investigating Quattor
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
WMS02 hotswappable | lcgwms02 | Scheduled Outage | Sep 22 (16:00) | Sep 30 (17:00) | LHC |
Batch system unstable | CEs | Unscheduled At Risk | Sep 25 (08:30) | Sep 25 (11:00) | All |
Oracle ASM patching | FTS, FTM, LFCs | Scheduled At Risk | Oct 01 (13:30) | Oct 01 (16:30) | All |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. |
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Catalin (Mon-Wed), Matt (Thu-Sun)
- AoD: