RAL Tier1 weekly operations Grid 20090914
From GridPP Wiki
Revision as of 15:15, 18 September 2009 by Derek ross (Talk | contribs)
Contents
Summary of Previous Week
Developments
- Andrew
- Completed UB report
- Adjusted settings on CLOUDCMSUS-RALLCG2 FTS channel
- Made adjustments to job tracing Perl script so that it works on all CE's
- Corrections made to pbslogs2mysql script so that it runs successfully on lcgbatch01
- Developed ganglia monitoring scripts for LFC (in progress)
- Catalin
- glite-WMS upgrade on lcgwms01
- investigated workload_manager issue on WMSes
- glite-VOBOX upgrade on Alice
- GridPP
- Derek
- SL5 Migration
- Matt
- SL5: Maui configuration for VO s/w installation jobs;
- Reinstalled FTS2.2 endpoint and tested Group (cloud) functionality;
- Provided feedback to IN2P3 about RAL batch system;
- Audited hotswap configuration for software raided service nodes;
- Modified PBS nodes Nagios check to take into account WNs in downtime.
- Richard
- Put into production version 1.0 of a Grid Services dashboard within the RT helpdesk system
- Developed further Perl scripts for providing custom helpdesk ticket reports and placed these into production. Scripts now in use by Grid team, Production team and CASTOR team.
- Continued work on using IPTABLES to throttle excessive connection attempts to BDII servers
- Developed faster methods for logfile analysis to help with BDII logs.
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
Plans for Week(s) Ahead
Development Priorities
- Andrew
- Continue work on LFC monitoring
- Investigate Atlas and LHCb efficiencies for August
- Update squid
- Start to understand in detail the CMS computing model
- Start to understand the scheduling policies on the batch farm
- Catalin
- apply workaround for workload_manager on WMS servers
- Alice SW worker node
- LB02 draining mode
- Derek
- Investigate publishing appropriate HEP-SPEC value in information system
- Incorporating changes to yaim config from SL5 Migration
- Updating documentation to reflect new CEs
- Metrics report
- Matt
- Disaster recovery planning
- Review progress of disk deployment testing
- Review Grid Services documentation
- Richard
- Investigating BDII
- Investigating Quattor
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
SL5 migration | lcgce07 | Scheduled Outage | Sep 14 (10:30) | Sep 16 (12:00) | LHC VOs |
SL5 migration | lcgce01, lcgce02 | Scheduled Outage | Sep 14 (10:30) | Sep 18 (12:00) | ALICE, non-LHC |
FTS drain of RAL channels | lcgfts01 | Unscheduled At Risk | Sep 15 (08:00) | Sep 15 (13:00) | All |
LB02 hotswappable | lcglb02 | Scheduled Outage | Sep 21 (09:00) | Sep 21 (16:00) | All |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. |
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon)
- Grid OnCall: Matt (Tue-Sun)
- AoD: Catalin (Wed)