RAL Tier1 weekly operations Grid 20091012
From GridPP Wiki
Revision as of 13:00, 12 October 2009 by Matt hodges (Talk | contribs)
Contents
Summary of Previous Week
Developments
- Alastair
- Induction
- Andrew
- updated all September pbsjobs MySQL table with data created using UT time; removed entries from 25 problem worker nodes
- UB schedule for September
- wrote Nagios script for checking PhEDEx watchdog agents
- restored efficiencies in Ganglia
- Training: CASTOR, library, display screen equipment
- Catalin
- involved in LFC FEs reconfiguration
- sorted out issues with Alice SW area
- prepared Alice SL5 VOBOX deployments
- Derek
- n/a
- Matt
- Reconfigured FTS/LFC frontends.
- Progressed disk deployment.
- Reviewed Grid Services installation/recovery documentation.
- Rescaled KSI2K batch capacity from HEP-SPEC06 ratings.
- Richard
- n/a
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Failure of FTS/LFC RAC hardware | 2009-10-06 | Ongoing | LHC for FTS; ATLAS + non-LHC for LFC | Critical | [2009-10-09]FTS/LFC running on alternate hardware. |
Failure of 3D RAC hardware | 2009-10-06 | Ongoing | ATLAS, LHCb | High | [2009-10-09]Services not yet restored. |
Plans for Week(s) Ahead
Plans
- Alastair
- ATLAS jamboree
- Job plan
- Andrew
- Complete September UB schedule
- Testing CMS skimming without LazyDownload
- Delete some CMS "dark" data
- Training: Oracle self-service, fire equipment
- Catalin
- ready to deploy Squid and FronTier for ATLAS (pending on HW provisioning)
- finalise SL5 VOBOX arrangements for Alice
- attend Oracle Self Service Training course
- Derek
- Matt
- Oracle SSC training
- GDB
- Meetings to discuss publishing of HEPSPEC-06
- Disaster recovery planning
- Richard
- n/a
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Failure of FTS/LFC RAC hardware | FTS, FTM, LFC (ATLAS, non-LHC) | Unscheduled Outage | Oct 06 (12:25) | Oct 07 (16:00) | LHC for FTS; ATLAS + non-LHC for LFC |
Failure of 3D RAC hardware | LFC (LHCb) | Unscheduled Outage | Oct 06 (12:25) | Ongoing | ATLAS, LHCb |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
HW for Squid deployment | ATLAS | High | request made via RT Fabric queue |
HW for FronTier deployment | ATLAS | Medium | request made via RT Fabric queue |
HW for SL5 64-bit VOBOX | Alice | Medium | request made via RT Fabric queue |
Non-capacity HW for testing | Medium | Still using the old HW | |
Hardware for PPS | Medium | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. | |
Hardware for testing LFC/FTS resilience | Medium | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue |
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Thu)
- Grid OnCall:
- AoD: Catalin (Wed)