Difference between revisions of "RAL Tier1 weekly operations Grid 20091123"
From GridPP Wiki
(No difference)
|
Latest revision as of 15:51, 23 November 2009
Contents
Summary of Previous Week
Developments
- Alastair
- Away
- Andrew
- CMS PhEDEx ganglia monitoring
- FTS channels: adjustments to STAR-UKILT2BRUNEL, STAR-UKILT2ICHEP, STAR-UKISTHGRIDRALPP for CMS
- Updated kernel on csflnx414
- Deleted old CMS files from /store/unmerged
- Completed automatic generation of UB schedule CPU & disk emails
- Started work on CMS computing model spreadsheet
- Training: attending Nagios training session
- Out sick 1 day
- Catalin
- **no** progress on remaining SL5 VOBOXes
- started work on backup, recovery (machines audit)
- dealt with FronTier following java update
- sorted out the WMS ICE issue
- Derek
- Metric report
- Testbed proposal
- Adding SL53 i386 to quattor for dev helpdesk
- Matt
- Kernel updates for FTS/MyProxy
- Caching CIP provider script (not deployed)
- Disaster recovery planning
- Backup/recovery planning
- Checked batch system for signs of SL4/SL5 crosstalk, and other job allocation problems; appears clean since restart of pbs_server daemon
- Richard
- CASTOR activities: Finished the new structure for the family of pre-production Quattor templates
- Built a 32-bit version of a BDII server and updated template to place log files etc in RAL-preferred location
- Took 2 RT tickets on BDII server config's
- Mayo
- Anual leave Monday and Tuesday
- Worked on New Metrics system
- Exported data from new metrics gathering system to enable Derek to produce the monthly report
- Worked on automating tape robot spreadsheet project
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
WMS Jobdirs full | Wed 18 Nov | Thu 19 Nov | All | Medium | Resolved |
FroNTier crash | Wed 11 Nov | Fri 20 Nov | ATLAS | Low | Resolved |
Plans for Week(s) Ahead
Plans
- Alastair
- Away
- Andrew
- CMS computing model spreadsheet
- t2k to t2k.org VO name change
- Catalin
- start deployment on 2nd Alice SL5 VOBOX (HW made available on Monday)
- ready to start deployment on LHCB SL5 VOBOX (waiting for "Quattor ready to go")
- implement Nagios checks for FronTier
- continue working on systems audit (backup, recovery)
- Derek
- Test SCAS
- Fix problems with CE information system
- Working on helpdesk end to end restore
- Matt
- Richard
- CASTOR activities: Working with CK and d/b folk to be able to script database setup for new pre-prod instance; also looking at using custom ncm- components for configuration
- Building and testing a 64-bit version of BDII server
- Mayo
- Implement feedback into second version of metrics gathering system in prperation for November Metrics
- Continue working on automated spreadsheet project
- Continue working on importing Nagios alarm data into svn
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
LHCb SL5 64bit VOBOX deployment using Quattor | 25 Nov 2009 | Medium | HW allocated but Quattor recipe not yet available (RT#53392) |
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | |
Hardware for PPS | High | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. | |
Hardware for Grid Services testbed | Medium |
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Thu)
- Grid OnCall:
- AoD: