RAL Tier1 weekly operations Grid 20091109
From GridPP Wiki
Contents
Summary of Previous Week
Developments
- Alastair
- Started working on ATLAS code to test different permissions at the Tier 1.
- Updated User Board CPU allocations.
- Continued work producing list of experiment requirements.
- Helped Brian with the disk server problems on gdss403.
- Andrew
- FTS channels: changed STAR-UKILT2ICHEP from SRMCOPY to URLCOPY; drained channels to/from RAL tier-2
- Deployed csflnx414 as a CMS VOBOX for testing
- Installed ProdAgent on csflnx414; learning about ProdAgent; submitted a test workflow to RAL
- Continued development of automated generation of UB schedule spreadsheet
- Testing of CMS skimming jobs
- Attended OPB "Managing data with robots"
- Catalin
- re-installed WMS03 and made it hotswappable
- completed FronTier for ATLAS installation/configuration
- progress with SL5 VOBOX for Alice
- various glite and kernel upgrades
- Derek
- Incorporated comments on Quattor status talk
- Attending Quattor workshop
- Set lcgce01 to Production Status
- Matt
- Backout 64-bit torque/maui on scheduler
- Metrics feedback to Mayo
- Tested footprints helpdesk for tracking Grid Service issues; provided feedback to Gareth
- Richard
- Updated Job Plan
- CASTOR activities: Built a complete set of Quattor templates for the 4 machines in new pre-prod instance (and exposed a couple of bugs in Quattor in the process!)
- Mayo
- Collected some feedback on the new Metric Gathering System prototype
- Created report view for new Metric Gathering System
- Begun writing script to extract data from Nagios-alarm-response-Grid spreadsheet for importing in to svn
- Working on script for automating data collection of tape robot statistics into a spreadsheet
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
PBS service failed to restart | 2009-11-05 13:15 | 2009-11-05 13:50 | All | Minor | Resolved by rolling back to 32-bit torque/maui |
Plans for Week(s) Ahead
Plans
- Alastair
- Continue with any remaining training/Tutorials
- Test prod/poweruser0/user permissions at the Tier 1.
- Produce first draft of experiment requirements by Wednesday.
- Work on more efficient code for testing checksums on a disk server.
- Andrew
- Deploy gdss383 to cmsFarmRead
- Continue work on automated generation of UB Schedule spreadsheet
- Add MySQL client, Spreadsheet::WriteExcel, Spreadsheet::ParseExcel to lcgui02
- Continue learning about ProdAgent
- Continue investigating ReadAhead and LazyDownload on CMS skimming jobs
- R89 machine room training
- Catalin
- finish deployment of SL5 VOBOX for Alice
- deploy 2nd ALICE VOBOX (see Fabric helpdesk request)
- kernel upgrades
- Derek
- Investigate/deploy SCAS
- Matt
- Disaster recovery planning
- Richard
- CASTOR activities: Add support for castor config files into Quattor templates
- Apply the recent quattor experience to completing quattor config/build for BDII servers
- Mayo
- Collect more feedback on prototype system
- Working on additional functionality for future releases of the Metric System
- Continue work on script for extracting data from Nagios-alarm-response-Grid spreadsheet for importing into svn
- Continue work on script for automating tape robot spreadsheet
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
kernel upgrades | all CEs | at risk | Wed 11 Nov 09:30 | Wed 11 Nov 12:00 | all |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Hardware for 2nd ALICE SL5 64bit VOBOX | 16 Nov 2009 | High | Request to re-deploy lcg0614 (ALICE SW WN) as SL5 VOBOX (using quattor or not) - RT#53338 |
Hardware for LHCb SL5 64bit VOBOX | 25 Nov 2009 | Medium | Request for HW allocation (RT#53392) |
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | |
Hardware for PPS | High | We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this. |
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon-Sun)
- AoD: