RAL Tier1 weekly operations Grid 20091109

From GridPP Wiki
Revision as of 15:46, 9 November 2009 by Catalin condurache (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Started working on ATLAS code to test different permissions at the Tier 1.
    • Updated User Board CPU allocations.
    • Continued work producing list of experiment requirements.
    • Helped Brian with the disk server problems on gdss403.
  • Andrew
    • FTS channels: changed STAR-UKILT2ICHEP from SRMCOPY to URLCOPY; drained channels to/from RAL tier-2
    • Deployed csflnx414 as a CMS VOBOX for testing
    • Installed ProdAgent on csflnx414; learning about ProdAgent; submitted a test workflow to RAL
    • Continued development of automated generation of UB schedule spreadsheet
    • Testing of CMS skimming jobs
    • Attended OPB "Managing data with robots"
  • Catalin
    • re-installed WMS03 and made it hotswappable
    • completed FronTier for ATLAS installation/configuration
    • progress with SL5 VOBOX for Alice
    • various glite and kernel upgrades
  • Derek
    • Incorporated comments on Quattor status talk
    • Attending Quattor workshop
    • Set lcgce01 to Production Status
  • Matt
    • Backout 64-bit torque/maui on scheduler
    • Metrics feedback to Mayo
    • Tested footprints helpdesk for tracking Grid Service issues; provided feedback to Gareth
  • Richard
    • Updated Job Plan
    • CASTOR activities: Built a complete set of Quattor templates for the 4 machines in new pre-prod instance (and exposed a couple of bugs in Quattor in the process!)
  • Mayo
    • Collected some feedback on the new Metric Gathering System prototype
    • Created report view for new Metric Gathering System
    • Begun writing script to extract data from Nagios-alarm-response-Grid spreadsheet for importing in to svn
    • Working on script for automating data collection of tape robot statistics into a spreadsheet

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status
PBS service failed to restart 2009-11-05 13:15 2009-11-05 13:50 All Minor Resolved by rolling back to 32-bit torque/maui

Plans for Week(s) Ahead

Plans

  • Alastair
    • Continue with any remaining training/Tutorials
    • Test prod/poweruser0/user permissions at the Tier 1.
    • Produce first draft of experiment requirements by Wednesday.
    • Work on more efficient code for testing checksums on a disk server.
  • Andrew
    • Deploy gdss383 to cmsFarmRead
    • Continue work on automated generation of UB Schedule spreadsheet
    • Add MySQL client, Spreadsheet::WriteExcel, Spreadsheet::ParseExcel to lcgui02
    • Continue learning about ProdAgent
    • Continue investigating ReadAhead and LazyDownload on CMS skimming jobs
    • R89 machine room training
  • Catalin
    • finish deployment of SL5 VOBOX for Alice
    • deploy 2nd ALICE VOBOX (see Fabric helpdesk request)
    • kernel upgrades
  • Derek
    • Investigate/deploy SCAS
  • Matt
    • Disaster recovery planning
  • Richard
    • CASTOR activities: Add support for castor config files into Quattor templates
    • Apply the recent quattor experience to completing quattor config/build for BDII servers
  • Mayo
    • Collect more feedback on prototype system
    • Working on additional functionality for future releases of the Metric System
    • Continue work on script for extracting data from Nagios-alarm-response-Grid spreadsheet for importing into svn
    • Continue work on script for automating tape robot spreadsheet

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
kernel upgrades all CEs at risk Wed 11 Nov 09:30 Wed 11 Nov 12:00 all

Requirements and Blocking Issues

Description Required By Priority Status
Hardware for 2nd ALICE SL5 64bit VOBOX 16 Nov 2009 High Request to re-deploy lcg0614 (ALICE SW WN) as SL5 VOBOX (using quattor or not) - RT#53338
Hardware for LHCb SL5 64bit VOBOX 25 Nov 2009 Medium Request for HW allocation (RT#53392)
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for PPS High We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon-Sun)
  • AoD: