Difference between revisions of "RAL Tier1 weekly operations Grid 20091102"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 10:42, 2 November 2009

Summary of Previous Week

Developments

  • Alastair
    • Finished security audit
    • Deployed disk servers from non-prod to prod
    • Went through most of castor training when Shaun wasn't too busy.
    • Learnt about Tier 2 data storage allocation from Brian
    • Learnt how to make changes with quattor and updated twiki
    • Updated PPD twiki
  • Andrew
    • Completed consistency check of Aug 09 APEL & PBS; resolved problems with Oct 09 pbsjobs MySQL
    • Writing Perl script to generate UB Schedule spreadsheet
    • Attended CMS Offline and Computing Workshop, CERN
    • Obtained CMS production role
    • Meeting with a member of CMS data ops about ProdAgent
    • Deleted 190,000 CMS files in /store/unmerged
    • Training: CERN level 1 & 2 safety
  • Catalin
    • finished the ALICE disk servers deployment
    • deployed and tested the FronTier/squid server for ATLAS
    • installed the SL5 VOBOX for Alice
    • started the drain operation for WMS03
  • Derek
    • Deployed updated vo config in quattor
    • Fixed quattor directory creation on WNs
    • Writing RAL talk for Quattor workshop
    • Documenting CE information system setup
  • Matt
    • Deployed gLite 3.2 SL5 VOBOX
    • Checked priorities for deploying Viglen 08 kit after it passes acceptance tests (meet shortfalls in ATLAS and LHCb pledges)
  • Richard
    • DSE Training
    • 5 X disk server deployments into AtlasSimStrip
    • Packaged RT helpdesk scripts plus their associated cron entries as an RPM using DR's layout
    • Repackaged the gmetric-bdii-top.pl and tier1-bdii-top-config RPMs using DR's layout
    • Updated the log analysis perl scripts in the gmetric-bdii-top.pl and tier1-bdii-top-config RPMs for better performance. One shows ~ 15X improvement, the other ~ 10X.
    • CASTOR activities: continued development of quattor templates for servers in pre-prod instance; also DNS changes
  • Mayo
    • Rolled out first prototype of the new Metric Gathering System
    • Collected some feedback on the new Metric Gathering System prototype
    • Resolved SVN access issues

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status

Plans for Week(s) Ahead

Plans

  • Alastair
    • go through gLite training
    • Finish Castor training
    • Update CPU efficiencies
    • Test UK Frontier/Squid using Athena release 15.5.1
    • Test prod/poweruser0/user permissions at the Tier 1.
    • Continue updating ppd twiki on ATLAS software.
  • Andrew
    • Continue work on automated generation of UB Schedule spreadsheet
    • Deploy a spare service node as a VOBOX using Quattor; install & setup ProdAgent; run a test production job
  • Catalin
    • finish deployment of SL5 VOBOX for Alice
    • re-install WMS03 (hotswaping)
    • integrate FronTier within ATLAS Frontier/squid network
  • Derek
    • Attend quattor workshop (Brussels)
    • Investigate/deploy SCAS
  • Matt
    • Disaster recovery planning
  • Richard
    • Update Job Plan
    • Complete quattor config/build for BDII servers
    • CASTOR activities: Continue work on new pre-prod instance
  • Mayo
    • Collect More feedback on prototype system
    • Begin working on additional functionality for future releases of the Metric System
    • work on phase two of the on call documentation project
    • design specification for IPMI project

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
WMS03 hotswappable lcgwms03.gridpp.rl.ac.uk Scheduled Outage Oct 30 (09:00) Nov 05 (16:00) non-LHC

Requirements and Blocking Issues

Description Required By Priority Status
HW for Squid deployment ATLAS High request made via RT Fabric queue; used reserved hardware
HW for FronTier deployment ATLAS High request made via RT Fabric queue; used reserved hardware
HW for SL5 64-bit VOBOX Alice High request made via RT Fabric queue; used reserved hardware
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Thu)
  • Grid OnCall:
  • AoD: