RAL Tier1 weekly operations Grid 20091026

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Perform Security Audit
    • Learn how to deploy disk servers for ATLAS
    • Discuss Job Plan with Matt
    • Discuss allocation of ATLAS disk space with Brian Davies and Stephen Burke
    • Go to Shared Service training
  • Andrew
    • FTS channel adjustments: timeouts doubled for STAR-FIHIPT2 & RALLCG2-CLOUDCMSITALY
    • Disk server deployment (5 servers to cmsFarmRead)
    • APEL & PBS comparisons for CREAM CE
    • Correcting PBS jobs MySQL table for October
    • Resolved problem with PhEDEx mss-remove agent
    • Upgraded PhEDEx to 3.2.9
    • Completed CMS "dark" data removal
    • Investigating consistency between missing files lists from PhEDEx & CASTOR team
  • Catalin
    • CRISTAL 1 course
    • finished kickstarts for FronTier and SL5 VOBOX and waited for HW
    • assisted the LFC ATLAS cleaning operation
    • disk servers deployment for ALICE
  • Derek
    • Updating vo configuration in quattor
    • Testing helpdesk backup
    • Cristal level 1
    • SSC Training
    • Out sick 1 day
  • Matt
    • Determine LHCb service class requirements for new allocation
    • Disk deployment meeting
  • Richard
    • ORACLE SSC Training
    • Further disk server deployments into Atlas NonProd (including updates to the TWiki instructions)
    • Continued work on BDII/Quattor task
    • CASTOR activities: Read through SDW's training slides; work on new pre-prod instance
  • Mayo
    • Worked on the new Metrics Gathering System
    • Thought Bubble website now in operation
    • Initial research into IPMI power control project

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status

Plans for Week(s) Ahead

Plans

  • Alastair
    • Finish security audit (if not already finished)
    • go through gLite training
    • go through castor training slides
    • learn about FTS and outputs that I will take over from Brian
    • Update CPU efficiencies
  • Andrew
    • Attend CMS Offline & Computing Workshop, CERN
  • Catalin
    • ready to deploy SL5 VOBOX for Alice (waiting for HW)
    • ready to deploy FronTier/squid for ATLAS (waiting for HW)
    • finish Alice disk servers deployment
    • start WMS03 drain
  • Derek
    • Test helpdesk restore
    • Updating quattor vo configuration
    • Update CE documentation
  • Matt
    • Check priorities for deploying Viglen 08 kit after it passes acceptance tests
    • VO requirements capture
    • Disaster recovery planning
  • Richard
    • RPM packaging and installation for new BDII connection throttling script
    • RPM packaging and installation for new BDII monitoring script
    • Complete quattor config/build for BDII servers
    • CASTOR activities: Continue work on new pre-prod instance
  • Mayo
    • Continued work on New Metric Gathering System
    • Begin Stage 2 of on call documentation project
    • Continue research into IPMI power control project

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
WMS03 hotswappable lcgwms03.gridpp.rl.ac.uk Scheduled Outage Oct 30 (09:00) Nov 05 (16:00) non-LHC

Requirements and Blocking Issues

Description Required By Priority Status
HW for Squid deployment ATLAS High request made via RT Fabric queue
HW for FronTier deployment ATLAS High request made via RT Fabric queue
HW for SL5 64-bit VOBOX Alice High request made via RT Fabric queue
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Thu)
  • Grid OnCall:
  • AoD: