RAL Tier1 weekly operations Grid 20100118

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Looked into results of Hammer Cloud test to understand Frontier Performance.
    • Made progress with getting ATLAS powerusers to run at the Tier 1.
    • Updated RAL PP twiki with feedback from ATLAS meeting.
  • Andrew
    • Added checksum checking of migrated files to PhEDEx production instance
    • Wrote Nagios plugin for checking user proxy on CMS VOBOX
    • Added (hidden) option to capacity & efficiency ganglia pages for specifying units (KSI2K or HEP-SPEC06)
    • Added options to all UB schedule scripts for HEP-SPEC06 option
    • Wrote documentation about adding new VO to UB schedule scripts
    • Preparations for CMS Data Ops training
    • Training: online display screen equipment course & self-assessment
  • Catalin
    • worked on SL5 LHCb VOBOX quattorised deployment
    • closed the t2k.org issue (user error)
    • WMS03 (non-LHC) update
  • Derek
    • Test CREAM CE reinstallation instructions
    • Created and tested quattor template to implement BLParser service
    • Added updated voms certificates to yaim config rpm
    • Listened in on GDB
  • Matt
    • Tested R-GMA recovery (Flexible Archiver component)
    • Worked with Carmine on LFC recovery plans
    • Produced 2009/Q4 FTS metrics for quarterly report
  • Richard
    • 2 days A/L
    • Finished plan for BDII changes
    • Continued writing discussion document for DNS proposal
    • Continued work on the CASTOR pre-prod instance
    • Built a test machine as a BDII server to test quattor templates
    • Worked with JK and GS on a script to check CASTOR checksums
  • Mayo
    • Encrypted passwords within the Metric system
    • Added a change password feature to the metric system
    • Fixed a bug within the Metric system
    • Worked on tape statistics spreadsheet project: converting excel chatrs to HTML

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status
FTS DB performance problems 20100115 11:00 20100115 16:00 LHC High Load on Orisa nodes redistributed across nodes by reconfiguring FTS agents.

Plans for Week(s) Ahead

Plans

  • Alastair
    • Run (hopefully) final tests on Frontier server (after Catalin has performed servlet update) to confirm it is working well.
    • Continue updating RAL PP twiki.
    • Complete version 1 of Tier 1 VO requirements with information that has been provided by Raja.
    • Possibly away/working from home Tuesday (Depends how long Hospital appointment takes)
  • Andrew
    • Joining CMS Data Ops - away at CERN for training
  • Catalin
    • finalise SL5 LHCb VOBOX deployment (hotswapping issues)
    • follow up some post-reboot WMS issues with CERN
    • work on LFC schemas tidying up (with Carmine)
    • exercise Alice xrootd (manager + peer) re-installation (on old SL4 voboxes)
  • Derek
    • Implementing BLParser on lcgbatch01
    • Completing testing of CE and CREAM CE for Intervention changes
    • GLexec and SCAS on SL5
  • Matt
    • Finish Grid Services Disaster Recovery document
    • Planning ATLAS/R89 co-hosting of Grid Services
    • Provide test site BDII for CIP upgrade testing
  • Richard
    • Finish discussion document for DNS proposal
    • Continue working on CASTOR pre-prod instance
    • Further work on the Quattor templates for BDII server
    • Re-do existing STP time bookings and enter EGEE timesheets back to starting date
    • 2 days A/L
  • Mayo
    • Automating Metric report system
    • Adding charts to the metric system
    • Web interface and script to fetch data for Tape robot statistics spreadsheet project

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
FTS DB problems Orisa, FTS agents Unscheduled 20100115 11:00 20100115 16:00 LHC

Requirements and Blocking Issues

Description Required By Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for Testbed High Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Hardware for SCAS servers Feb 1 2010 High Hardware required for production SCAS servers - required to be in place by end of Feb
Hardware for SL5 CREAM CE for Non LHC SL5 batch access Medium Hardware required for CREAM CE for non-LHC vos
Pool accounts for Super B vo Medium Required to enable Super B vo on batch farm

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall:
  • AoD: Catalin (Wed)