RAL Tier1 weekly operations Grid 20100111

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Prepare slides and run Hammer Cloud test for ATLAS UK meeting in Cambridge.
    • Away at ATLAS UK meeting 6th - 8th January.
  • Andrew
    • Improved CMS monitoring, wrote scripts checking: JobRobot, SAM tests, production jobs, pre-staging of RAW data, proxy on VOBOX, 5 data transfer checks
    • Deleted CMS files in /store/unmerged
    • Investigated various CMS transfer problems (debug instance) & jobs with low CPU efficiencies
    • Completed December accounting; updated spreadsheet generating Perl script to handle multiple years
    • Planning KSI2K to HEP-SPEC06 migration
  • Catalin
    • SL5 LHCb VOBOX installation (with Ian)
    • followed up issue with t2k.org 'zero size' LFC entries
    • decomissioned old SL4 ALICE VOBOXes
    • atlasbackup 'exclude files' fix
  • Derek
    • Updated Wordpress
    • Published fairshares in information system
    • Configured GlExec on a SL4 WN
    • Tested CE reinstallation instructions
  • Matt
  • Richard
    • Finished plan for BDII changes
    • Continued writing discussion document for DNS proposal
    • Continued work on the CASTOR pre-prod instance
    • Built a test machine as a BDII server to test quattor templates
    • Worked with JK and GS on a script to check CASTOR checksums
  • Mayo
    • Encrypted passwords within the Metric system
    • Added a change password feature to the metric system
    • Fixed a bug within the Metric system
    • Worked on tape statistics spreadsheet project: converting excel chatrs to HTML

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status

Plans for Week(s) Ahead

Plans

  • Alastair
    • Look into results of Hammer Cloud test to try and understand slow Frontier Performance (run more tests if necessary.
    • Update RAL PP twiki with feedback from ATLAS meeting.
    • Find out about if there is still a need for Pacman mirror.
  • Andrew
    • Include checksum checking in PhEDEx production instance
    • Write Nagios plugin to check for recently-migrated files with incorrect checksums
  • Catalin
    • finalise SL5 LHCb VOBOX deployment
    • work on t2k.org 'zero size' LFC entries issue
    • work on LFC schemas tidying up (with Carmine)
    • exercise Alice xrootd (manager + peer) re-installation (on old SL4 voboxes)
  • Derek
    • Testing helpdesk restore
    • Verifying Cream CE installation instructions
  • Matt
    • Test R-GMA recovery (Flexible Archiver component)
  • Richard
    • Finish discussion document for DNS proposal
    • Continue working on CASTOR pre-prod instance
    • Further work on the Quattor templates for BDII server
    • Re-do existing STP time bookings and enter EGEE timesheets back to starting date
    • 2 days A/L
  • Mayo
    • Automating Metric report system
    • Adding charts to the metric system
    • Web interface and script to fetch data for Tape robot statistics spreadsheet project

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)

Requirements and Blocking Issues

Description Required By Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for PPS High We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
Hardware for Grid Services testbed Medium

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon-Fri)
  • AoD: