RAL Tier1 weekly operations castor 22/03/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • Mostly away on paternity leave
  • Shaun:
    • Detailed Planning of 2.1.8/9 upgrades
    • Started testing SRM 2.8-6 on certification
    • Worked on preparation for PreProd stress testing
    • Planning for LHCb Jamboree
    • Slide prep for open day
  • Chris:
    • Testing maximum number of job slots for root protocol with Raja
    • Built 4 cold stand-by central castor servers (still need the final configuration and basic tests)
    • Deploying disk servers
    • DepMon duties
    • Castor on Call duties
    • Doing work related to Tier1 Security Group project
  • Cheney:
    • ..
  • Tim:
    • Drained CS5116, may have issues, no data loss
    • T10K testing on preprod
  • Richard:
    • Further tweaks to stress-testing suite in preparation for benchmarking exercise
  • Brian:
    • Disk server removal
  • Jens:
    • ..

Developments for this week

  • Matthew:
    • 2.1.8/2.1.9 strategy
    • Polymorphic servers' 'morphing' scripts
    • Tier1 talk
  • Shaun:
    • LHCb Jamboree
    • Progress Strategic Objectives
    • Testing SRM 2.8-6
    • Progressing Stress Testing on pre-prod
  • Chris:
    • Finish configuring and testing cold stand-by central castor servers
    • Continue doing work related to Tier1 Security Group project
    • DepMon duties
    • Castor on Call duties (Mon-Tue)
    • Progress with Quattor tape server
    • Stress tests on PreProd (part of castor upgrade plan)
    • Prepare certification instance for 2.1.8 upgrade
    • Preparation for 2.1.8/2.1.9 castor upgrade
  • Cheney:
    • ..
  • Tim:
    • More T10K testing prior to CMS migration
    • Installation of new tape servers
  • Richard:
    • Adding "random file size" feature to stress testing suite
    • Running stress tests on pre-prod instance
  • Jens

Operations Issues

  • ATLAS User causing SRM failures. Under investigation.
  • A checksum problem has been reported by ATLAS causing some transfer failures from the CE to the SE.
  • CMS migrations continue to be troublesome. Production team will start restarting migHunters if observed, but the nagios alert needs to be revisited since it does not seem to be working
  • Castor151 (part of Neptune RAC) rebooted Friday morning. No obvious reason, but it was pointed out that backups now run from this node having been moved from the previously flaky node - which has since been rock solid.
  • During disk server deployments we have again seen instances where not all the RPMs required have been installed (roll on quattor)

Blocking issues

  • IP Address and Cabling for Facilities Instance

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB None

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Chris(Mon-Tue)/Shaun(Wed-Sun)
  • Staff absences:
    • Shaun: Mon, Tue.