RAL Tier1 weekly operations castor 16/11/2009

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Establishing requirements for ATLAS B&W lists (Brian)
  • Vulcan now working with array from Kevin (Cheney)
  • Fixed crashed Xen host (Cheney)
  • Nagios work (Cheney)
  • Four T10KB drives now working (Tim)
  • Replaced faulty tape drive (Tim)
  • Installing T10KB drives (Tim)
  • Finalizing configuration on repack (Chris, Tim)
  • Resolved python problem on repack - got a missing RPM from CERN (Chris)
  • Testing B&W Lists (Chris)
  • Quattor now working with all four preprod central servers (Richard)
  • Depmon and CoD duties (Matthew)
  • Wrote restarter for job manager (Matthew)

Developments for this week

  • Configuring repack server (Chris, Tim, Matthew)
  • Improving resilience on central servers (Chris, Shaun)
  • Configuring access for T2K (Shaun, Jens)
  • Testing Quattor templates for preprod (Richard)
  • Write restarter for rmmaster (Matthew)
  • Disaster recovery document (Matthew)

Operations Issues

  • Crash of CIP2 Xen hosting machine. Older CIP1 switched back into production
  • DB problems migrating services between nodes during application of ORACLE patch. Connections weren't kept open, resulting in various CASTOR services stopping
  • D2D copies from lhcbUser don't work - investigating

Blocking issues

none

Planned, Scheduled and Cancelled Down Times

none

Advanced Planning

  • Black and White lists will be tested and introduced on ATLAS
  • Install/enable gridftp-internal on Gen (This year)

Staffing

  • Castor on Call person: Shaun
  • Chris away Monday, Shaun away Wednesday, Cheney away Friday