RAL Tier1 weekly operations castor 03/03/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • We have started stress testing the two new 2014 disk server generations now in preprod (gdss726,727)
  • Final testing will br ongoing for 2.1.14 - including draining under stress. Our target minor release (-11) will be installed on preprod
  • ATLAS has requested that we reduce the number of replicas in HOTDISK from 5 to 1

Operations Problems

  • Number of out of thread errors is now back to background levels. The increased thread number on ATLAS stager is being rolled out to all stagers this week.
  • ATLASDATADISK and ATLASSCRATCHDISK is full. The VO is aware.
  • Admin machine lcgccvm02 has crashed ~5 times over the last 5 days. We believe this is related to HyperV instability.
  • Elastic Logging instability continues and needs to be fixed prior to being production ready.

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB


Advanced Planning

Tasks

  • CASTOR 2.1.14 + SL5/6 testing. The change control has gone through today with few problems.
  • iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
  • Quattor cleanup process is ongoing.
  • Installation of new Preprod headnodes

Interventions

  • none

Staffing

  • Castor on Call person
    • Rob
  • Staff absence/out of the office:
    • Chris SDB User Meeting (Mon-Wed)
    • Matt EGI DP workshop (Tue-Thu)