RAL Tier1 weekly operations castor 31/03/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Disk deployments: 1 CV’13 in lhcbDst / 10 CV ’13 in lhcbNonProd waiting for blessing / 3 CV’13 on way to cmsNonProd
  • Disk Draining: 2 atlas servers drained and 1 in progress. 3 CMS servers drained and 1 in progress. 1 lhcb server drained, awaiting disk deployment before any further draining.
  • CMSdisk at 7% free
  • CASTOR 2.1.14 Upgrade Progress - Reversion to 2.1.13-9 software and databases on preprod was successful. The instance passed functional tests and was subjected to stress testing by Matt overnight. The software and databases of the central services were then upgraded to 2.1.14-11, but without the switch to 2.1.14 native mode on the name server to test the running of 2.1.13-9 stagers against 2.1.14-11 central services. The instance passed functional tests and is being subjected to stress testing by Matt. Next is testing 2.1.14-11 stagers against 2.1.14-11 central services with the name server still in 2.1.13 compatibility mode.

Operations Problems

  • CMS load continues to cause problems, we had to restart transfer/diskmanagers to get things working again (Monday 10:45 and Tuesday 17:30)
  • transfermanagerd restarted on fdscdlf02 Thursday
  • vcert srm and name server not accessible due to issues with hypervisor after rack move, possibly some config required to bring it back. Dimitrios is looking into this
  • We had a node crash on Neptune causing brief issues with Atlas srm, known issue has already been logged with Oracle

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • (Tue 1 Apr) Tim will stop all tape access for an hour or so to test Linux based controller (ASCLS, controls tape robot). Andrew L has been informed.
  • (Tue 1 Apr) Facilities CASTOR Upgrade. Downtime between 0900-1600

Advanced Planning

Tasks

  • Atlas would like to store c2 million EVNT monte carlo files – Brian to discuss with Alastair. Other tier 1s are not keen but RAL tier 1 / castor should be able to cope with this.

<<<<< REVIEW THIS >>>>>

  • CASTOR 2.1.14 + SL5/6 testing. The change control has gone through today with few problems.
  • iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
  • Quattor cleanup process is ongoing.
  • Installation of new Preprod headnodes

Interventions

Staffing

  • Castor on Call person
    • Matthew
  • Staff absence/out of the office:
    • (Mon-Fri) Rob A/L
    • (Friday) Bruno poss A/L