Difference between revisions of "RAL Tier1 weekly operations castor 31/03/2014"

From GridPP Wiki
Jump to: navigation, search
(Operations Problems)
(Planned, Scheduled and Cancelled Interventions)
 
(15 intermediate revisions by one user not shown)
Line 1: Line 1:
 
== Operations News ==
 
== Operations News ==
* ..
+
* Disk deployments: 1 CV’13 in lhcbDst / 10 CV ’13 in lhcbNonProd waiting for blessing / 3 CV’13 on way to cmsNonProd
 +
* Disk Draining: 2 atlas servers drained and 1 in progress. 3 CMS servers drained and 1 in progress. 1 lhcb server drained, awaiting disk deployment before any further draining.
 +
* CMSdisk at 7% free
 +
* CASTOR 2.1.14 Upgrade Progress - Reversion to 2.1.13-9 software and databases on preprod was successful. The instance passed functional tests and was subjected to stress testing by Matt overnight. The software and databases of the central services were then upgraded to 2.1.14-11, but without the switch to 2.1.14 native mode on the name server to test the running of 2.1.13-9 stagers against 2.1.14-11 central services. The instance passed functional tests and is being subjected to stress testing by Matt. Next is testing 2.1.14-11 stagers against 2.1.14-11 central services with the name server still in 2.1.13 compatibility mode.
  
 
== Operations Problems ==
 
== Operations Problems ==
Line 13: Line 16:
 
== Planned, Scheduled and Cancelled Interventions ==
 
== Planned, Scheduled and Cancelled Interventions ==
 
'''Entries in/planned to go to GOCDB'''
 
'''Entries in/planned to go to GOCDB'''
 
+
* (Tue 1 Apr) Tim will stop all tape access for an hour or so to test Linux based controller (ASCLS, controls tape robot). Andrew L has been informed.
 +
* (Tue 1 Apr) Facilities CASTOR Upgrade. Downtime between 0900-1600
  
 
== Advanced Planning ==
 
== Advanced Planning ==
 
'''Tasks'''
 
'''Tasks'''
 +
 +
* Atlas would like to store c2 million EVNT monte carlo files – Brian to discuss with Alastair. Other tier 1s are not keen but RAL tier 1 / castor should be able to cope with this.
 +
 +
<<<<< REVIEW THIS >>>>>
 
* CASTOR 2.1.14 + SL5/6 testing. The change control has gone through today with few problems.
 
* CASTOR 2.1.14 + SL5/6 testing. The change control has gone through today with few problems.
 
* iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
 
* iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
Line 23: Line 31:
 
   
 
   
 
'''Interventions'''
 
'''Interventions'''
* (Tue 1 Apr) Facilities CASTOR Upgrade. Downtime between 0900-1600
 
  
 
== Staffing ==
 
== Staffing ==
Line 29: Line 36:
 
**  Matthew
 
**  Matthew
 
* Staff absence/out of the office:
 
* Staff absence/out of the office:
** ..
+
** (Mon-Fri) Rob A/L
 +
** (Friday) Bruno poss A/L

Latest revision as of 16:46, 28 March 2014

Operations News

  • Disk deployments: 1 CV’13 in lhcbDst / 10 CV ’13 in lhcbNonProd waiting for blessing / 3 CV’13 on way to cmsNonProd
  • Disk Draining: 2 atlas servers drained and 1 in progress. 3 CMS servers drained and 1 in progress. 1 lhcb server drained, awaiting disk deployment before any further draining.
  • CMSdisk at 7% free
  • CASTOR 2.1.14 Upgrade Progress - Reversion to 2.1.13-9 software and databases on preprod was successful. The instance passed functional tests and was subjected to stress testing by Matt overnight. The software and databases of the central services were then upgraded to 2.1.14-11, but without the switch to 2.1.14 native mode on the name server to test the running of 2.1.13-9 stagers against 2.1.14-11 central services. The instance passed functional tests and is being subjected to stress testing by Matt. Next is testing 2.1.14-11 stagers against 2.1.14-11 central services with the name server still in 2.1.13 compatibility mode.

Operations Problems

  • CMS load continues to cause problems, we had to restart transfer/diskmanagers to get things working again (Monday 10:45 and Tuesday 17:30)
  • transfermanagerd restarted on fdscdlf02 Thursday
  • vcert srm and name server not accessible due to issues with hypervisor after rack move, possibly some config required to bring it back. Dimitrios is looking into this
  • We had a node crash on Neptune causing brief issues with Atlas srm, known issue has already been logged with Oracle

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

  • (Tue 1 Apr) Tim will stop all tape access for an hour or so to test Linux based controller (ASCLS, controls tape robot). Andrew L has been informed.
  • (Tue 1 Apr) Facilities CASTOR Upgrade. Downtime between 0900-1600

Advanced Planning

Tasks

  • Atlas would like to store c2 million EVNT monte carlo files – Brian to discuss with Alastair. Other tier 1s are not keen but RAL tier 1 / castor should be able to cope with this.

<<<<< REVIEW THIS >>>>>

  • CASTOR 2.1.14 + SL5/6 testing. The change control has gone through today with few problems.
  • iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
  • Quattor cleanup process is ongoing.
  • Installation of new Preprod headnodes

Interventions

Staffing

  • Castor on Call person
    • Matthew
  • Staff absence/out of the office:
    • (Mon-Fri) Rob A/L
    • (Friday) Bruno poss A/L