Difference between revisions of "RAL Tier1 weekly operations castor 31/03/2014"

Latest revision as of 16:46, 28 March 2014

Operations News

Disk deployments: 1 CV’13 in lhcbDst / 10 CV ’13 in lhcbNonProd waiting for blessing / 3 CV’13 on way to cmsNonProd
Disk Draining: 2 atlas servers drained and 1 in progress. 3 CMS servers drained and 1 in progress. 1 lhcb server drained, awaiting disk deployment before any further draining.
CMSdisk at 7% free
CASTOR 2.1.14 Upgrade Progress - Reversion to 2.1.13-9 software and databases on preprod was successful. The instance passed functional tests and was subjected to stress testing by Matt overnight. The software and databases of the central services were then upgraded to 2.1.14-11, but without the switch to 2.1.14 native mode on the name server to test the running of 2.1.13-9 stagers against 2.1.14-11 central services. The instance passed functional tests and is being subjected to stress testing by Matt. Next is testing 2.1.14-11 stagers against 2.1.14-11 central services with the name server still in 2.1.13 compatibility mode.

Operations Problems

CMS load continues to cause problems, we had to restart transfer/diskmanagers to get things working again (Monday 10:45 and Tuesday 17:30)
transfermanagerd restarted on fdscdlf02 Thursday
vcert srm and name server not accessible due to issues with hypervisor after rack move, possibly some config required to bring it back. Dimitrios is looking into this
We had a node crash on Neptune causing brief issues with Atlas srm, known issue has already been logged with Oracle

Blocking Issues

none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

(Tue 1 Apr) Tim will stop all tape access for an hour or so to test Linux based controller (ASCLS, controls tape robot). Andrew L has been informed.
(Tue 1 Apr) Facilities CASTOR Upgrade. Downtime between 0900-1600

Advanced Planning

Tasks

Atlas would like to store c2 million EVNT monte carlo files – Brian to discuss with Alastair. Other tier 1s are not keen but RAL tier 1 / castor should be able to cope with this.

<<<<< REVIEW THIS >>>>>

CASTOR 2.1.14 + SL5/6 testing. The change control has gone through today with few problems.
iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
Quattor cleanup process is ongoing.
Installation of new Preprod headnodes

Interventions

Staffing

Castor on Call person
- Matthew
Staff absence/out of the office:
- (Mon-Fri) Rob A/L
- (Friday) Bruno poss A/L

@@ Line 1: / Line 1: @@
 == Operations News ==
-* ..
+* Disk deployments: 1 CV’13 in lhcbDst / 10 CV ’13 in lhcbNonProd waiting for blessing / 3 CV’13 on way to cmsNonProd
+* Disk Draining: 2 atlas servers drained and 1 in progress. 3 CMS servers drained and 1 in progress. 1 lhcb server drained, awaiting disk deployment before any further draining.
+* CMSdisk at 7% free
+* CASTOR 2.1.14 Upgrade Progress - Reversion to 2.1.13-9 software and databases on preprod was successful. The instance passed functional tests and was subjected to stress testing by Matt overnight. The software and databases of the central services were then upgraded to 2.1.14-11, but without the switch to 2.1.14 native mode on the name server to test the running of 2.1.13-9 stagers against 2.1.14-11 central services. The instance passed functional tests and is being subjected to stress testing by Matt. Next is testing 2.1.14-11 stagers against 2.1.14-11 central services with the name server still in 2.1.13 compatibility mode.
 == Operations Problems ==
@@ Line 13: / Line 16: @@
 == Planned, Scheduled and Cancelled Interventions ==
 '''Entries in/planned to go to GOCDB'''
+* (Tue 1 Apr) Tim will stop all tape access for an hour or so to test Linux based controller (ASCLS, controls tape robot). Andrew L has been informed.
+* (Tue 1 Apr) Facilities CASTOR Upgrade. Downtime between 0900-1600
 == Advanced Planning ==
 '''Tasks'''
+* Atlas would like to store c2 million EVNT monte carlo files – Brian to discuss with Alastair. Other tier 1s are not keen but RAL tier 1 / castor should be able to cope with this.
+<<<<< REVIEW THIS >>>>>
 * CASTOR 2.1.14 + SL5/6 testing. The change control has gone through today with few problems.
 * iptables to be installed on lcgcviewer01 to harden the logging system against the injection of junk data by security scans.
@@ Line 23: / Line 31: @@
 '''Interventions'''
-* (Tue 1 Apr) Facilities CASTOR Upgrade. Downtime between 0900-1600
 == Staffing ==
@@ Line 29: / Line 36: @@
 **  Matthew
 * Staff absence/out of the office:
-** ..
+** (Mon-Fri) Rob A/L
+** (Friday) Bruno poss A/L

Difference between revisions of "RAL Tier1 weekly operations castor 31/03/2014"

Latest revision as of 16:46, 28 March 2014

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Staffing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools