Difference between revisions of "RAL Tier1 weekly operations castor 21/03/2016"

Latest revision as of 16:23, 18 March 2016

Operations News

NSS patching on Tier1 was successful
4 atlas disk servers in passive draining
CERN suggested 2.1.16 deployed to tape servers (Steve Murray)

Operations Problems

draining is not working for atlas (does however seem to work on LHCb)
transfermanager on atlas dlf was not performing TM tasks but was reporting as being up
2.1.15 - would like to reconfigure oracle to use physical memory only - no swap, Testing in preprod now. Also advanced queue configuration testing on preprod against raltags db.
CV11 disk sever raid patching - Monday after change control
2.1.15 Problems re config required for production to solve slow file open times - Andrey reports that CERN use 100GB of memory for DB servers in castor to run 2.1.15 (vs our 32GB), Oracle are not providing adequate support at the moment. 2.1.15 deployment will not be scheduled at the moment.
Could not drain gdss702 (castor 2.1.15) in Preprod (all files failed according to draindiskserver -q) - does draining work in 2.1.15?

Planned, Scheduled and Cancelled Interventions

Tuesday 22/03/16 - NSS patch for castor fac etc inc. DB Juno
2.1.15 update to nameserver will not go ahead. This is due to slow file open times issues on the stager. Testing/debugging of stager issue is ongoing. If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)

Long-term projects

RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
Facilities drive re-allocation. ACTION: RA
SRM 2.1.14 testing with SdW on VCERT
SRM db dups script - needs automating

Advanced Planning

Tasks

CASTOR 2.1.15 implementation and testing

Interventions

NSS patching for castor fac etc

Staffing

Castor on Call person next week
- RA

Staff absence/out of the office:
- BD - out Tuesday, Wed Thurs (at a conf)
- GS only in on Tuesday

New Actions

None this week!

Existing Actions

BD mice ticket - asking for a separate tape pool for d0t1 for monticarlo
GS is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
GS Callout for CIP only in waking hours?
RA CV11 firmware updates
BD to clarify if separating the DiRAC data is a necessity
BD ensure quattorising atlas consistency check
RA to deploy a 14 generation into preprod
BD re. WAN tuning proposal - discuss with GS, does it need a change control?
RA to try stopping tapeserverd mid-migration to see if it breaks - ask Tim.
RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress

@@ Line 1: / Line 1: @@
 == Operations News ==
-* 4 servers into read only in atlasStripInput - as part of plan to decommission 10% after servers out of warrantee (AD & BD), drain next week
+* NSS patching on Tier1 was successful
-* CERN suggested 2.1.16 deployed to tape servers (Steve Murray)
+* 4 atlas disk servers in passive draining
+* CERN suggested 2.1.16 deployed to tape servers (Steve Murray)
 == Operations Problems ==
+* draining is not working for atlas (does however seem to work on LHCb)
+* transfermanager on atlas dlf was not performing TM tasks but was reporting as being up
+* 2.1.15 - would like to reconfigure oracle to use physical memory only - no swap, Testing in preprod now. Also advanced queue configuration testing on preprod against raltags db.
+* CV11 disk sever raid patching - Monday after change control
 * 2.1.15 Problems re config required for production to solve slow file open times - Andrey reports that CERN use 100GB of memory for DB servers in castor to run 2.1.15 (vs our 32GB), Oracle are not providing adequate support at the moment. 2.1.15 deployment will not be scheduled at the moment.
 * Could not drain gdss702 (castor 2.1.15) in Preprod (all files failed according to draindiskserver -q) - does draining work in 2.1.15?
-* LHCb job failures ticket still open
-* CMS AAA issues
-* OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
-* LHCb job failures - GGUS ticket open
-* CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate
 == Planned, Scheduled and Cancelled Interventions ==
-* 2.1.15 update to nameserver will not go ahead due to performance issues on the DB.
+* Tuesday 22/03/16 - NSS patch for castor fac etc inc. DB Juno
+* 2.1.15 update to nameserver will not go ahead.  This is due to slow file open times issues on the stager.  Testing/debugging of stager issue is ongoing.  If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
@@ Line 22: / Line 23: @@
 * Facilities drive re-allocation.  ACTION: RA
 * SRM 2.1.14 testing  with SdW on VCERT
+* SRM db dups script - needs automating
 == Advanced Planning ==
@@ Line 29: / Line 32: @@
 '''Interventions'''
-* Juno reboot for patching - when?
+* NSS patching for castor fac etc
 == Staffing ==
 * Castor on Call person next week
-** RA week of 21/03/16 onwards & discuss with RA on his return
+** RA
 * Staff absence/out of the office:
-** RA - out monday
+** BD - out Tuesday, Wed Thurs (at a conf)
-** SdW - out
+** GS only in on Tuesday
-== New Actions ==
+== New Actions ==
+* None this week!
 == Existing Actions ==
-* BD check SNO+ transfers fts are migrating to tape if required
+* BD mice ticket - asking for a separate tape pool for d0t1 for monticarlo
-* BD mice ticket
+* GS is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
-* CC is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
 * GS Callout for CIP only in waking hours?
 * RA CV11 firmware updates
-* RA follow up with Fabric re: CV '11 gen RAID card controller firmware update
 * BD to clarify if separating the DiRAC data is a necessity
 * BD ensure quattorising atlas consistency check
 * RA to deploy a 14 generation into preprod
 * BD re. WAN tuning proposal - discuss with GS, does it need a change control?
-* RA to try stopping tapeserverd mid-migration to see if it breaks.
+* RA to try stopping tapeserverd mid-migration to see if it breaks - ask Tim.
 * RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
 * GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress

Difference between revisions of "RAL Tier1 weekly operations castor 21/03/2016"

Latest revision as of 16:23, 18 March 2016

Contents

Operations News

Operations Problems

Planned, Scheduled and Cancelled Interventions

Long-term projects

Advanced Planning

Staffing

New Actions

Existing Actions

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools