Difference between revisions of "RAL Tier1 weekly operations castor 01/12/2014"

Latest revision as of 15:03, 2 December 2014

List of CASTOR meetings

Operations News

SL6 Headnode work - stress testing ongoing without incident. (using realistic jobs from Alastair / Andrew if possible)
ATLAS have filled their available space in atlasStripInput, so we have paused draining pending deletions.
Draining - latest estimate is to complete draining in 7 weeks, assuming no breaks.
Webdav access for LHCb now working

Operations Problems

gdss720 / gdss763 are both drained, out of production and currently being worked on by Fabric team
gdss659 is still but will be decommissioned out of CASTOR.
lcgclsf02 had another sshd failure on Wednesday. We believe this is due to the xroot service on this node spawning an excessive number of subprocesses, which ran into a hard limit in the OS. This will be resolved by shifting this node's xroot responsibilities over to a dedicated node.
Last week's SRM SAM test failures on LHCB/CMS and callout on pluto have been explained, and are due to the details of our DB setup. The issue will be fixed when we upgrade to 2.1.14-14.

Blocking Issues

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
Upgrade of CASTOR to SL6 (December)
Upgrade Oracle DB to version 11.2.0.4 (Late February?)
Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (Early February)

Advanced Planning

Tasks

Plan and publish SL6 deployment plans
Plan to ensure PreProd represents production in terms of hardware generation are underway
Possible future upgrade to CASTOR 2.1.14-15 post-Christmas
Switch from admin machines: lcgccvm02 to lcgcadm05
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions

Upgrade Production headnodes to SL6 - first upgrade Tuesday 2nd December. Next upgrades (ATLAS and CMS) planned for 9th and 10th December.

Actions

Staffing

Castor on Call person
- Rob

Staff absence/out of the office:

- Rob - Out at least 1 day
- Brian - Out Tuesday-Friday.

@@ Line 4: / Line 4: @@
 == Operations News ==
-* xrootd security advisory with FAX component within xrootd
+* SL6 Headnode work - stress testing ongoing without incident. (using realistic jobs from Alastair / Andrew if possible)
-* SL6 Headnode work - scheduled to be stress testing WE 22/23 Nov (using realistic jobs from Alastair / Andrew if possible)
+* ATLAS have filled their available space in atlasStripInput, so we have paused draining pending deletions.
-* Draining - latest estimate is to complete draining in 11 weeks (with no breaks). LHCb draining rate test  -  pseudo rebalancing also underway
+* Draining - latest estimate is to complete draining in 7 weeks, assuming no breaks.
-* webdav access for LHCb now working
+* Webdav access for LHCb now working
-* GDSS673 back into production (lhcbRawRdst)
 == Operations Problems ==
 * gdss720 / gdss763 are both drained, out of production and currently being worked on by Fabric team
-* gdss659 still in atlasNonProd
+* gdss659 is still but will be decommissioned out of CASTOR.
-* lcgclsf02 failed on Tuesday night, root cause unknown - server has been tested by fabric and returned to production
+* lcgclsf02 had another sshd failure on Wednesday. We believe this is due to the xroot service on this node spawning an excessive number of subprocesses, which ran into a hard limit in the OS. This will be resolved by shifting this node's xroot responsibilities over to a dedicated node.
-* SRM SAM test failures on LHCB/CMS and callout on pluto. Some deadlocking on dbases, investigating root cause
+* Last week's SRM SAM test failures on LHCB/CMS and callout on pluto have been explained, and are due to the details of our DB setup. The issue will be fixed when we upgrade to 2.1.14-14.
 == Blocking Issues ==
@@ Line 24: / Line 21: @@
 == Planned, Scheduled and Cancelled Interventions ==
 * A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
+* Upgrade of CASTOR to SL6 (December)
+* Upgrade Oracle DB to version 11.2.0.4 (Late February?)
+* Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (Early February)
@@ Line 30: / Line 30: @@
 * Plan and publish SL6 deployment plans
 * Plan to ensure PreProd represents production in terms of hardware generation are underway
-* Possible future upgrade to CASTOR 2.1.14-15 post Christmas
+* Possible future upgrade to CASTOR 2.1.14-15 post-Christmas
 * Switch from admin machines: lcgccvm02 to lcgcadm05
 * Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
@@ Line 36: / Line 36: @@
 '''Interventions'''
-* Upgrade Production headnodes to SL6
+* Upgrade Production headnodes to SL6 - first upgrade Tuesday 2nd December. Next upgrades (ATLAS and CMS) planned for 9th and 10th December.
 == Actions ==
@@ Line 43: / Line 43: @@
 == Staffing ==
 * Castor on Call person
-** Matt
+** Rob
 * Staff absence/out of the office:
-** Chris - Friday
+** Rob - Out at least 1 day
-** Rob - out Tuesday and Thursday
+** Brian - Out Tuesday-Friday.

Difference between revisions of "RAL Tier1 weekly operations castor 01/12/2014"

Latest revision as of 15:03, 2 December 2014

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Actions

Staffing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools