Difference between revisions of "RAL Tier1 weekly operations castor 08/12/2014"

Latest revision as of 10:53, 4 February 2015

List of CASTOR meetings

Operations News

ATLAS have filled their available space in atlasStripInput, so we have paused draining pending deletions.
Draining - latest estimate is to complete draining in 7 weeks, assuming no breaks.
SL6 has been rolled out to LHCB headnodes, no serious issues encountered
SL6 rollout to CMS headnodes scheduled for Tuesday and Atlas for Wednesday

Operations Problems

gdss720 / gdss763 are both drained, out of production and currently being worked on by Fabric team
gdss659 is still but will be decommissioned out of CASTOR.
CMS xroot redirector has been moved from lcgclsf02 to another temporary node (was causing issues on LSF02).
CMS has been suffering from castor issues that are thought to stem from very full cmsDisk (was 3% free)

Blocking Issues

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
Upgrade of CASTOR headnodes to SL6 CMS Tuesday 9/12/14 and Atlas Wednesday 10/12/14 and Gen W/C 5th Jan
Upgrade Oracle DB to version 11.2.0.4 (Late February?)
Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (Early February)

Advanced Planning

Tasks

Plan and publish SL6 deployment plans
Plan to ensure PreProd represents production in terms of hardware generation are underway
Possible future upgrade to CASTOR 2.1.14-15 post-Christmas
Switch from admin machines: lcgccvm02 to lcgcadm05
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions

Upgrade Production headnodes to SL6 - Next upgrades (ATLAS and CMS) planned for 9th and 10th December.

Actions

Staffing

Castor on Call person
- Chris

Staff absence/out of the office:
- Rob - Out Thurs/Friday then off for Christmas (however doing oncall at some point)
- Brian - Out all week

@@ Line 1: / Line 1: @@
 [https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor List of CASTOR meetings]
 == Operations News ==
-* SL6 Headnode work - stress testing ongoing without incident. (using realistic jobs from Alastair / Andrew if possible)
 * ATLAS have filled their available space in atlasStripInput, so we have paused draining pending deletions.
 * Draining - latest estimate is to complete draining in 7 weeks, assuming no breaks.
-* Webdav access for LHCb now working
+* SL6 has been rolled out to LHCB headnodes, no serious issues encountered
+* SL6 rollout to CMS headnodes scheduled for Tuesday and Atlas for Wednesday
 == Operations Problems ==
 * gdss720 / gdss763 are both drained, out of production and currently being worked on by Fabric team
 * gdss659 is still but will be decommissioned out of CASTOR.
-* lcgclsf02 had another sshd failure on Wednesday. We believe this is due to the xroot service on this node spawning an excessive number of subprocesses, which ran into a hard limit in the OS. This will be resolved by shifting this node's xroot responsibilities over to a dedicated node.
+* CMS xroot redirector has been moved from lcgclsf02 to another temporary node (was causing issues on LSF02).
-* Last week's SRM SAM test failures on LHCB/CMS and callout on pluto have been explained, and are due to the details of our DB setup. The issue will be fixed when we upgrade to 2.1.14-14.
+* CMS has been suffering from castor issues that are thought to stem from very full cmsDisk (was 3% free)
 == Blocking Issues ==
 * grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.
 == Planned, Scheduled and Cancelled Interventions ==
 * A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
-* Upgrade of CASTOR to SL6 (December)
+* Upgrade of CASTOR headnodes to SL6 CMS Tuesday 9/12/14 and Atlas Wednesday 10/12/14 and Gen W/C 5th Jan
 * Upgrade Oracle DB to version 11.2.0.4 (Late February?)
 * Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (Early February)
@@ Line 36: / Line 34: @@
 '''Interventions'''
-* Upgrade Production headnodes to SL6 - first upgrade Tuesday 2nd December. Next upgrades (ATLAS and CMS) planned for 9th and 10th December.
+* Upgrade Production headnodes to SL6 - Next upgrades (ATLAS and CMS) planned for 9th and 10th December.
 == Actions ==
@@ Line 43: / Line 41: @@
 == Staffing ==
 * Castor on Call person
-** Rob
+** Chris
 * Staff absence/out of the office:
+** Rob - Out Thurs/Friday then off for Christmas (however doing oncall at some point)
-** Rob - Out at least 1 day
+** Brian - Out all week
-** Brian - Out Tuesday-Friday.

Difference between revisions of "RAL Tier1 weekly operations castor 08/12/2014"

Latest revision as of 10:53, 4 February 2015

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Actions

Staffing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools