Difference between revisions of "RAL Tier1 weekly operations castor 12/01/2015"

Latest revision as of 13:40, 9 January 2015

List of CASTOR meetings

Operations News

cmsDisk GDSS757 back in production (failed not responding - no issues found)
Draining - ongoing
GEN SL6 headnode upgrade – success
SL6 name server upgrade postponed due to castor team resource - needs to be rescheduled

Operations Problems

LHCb ticket investigation - FTS problem, double prepare to put resulting in zero length files - has hopefully been fixed by a past upgrade. We need to follow up by investigating time periods and provide lists of files to VOs (190k total across all LHC VOs)
Heavy LHCb usage causing significant castor load - SRM test failures. Number of jobs reduced for the weekend.
Ganglia monitoring - atlas missing some data e.g. ganglia. Castor to raise a ticket in support.
Checksum Mismatch - Facilities tape issues, Tim hoping to retrieve problem file from offsite tape

Blocking Issues

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

Kernel upgrade on Castor SL5 disk/srm/tape. Currently in planning but likely to be last week in Jan
A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
Upgrade of CASTOR headnodes to SL6 Gen W/C 5th Jan
Upgrade Oracle DB to version 11.2.0.4 (Late February?)
Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (Early February)

Advanced Planning

Tasks

DB team need to plan some work which will result in the DBs being under load for approx 1h - not terribly urgent but needs to be done in new year.
Provide new VM? to provide castor client functionality to query the backup DBs
Plan to ensure PreProd represents production in terms of hardware generation are underway
Possible future upgrade to CASTOR 2.1.14-15 post-Christmas
Switch from admin machines: lcgccvm02 to lcgcadm05
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions

Actions

Staffing

Castor on Call person
- Chris

Staff absence/out of the office:
- Shaun is out on Tuesday

@@ Line 3: / Line 3: @@
 == Operations News ==
-* ATLAS atlasStripInput is still very full.
+* cmsDisk GDSS757 back in production (failed not responding - no issues found)
-* Draining - latest estimate is to complete draining in 7 weeks, once it resumes.
+* Draining - ongoing
-* Kernel and errata upgrade on Castor SL6 headnodes (including reboot) - Tues 23rd 10:00 - 12:00
+* GEN SL6 headnode upgrade – success
+* SL6 name server upgrade postponed due to castor team resource - needs to be rescheduled
 == Operations Problems ==
-* Switch and DNS problems last Saturday caused a significant outage, however not aware of any fallout from this.
+* LHCb ticket investigation - FTS problem, double prepare to put resulting in zero length files - has hopefully been fixed by a past upgrade. We need to follow up by investigating time periods and provide lists of files to VOs (190k total across all LHC VOs)
-* CMS are experiencing poor performance with xroot stress test - possibly caused by existing heavy load or possibly xroot redirector? Andrew will retry.
+* Heavy LHCb usage causing significant castor load - SRM test failures. Number of jobs reduced for the weekend.
-* cedaRetrieve issue
+* Ganglia monitoring - atlas missing some data e.g. ganglia. Castor to raise a ticket in support.
-* RAL Ops SAM tests (lcgccvm02) have been fixed (initially causing false positives) - initally failed as a side affect of SL6 headnode upgrade.
+* Checksum Mismatch - Facilities tape issues, Tim hoping to retrieve problem file from offsite tape
-* GDSS778 (lhcbDst) out for memory failure – now back in.
-* LHCb SRM user file table dups – only a few and now removed.
-* gdss720 / gdss763 are both drained, out of production and currently being worked on by Fabric team.
-* gdss659 is still but will be decommissioned out of CASTOR.
 == Blocking Issues ==
@@ Line 22: / Line 18: @@
 == Planned, Scheduled and Cancelled Interventions ==
-* Kernel and errata upgrade on Castor SL6 headnodes (including reboot) - Tues 23rd 10:00 - 12:00
+* Kernel upgrade on Castor SL5 disk/srm/tape. Currently in planning but likely to be last week in Jan
 * A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
 * Upgrade of CASTOR headnodes to SL6 Gen W/C 5th Jan
@@ Line 47: / Line 43: @@
 == Staffing ==
 * Castor on Call person
-** Matt to 24/12
+** Chris
-** Rob 25/12-29/12
-** Matt from 30/12 – 4/1
 * Staff absence/out of the office:
-** Only CASTOR person in office Mon-Wed is Matt
+** Shaun is out on Tuesday

Difference between revisions of "RAL Tier1 weekly operations castor 12/01/2015"

Latest revision as of 13:40, 9 January 2015

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Actions

Staffing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools