Difference between revisions of "RAL Tier1 weekly operations castor 21/03/2016"

From GridPP Wiki
Jump to: navigation, search
 
Line 1: Line 1:
 
== Operations News ==
 
== Operations News ==
 
* NSS patching on Tier1 was successful  
 
* NSS patching on Tier1 was successful  
* 4 atlas disk servers in passive draining
+
* 4 atlas disk servers in passive draining
 
+
* CERN suggested 2.1.16 deployed to tape servers (Steve Murray)   
 
+
* 4 servers into read only in atlasStripInput - as part of plan to decommission 10% after servers out of warrantee (AD & BD), drain next week
+
* CERN suggested 2.1.16 deployed to tape servers (Steve Murray)
+
* Atlas finished deletion of secondary tape copies in prep for the C > D migration
+
* glibc patching for facilities completed
+
* 11.2.0.4. client updates are done (LHCb has been restarted already)
+
* glibc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their build history.  ACTION: RA to revisit quattor build so that this does not recur.
+
* 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time.  This should be transparent. ACTION RA
+
  
 
== Operations Problems ==
 
== Operations Problems ==
Line 16: Line 8:
 
* transfermanager on atlas dlf was not performing TM tasks but was reporting as being up  
 
* transfermanager on atlas dlf was not performing TM tasks but was reporting as being up  
 
* 2.1.15 - would like to reconfigure oracle to use physical memory only - no swap, Testing in preprod now. Also advanced queue configuration testing on preprod against raltags db.  
 
* 2.1.15 - would like to reconfigure oracle to use physical memory only - no swap, Testing in preprod now. Also advanced queue configuration testing on preprod against raltags db.  
* gdss620 - still out
 
* ask Shaun to look at CMS AAA issues in lasts few days
 
 
* CV11 disk sever raid patching - Monday after change control
 
* CV11 disk sever raid patching - Monday after change control
 
 
 
* 2.1.15 Problems re config required for production to solve slow file open times - Andrey reports that CERN use 100GB of memory for DB servers in castor to run 2.1.15 (vs our 32GB), Oracle are not providing adequate support at the moment. 2.1.15 deployment will not be scheduled at the moment.
 
* 2.1.15 Problems re config required for production to solve slow file open times - Andrey reports that CERN use 100GB of memory for DB servers in castor to run 2.1.15 (vs our 32GB), Oracle are not providing adequate support at the moment. 2.1.15 deployment will not be scheduled at the moment.
 
* Could not drain gdss702 (castor 2.1.15) in Preprod (all files failed according to draindiskserver -q) - does draining work in 2.1.15?
 
* Could not drain gdss702 (castor 2.1.15) in Preprod (all files failed according to draindiskserver -q) - does draining work in 2.1.15?
* gdss646 lhcbUser d1t0 - in production readonly following loss of a disk
+
 
* Gdss619 genTape d0t1 – back in production now was out this week
+
* Gdss698 lhcbDst d1t0 – just been pulled out
+
* Gdss677 cmsTape d0t1 – currently out and rebuilding raid
+
* Failure of main CIP server over weekend, now back in operation
+
* LHCb job failures ticket still open
+
* CMS AAA issues 
+
* glibc patching - issues with servers coming back, understood - kernel not updated by spma before rebooted. Also network interfaces / SSH key being changed on a storageD machine
+
* Main CIP system failed, have failed over to test CIP machine.  HW failure to be fixed then will fail back over to production system.  ACTION: RA, CC and Fabric
+
* OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
+
* LHCb job failures - GGUS ticket open
+
* ongoing AAA issues in CMS
+
* CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team
+
  
 
== Planned, Scheduled and Cancelled Interventions ==
 
== Planned, Scheduled and Cancelled Interventions ==
 
* Tuesday 22/03/16 - NSS patch for castor fac etc inc. DB Juno   
 
* Tuesday 22/03/16 - NSS patch for castor fac etc inc. DB Juno   
 +
* 2.1.15 update to nameserver will not go ahead.  This is due to slow file open times issues on the stager.  Testing/debugging of stager issue is ongoing.  If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
  
 
* 2.1.15 update to nameserver will not go ahead.  This is due to slow file open times issues on the stager.  Testing/debugging of stager issue is ongoing.  If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
 
* 11.2.04 client updates (running in preprod) - possible change control for prod (see above)
 
* WAN tuning proposal - possibly put into change control BD
 
* CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team.
 
  
 
== Long-term projects ==
 
== Long-term projects ==
Line 51: Line 23:
 
* Facilities drive re-allocation.  ACTION: RA
 
* Facilities drive re-allocation.  ACTION: RA
 
* SRM 2.1.14 testing  with SdW on VCERT
 
* SRM 2.1.14 testing  with SdW on VCERT
 
 
* SRM db dups script - needs automating
 
* SRM db dups script - needs automating
  
Line 81: Line 52:
 
* GS Callout for CIP only in waking hours?
 
* GS Callout for CIP only in waking hours?
 
* RA CV11 firmware updates
 
* RA CV11 firmware updates
* RA to revisit quattor build for the head nodes which did not reboot so that this does not recur.
 
 
* BD to clarify if separating the DiRAC data is a necessity  
 
* BD to clarify if separating the DiRAC data is a necessity  
 
* BD ensure quattorising atlas consistency check
 
* BD ensure quattorising atlas consistency check
Line 89: Line 59:
 
* RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
 
* RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
 
* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress
 
* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress
 
== Closed Actions ==
 
* BD check SNO+ transfers fts are migrating to tape if required
 
* GS to confirm if no other Tier1 1s are down W/C 21 march
 
* BD to coordinate with atlas re bulk deletion before TF starts repack
 

Latest revision as of 16:23, 18 March 2016

Operations News

  • NSS patching on Tier1 was successful
  • 4 atlas disk servers in passive draining
  • CERN suggested 2.1.16 deployed to tape servers (Steve Murray)

Operations Problems

  • draining is not working for atlas (does however seem to work on LHCb)
  • transfermanager on atlas dlf was not performing TM tasks but was reporting as being up
  • 2.1.15 - would like to reconfigure oracle to use physical memory only - no swap, Testing in preprod now. Also advanced queue configuration testing on preprod against raltags db.
  • CV11 disk sever raid patching - Monday after change control
  • 2.1.15 Problems re config required for production to solve slow file open times - Andrey reports that CERN use 100GB of memory for DB servers in castor to run 2.1.15 (vs our 32GB), Oracle are not providing adequate support at the moment. 2.1.15 deployment will not be scheduled at the moment.
  • Could not drain gdss702 (castor 2.1.15) in Preprod (all files failed according to draindiskserver -q) - does draining work in 2.1.15?


Planned, Scheduled and Cancelled Interventions

  • Tuesday 22/03/16 - NSS patch for castor fac etc inc. DB Juno
  • 2.1.15 update to nameserver will not go ahead. This is due to slow file open times issues on the stager. Testing/debugging of stager issue is ongoing. If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)


Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
  • JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
  • Facilities drive re-allocation. ACTION: RA
  • SRM 2.1.14 testing with SdW on VCERT
  • SRM db dups script - needs automating


Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing


Interventions

  • NSS patching for castor fac etc

Staffing

  • Castor on Call person next week
    • RA
  • Staff absence/out of the office:
    • BD - out Tuesday, Wed Thurs (at a conf)
    • GS only in on Tuesday


New Actions

  • None this week!

Existing Actions

  • BD mice ticket - asking for a separate tape pool for d0t1 for monticarlo
  • GS is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
  • GS Callout for CIP only in waking hours?
  • RA CV11 firmware updates
  • BD to clarify if separating the DiRAC data is a necessity
  • BD ensure quattorising atlas consistency check
  • RA to deploy a 14 generation into preprod
  • BD re. WAN tuning proposal - discuss with GS, does it need a change control?
  • RA to try stopping tapeserverd mid-migration to see if it breaks - ask Tim.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress