Difference between revisions of "RAL Tier1 weekly operations castor 01/02/2016"

From GridPP Wiki
Jump to: navigation, search
 
Line 3: Line 3:
 
* 2.1.15 name server tested  
 
* 2.1.15 name server tested  
 
* New SRM on vcert2  
 
* New SRM on vcert2  
 
 
* CMS tape no longer an issue, following disk server failure and test files in castor cache
 
 
* New SRM (SL6) with bug fixes available - needs test
 
* New SRM (SL6) with bug fixes available - needs test
* noop scheduler now everywhere
 
* CMS log spam - failing disk to disk copied, Juan spotted incorrect config and corrected
 
 
* Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
 
* Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
* db dups removal script available
 
* Brain and Tim updating the new tape pool creation docs
 
* 11.2.04 Oracle client updates, on prepod so far so good
 
* Rob testing 2.1.15 on vcert ...
 
* Atlas data read rate hit 20GB/s 4/12/2015 - 09:30, also seen in colourful network map
 
* Plan to replace disk array and database nodes underway - also a possible improvement in configuration, frequent write of a Oracle system log is slow and can be improved by writing to a dedicated area with a different RAID config
 
 
* LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)  
 
* LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)  
* 2.1.15-20 now on vcert
 
* SRMdb dups script created
 
* Erratas done on production and all others
 
* 11.2.0.4 clients rolled out on preprod
 
* Attempted fix for LHCb's intermittent stage request failures - increase 'Stage Request'
 
 
thread count from 4 to 12.
 
 
* BD looking at porting persistent tests to Ceph
 
* BD looking at porting persistent tests to Ceph
  
Line 30: Line 12:
 
* gdss620 - gen tape with fabric still  
 
* gdss620 - gen tape with fabric still  
 
* CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate
 
* CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate
 
 
* atlas trying to recall a file for 2 months! no-one noticed, Brian looking at root cause (file not lost)
 
* 676 3 lost file // 1% off 667 atlas
 
* CMS many recalls & migrations - problems as test files on tape and stuck in queue (following issue with tape cache). CMS test now failing, CASTOR actually functioning OK. 
 
 
* fac tape drive broken - Tim taken out of tape pool
 
* fac tape drive broken - Tim taken out of tape pool
* Job recovery files leading to mismatch between casor and rucio A Brian to confirm if LHCb have similar issue
 
* disk servers - 620 failed 1 canbemigr, copied from disk, deleted from castor and copied abck in.
 
* castor 'internal communication' issue Monday - issue stopped later in day
 
* SAM tests failed on Wed night - ACL issues not castor
 
* CMS tape disk servers showing high load - fixed by changing transfer manager waitings
 
* gfal cat does not work on RAL/cern castor
 
* atlas data disk filled during the christmas holiday
 
* 687 not back full production ...
 
* double disk failure on one server??
 
* LHCb tape issue - many recall requests form LHCb, not going quickly and causing backups on tape drives / migration queues, compounded by network issues. Also one disk server in an odd state, 100% system CPU. Investigating to see if CASTOR 2.1.15 does not give priority to migration to tape rather than recalls.
 
* Network switch issue in R26 causing difficulties for standby DB - Monday, resolved Tuesday morning 
 
* Service castor restart on Gen to solve BD blocking issue post network issue
 
* DTEAM trying to access a castor file on Atlas - tape could not be read, dteam cannot be used on atlas / lhcb instances 
 
* 2.1.15 on tape servers - possibly some undesirable behaviour, tape being held in drive and not ejected
 
* 3 disk servers out have all gone back in - CPUs replaced
 
* Disk servers name lookup issue (CV11's) - more system than CASTOR. Currently holding CV11 upgrades until understood.
 
* NS dump failed because backup times changed - more frequent (Brian)
 
* LHCb ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)
 
  
 
== Blocking Issues ==
 
== Blocking Issues ==
  
 
== Planned, Scheduled and Cancelled Interventions ==
 
== Planned, Scheduled and Cancelled Interventions ==
* merge all castor tape backed pool into one - Shaun looking at putting some instructions together
+
* Merge all castor tape backed pool into one
* 2.1.15 on vcert
+
* fac disk servers SL6 - A CP talk to Jens / DLS
+
* Put fac disk servers 34/35 into diamond recall
+
 
* 11.2.04 client updates (running in preprod) - possible change control for prod
 
* 11.2.04 client updates (running in preprod) - possible change control for prod
 
* WAN tuning proposal - possibly put into change control Brian
 
* WAN tuning proposal - possibly put into change control Brian
 
* CASTOR 2.1.15
 
* CASTOR 2.1.15
 
* Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first
 
* Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first
* propose tues 19th Jan for SL6 upgrade on Fac d0t1 - comms with DLS and CEDA
 
 
 
== Things from the CASTOR F2F ==
 
* CASTOR can run on an arbitrary Ceph underlayer.
 
* No 2.1.16 is planned any time soon.
 
* SL7 test build ready soon
 
  
  
Line 79: Line 28:
 
* JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
 
* JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
 
* JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements
 
* JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements
* quatorize srm sl5?
 
  
  
Line 97: Line 45:
 
== New Actions ==
 
== New Actions ==
 
* Rob to double check 667 has had its firmware updated
 
* Rob to double check 667 has had its firmware updated
* Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - transfermeneger start 'nuclear' method)
+
* Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - transfermanager start 'nuclear' method)
 
* Rob to send Shaun’s merging tape pools wiki to CERN for review
 
* Rob to send Shaun’s merging tape pools wiki to CERN for review
 
* Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
 
* Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
Line 111: Line 59:
 
* Rob look at DLS / CEDA data rates in relation  
 
* Rob look at DLS / CEDA data rates in relation  
 
* BD re. WAN tuning proposal - discuss with GS, does it need a change control?
 
* BD re. WAN tuning proposal - discuss with GS, does it need a change control?
* BD recovery files leading to mismatch between casor and rucio. Brian to confirm if LHCb have similar issue
+
* RA to try stopping tapeserverd mid-migration to see if it breaks.
* KH Ask Kashif if we should update other disk servers with the new BIOS ...
+
* RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
* RA/BD Discuss DTeam use of atlas and issues around
+
* RA/JS vcert upgrade to 2.15.20 - DONE
+
* RA to try stopping tapeserverd mid-migration to see if it breaks. RA to check with TF for
+
 
+
more detailed explanation of what he did and why.
+
* RA produce a list of requirements for tape drive dedication in future castor versions, to
+
 
+
discuss with CERN
+
* SdW to revisit schema changes to improve SRM DB dup position W/C 12th Oct.
+
* Production team check srm03 regarding lack of crl updates
+
* RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under  
+
 
+
testing
+
* RA/JJ to look at information provider re DiRAC (reporting disk only etc) - being progressed
+
 
+
15/01/16
+
 
* GS to investigate providing checks for /etc/noquattor on production nodes & checks for  
 
* GS to investigate providing checks for /etc/noquattor on production nodes & checks for  
 
+
* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress
fetch-crl - ONGOING
+
* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress  
+
 
+
== Completed actions ==
+
* CP talk to Jens / DLS re fac disk servers > SL6
+
* RA/BD review unowned RT tickets - DONE
+
* GS to arrange meeting castor/fab/production to discuss the decommissioning procedures -DONE
+
* BD to chase AD about using the space reporting thing we made for him -DONE
+
* RA to schedule no-op change to be included in quattor / persistent (CMSdisk)- DONE
+
* RA/TF Contact Steve at CERN re CASTOR priority sending files to tape vs recalls - DONE
+
* GS/RA to revisit the CASTOR decommissioning process in light of the production team updates to their decommissioning process
+

Latest revision as of 17:45, 29 January 2016

Operations News

  • Merging tape pools wiki created by Shaun
  • 2.1.15 name server tested
  • New SRM on vcert2
  • New SRM (SL6) with bug fixes available - needs test
  • Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
  • LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
  • BD looking at porting persistent tests to Ceph

Operations Problems

  • atlasScratchDisk ... gdss667 triple disk failure, most data lost. 2 x callouts, down on disk servers and hit limit on xroot connections - SAM test failing.
  • gdss620 - gen tape with fabric still
  • CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate
  • fac tape drive broken - Tim taken out of tape pool

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • Merge all castor tape backed pool into one
  • 11.2.04 client updates (running in preprod) - possible change control for prod
  • WAN tuning proposal - possibly put into change control Brian
  • CASTOR 2.1.15
  • Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first


Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
  • JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
  • JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements


Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing

Interventions

  • Remaining D0T1 disk servers

Staffing

  • Castor on Call person next week
    • Rob
  • Staff absence/out of the office:
    • Brian out Tues 26th - thurs 4th
    • Shaun out 1st Feb for a week, then back in until 1st March

New Actions

  • Rob to double check 667 has had its firmware updated
  • Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - transfermanager start 'nuclear' method)
  • Rob to send Shaun’s merging tape pools wiki to CERN for review
  • Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
  • Rob to send email to Kevin O'N re any issues with Fac tape config change level 2
  • Tim apply Fac tape config change level 1 - i.e. a DLS read only tape on Monday 1st
  • Gareth to share reporting available for fac tape


Existing Actions

  • Juan to discuss with Shaun what to do with the workaround applied to “subrequesttodo” procedure in CASTOR 2.1.15.
  • Rob to create castor 14 gen build (for Ceph)
  • Shaun to present plan to put tape backend pool into one on 29th Jan
  • Rob look at DLS / CEDA data rates in relation
  • BD re. WAN tuning proposal - discuss with GS, does it need a change control?
  • RA to try stopping tapeserverd mid-migration to see if it breaks.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress