Difference between revisions of "RAL Tier1 weekly operations castor 15/01/2016"

Latest revision as of 16:11, 18 January 2016

Operations News

noop scheduler now everywhere
CMS log spam - failing disk to disk copied, Juan spotted incorrect config and corrected
Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842

db dups removal script available
Brain and Tim updating the new tape pool creation docs
11.2.04 Oracle client updates, on prepod so far so good
Rob testing 2.1.15 on vcert ...
Atlas data read rate hit 20GB/s 4/12/2015 - 09:30, also seen in colourful network map
Plan to replace disk array and database nodes underway - also a possible improvement in configuration, frequent write of a Oracle system log is slow and can be improved by writing to a dedicated area with a different RAID config
LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
2.1.15-20 now on vcert
SRMdb dups script created
Erratas done on production and all others
11.2.0.4 clients rolled out on preprod
Attempted fix for LHCb's intermittent stage request failures - increase 'Stage Request' thread count from 4 to 12.
BD looking at porting persistent tests to Ceph

Operations Problems

CMS many recalls & migrations - problems as test files on tape and stuck in queue (following issue with tape cache). CMS test now failing, CASTOR actually functioning OK.
fac tape drive broken - Tim taken out of tape pool

Job recovery files leading to mismatch between casor and rucio A Brian to confirm if LHCb have similar issue
disk servers - 620 failed 1 canbemigr, copied from disk, deleted from castor and copied abck in.
castor 'internal communication' issue Monday - issue stopped later in day
SAM tests failed on Wed night - ACL issues not castor
CMS tape disk servers showing high load - fixed by changing transfer manager waitings
gfal cat does not work on RAL/cern castor
atlas data disk filled during the christmas holiday
687 not back full production ...
double disk failure on one server??
LHCb tape issue - many recall requests form LHCb, not going quickly and causing backups on tape drives / migration queues, compounded by network issues. Also one disk server in an odd state, 100% system CPU. Investigating to see if CASTOR 2.1.15 does not give priority to migration to tape rather than recalls.
Network switch issue in R26 causing difficulties for standby DB - Monday, resolved Tuesday morning
Service castor restart on Gen to solve BD blocking issue post network issue

DTEAM trying to access a castor file on Atlas - tape could not be read, dteam cannot be used on atlas / lhcb instances
2.1.15 on tape servers - possibly some undesirable behaviour, tape being held in drive and not ejected
3 disk servers out have all gone back in - CPUs replaced
Disk servers name lookup issue (CV11's) - more system than CASTOR. Currently holding CV11 upgrades until understood.
NS dump failed because backup times changed - more frequent (Brian)
LHCb ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)

Blocking Issues

Planned, Scheduled and Cancelled Interventions

fac disk servers SL6 - A CP talk to Jens / DLS
Put fac disk servers 34/35 into diamond recall
11.2.04 client updates (running in preprod) - possible change control for prod
WAN tuning proposal - possibly put into change control Brian
CASTOR 2.1.15

Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first
propose tues 19th Jan for SL6 upgrade on Fac d0t1 - comms with DLS and CEDA

Things from the CASTOR F2F

CASTOR can run on an arbitrary Ceph underlayer.
No 2.1.16 is planned any time soon.
SL7 test build ready soon

Long-term projects

RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements
quatorize srm sl5?

Advanced Planning

Tasks

CASTOR 2.1.15 implementation and testing

Interventions

Remaining D0T1 disk servers

Staffing

Castor on Call person next week
- Rob

Staff absence/out of the office:
- Chris out Monday

New Actions

BD re. WAN tuning proposal - discuss with GS, does it need a change control?
CP talk to Jens / DLS re fac disk servers > SL6

Existing Actions

BD recovery files leading to mismatch between casor and rucio. Brian to confirm if LHCb have similar issue
KH Ask Kashif if we should update other disk servers with the new BIOS ...
RA/BD Discuss DTeam use of atlas and issues around
RA/JS vcert upgrade to 2.15.20 - DONE
RA to try stopping tapeserverd mid-migration to see if it breaks. RA to check with TF for more detailed explanation of what he did and why.
RA produce a list of requirements for tape drive dedication in future castor versions, to discuss with CERN
SdW to revisit schema changes to improve SRM DB dup position W/C 12th Oct.
Production team check srm03 regarding lack of crl updates
RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
RA/JJ to look at information provider re DiRAC (reporting disk only etc) - being progressed 15/01/16
GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress

Completed actions

RA/BD review unowned RT tickets - DONE
GS to arrange meeting castor/fab/production to discuss the decommissioning procedures -DONE
BD to chase AD about using the space reporting thing we made for him -DONE
RA to schedule no-op change to be included in quattor / persistent (CMSdisk)- DONE
RA/TF Contact Steve at CERN re CASTOR priority sending files to tape vs recalls - DONE
GS/RA to revisit the CASTOR decommissioning process in light of the production team updates to their decommissioning process

@@ Line 12: / Line 12: @@
 * Plan to replace disk array and database nodes underway - also a possible improvement in configuration, frequent write of a Oracle system log is slow and can be improved by writing to a dedicated area with a different RAID config
 * LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
-* V09 disk machines - decommissioned but still running, GS turned off 25 this week
 * 2.1.15-20 now on vcert
 * SRMdb dups script created
 * Erratas done on production and all others
 * 11.2.0.4 clients rolled out on preprod
-* RA, SdW, GTF and AS have been to CERN for a CASTOR face-to-face meeting
 * Attempted fix for LHCb's intermittent stage request failures - increase 'Stage Request' thread count from 4 to 12.
 * BD looking at porting persistent tests to Ceph
@@ Line 39: / Line 37: @@
 * Service castor restart on Gen to solve BD blocking issue post network issue
-* 3 x disk servers (atlas) put back RO, followed by BIOS update and put in to full production - should we update other disk servers?? Ask Kashif
-* gdss687 went read only Friday - now out of production
 * DTEAM trying to access a castor file on Atlas - tape could not be read, dteam cannot be used on atlas / lhcb instances
 * 2.1.15 on tape servers - possibly some undesirable behaviour, tape being held in drive and not ejected
-* SL6 CV11 - no issues seen this week
 * 3 disk servers out have all gone back in - CPUs replaced
-* SL6 upgrade, CV11 problems – systems have been stable this week. RA had done some stress testing but the results from this were inconclusive, in that they did not produce a comparable result. Therefore it may be that it is just not possible to simulate the behaviour on pre-production system. ACTIONS: RA to ensure the procedure for dealing with any recurrence of this issue is documented for on-call personnel. GS to ensure this is mentioned at on call meeting and check with CW and KM what they are testing with the machines which have had the fault as we have seen them returned to production and then showing the fault again. RA – decision made to continue with upgrade to SL6 in any case.
 * Disk servers name lookup issue (CV11's) - more system than CASTOR. Currently holding CV11 upgrades until understood.
 * NS dump failed because backup times changed - more frequent (Brian)
-* RA has developed an automatic script to clean up the usual case of these.
-* stager_qry has been running slowly for ATLAS
-* The failed CV11 nodes from atlasTape have been fixed and returned to prod. We are looking into our procedures to see if acceptance testing should be improved.
 * LHCb ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)
@@ Line 60: / Line 51: @@
 * 11.2.04 client updates (running in preprod) - possible change control for prod
 * WAN tuning proposal - possibly put into change control Brian
+* CASTOR 2.1.15
-* Tue 8th Dec - firewall bypass link is being moved, will not affect castor internally (draining FTS)
-* Wed 9th Dec - very short NW break for castor headnodes due to removal of old core network switch - scheduled for 09:30.
-* 2.1.15 Tape - completion of upgrade next week (Tim)
-* network change to reconnect to atlas building - Wed 09:30, short break to standby dbs
-* network change - castor headnodes affected - to schedule (GS)
-* Tape backed nodes to SL6 - scheduled for Monday/Tuesday.
-* CASTOR 2.1.15
-** Initial installation on vcert underway. Currently scheduled for early next year.
 * Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first

Difference between revisions of "RAL Tier1 weekly operations castor 15/01/2016"

Latest revision as of 16:11, 18 January 2016

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Things from the CASTOR F2F

Long-term projects

Advanced Planning

Staffing

New Actions

Existing Actions

Completed actions

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools