RAL Tier1 weekly operations castor 07/03/2016

From GridPP Wiki
Jump to: navigation, search

Operations News

  • 2.1.15 now works - 0.1-0.15s i.e. file open times good
  • glibc patching for facilities completed
  • 11.2.0.4. client updates are done (LHCb has been restarted already)


  • No disk server issues this week
  • glibc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their build history. ACTION: RA to revisit quattor build so that this does not recur.
  • 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time. This should be transparent. ACTION RA

Operations Problems

  • Failure of main CIP server over weekend, now back in operation
  • LHCb job failures ticket still open
  • CMS AAA issues
  • glibc patching - issues with servers coming back, understood - kernel not updated by spma before rebooted. Also network interfaces / SSH key being changed on a storageD machine


  • Main CIP system failed, have failed over to test CIP machine. HW failure to be fixed then will fail back over to production system. ACTION: RA, CC and Fabric
  • OPN links. BD investigating what data flow is filling OPN and superjanet at the same time.
  • LHCb job failures - GGUS ticket open
  • ongoing AAA issues in CMS
  • CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate ACTION: RA follow up with fabric team

Planned, Scheduled and Cancelled Interventions

  • 2.1.15 update to nameserver will not go ahead. This is due to slow file open times issues on the stager. Testing/debugging of stager issue is ongoing. If these issues are resolved propose nameserver upgrade on 16th March with stager following week, 22nd March. (RA)
  • 11.2.04 client updates (running in preprod) - possible change control for prod (see above)
  • WAN tuning proposal - possibly put into change control BD
  • CASTOR facilities patching scheduled for next week - detailed schedule to be agreed with fabric team.

Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
  • JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
  • Facilities drive re-allocation. ACTION: RA
  • SRM 2.1.14 testing with SdW on VCERT

Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing


Interventions

  • Juno reboot for patching - poss Wed 16th March (needs Martin B / Juan / castor)
  • Tier1 Nameserver 2.1.15 update proposed for week before Easter W/C 21st March

Staffing

  • Castor on Call person next week
    • CP Friday 04/03/16 onwards
  • Staff absence/out of the office:
    • SdW - all wk and next few weeks
    • RA - all week
    • AP - all week


New Actions

  • CC is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
  • GS Callout for CIP only in waking hours?
  • RA CV11 firmware updates
  • GS to confirm if no other Tier1 1s are down W/C 21 march


Existing Actions

  • RA to revisit quattor build for the head nodes which did not reboot so that this does not recur.
  • RA follow up with Fabric re: CV '11 gen RAID card controller firmware update
  • BD to coordinate with atlas re bulk deletion before TF starts repack
  • BD to clarify if separating the DiRAC data is a necessity
  • BD ensure quattorising atlas consistency check
  • RA to deploy a 14 generation into preprod
  • BD re. WAN tuning proposal - discuss with GS, does it need a change control?
  • RA to try stopping tapeserverd mid-migration to see if it breaks.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress