Difference between revisions of "RAL Tier1 weekly operations castor 22/04/2016"

From GridPP Wiki
Jump to: navigation, search
Line 2: Line 2:
  
 
1.Problems encountered this week
 
1.Problems encountered this week
  - CMS - AAA
+
- CMS - AAA
  - LHCb job failure
+
- LHCb job failure
 
2.Upgrades/improvements made this week
 
2.Upgrades/improvements made this week
  - gfal
+
- gfal
 
3.What are we planning to do next week?
 
3.What are we planning to do next week?
 
4.Long-term project updates (if not already covered)
 
4.Long-term project updates (if not already covered)
  - 2.1.15  
+
- 2.1.15  
  - Progress
+
- Progress
  - Planning
+
- Planning
 
5.Special topics
 
5.Special topics
 
6.Actions
 
6.Actions

Revision as of 09:32, 6 May 2016

Agenda:

1.Problems encountered this week - CMS - AAA - LHCb job failure 2.Upgrades/improvements made this week - gfal 3.What are we planning to do next week? 4.Long-term project updates (if not already covered) - 2.1.15 - Progress - Planning 5.Special topics 6.Actions 7.Anything for CASTOR-Fabric? 8.AoTechnicalB 9.Availability for next week 10.On-Call 11.AoOtherB


Operations News

  • New MICE user set up

Operations Problems

  • gfalcat does not work with castor, underlying issue fixed for gfalcopy but not gfalcat (gfal developers responsible) - Tracking
  • AtlasScratch, users from atlas still having problems accessing atlasScratch files - investigations ongoing
  • GDSS771 crashed - now in draining
  • draining is not working for atlas (does however seem to work on LHCb) - Brian has changed parameters as recommended by Shaun no improvement. manual method of draining still works - diskServerLs and stager_get (to move file to another disk server)


Planned, Scheduled and Cancelled Interventions

  • CASTOR 2.1.15


Long-term projects

  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
  • JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.

Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing
  • Deployment of SRM 2.14

Staffing

  • All in
  • Castor on Call person next week


New Actions

  • GS ask Kashif re RAID firmware updates on d0t1 v2011 machines and if there are other batches of machines that should upgraded


Existing Actions

  • BD check if D drives have arrived for WLCG
  • BD report draining issues to CERN
  • BD to work with George P (new hire) to hand over the WAN tuning work
  • RA/AS new tool for monitoring srm db dups - the user type
  • RA to get someone to code review his SRM_DB_DUPLICATES blatting script
  • BD mice ticket - asking for a separate tape pool for d0t1 for monticarlo
  • GS is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
  • GS Callout for CIP only in waking hours?
  • RA ensure quattorising atlas consistency check - Rob to talk to Andrew L
  • RA to try stopping tapeserverd mid-migration to see if it breaks - ask Tim.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress