RAL Tier1 weekly operations castor 10/08/2009

From GridPP Wiki
Revision as of 10:40, 12 August 2009 by Chris kruk (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • CIP development on certification (Jens)
  • CASTOR Disaster Recovery document (Matt)
  • Investigate 2.1.8 NS client on 2.1.7 NS DB (Chris)
  • Written lsf archiving mechanism which archives logs for 3 months on DMF and keeps 24 hours log on local partition (Chris)
  • Finished deploying the last remaining cmsTest for gridftp-internal tests (Chris)
  • Upgraded all remaining tape servers to 2.1.8-8 (Chris)
  • Matt was Castor on Duty from Mon-Thur and Chris on Friday
  • Increased threads on ATLAS JobManager from 2 to 10. The change has been reversed back to 2 after unsuccessful tests (Shaun)
  • Fixed Atlas tape monitoring metrics on CastorMon box (Shaun)
  • Written new optimised canbemigr script (Shaun)
  • Investigating Atlas "lcg_cr=Invalid argument" problem (Brian)
  • SL8500 robot's handbot problems (Tim)

Developments for this week

  • CIP development on certification (Jens)
  • Implement newly written lsf archiving mechanism in production (Chris)
  • Review and update disk server deployment procedure (Chris)
  • Return hotspare disk servers back to production instances (Chris)

Ongoing

  • Find out more about CERN's virtualized certification setup (Chris, Matt)
  • CastorMon monitoring graphs for Gen instance (Brian)
  • Cleaning up database for a future 2.1.8 upgrade (Shaun)
  • Setting up Preproduction (Matt, Chris)

Operations Issues

  • Ongoing problem with gdss213 (AtlasScratchDisk) which has had disk failure plus errors on other 3 disks.
  • Unsuccessful increased of Atlas JobManager threads from 2 to 10 which has degraded DB performance and decreased our efficiency (6th August)
  • CMS tape migration stopped at around 09:00 on 2nd August and restarted afternoon on 3rd August.
  • Tape robot down on the morning on 3rd August due to Sun engineer intervention.
  • gdss152 (AtlasSimStrip/disk1tape0) is out of production from 10th August due to hardware intervention
  • Aircon problems from 10th August in the machine room which is causing high Temp and necessity to shutdown Castor services.

Blocking issues

none

Scheduled and Cancelled Down Times

none

Changes to Production Milestones

none

Advanced Planning

  • CIP upgrade to include nearline publishing (August)
  • SRM 2.8 upgrade (August)
  • Work with Fabric to add extra RAID card in remaining Viglen'06 disk servers (Second half of August)
  • Database optimization tasks (September)
  • Upgrade nameserver to 2.1.8 (Possibly during September)
  • Black and White lists? (Possibly during September)
  • Improve resiliency to central services (This year)

Staffing

  • Castor on Call person: Chris
  • Cheney on A/L
  • Matt on A/L
  • Shaun on A/L
  • Tim on A/L
  • Jens on A/L from Tuesday