Difference between revisions of "RAL Tier1 weekly operations castor 06/12/2010"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 16:03, 6 December 2010

Operations News

  • 2.1.9-10 installed on Facilities instance. Now returned to users for further testing.
  • New puppetmaster has now been installed and is controlling all Facilities and preprod disk servers.
  • All CASTOR EMC raid arrays now fed from UPS on one of their power supplies

Operations Issues

  • The slow network problem affecting 47 disk servers was identified to a faulty transceiver which was replaced on Tuesday. All instances apart from Gen had suffered due to this problem, especially CMS.
  • A power blip on Wednesday afternoon knocked out all disk servers on phase 'A' - around 100 disk servers. Most came up without problem. As of Friday morning, only gdss77 is still out of production.
  • On Thu/Fri night an unknown problem caused the robot 'playground' to fill up with parked tapes, and disabled a number of handbots. An engineer was called out and returned the parked tapes to production and freed up the disabled handbots.
  • CMS instance was heavily loaded over the weekend. gdss310 stopped responding and removing it from CASTOR helped. The SRMs repeatedly crashed by StatusOfBringOnline requests crashing frontend. CMS SRMs were upgraded to 2.8-6 to prevent reoccurence.

Blocking issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities going into full production

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Update ATLAS to 2.1.9-6 06/12/2010 08:00 08/12/2010 18:00 Downtime ATLAS

Advanced Planning

  • Deploy new puppetmaster
  • Upgrade ATLAS, CMS, Gen disk servers to 64bit o/s
  • CASTOR upgrade to 2.1.9-10 and SRM upgrade to 2.10 to fix the unavailable status being reported to FTS with draining disk servers
  • CASTOR upgrade to 2.1.9-10 which incorporates the fix for gridftp-internal to support multiple service classes, enabling checksums for Gen
  • CASTOR for Facilities instance in production by end of 2010

Staffing

  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • Chris A/L on Friday