RAL Tier1 weekly operations castor 27/06/2011

From GridPP Wiki
Revision as of 13:38, 27 June 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • Tiju has written PUT tests which run on castoradm1 to test Facilities CASTOR instance.

Operations Problems

  • LHCb Problems with tape recalls. Addressed, at least in part, by modifying the stream policy to only allow migration once we have 20GB worth of data to migrate (or after 1 hour) and by switching GC Policy to LRU. However problems still exist due to writing and reading into the pool concurrently.
  • CMS LSF server was accidentally switched off. While this was resolved in about 1 hour, there were knock on problems the next day with the CMS stager database. This was resolved quite quickly (despite the away day)
  • LHCb disk server gdss120 was found to be writing to the system disk rather than the disk array. This machine was drained and replaced with gdss163, and gdss120 has now been retired from service.
  • Neptune RAC failed over the weekend for unknown reasons, and had to be restarted, resulting in 2-4 hours of downtime. Since this hosts the NS, all instances were affected. There were some knock-on problems on Monday

Blocking Issues

  • Lack of production-class hardware running ORACLE 10g needs to be resolved prior to CASTOR for Facilities can guarantee the same level of service as the Tier1 instances. Has arrived and we are awaiting installation.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
CASTOR 2.1.10-1 upgrade (STC) 05 July 09:00 05 July 11:00 Downtime All

Advanced Planning

  • Upgrade of CASTOR clients on WNs to 2.1.10-0
  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Move Facilities DB instance to new Database hardware running 10g
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

  • Castor on Call person: Shaun (Mon-Fri) and Matthew (Sat-Sun)
  • Staff absence/out of the office:
    • Matthew attending HEPSYSMAN on Thursday and Friday