RAL Tier1 weekly operations Fabric 20091207

From GridPP Wiki
Revision as of 15:14, 7 December 2009 by James adams (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
  • Ian:
    • Finalised new flex license servers for LSF
    • Further Quattor tutorial for Cheney & Matt
    • Provided physical hardware for CIP
      • Went through various configuration options
  • James T:
    • Quattorisation of disk servers
    • Primary on call Mon - Thurs
    • Viglen disk swap out support
    • Post-mortem on gdss138
  • Jonathan:
    • maintained NIS netgroup
    • corrected atlasbackup problems for a few hosts
    • Administrator on Duty (Wednesday)
    • unmounted /home/csf from lcg0617/618
    • Nagios configuration updates
    • system tuning of nagger to try to reduce scheduling queue
    • installed RPM mrtg on nagger and added configuration to collect performance statistics from Nagios (see http://nagger.gridpp.rl.ac.uk/mrtg/nagios-[a-n].html at present)
  • James A:
    • Achieved a working preliminary SINDES server.
    • Upgraded Cacti on thor.gridpp.rl.ac.uk
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss138 double disks failure. (Intervention)
    • gdss149, 162, 163 and 367 given back to castor.
    • gdss77 kernel panic possibly faulty memory. (Intervention)
    • gdss139 given back to castor.
    • Moved 3 batch systems from R27 to HPD room (CV 2005 rack) with MJB.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss77, 138 and 282.

Absences

  • Jonathan: S/L Monday
  • Jonathan: A/L Thursday am

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All
Gdss138 double disk failure: two drives failed in quick succession (30 minutes) Monday 0530-0600 Ongoing Severe LHCb Dst data. Data loss confirmed

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type
Cacti (http://cacti.gridpp.rl.ac.uk) Upgrade Cacti software. Subject to team manager's approval of plan. Tuesday 2009-12-08 13:00 Tuesday 2009-12-08 17:00 none At Risk

Development priorities

  • All
    • Work on evacuating A1 Upper (Castor LSF/FlexLM triplet)
  • Martin:
  • Ian:
    • Reconfigure physical CIP again
    • Implement second CIP with Quattor (T2K)
    • Start work on Quattor managed glite 3.2 vobox with Catalin
    • Assist with new disk servers as required
    • Incorporation of latest QWG template updates
  • James T:
    • Quattorisation of disk servers
    • Remove nincom as Ganglia data source for Services_Monitoring
    • Script to compare Overwatch with real CASTOR status
    • TOASTER preparation
  • Jonathan:
    • Quattor implementation for Nagios slave
    • security updates to disk servers to prevent general user logins
    • Nagios configuration updates
  • James A:
    • Continue with SINDES.
    • Upgrade Cacti on cacti.gridpp.rl.ac.uk, install plugins and apply internal patches.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss339 kernel panic.(Intervention)
    • gdss138 double disk failure. (Intervention)
    • Decommissioning old batch systems with Production Team.
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss77, 138, 282 and 339.

Absences

  • Kashif: A/L Wednesday

Fabric On-Call

  • Mon-Sun: Ian Primary on call

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on various hardware requests for Services team.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric