RAL Tier1 weekly operations Fabric 20091130

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • Work on Viglen08 disk acceptance tests
    • Work on test nodes for LFC database resilience tests
    • Various finance an dprocurement issues
  • Ian:
    • Wrapped up FP7 bid - submitted Tuesday
    • Tested new features in Quattor for Nagios slave server
    • Imported Quattor updates from QWG - fixed a couple of resulting issues
  • James T:
    • CRISTAL2 preparation
    • Set up ancillary network for Streamline08 disk servers with James A.
    • Quattorisation of disk servers
    • CRISTAL2 Wed - Fri
    • Fabric on call Mon - Thurs
    • Primary on call Fri - Sun
  • Jonathan:
    • updated kernels on NIS servers and rebooted
    • removed mount of /home/csf and added soft-links for Bfactory users for farm nodes
    • wrote paper about backup policy, recovery etc for Tier1 review
    • increased quota for LHCb AFS volume
    • cleared up atlasbackup problems for some nodes
    • created archive backups for ccsc07/15 for Richard
    • Nagios configuration updates
    • released new versions of RPMs and tier1-nrpe-config
    • rebooted nagger for new kernel
    • worked on Quattor configuration of Nagios slave (with help from Ian)
  • James A:
    • Caught up after leave.
    • Tried to focus on SINDES.
    • Helped with general QUATTOR issues where needed.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss163 double disks failure. (Finish test)
    • gdss95 and gdss134 given back to castor
    • Created graphs of drives failure for MJB.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 163 and 282.


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All
Gdss138 double disk failure: two drives failed in quick sucession (30 minutes) Monday 0530-0600 Ongoing Severe LHCb Dst data. Data loss confirmed

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Work on evacuating A1 Upper (Castor LSF/FlexLM triplet)
  • Martin:
    • Viglen08 disk acceptance solution
  • Ian:
    • Finalising new flex license servers for LSF
    • Further Quattor tutorial for Cheney
    • Assist with new disk servers
  • James T:
    • Quattorisation of disk servers
    • Decision on Viglen 2008 suggested solution
    • Primary on call Mon - Thurs
  • Jonathan:
    • Quattor implementation for Nagios slave
    • security updates to disk servers to prevent general user logins
    • Nagios configuration updates
  • James A:
    • Continue with SINDES.
    • Make some fixed to the Hardware database for Kash.
    • Update and make changes to Cacti.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss67 return to castor team after finishing test.
    • gdss138 double disk failure. (Intervention)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss67, 138, 163 and 282.

Absences

  • Jonathan: S/L Monday
  • Jonathan: A/L Thursday am

Fabric On-Call

  • Mon-Thu: James T Primary on call
  • Fri-Sun: Ian Primary on call

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on various hardware requests for Services team.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric