RAL Tier1 weekly operations Fabric 20100719

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
    • Sick leave on Friday.
  • Ian:
    • Prepared initial gLite 3.2 LFC for Catalin
    • Installed and began configuring HyperV server & hypervisors
    • Atlas sofwtare investigations
    • Initial tests of CernVM-FS (as possible sw server)


  • Tim:
    • Tweaking T10K migration
    • Improving monitoring script for above
    • Above now logginf to ganglia
    • Removing data from DMF


  • Jonathan:
  • James A:
    • Working on new Quattor server.
    • Wrote a tool and some modules for decoding and replaying Frontier Squid logs.
  • James T
    • Sick leave on Tuesday
    • Applied WAN tuning to LHCb, ATLAS, CMS servers that didn't have it at Brian's request.
    • Deployed new version (1.3-1) of TAVS (Tier1 Array Verify Schedulaer) to quattorised disk servers. New version supports Adaptec cards and fixes a bug in the error handling.
    • Deployed the RFIO /etc/services change to SL5 disk servers.
    • Wrote script to check consistency of Overwatch and CASTOR (thanks to James A. for help).
    • Wrote syslog survey for security strategy group - expect it in your inbox this week.
    • Applied for certificates for ssv06 hosts (decommissioned Viglen '06 hosts).
    • Very constructive meeting with LSI and Streamline on Friday. We will start acceptance tests early this week.
  • Cheney
    • set up acls on robot controller
    • ipmi on rhubarb problem
    • replace some drives and show kash
    • tracked down some rogue emails
    • updated nagios to monitor disk arrays
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss67 passed acceptance test and given back to Castor team.
    • gdss78 passed acceptance test. Re-installing partition and Linux.
    • gdss207 crashed again. (Intervention)
    • gdss474 replaced raid card. (Fixed). Given back to castor.
    • gdss332 replaced IPMI card with John Kelly.
    • gdss217 replaced 8x1gb memory. Given back to castor.
    • lcgce01 disk partition failure. (Reported)
    • Arranged collection of faulty parts with Viglen, Transtec and VSPL.
    • Hardware failure stats/graphs.
    • gdss231 & 420 replace battery.
    • gdss536 & 537 replaced LSI raid cards with Adaptec with Boston engineer. (For testing)
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • James T on sick leave on Tuesday
  • Cheney out Wednesday
  • Martin on sick leave on Friday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Sick Leave Monday
    • Multipath update
    • Disk ITT evaluation
  • Ian:
    • Virtualisation platform testing & planning
    • Help Cheney with Quattor as required
    • Fabric Automation planning
  • Tim:
    • Library microcode update
    • Facilities castor planning
  • Cheney
    • central cator servers set up on quattor
  • Jonathan:
  • James T:
    • (Re)start Streamline '09 acceptance testing.
    • Overwatch/CASTOR comparison script testing.
    • Security strategy group tasks.
  • James A:
    • New Quattor Server.
    • Thinking about caching DNS servers.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss207 chase Vendor for replacement parts.
    • gdss380 run 7 days acceptance test.
    • Work with Streamline Engineer.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Martin on sick leave Monday
  • James T on special leave Monday 26th to Friday 30th July
  • Ian annual leave Thursday & Friday

Fabric On-Call

James T Monday-Thursday

Kashif Friday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1