RAL Tier1 weekly operations Fabric 20100719
From GridPP Wiki
Contents
Developments
- All:
- Martin:
- Sick leave on Friday.
- Ian:
- Prepared initial gLite 3.2 LFC for Catalin
- Installed and began configuring HyperV server & hypervisors
- Atlas sofwtare investigations
- Initial tests of CernVM-FS (as possible sw server)
- Tim:
- Tweaking T10K migration
- Improving monitoring script for above
- Above now logginf to ganglia
- Removing data from DMF
- Jonathan:
- James A:
- Working on new Quattor server.
- Wrote a tool and some modules for decoding and replaying Frontier Squid logs.
- James T
- Sick leave on Tuesday
- Applied WAN tuning to LHCb, ATLAS, CMS servers that didn't have it at Brian's request.
- Deployed new version (1.3-1) of TAVS (Tier1 Array Verify Schedulaer) to quattorised disk servers. New version supports Adaptec cards and fixes a bug in the error handling.
- Deployed the RFIO /etc/services change to SL5 disk servers.
- Wrote script to check consistency of Overwatch and CASTOR (thanks to James A. for help).
- Wrote syslog survey for security strategy group - expect it in your inbox this week.
- Applied for certificates for ssv06 hosts (decommissioned Viglen '06 hosts).
- Very constructive meeting with LSI and Streamline on Friday. We will start acceptance tests early this week.
- Cheney
- set up acls on robot controller
- ipmi on rhubarb problem
- replace some drives and show kash
- tracked down some rogue emails
- updated nagios to monitor disk arrays
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Decommissioning old batch systems.(R 27)
- gdss67 passed acceptance test and given back to Castor team.
- gdss78 passed acceptance test. Re-installing partition and Linux.
- gdss207 crashed again. (Intervention)
- gdss474 replaced raid card. (Fixed). Given back to castor.
- gdss332 replaced IPMI card with John Kelly.
- gdss217 replaced 8x1gb memory. Given back to castor.
- lcgce01 disk partition failure. (Reported)
- Arranged collection of faulty parts with Viglen, Transtec and VSPL.
- Hardware failure stats/graphs.
- gdss231 & 420 replace battery.
- gdss536 & 537 replaced LSI raid cards with Adaptec with Boston engineer. (For testing)
- Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- James T on sick leave on Tuesday
- Cheney out Wednesday
- Martin on sick leave on Friday
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Sick Leave Monday
- Multipath update
- Disk ITT evaluation
- Ian:
- Virtualisation platform testing & planning
- Help Cheney with Quattor as required
- Fabric Automation planning
- Tim:
- Library microcode update
- Facilities castor planning
- Cheney
- central cator servers set up on quattor
- Jonathan:
- James T:
- (Re)start Streamline '09 acceptance testing.
- Overwatch/CASTOR comparison script testing.
- Security strategy group tasks.
- James A:
- New Quattor Server.
- Thinking about caching DNS servers.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss207 chase Vendor for replacement parts.
- gdss380 run 7 days acceptance test.
- Work with Streamline Engineer.
- Continuous decommissioning old batch systems.(R 27)
Absences
- Jonathan on partial retirement (not in on Monday and Friday)
- Martin on sick leave Monday
- James T on special leave Monday 26th to Friday 30th July
- Ian annual leave Thursday & Friday
Fabric On-Call
James T Monday-Thursday
Kashif Friday-Sunday