RAL Tier1 weekly operations Grid 20100927
From GridPP Wiki
Revision as of 14:29, 27 September 2010 by Alastair dewhurst (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
RAID software failure on lcglb01 | 14 Sep 2010 | all | low | replacement disk received; Fabric to swap it |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Name change for glite-APEL box | 15 Sep 2010 | early October | Medium |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Written disk draining scirpt/twiki page.
- Dealt with data loss at RALPP Higgs group disk last week
- Working on ATLAS software server, testing CVMFS
- 825 test jobs have been run.
- lcg0805 has been setup for production style testing, need to add queue into ATLAS system.
- Production tasks submitted.
- Writing script to graph transfer times for FTS transfers
- Working on Hammer cloud test of castor 2.1.9
- Analysis queue setup
- Need to copy DBrelease into pre-prod and replicate
Andrew
- Fixed (again) ganglia CPU efficiency monitoring (crond wasn't running the script) [Done]
- Setting up & testing glite-APEL [Ongoing]
- CMS data ops
- Running rereco at RAL, PIC, FNAL, ASGC [Ongoing]
- Revising CMS change-control form
Catalin
- improve WMS monitoring [done]
- work on Helpdesk MySQL database migration [done]
- migrate remaining databases [ongoing]
- kernel upgrades on SL5 nodes [ongoing]
- halt old SL4 LFC FEs
- prepare nodes in ATLAS building for power shutdown
Derek
- CREAM CE quattor profile [ongoing]
- Investigating CREAM CE instability [ongoing]
- WN update rollout (folded into kernel update) [done]
Matt
- Further testing of Quattorised gLite3.2 FTS FEs. [Ongoing]
- Quattorisation of MyProxy nodes (write up Change Control). [Ongoing]
- Capacity Signoff meeting followup. [Done]
- Migrated FTS agents after h/w failure. [Done]
- Scheduled FTS drain for LHCb. [Done]
- Tested afs cache on UI02 (with Ian). [Done]
- Rework FTS change control; factor out ATLAS power off. [Done]
- Work on Nagios plugins (tier1-nagios plugins build with Richard; restarter configuration). [Done]
Richard
- Simplified process of building tier1-nagios-plugins rpm and updated 2 plugins [Done]
- Prepping for kernel updates on the RAL top-level BDIIs
- Working on the "team status page" being developed as an action from team awayday [ongoing]
- Reviewing G/S process documentation [ongoing]
- CASTOR items:
VO Reports
ALICE
ATLAS
CMS
- 31 corrupt MC files at RAL (from gdss280) have been globally invalidated.
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall: