RAL Tier1 weekly operations Grid 20100201
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Disk errors on LB01 host | 27-Jan-2010 | 1-Feb-2010 | all | low | failed disk replaced; service no disrupted |
SQL server reboot problems | 27-Jan-2010 12:00 | 27-Jan-2010 13:30 | high | server didn't reboot after kernel upgrade; GRUB magic work needed | |
Alice VOBOX upgrade problems | 27-Jan-2010 13:00 | 27-Feb-2010 14:30 | Alice | high | filesystem problems after kernel and RPM upgrades; machine needed a re-install from scratch |
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | ||
Hardware for SCAS servers | 2010-02-01 | High | Hardware required for production SCAS servers - required to be in place by end of Feb | |
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS). | ||
Hardware for SL5 CREAM CE for Non LHC SL5 batch access | Medium | Hardware required for CREAM CE for non-LHC vos | ||
Pool accounts for Super B vo | 2010-01-13 | Medium | Required to enable Super B vo on batch farm.
Done |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- ATLAS: Tier 1 throughput test performed today. RAL + FZK excluded.
- ATLAS: Re-processing run still meant to start on Friday 5th February. Will know exact timetable on 3rd February.
- CMS: continuous data taking will begin on 8th February
- Disk deployment: meeting scheduled for 14:00-15:00 Tuesday
Highlights for Tier-1 VO Liaison Meeting
- ALICE: possible decision to get rid of the lcg-CE at T1s and T2s
- Plan T2K configuration of FTS. Request dedicated diskpool subject to confirmation of alloaction from UB
Detailed Individual Reports
Alastair
- Understand remaining errors from HC test. [Done]
- Continue updating RAL PP twiki. [Ongoing]
- Prepare slides for presentation on computing requirements. [Ongoing]
- Write Nagios script to warn when space token are near full. [To be implemented when Castor comes back]
- Work with Brian + Chris in re-deploying disk servers to ATLAS space tokens.
Andrew
- Testing of new CMSSW TTreeCache training patch (still not quite as good as lazy-download; found that it crashes with official re-reco config) [Done]
- Investigated problems with CMS backfill jobs; MC production failed jobs [Done?]
- Started adding PhEDEx-CASTOR consistency Ganglia monitoring
- Test another new CMSSW I/O optimisation patch
- Complete PhEDEx-CASTOR consistency Ganglia monitoring (PhEDEx part done)
- Complete document about automatic job killing
- Display Screen Equipment assessment (had to do a second one) [Done]
Catalin
- WMS01 and 02 upgrades [Done]
- kernel updates [Done]
- re-installed one ALICE VOBOX [Done]
- 1-to-1 on Nagios configuration (with Jonathan)
- chase CERN for LFC schemas tidying up
- test Alice xrootd (manager + peer) re-installation (with Chris)
- quattorise additional LFC frontends (with Ian)
Derek
- Reinstalling lcgce08 with host swap config[Done]
- Reconfiguring lcgce01[Done]
- Installing SL5 SCAS server
- Testing SL5 GLexec WN
- Setting up testbed site in quattor
- Released new yaim config rpm with updated GridPP VOMS server certificate
- Installed new yaim config rpm on lcgce02 and csfnfs58
Matt
- Test upgrade path from FTS2.1 to FTS2.2 on orisa
- Plan ATLAS/R89 co-hosting of Grid Services
- FTS drain and migration of front-ends back to somnus [Done]
- Plan T2K configuration of FTS, and request dedicated diskpool
Richard
- Manual Handling Training [Done]
- Had a quattor session with IC to demonstrate problems with the current BDII build. Will do a fresh build to test the effect of altering the INSTALL_ROOT template variable (and report findings back to Michel Jouvin for subsequent inclusion in the QWG templates)
- Currently working on one of the Nagios plugins assigned by Cheney
- CASTOR items:
- Set up CCSE03..07 as CASTOR disk servers [Done]
- Waiting for resolution of:
- Disk array problems on castor301
- Powering off / Crashing problem on ccse02
Mayo
- Create system for exporting Metrics report to spreadsheet [Done]
- Adding bar chart to Metric system
- Admin interface for Metric System
VO Reports
ALICE
- possible decision to get rid of the lcg-CE at T1s and T2s
ATLAS
- Tier 1 throughput test performed today. RAL + FZK excluded.
- Re-processing run still meant to start on Friday 5th February. Will know exact timetable on 3rd February.
CMS
- Clean-up at Tier 1s may begin soon in preparation for next data taking period
- Continuous data taking will begin on February 8
- Backfill restarted (Tues night) in order to check CREAMCE monitoring problems
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon-Sun)
- AoD: