RAL Tier1 weekly operations Grid 20100125
From GridPP Wiki
Contents
Summary of Previous Week
Developments
- Alastair
- Ran tests on Frontier server to confirm it is working well. Small number (< 10) errors not understood currently.
- Completed version 1 of Tier 1 VO requirements with information that has been provided by Raja.
- Helped deploy disk servers to ATLAS scratch disk when it became full. Wrote script to clean scratchdisk should it rapidly fill again.
- Andrew
- CMS data ops training at CERN (ProdAgent; MC production; backfill; creating workflows)
- Catalin
- handed over the SL5 LHCb VOBOX
- ATLAS Frontier 3.22 update
- tested the Alice xrootd (manager + peer) re-installation
- followed up some post-reboot WMS issues with CERN
- Derek
- Added BLParser to lcgbatch01
- Wrote quattor template for SL5 Glexec
- Added extra config to yaim configuration to be categorised correctly by GSTAT
- Deployed and tested Staged Rollout SL5 64 bit top BDII
- Completed Multi User pilot job questionnaire
- Matt
- Migrated FTS agents to warm standby host
- Documented R-GMA recovery procedure
- Finished draft of Grid Services Disaster Recovery document
- Provided test site BDII for CIP upgrade testing, and tested CIP output
- Provided input for Grid Team for GridPP4
- Richard
- 2 days A/L
- Catch-up
- Fire Safety Training
- Attended "share out" meeting for Nagios/NRPE plugins
- Re-built the machine sv-08-02 as a test BDII server for Jens' CIP activity
- CASTOR items:
- Re-installed castor303.ads as a CASTOR disk disk server
- Castor301 needs same treatment but has memory problems
- Mayo
- Encrypted passwords within the Metric system
- Added a change password feature to the metric system
- Fixed a bug within the Metric system
- Worked on tape statistics spreadsheet project: converting excel chatrs to HTML
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Disk errors on FTS Agents host | 20100119 14:00 | 20100121 09:00 | LHC | Low | Schedules migration agents to standby host |
Plans for Week(s) Ahead
Plans
- Alastair
- Understand remaining errors from HC test.
- Continue updating RAL PP twiki.
- Prepare slides for presentation on computing requirements.
- Write Nagios script to warn when space token are near full.
- Andrew
- IO testing: will try new CMSSW TTreeCache patch & compare to lazy download
- Investigate CMS job status reporting problems for my backfill jobs
- Investigate my backfill jobs killed by batch system
- Investigate my aborting MC production jobs at UK T2s/T3s
- Write document about automatic job killing
- Catalin
- WMS01 and 02 upgrades
- kernel updates
- chase CERN for LFC schemas tidying up
- test Alice xrootd (manager + peer) re-installation (with ChrisK)
- Derek
- Reinstalling lcgce08 with host swap config
- Reconfiguring lcgce01
- Continuing work on Glexec and SCAS
- Matt
- FTS drain and migration of front-ends back to somnus
- Test upgrade path from FTS2.1 to FTS2.2 on orisa
- Planning ATLAS/R89 co-hosting of Grid Services
- Plan T2K configuration of FTS, and request dedicated diskpool
- Richard
- Manual Handling Training
- Feed back results from Jens' CIP testing into Quattor profile for BDII server
- CASTOR items:
- Finish installing the suite of RPMs needed on castor303 (new disk server)
- Re-install castor301 when memory has been fixed.
- Mayo
- Automating Metric report system
- Adding charts to the metric system
- Web interface and script to fetch data for Tape robot statistics spreadsheet project
VO Reports
ALICE
- Status Report (21 Jan) - very stable behaviour, CREAM stability exceptional, "a bit of free resources" (re: RAL farm between week 53/2009 and week 02/2010)
ATLAS
- ATLAS confirmed that RAL WMS not critical for UK operations.
CMS
- RAL ranked 2 in T1 Site Readiness Ranking on 2010-01-25 (last 2 weeks)
- Only JobRobot and Backfill (re-reco) jobs running recently
- High CMS network usage on Friday was due to lazy-download not being specified in the reco config file
LHCb
- Other Tier-1s (IN2P3, PIC) also reporting low job efficiencies. Suspect LHCb user or application problems.
- LHCb confirmed that RAL WMS not critical for UK operations.
- New release of DIRAC, so testing on SL5 VOBOX should be able to proceed shortly.
Resource Requests
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Disk problems on FTS agents host | lcgfts01 | Scheduled | 20100121 07:00 | 20100121 09:00 | LHC |
Requirements and Blocking Issues
Description | Required By | Priority | Status |
---|---|---|---|
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue | |
Hardware for Testbed | High | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS). | |
Hardware for SCAS servers | Feb 1 2010 | High | Hardware required for production SCAS servers - required to be in place by end of Feb |
Hardware for SL5 CREAM CE for Non LHC SL5 batch access | Medium | Hardware required for CREAM CE for non-LHC vos | |
Pool accounts for Super B vo | Medium | Required to enable Super B vo on batch farm |
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon-Sun)
- AoD: