From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Summary of Previous Week
Developments
- Alastair
- Ran tests on Frontier server to confirm it is working well. Small number (< 10) errors not understood currently.
- Completed version 1 of Tier 1 VO requirements with information that has been provided by Raja.
- Helped deploy disk servers to ATLAS scratch disk when it became full. Wrote script to clean scratchdisk should it rapidly fill again.
- Andrew
- CMS data ops training at CERN (ProdAgent; MC production; backfill; creating workflows)
- Catalin
- handed over the SL5 LHCb VOBOX
- ATLAS Frontier 3.22 update
- tested the Alice xrootd (manager + peer) re-installation
- followed up some post-reboot WMS issues with CERN
- Derek
- Added BLParser to lcgbatch01
- Wrote quattor template for SL5 Glexec
- Added extra config to yaim configuration to be categorised correctly by GSTAT
- Deployed and tested Staged Rollout SL5 64 bit top BDII
- Completed Multi User pilot job questionnaire
- Matt
- Migrated FTS agents to warm standby host
- Documented R-GMA recovery procedure
- Finished draft of Grid Services Disaster Recovery document
- Provided test site BDII for CIP upgrade testing, and tested CIP output
- Provided input for Grid Team for GridPP4
- Richard
- 2 days A/L
- Catch-up
- Fire Safety Training
- Attended "share out" meeting for Nagios/NRPE plugins
- Re-built the machine sv-08-02 as a test BDII server for Jens' CIP activity
- CASTOR items:
- Re-installed castor303.ads as a CASTOR disk disk server
- Castor301 needs same treatment but has memory problems
- Mayo
- Encrypted passwords within the Metric system
- Added a change password feature to the metric system
- Fixed a bug within the Metric system
- Worked on tape statistics spreadsheet project: converting excel chatrs to HTML
Operational Issues and Incidents
Description
|
Start
|
End
|
Affected VO(s)
|
Severity
|
Status
|
Disk errors on FTS Agents host
|
20100119 14:00
|
20100121 09:00
|
LHC
|
Low
|
Schedules migration agents to standby host
|
Plans for Week(s) Ahead
Plans
- Alastair
- Understand remaining errors from HC test.
- Continue updating RAL PP twiki.
- Prepare slides for presentation on computing requirements.
- Write Nagios script to warn when space token are near full.
- Andrew
- IO testing: will try new CMSSW TTreeCache patch & compare to lazy download
- Investigate CMS job status reporting problems for my backfill jobs
- Investigate my backfill jobs killed by batch system
- Investigate my aborting MC production jobs at UK T2s/T3s
- Write document about automatic job killing
- Catalin
- WMS01 and 02 upgrades
- kernel updates
- chase CERN for LFC schemas tidying up
- test Alice xrootd (manager + peer) re-installation (with ChrisK)
- Derek
- Reinstalling lcgce08 with host swap config
- Reconfiguring lcgce01
- Continuing work on Glexec and SCAS
- Matt
- FTS drain and migration of front-ends back to somnus
- Test upgrade path from FTS2.1 to FTS2.2 on orisa
- Planning ATLAS/R89 co-hosting of Grid Services
- Plan T2K configuration of FTS, and request dedicated diskpool
- Richard
- Manual Handling Training
- Feed back results from Jens' CIP testing into Quattor profile for BDII server
- CASTOR items:
- Finish installing the suite of RPMs needed on castor303 (new disk server)
- Re-install castor301 when memory has been fixed.
- Mayo
- Automating Metric report system
- Adding charts to the metric system
- Web interface and script to fetch data for Tape robot statistics spreadsheet project
VO Reports
ALICE
- Status Report (21 Jan) - very stable behaviour, CREAM stability exceptional, "a bit of free resources" (re: RAL farm between week 53/2009 and week 02/2010)
ATLAS
- ATLAS confirmed that RAL WMS not critical for UK operations.
CMS
- RAL ranked 2 in T1 Site Readiness Ranking on 2010-01-25 (last 2 weeks)
- Only JobRobot and Backfill (re-reco) jobs running recently
- High CMS network usage on Friday was due to lazy-download not being specified in the reco config file
LHCb
- Other Tier-1s (IN2P3, PIC) also reporting low job efficiencies. Suspect LHCb user or application problems.
- LHCb confirmed that RAL WMS not critical for UK operations.
- New release of DIRAC, so testing on SL5 VOBOX should be able to proceed shortly.
Resource Requests
Downtimes
Description
|
Hosts
|
Type
|
Start
|
End
|
Affected VO(s)
|
Disk problems on FTS agents host
|
lcgfts01
|
Scheduled
|
20100121 07:00
|
20100121 09:00
|
LHC
|
Requirements and Blocking Issues
Description
|
Required By
|
Priority
|
Status
|
Hardware for testing LFC/FTS resilience
|
|
High
|
DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
|
Hardware for Testbed
|
|
High
|
Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
|
Hardware for SCAS servers
|
Feb 1 2010
|
High
|
Hardware required for production SCAS servers - required to be in place by end of Feb
|
Hardware for SL5 CREAM CE for Non LHC SL5 batch access
|
|
Medium
|
Hardware required for CREAM CE for non-LHC vos
|
Pool accounts for Super B vo
|
|
Medium
|
Required to enable Super B vo on batch farm
|
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon-Sun)
- AoD: