RAL Tier1 weekly operations Grid 20100322
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
non-LHC LFC schema clean-up | lfc.gridpp.rl.ac.uk, Somnus RAC | SD | Thu 25 Mar 09:00 | Thu 25 Mar 14:00 | non-LHC |
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Added per-vo jobs CE monitoring to ganglia
- non-LHC LFC schema clean up (w/ Carmine)
- Deploying SCAS servers and glexec
- LHC status: preparations for 7 TeV collisions continuing; 900 GeV collisions are possibly planned but not scheduled yet
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Invesitage ways of installing ATLAS software in a new AFS test area.
- Monitor ATLAS MC production and re-processing currently going on at RAL. [Ongoing]
- Continue ATLAS disk deployment.
Andrew
- Wrote script & setup cron for checking APEL-PBS consistency daily [Done]
- Removed a host from renewers & retrievers host list [Done]
- Modified PhEDEx debug/dev instances then prod instance to use CERN FTS 2.2 server [Done]
- Disk server deployment: gdss114-117 to cmsNonProd then cmsWanIn [Done]
- Added per-vo jobs CE monitoring to ganglia [Done, except for lcgce01 which is ongoing]
- Working out how to install FTS monitor on lcgwww [Ongoing]
- CMS data ops
- Running backfill at RAL and IN2P3 [Ongoing]
- Ran more MC production at RAL (using cmsHotDisk) [Done]
Catalin
- non-LHC LFC schema clean up (w/ Carmine)
- work on various Nagios checks on grid services hosts
- work on Dataguard replication (w/ Carmine) [ongoing]
- quattorise additional LFC frontends (w/ Ian) [ongoing]
- various grid services yum updates [ongoing]
Derek
- Deploying SCAS servers and glexec
- Change Control for lcgce05 [Done]
- Deploying infrastructure hosts for testbed
- Writing talks for batch system training
- Enabling new vo on ce.ngs host
Matt
- Revise Tier-1 talk.
- Write up production plans.
- Write batch system training material.
- Upgrade FTS to 2.2.3. [Done]
- Update resource profiles for Q2/10. [Done]
Richard
- Using stress-testing script developed for CASTOR to test behaviour of new BDII server
- Re-working the Grid Services Quattorisation Roadmap as a WIKI page
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
Mayo
- TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet [Done]
- Create Batch job to run TSBN backend script and update web interface automatically [Done]
- implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- write user experience report on NGS certificate wizard project [Done]
- writing and configuring Nagios nrpe plugins
VO Reports
ALICE
ATLAS
CMS
- The TTreeCache patches will be put into a patched version of CMSSW soon. This should mean that if lazy-download is not used, there will be no problems. RAL has been chosen as the CASTOR site to be tested. Not known yet when this will take place.
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon - Sun excl Wed), Catalin (Wed)
- AoD: Catalin (Wed)