Difference between revisions of "RAL Tier1 weekly operations Grid 20100329"
From GridPP Wiki
Derek ross (Talk | contribs) |
(No difference)
|
Latest revision as of 09:39, 31 March 2010
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. | ||
HW for SL5 CMS Phedex Vobox | High | Required to replace the existing SL4 machine |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Support survey sent out to all VOs which use RAL Tier-1. Reponses requested by 23rd April.
- Talks from Batch system training
- CMS will start using GGUS team tickets instead of Savannah tickets for Tier-1s
Highlights for Tier-1 VO Liaison Meeting
- Support survey sent out to all VOs which use RAL Tier-1. Reponses requested by 23rd April.
- lcgce05 deployed for non-LHC vo access to SL5 WNs
Detailed Individual Reports
Alastair
- Watch for ATLAS problems during LHC first collisions.
- Add extra diskservers to ATLASGROUPDISK space token and set this up in TiersofATLAS.
- Change FTS transfer settings for Tier 2 channels.
Andrew
- Sent out support survey to VOs (responses requested by April 23rd) [Done]
- Added per-VO job monitoring of lcgce01 [Done]
- Sorting out gaps & problems in APEL publishing [Ongoing]
- Installing & setting up FTS monitor, including DN restriction
- I/O tests of official version of patches to go into CMSSW (skimming & reconstruction) [Done]
- CMS data ops
- Started running backfill at FNAL and CNAF [Ongoing]
- Cleaned up some old ProdAgent instances, installed some new 0_12_17_patch3
- PPD staff meeting; batch-system training
Catalin
- non-LHC LFC schema clean up (w/ Carmine) [done]
- work on various Nagios checks on grid services hosts [ongoing]
- work on Dataguard replication (w/ Carmine) [ongoing]
- quattorise additional LFC frontends (w/ Ian) [ongoing]
- various grid services yum updates [ongoing]
- install squid on LHCb VOBOX
Derek
- Enabling new vo on ce.ngs host
- Publishing lcgce05
- Batch system training [Done]
- Writing Open day talk
Matt
- Write up production plans.
Richard
- Re-working the Grid Services Quattorisation Roadmap as a WIKI page [done]
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Deployed one set of diskservers into the lhcbNonProd s/class and the other into cmsNonProd [done]
- Updating benchmarking tool to meet requirements of pre-prod stress testing
Mayo
- TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet [Done]
- Create Batch job to run TSBN backend script and update web interface automatically [Done]
- implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- write user experience report on NGS certificate wizard project [Done]
- writing and configuring Nagios nrpe plugins
VO Reports
ALICE
ATLAS
CMS
- CMS will start using GGUS team tickets instead of Savannah tickets for Tier-1s
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Catalin (Mon-Wed), Derek (Thu - Sun)
- AoD: