RAL Tier1 weekly operations Grid 20100412
From GridPP Wiki
Revision as of 14:22, 12 April 2010 by Mayo agard-olubo (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy | |
LFC/FTS downtime | 12-Apr-2010 ~11:30 | 12-Apr-2010 ~14:30 | all | high | Failure of one RAC node in the database behind the LFC and FTS services |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. | ||
HW for SL5 CMS Phedex Vobox | High | Required to replace the existing SL4 machine |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- CMS started MC reprocessing at all Tier-1s (~500 workflows) on 9th April. No problems at RAL so far.
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- All disk servers for ATLAS deployed or back with Fabric.
- Working on ATLAS software server upgrade
- Working on setting up and testing ATLASGROUP disk at RAL.
Andrew
- Deployed gdss121,136,137 from cmsNonProd to cmsWanOut [Done]
- Upgraded prod & debug instances of PhEDEx to 3_3_0 [Done]
- Added new DESY VOMS certificate to tier1-yaim-config [Done]
- Added ganglia CPU efficiency plots with 10-minute time resolution; added new gmetric to lcgbatch01 [Done]
- Added CE metrics to the grid services "special pages" in ganglia [Done]
- March accounting [Done]
- Made corrections to various scripts involving KSI2K to/from HEP-SPEC06 [Done]
- Deleted some CMS files from /store/unmerged [Done]
- VO support survey
- 3 responses so far, one saying that don't use RAL
- Started preparing results document
- CMS data ops
- Continued running backfill at FNAL and CNAF (moved to PA 0_12_17_patch3)
- Run MC reprocessing workflows at FNAL and CNAF [Ongoing]
- From previous week:
- Added new IN2P3 hosts to renewers/retrievers list [Done]
- Converted fts script to use Oracle DB rather than webpage [Done]
- Updated tier1-vobox-config on the 3 CMS VOBOXs [Done]
- Added per-VO job monitoring to lcgce05 [Done]
- Upgraded PhEDEx dev instance to 3_3_0 (clean install) [Done]
- Updated eff-stats.pl to generate eff-stats.csv in HEP-SPEC06 [Done]
- CMS data ops
- Ran urgent MC rereco at FNAL for media event [Done]
- Continued backfill at FNAL & CNAF [Done]
Catalin
- work on various Nagios checks on grid services hosts (new BDII, NOCALLOUT checks) [ongoing]
- work on Dataguard replication (w/ Carmine) [ongoing]
- install squid on LHCb VOBOX
- GridPP24 (Wed, Thu)
Derek
- Deployed 5 new top level BDIIS [Done]
- Stopped publishing lhcb on grid1000M queue [Done]
- Testing config changes to CE for mapping updates for glexec
- Testing queue addition to batch system for atlas
- At GridPP + Deployment Board (Wed-Fri)
Matt
- Write up production plans.
Richard
- 2 days at GridPP meeting (Wed-Thu)
- Applied latest OpenLDAP RPMs to quattor-built BDII -- now watching this machine to check behaviour
- Re-working the Grid Services Quattorisation Roadmap as a WIKI page [done]
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Updating benchmarking tool to meet requirements of pre-prod stress testing [Done]
- Found d/b performance problem when running stress tests on pre-prod (missing table indexes)
- Re-start run of p/p stress testing
Mayo
- Implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- Writing and configuring Nagios nrpe plugins [Done]
- Certificate viewer for NGS cert wizard
- Write PDU power controller query script
VO Reports
ALICE
- upgrade Alien v2.18 on VOBOXes
ATLAS
CMS
- For MinimumBias during 7 TeV stable beams, average event size: 152 kB RAW, 125 kB RECO
- (8th April) Request for tape families to be setup for custodial & non-custodial data for upcoming real high energy physics runs
- (9th April) Started reprocessing of all Summer09 MC samples at all Tier-1s. No problems at RAL so far.
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Sun excl Wed)
- Grid OnCall:
- AoD: