RAL Tier1 weekly operations Grid 20100301
From GridPP Wiki
Revision as of 13:11, 3 March 2010 by Matt hodges (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon | |
ATLAS s/w server overloaded | Sun 21 Feb 2010 | Thu 26 Feb 2010 | ATLAS | medium |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Production hardware will be available soon. [2010-02-22] Test hardware available; some config tweaks needed. | ||
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. | ||
Hardware for additional SL4 LFC frontends | Medium | Required to improve resilience of existing LFC services |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Confirming upgrade procedure for FTS2.1 to FTS2.2.
- CMS: Cosmics data taking continued, then splash events over the weekend. 27 splashes from beam 1, 30 splashes from beam 2.
Highlights for Tier-1 VO Liaison Meeting
- Disk deployments for ATLAS and LHCb. No overall change for ATLAS.
- FTS2.2 upgrade path tested, and endpoint available for further testing.
Detailed Individual Reports
Alastair
- Work with Brian in deploying disk servers for the new ATLAS space token requests.
- Monitor first ATLAS powerusers that have started to use Tier 1
- Monitor/investigate ATLAS MC production and re-processing currently going on at RAL. [Ongoing]
Andrew
- CMS backfill
- Continued running at RAL; also started at ASGC
- Will be responsible for running at IN2P3 from 1st March (as well as RAL)
- Investigating why the skimming jobs currently running on Commissioning10 data have very low CPU efficiency
- Disk server deployment (gdss119 to lhcbNonProd, gdss393,414,415 to atlasNonProd then atlasSimStrip) [Done]
- Added a DN (for CMS) to renewer/retriver host list on RAL MyProxy [Done]
- Deleted CMS data from /store/unmerged & /store/testfile-put-*.txt files [Done]
- Made a number of adjustments to maui.cfg due to the ATLAS software disk problems [Done]
- LHCb disk server deployment [To do]
Catalin
- lcgce07 downtime - disk replacement, memory swap [done]
- install APEL patches on CEs [ongoing]
- work on LFC schema tidying up (w/ Carmine) [ongoing]
- quattorise additional LFC frontends (w/ Ian - pending on HW provisioning)
- enable ngs.ac.uk on LFC catalog
Derek
- A/L
Matt
- Tier-1 talk
- FTS2.2
- Confirming upgrade procedure for FTS2.1 to FTS2.2.
- Initial test of upgrade path from FTS2.1 to FTS2.2 on orisa [Done]
- CA updates (again) on service nodes (including CEs in Derek's absence)
- Test APEL publication with latest patches [Done]
- Request dedicated diskpool for T2K (depends on allocation)
Richard
- Checking behaviour of new/old BDII servers to ensure that important information is not being suppressed
- Working on the Grid Services Quattorisation Roadmap
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
- Adding support for lcg-cp command to stress testing suite
Mayo
- TSBN spreadsheet web interface (first version) [Done]
- TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet
- Create Batch job to run TSBN backend script and update web interface automatically
- writing and configuring Nagios nrpe plugins
VO Reports
ALICE
ATLAS
CMS
- Problems with jobs failing (backfill and skimming) due to gdss364 RAID card issue
- Cosmics data taking continued
- Splash events over the weekend: 27 splashes from beam 1, 30 splashes from beam 2
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Catalin (Mon, Wed-Sun)
- AoD: