Difference between revisions of "RAL Tier1 weekly operations Grid 20100419"
From GridPP Wiki
(No difference)
|
Latest revision as of 14:51, 19 April 2010
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy | |
LFC/FTS downtime | 12-Apr-2010 ~11:30 | 12-Apr-2010 ~14:30 | all | high | Failure of one RAC node in the database behind the LFC and FTS services |
APEL publishing problem | all | low | APEL publishing isn't working |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. | ||
HW for SL5 CMS Phedex Vobox | 19-Mar-2010 | High | Required to replace the existing SL4 machine |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server upgrade (testing with Jonathan starting tomorrow)
- Working on setting up and testing ATLASGROUP disk at RAL.
- Working with B-Physics Group on group analysis requirements (TAG based analysis).
- Looking into ATLAS PFC (Pool File Catalogue) problems.
Andrew
- FTS monitor 1.3 now publicly available on lcgwww (modified code so that DNs are not displayed)
- Attended GridPP 24 on Wed
- CMS data ops
- Running Spring10 MC redigi/rereco workflows at FNAL & CNAF
- At Bristol on Wed 21st for UK CMS computing F2F
Catalin
- GridPP24 (last Wed-Thus) [done]
- GOCDB tidying-up (VOBOXes) [done]
- install and configure squid on LHCb VOBOX [ongoing]
- work on various Nagios checks on grid services hosts (NOCALLOUT checks) [ongoing]
- Alice VOBOX updates (alien, glite, OS) [ongoing]
Derek
- Consulted about effects due to upcoming uid change for pilot roles [Done]
- Wrote change control requests for CE reconfiguration [Done]
- Testing config changes to CE for mapping updates for glexec [Done]
- Testing queue addition to batch system for atlas [Done]
- At GridPP + Deployment Board (Wed-Fri) [Done]
- Evaluating cloud technology for Grid Services testbed use
- Documenting Grid Services testbed
Matt
- Look at draft User Board allocations; update CPU/disk capacity profiles
- Write up production plans [Done]
Richard
- 2 days at GridPP meeting (Wed-Thu)
- Applied latest OpenLDAP RPMs to quattor-built BDII -- currently running timings on this machine to check behaviour
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Reviewed CASTOR Server Recovery document with Cheney
- P/P Stress testing found Oracle error "ORA-00257: archiver error. Connect internal only, until freed" (caused by lack of disk space on Oracle server)
- Continuing p/p stress testing
Mayo
- Implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- Writing and configuring Nagios nrpe plugins [Done]
- Certificate viewer for NGS cert wizard
- Write PDU power controller query script [Done]
- Write a script to turn PDU ports off
VO Reports
ALICE
VOBOX sites updated to AliEn v2.18
RAL has been reconfigured to use both available VOBOXes in failover mode (no more use of lcg-CE, only CREAM-CE)
ATLAS
- Not heard anything about LHC technical stop. (Maybe I missed something/it will be announced tomorrow)
- LHC continuing to islowly increase luminosity.
- ATLAS Software and computing week. (More chaotic than usual due to Volcano)
- Fast 'fast re-processing' could start this week.
- ATLAS announced that it would like all disk space in production by June 1st.
CMS
- Spring10 reprocessing of Summer09 MC samples almost complete (all Tier-1s involved)
- This week RAL has the number 1 position in the CMS site readiness ranking
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek (Mon-Sun excl Wed), Catalin (Wed)
- AoD: