RAL Tier1 weekly operations Grid 20100308
From GridPP Wiki
Revision as of 13:20, 10 March 2010 by Matt hodges (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Production hardware will be available soon. [2010-02-22] Test hardware available; some config tweaks needed. | ||
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- ATLAS test of CERN SRMs tomorrow (08:00 to 15:00).
- Submit FTS2.2 change control.
Highlights for Tier-1 VO Liaison Meeting
- FTS2.2 upgrade scheduling.
- Testing CREAMCE for non-LHC VOs.
Detailed Individual Reports
Alastair
- Invesitage ways of installing ATLAS software in a new AFS test area.
- Monitor ATLAS MC production and re-processing currently going on at RAL. [Ongoing]
- Continue ATLAS disk deployment.
Andrew
- Installed GSL rpms (incl devel) onto lcgui01 (LHCb request) [Done]
- Disk server deployment: 10 to lhcbNonProd to lhcbDst; 1 from lhcbNonProd to lhcbRawRdst [Done]
- Changed PhEDEx debug instance to use the test FTS 2.2 endpoint; FTS 2.2 adjustments for Chris Brew [Done]
- CMS data ops
- Ran reprocessing of some Commissioning10 cosmics data (prompt reco failed at Tier-0, reprocessing required at RAL) [Done]
- Ran two MC reprocessing workflows [Ongoing]
- Continued backfill at RAL; completed backfill at ASGC [Ongoing]
- Next week
- Do the February 2010 UB schedule
- Prepare draft questions for RAL Tier-1 VO survey
- Add per-VO ganglia monitoring for CEs
- Add additional page showing additional channel information for FTS 2.2
Catalin
- enabled ngs.ac.uk on LFC catalog [done]
- work on LFC schema tidying up (w/ Carmine) [ongoing]
- work on Dataguard replication (w/ Carmine) [ongoing]
- quattorise additional LFC frontends (w/ Ian) [ongoing]
- tidying up Nagios configurations (ALICE VOBOX, CE)
Derek
- Debugged problem with magic job submissions
- Deploying SL5 CREAMCE for non-LHC vos
Matt
- Tier-1 talk.
- FTS2.2:
- Change Control.
- t2k.org configuration problems.
- Confirming upgrade procedure for FTS2.1 to FTS2.2. [Done]
- SL5 CREAM CE installation.
- Update resource profiles for Q2/10.
Richard
- Checking behaviour of new/old BDII servers to ensure that important information is not being suppressed
- Working on the Grid Services Quattorisation Roadmap
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
- Adding ability to spreadsheet results of new benchmarking tool
Mayo
- TSBN spreadsheet backend script to copy data form castoradm1 to TSBN spreadsheet [Done]
- Create Batch job to run TSBN backend script and update web interface automatically [Done]
- implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- Begin collaboration with SCT on NGS certificate wizard project
- writing and configuring Nagios nrpe plugins
VO Reports
ALICE
ATLAS
CMS
- Continuous data taking continued
- Reprocessing of some Commissioning10 cosmics data completed
- CASTOR problems with MC reprocessing due to "hot" files
- Resolved due to the setting up of cmsHotDisk service class
- Backfill, rereco & MC reprocessing jobs over the past week: 20575 jobs; 5616 KSI2K days CPU time; CPU efficiency 80% (low due to problem with "hot" files)
- Transfers to/from RAL over the past week:
- from CERN: 4.6 TB
- from T2s: 7.0 TB
- from T1s: 1.4 TB
- to T1s: 5.0 TB
- to T2s: 3.9 TB
- migrated to tape: 15.6 TB
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek
- AoD: