Difference between revisions of "RAL Tier1 weekly operations Grid 20100222"
From GridPP Wiki
Matt hodges (Talk | contribs) |
(No difference)
|
Latest revision as of 13:25, 24 February 2010
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon | |
CRL issues for SL4 batch | Tue 16 Feb 2010 | Wed 17 Feb 2010 | non-LHC | medium | solved; CRLs updated on NFS server |
ATLAS s/w server overloaded | Sun 21 Feb 2010 | Ongoing | ATLAS | medium |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
RAID and memory issues | lcgce07 and lcg0280 | SD | Fri 19 Feb 2010 14:00 | Tue 23 Feb 2010 16:00 | CMS, Alice, LHCb |
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for testing LFC/FTS resilience | High | DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Production hardware will be available soon. [2010-02-22] Test hardware available; some config tweaks needed. | ||
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. | ||
Hardware for additional SL4 LFC frontends | Medium | Required to improve resilience of existing LFC services |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Ongoing load issues on ATLAS s/w server.
- ATLAS 4GB jobs having minimal affect regarding blocked job starts (~1%).
- FTS 2.2 released; starting to test upgrade path.
- Disk deployment: 100TB requested for ATLAS to enable LHCb drain to commence; capacity needed in SimStrip, which was filled over the weekend.
Highlights for Tier-1 VO Liaison Meeting
- Disk deployment: 100TB requested for ATLAS to enable LHCb drain to commence; capacity needed in SimStrip, which was filled over the weekend.
- FTS2.2 testing ongoing; CNAF experiencing problems with upgrade.
Detailed Individual Reports
Alastair
- Work with Brian + Chris in re-deploying disk servers to ATLAS space tokens. [Ongoing]
- Write scripts to monitor effect of 4GB memory limit change on batch system. [Done]
- Monitor/investigate ATLAS MC production and re-processing currently going on at RAL. [Ongoing]
Andrew
- Running backfill at RAL (re-reco of BeamCommissioning09 Cosmics) [ongoing]
- Ran a production workflow: reprocessing of a Summer09 MC sample (generated data is custodial) [Done]
- Added ganglia monitoring of usage of CMS tape pools (per tape pool & combined stack plot) [Done]
- Testing new disk server on CASTOR pre-prod instance with CMSSW (skimming & reconstruction) [ongoing]
- Added .tr (for T2_TR_METU) to CLOUD-CMS-CERN FTS channel [Done]
Catalin
- 're-certified' ATLAS Frontier after 3D migration (with Alastair) [done]
- install APEL patches on CEs [ongoing]
- work on LFC schema tidying up (with Carmine) [ongoing]
- quattorise additional LFC frontends (with Ian - pending on HW provisioning)
- lcgce07 downtime - disk replacement, memory swap
Derek
- A/L
Matt
- FTS2.2
- Look at GGUS bug regarding checksum scenarios [Done]
- Test upgrade path from FTS2.1 to FTS2.2 on orisa
- Disk deployment: request 100TB for ATLAS to enable LHCb drain to commence [Done]
- Tier-1 Open Day talk for Grid Services
- Test FTS functionality for T2K [Done]
- CA updates on service nodes (including CEs in Derek's absence) [Done]
- Test APEL publication with latest patches
- Request dedicated diskpool for T2K (depends on allocation)
Richard
- Submitted change control request for rolling out quattorised BDII server [Done]
- Now working with Ian C to "factorise" the template so that non-machine specific items are distributed to the appropriate points in the hierarchy of templates
- Working on the Grid Services Quattorisation Roadmap
- Writing a proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Working on benchmarking plan to establish baseline performance before upgrading to new CASTOR release(s)
- Set up a "Plan B" CASTOR LSF server in case the need arises [Done]
Mayo
- Adding bar chart to Metric system [Done]
- Admin interface for Metric System [Done]
- TSBN spreadsheet web interface and backend automation script
- writing and configuring Nagios nrpe plugins
VO Reports
ALICE
ATLAS
CMS
- Problems over the past week: Oracle problems affecting transfers (x2); writes to cmsWanIn pending for too long causing transfers to fail (x4); tape migration; tape recall problems (one tape); gdss364 problems caused jobs to fail on 19th-20th Feb
- Transfers to/from RAL over the past week:
- from CERN: 13.1 TB (Commissioning10 cosmics)
- from T2s: 2.4 TB
- to T1s: 19.5 TB
- to T2s: 20.4 TB
- migrated to tape: 25.5 TB
- CPU usage over the past week:
- backfill (re-reco) & MC reprocessing: 7384 KSI2K days, CPU efficiency 92%
- skimming: 464 KSI2K days, CPU efficiency 51%
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Matt (Mon-Sun)
- AoD: