RAL Tier1 weekly operations Grid 20100510
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy | |
RAID software failure on lcglb01 | 4-Apr-2010 | 7-Apr-2010 | all | low | RAID configuration re-built with the same HDDs |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Hardware for Testbed | Medium | Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Have initial hardware. [2010-02-22] More hardware expected by end of March. |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Request Viglen 08 disk deployment
- Propose to UB schedule for decommissioning of SL4 capacity
- SuperB requested (1TB) storage
Highlights for Tier-1 VO Liaison Meeting
- Disk deployments to meet 2010 pledges actioned
- SL4 decommissioning schedule agreed by User Board
- SuperB request for CASTOR/SRM configuration
- Request to co-schedule remaining two CEs for pilot role reconfiguration being considered
Detailed Individual Reports
Alastair
- Working on ATLAS software server upgrade (testing with Jonathan starting tomorrow)
- Working on setting up and testing ATLASGROUP disk at RAL.
- Working with B-Physics Group on group analysis requirements (TAG based analysis).
- Looking into ATLAS PFC (Pool File Catalogue) problems.
Andrew
- APR [Ongoing]
- Started April accounting [Ongoing]
- Added new FTS endpoint [Done]
- Investigating FTS groups [Ongoing]
- Regenerated LoadTest files with James J. [Done]
- CMS data ops
- Completing reprocesing at FNAL & CNAF
- Started running backfill at RAL & PIC
- Installing & setting up PhEDEx on SL5 VOBOX [Ongoing]
- Learnt how to use the DBS Python API
Catalin
- tidy up Nagios monitoring [ongoing]
- install and configure squid on LHCb VOBOX [ongoing]
- LFC/FTS replication (w/ Carmine) [ongoing]
- Frontier updates
- work on Grid Services change control approved exceptions [done]
- work on RAID issue on lcglb01 [done]
- APR [done]
Derek
- Intervention on lcgce06 for glexec
- Testing CREAM CE 1.6
- ce.ngs.rl.ac.uk removed from site bdii [Done]
- APR [Done]
- Security Service Challenge 4 writeup [Done]
Matt
- APRs [Ongoing]
- Request Viglen 08 disk deployment [Done]
- Capacity Planning (meeting with Andrew L)
- Site BDII performance problems
- Propose to UB schedule for decommissioning of SL4 capacity
Richard
- APR [Done]
- Looking at the site-bdii timeout problem
- Working on proposal on intra/inter -team communication to meet an action from the team awayday
- Reviewing G/S process documentation
- Further Nagios items from the to-do list (https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/NagiosTasksToDo)
- CASTOR items:
- Running remaining p/p stress tests
Mayo
- Implement feedback into TSBN web interface
- Set up scripts that update TSBN interface to run as scheduled jobs on a windows machine
- Writing and configuring Nagios nrpe plugins [Done]
- Certificate viewer for NGS cert wizard
- Write PDU power controller query script [Done]
- Write a script to turn PDU ports off
VO Reports
ALICE
Would like CREAM-CE v1.6 to be installed asap
ATLAS
CMS
- Starting the 'train' model: every Thursday 8pm GVA a new re-reco pass will be carried out at T1s, instead of waiting for requests.
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Catalin
- AoD: