RAL Tier1 weekly operations Grid 20100913
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status | |
---|---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
gLite-WMS update + maintenance | lcgwms02 | Thu 9 Sep 15:00 | Thu 16 Sep 15:00 | LHC |
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server, testing CVMFS
- 825 test jobs have been run.
- lcg0805 has been setup for production style testing, need to add queue into ATLAS system.
- Writing script to graph transfer times for FTS transfers
- Working on Hammer cloud test of castor 2.1.9
- Analysis queue setup
- Need to copy DBrelease into pre-prod and replicate
- A/L Wednesday, Thursday and Friday
Andrew
- CMS CASTOR 2.1.9 testing
- Investigated problems with loadtest injection from RAL to Imperial in dev instance [Done]
- Investigated lazy-download problem with CASTOR 2.1.7 & 2.1.9 [Done]
- Updated published CPU capacity [Done]
- Tested the two SL5 CMS Squids by running test rereco jobs [Done]
- Testing glite 3.2 FTS test instance using PhEDEx debug instance [Done]
- I/O testing with CMSSW 3.8 series with new I/O settings [Next week]
- CMS data ops
- Running data rereco preproduction at RAL [Ongoing]
Catalin
- add new frontends to non-LHC LFC alias [done]
- add new frontends to LHCb LFC alias [done]
- gLite updates WMS01 LHC [done]
- gLite updates WMS02 LHC [ongoing]
- improve WMS monitoring [ongoing]
- add new frontends to Atlas LFC alias
- work on improving ganglia monitoring for Grid Services [ongoing]
- work on Helpdesk MySQL database migration [ongoing]
Derek
- Catching up
- CREAM CE quattor profile [ongoing]
- Investigating CREAM CE instability [ongoing]
- Deployed quattorised sudo config
- Refactored quattorised atlasbackup configuration
- Intervened on lcgce01 over weekend(11-12) to resolve job submission issue
Matt
- Capacity Signoff meeting. [New]
- Further testing of Quattorised FTS FEs. [Ongoing]
- Quattorisation of MyProxy nodes (write up Change Control). [New]
- Assisting Richard with Top BDII problems. [Done]
- Analysis of LHCb job efficiencies during disk server problem period. [Done]
- Change Controls for FTS FE updates. [Done]
- Quattorisation of FTS Agents host. [Done]
Richard
- Some clean-up tasks after last week's upgrade to the RAL top-level BDIIs
- Working on the "team status page" being developed as an action from team awayday [ongoing]
- Reviewing G/S process documentation [ongoing]
- CASTOR items:
- Helped Cheney with quattor issues building head nodes for facilities instance
VO Reports
ALICE
ATLAS
CMS
- CMS Daily Metric for RAL was ERROR on 11 Sep due to a worker node with a read-only /pool filesystem causing SAM tests and Job Robot jobs to fail.
- CMS will start producing the AOD when data taking resumes. This was always in the plan, but will be implemented now. Will result in modest increase (~10%) increase in the rate from CERN to Tier-1s.
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Fri-Sun)
- Grid OnCall: Derek (Mon-Thu)