Difference between revisions of "RAL Tier1 weekly operations Grid 20100802"
From GridPP Wiki
Matt hodges (Talk | contribs) |
(No difference)
|
Latest revision as of 14:54, 2 August 2010
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status | |
---|---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good | ||
FTS02 | 21-Jul-2010 | All | High | SMART errors on both FTS02 disks, Fabric have replacements and wish to arrange swap out |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
#61658: HW request for CMS Squid VOBOX | 30 June 2010 | Medium | [30-06-2010]Request made |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Working on testing FTS timeout limits.
- Understanding CPU/disk capacities
- Build gLite3.2 FTS test node
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server [ongoing]
- Working on ATLAS Frontier service, monitoring and backup.
- Working on testing FTS timeout limits.
- Working on ATLAS B-Physics software code.
Andrew
- Implementing & testing CMS t1production role [Done]
- CMS storage consistency check; deleted 7 TB of dark data [Done]
- Added VOBOX proxy renewal daemon restarter to SL5 VOBOX [Done]
- Updated PhEDEx Dev instance to 3_3_2 [Done]
- Checking checksums (with PhEDEx) of the last file written to each T10KB tape [Done]
- WMS bulk submission testing - missing jobs on CREAM CEs
- Understanding CPU/disk capacities
- CMS Data Ops
- Data rereco & skims at RAL & FNAL [Done]
- Update PhEDEx prod & debug instances to 3_3_2 [To do]
- CMS CASTOR 2.1.9 testing [To do]
Catalin
- ATLAS frontier monitoring [ongoing]
- LFC quattor profiles (SL4 and SL5) [ongoing]
- prepare various gLite updates
Derek
- Configured cms t1 prod role on CEs [Done]
- Applied 1250 job limit to Alice in maui [Done]
- Investigated (CREAM) ce requirements functionality as a way to limit job cpu times per vo [Done]
- Configured NGS-UEE publishing for lcgce05
- Writing Strawman Cloud strategy [ongoing]
- CREAM CE quattor profile [ongoing]
- Investigating CREAM CE instability
Matt
- Build gLite3.2 FTS test node
- Add timeout configuration to local FTS information (SVN)
- Audit wLCG pledges vs. deployed disk
- Finish first pass of ascii FTS docs; look at build system
Richard
- Implemented change on RAL top-level BDIIs [done]
- Added pre-prod CIP into site BDII
- Working on the "team status page" being developed as an action from team awayday [ongoing]
- Reviewing G/S process documentation [ongoing]
- Further work on tool to help with automating the wiki page on grid middleware versions [done]
- CASTOR items:
- Continue trying to get 2.1.9 functional tests running on pre-prod
- Update the pre-prod resources being published to support VO testing
VO Reports
ALICE
ATLAS
CMS
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin
- Grid OnCall:
- AoD: