RAL Tier1 weekly operations Grid 20100719
From GridPP Wiki
Revision as of 12:23, 21 July 2010 by Derek ross (Talk | contribs)
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status | |
---|---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress | ||
WMS03 | 16-Jul-2010 | 16-Jul-2010 | Non-LHC | low | Was unresponsive and rebooted |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
#61658: HW request for CMS Squid VOBOX | 30 June 2010 | Medium | [30-06-2010]Request made | |
#62179: Request for new CMS pool accounts | 16 July 2010 | High | [16-07-2010]Request made |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Mayo has now left, please remove any access he may have had
- Only 2 Grid Team members in on Wed-Thu
- New CMS t1production role
- Batch farm full :-), causing issues for CMS :-(
Highlights for Tier-1 VO Liaison Meeting
- Investigating options for limiting Alice jobs after CMS ran work elsewhere over the weekend
- Progressing with enabling new CMS role on batch farm
- Roll out an upgrade of the top level BDIIs next week (At-risk)
- 2 crashes of WMS03 with no obvious cause
Detailed Individual Reports
Alastair
- Working on ATLAS software server on /afs [ongoing]
- Written script to identify unavaliable files when a disk server is taken out of production. [testing]
- Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
- Working to improve pbsjobs database to allow easier monitoring of production work.
- Working on ATLAS Frontier service, monitoring and backup.
Andrew
- Investigated slow transfers of an important MC dataset to many T2s [Done]
- Added Ganglia monitoring of CMS data transfers (volume per day & rates) to/from CERN, T1s, T2s [Done]
- Preparations for new CMS t1production role
- Working on change-control form & implementation plan; submitted request for Fabric for new pool accounts
- Updated FTS monitor to v1.4 [Done]
- Understanding disk & tape capacity calculations
- CMS data ops
- MC production at CNAF
- backfill (MC production) at RAL; testing CREAM CEs
- Data reprocessing at FNAL
- Try glite-APEL installation in testbed [To do]
- Write script for checksum checking of last file on T10KB tapes [To do]
Catalin
- Python course (Mon - Thu) RAL R1
Derek
- Sync'd testbed against QWG profiles [Done]
- Rebooted lcgwms03 [Done]
- Debugging t2k job submission issues
- CIC broadcast for lcgce02 decommission [Done]
- Writing Strawman Cloud strategy [ongoing]
- Sync production templates against QWG
Matt
Richard
- Submitted change control request for updating RAL top-level BDIIs [done]
- Working on the "team status page" being developed as an action from team awayday [ongoing]
- Reviewing G/S process documentation [ongoing]
- Developed a tool to help with automating the wiki page on grid middleware versions [done]
- CASTOR items:
- Continue trying to get 2.1.9 functional tests running on pre-prod
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
- cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7
ATLAS
CMS
- Due to CMS unable to get any job slots at RAL, v2 of an urgent workflow was run at FNAL. The v1 finally generated at RAL has been deleted.
- Started to use CREAM CEs again due to upgrade of CNAF WMSs; no problems so far.
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin
- Grid OnCall:
- AoD: