RAL Tier1 weekly operations Grid 20110110
From GridPP Wiki
Operational Issues
Description
|
Start
|
End
|
Affected VO(s)
|
Severity
|
Status
|
|
|
|
|
|
|
Downtimes
Description
|
Hosts
|
Type
|
Start
|
End
|
Affected VO(s)
|
Blocking Issues
Description
|
Requested Date
|
Required By Date
|
Priority
|
Status
|
|
|
|
|
|
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Last week at ATLAS UK meeting in RHUL
- ATLAS TaskForce [ongoing]
- Draining SL08 disk servers deployed to ATLAS service classes. [Done]
- Working on ATLAS permission change. [On hold]
Andrew
- December accounting [Done]
- Updated Maui config (Jan 2011 allocations; converted fully to HS06) [Done]
- Investigating APEL-PBS inconsistency problems, APEL OutOfMemory problems [Ongoing]
- Investigating CMS problems (Job Robot failures, files not migrating, file staging failures)
- CMS data ops
- data & MC rereco at RAL, IN2P3, KIT, PIC [Ongoing]
Catalin
- work on squid deployments for ATLAS [ongoing]
- assist ATLAS FTS requests [ongoing]
- kernel updates on non-Quattor machines [done]
- apply errata templates on Quattorised machines
Derek
- Deploying testbed batch system [ongoing]
- Debugging issue with Magic jobs [ongoing]
- Initial rollout of setting Operating System config on pbs mom on batch workers to sl5 [ongoing]
- Removed reservation and increased job limit for atlassgm to 10 to allow more cvmfs validation jobs over holiday
Matt
- Catchup: metrics, change controls, etc.
- Deploying PBS JobMon monitoring tools. [Stalled]
- Test FTS SRM/GridFTP ratio configuration. [Stalled]
Richard
- Wrote a gmetric tool to measure Quattor deploy hitrate (i.e. percentage of deploys (as found in SVN repo) that were "seen" by a machine) [Done]
- Working prototype of tool for automatic the checking of middleware baselines now in place [Done]
- Developing a set of Quattor templates for an ARGUS server [Ongoing]
- Developing a "pseudo-update" to apply gLite update 19 to BDIIs [Ongoing]
- Working on the "team status page" being developed as an action from team awayday [Ongoing]
- Reviewing G/S process documentation [Ongoing]
- CASTOR items:
- Added an LSF server to the "cert-in-a-box" cluster. [Ongoing]
VO Reports
ALICE
ATLAS
CMS
- Mid Week Global Runs start 24th January
- 64-bit version of CMSSW will be tested at sites
LHCb
OnCall/AoD Cover
OnCall Rota
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall:
- AoD: