From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Operational Issues
Description
|
Start
|
End
|
Affected VO(s)
|
Severity
|
Status
|
lcgwms03 unresponsive
|
Fri 26 Nov ~18:05
|
Mon 29 Nov 09:00
|
non LHC
|
High
|
machine rebooted; back into production
|
Downtimes
Description
|
Hosts
|
Type
|
Start
|
End
|
Affected VO(s)
|
Blocking Issues
Description
|
Requested Date
|
Required By Date
|
Priority
|
Status
|
|
|
|
|
|
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- In CERN for ATLAS software week!
Andrew
- Capacity planning system project [Ongoing]
- Wrote Python script to generate XML for CMS batch system monitoring; wrote & submitted change control form; deployed [Done]
- Wrote & deployed CMS tape usage monitoring script [Done]
- Investigated disappearance of files from gdss381 (cmsTemp) [Done]
- Updated errata on lcgvo-02-21, lcg0677, lcg0678, lcgapel0676 [Done]
- CMS data ops
- Skims at FNAL [Done]
- MC rereco at PIC, KIT [Done]
- Data rereco preprod at RAL, IN2P3 [Ongoing]
- WMAgent testing at all 7 CMS Tier-1s [Ongoing]
- Upgrade CMS squids to frontier-squid-2.7.STABLE9-5 [To do]
Catalin
- work on (x)ROOT(d); deploy test infrastructure [ongoing]
- kernel updates and last errata applied on various systems [ongoing]
- apply latest updates (squid, frontier) on Atlas Frontier node
- decommission various old systems
- update glite-WMS [done]
- work on Tier1 DB migration plans [ongoing]
- work on WMS monitoring [stalled]
Derek
- Investigation of secure deployment of ssh keys to hosts [ongoing]
- Reinstalling lcgce08 [Done]
- Investigating solutions for whole node scheduling [ongoing]
- A/L (29th-3rd)
Matt
- Tier-1 Resources meeting prep. [New]
- Deploy top BDII on EC2. [Ongoing]
- Quattorisation FTM. [Ongoing]
- Deploying PBS JobMon monitoring tools. [Stalled]
- Test FTS SRM/GridFTP ratio configuration. [Stalled]
- Switch to gLite 3.2 FTS frontends (November 24). [Done]
- Reprofile disk capacity. [Done]
- Writing storage testbed proposal. [Done]
Richard
- Applied kernel + OS updates to CIP and site and top level BDIIs (including those in testbed) [Done]
- Wrote a gmetric tool to measure Quattor deploy hitrate (i.e. percentage of deploys (as found in SVN repo) that were "seen" by a machine) [Done]
- Updated the ShowQuattorChanges CGI script to show the deploy list [Done]
- Working on the tool for automatic the checking of middleware baselines [Ongoing]
- Developing a set of Quattor templates for an ARGUS server [Ongoing]
- Developing a "pseudo-update" to apply gLite update 19 to BDIIs [Ongoing]
- Final touches to the CGI script before releasing initial version [Done]
- Working on the "team status page" being developed as an action from team awayday [Ongoing]
- Reviewing G/S process documentation [Ongoing]
- CASTOR items:
- Built a new cluster within Quattor for building "cert-in-a-box". Stager server built -- now adding other headnode types. [Ongoing]
VO Reports
ALICE
ATLAS
CMS
- CMS CASTOR Job Manager outage during 2010-11-27 22:59-23:34
- Large number of FTS timeouts in outgoing tranfers is ongoing (since CASTOR upgrade)
- RAL is the worst CMS Tier-1 over the past 2 weeks (only 30% readiness)
LHCb
OnCall/AoD Cover
OnCall Rota
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall:
- AoD: