From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Operational Issues
Description
|
Start
|
End
|
Affected VO(s)
|
Severity
|
Status
|
|
|
|
|
|
|
Downtimes
Description
|
Hosts
|
Type
|
Start
|
End
|
Affected VO(s)
|
Blocking Issues
Description
|
Requested Date
|
Required By Date
|
Priority
|
Status
|
|
|
|
|
|
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- ATLAS TaskForce
- Keeping ATLAS up to date with castor upgrade.
- Helping Lancaster with File Loss.
- Working on ATLAS permission change.
Andrew
- Capacity planning system project [Ongoing]
- Resource review meeting [Done]
- November accounting [Done apart form csv file]
- Update CMS squids to latest version; updated Nagios plugin [Done]
- FTS adjustments ATLAS and T2K [Done]
- Investigated networking problems on Monday affecting CMS jobs
- CMS data ops
- WMAgent testing
- Rereco at RAL, IN2P3, KIT [Ongoing]
Catalin
- test squid deployments for ATLAS [ongoing]
- work on (x)ROOT(d); deploy test infrastructure [ongoing]
- kernel updates and last errata applied on various systems [done]
- apply latest updates (squid, frontier) on Atlas Frontier node [done]
- decommission various old systems [done]
- work on Tier1 DB migration plans [ongoing]
- work on WMS monitoring [stalled]
Derek
- Investigation of secure deployment of ssh keys to hosts [ongoing]
- Reinstalling lcgce08 [Done]
- Investigating solutions for whole node scheduling [ongoing]
- A/L (29th-3rd)
Matt
- T2K FTS configuration. [New]
- Handover for A/L. [New]
- Quattorisation FTM. [Ongoing]
- Tier-1 Resources meeting prep. [Done]
- Deploying PBS JobMon monitoring tools. [Stalled]
- Test FTS SRM/GridFTP ratio configuration. [Stalled]
- Deploy top BDII on EC2. [Done]
- Blog top BDII on EC2. [Done]
Richard
- Wrote a gmetric tool to measure Quattor deploy hitrate (i.e. percentage of deploys (as found in SVN repo) that were "seen" by a machine) [Done]
- Working prototype of tool for automatic the checking of middleware baselines now in place [Done]
- Developing a set of Quattor templates for an ARGUS server [Ongoing]
- Developing a "pseudo-update" to apply gLite update 19 to BDIIs [Ongoing]
- Working on the "team status page" being developed as an action from team awayday [Ongoing]
- Reviewing G/S process documentation [Ongoing]
- CASTOR items:
- Added an LSF server to the cert-in-a-box" cluster. [Ongoing]
VO Reports
ALICE
ATLAS
CMS
- Deleted ~ 80 TB old data last week
- Last week's CMS problems:
- 2 x proxy renewal problems on CMS VOBOX causing ~ 1 hour of failed transfers to RAL. Restarter didn't seem to successfully restart it.
- Failing transfers (mainly outgoing) and SAM tests on Sunday
- There was a cmssgm job in running state but forgotten by batch system, preventing new software release from being installed (required for current reprocessing). Delayed start of reprocessing at RAL by 1.5 days.
- Network/DNS issues
- Squids denying access from some worker nodes, causing some reprocessing jobs to fail because they couldn't failover to CERN
- Central CMS monitoring of squids had this at 2010-12-06 05:20 "Skipping host lcgsquid01.grid pp.rl.ac.uk as it does not resolve to an IPv4 address"
LHCb
OnCall/AoD Cover
OnCall Rota
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall:
- AoD: