RAL Tier1 weekly operations Grid 20100726

Operational Issues

Description	Start	End	Affected VO(s)	Severity	Status
Job status monitoring from CREAMCE	2-Feb-2010		CMS	medium	[10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good
WMS03	16-Jul-2010	16-Jul-2010	Non-LHC	low	Was unresponsive and rebooted
FTS02	21-Jul-2010		All	High	SMART errors on both FTS02 disks, Fabric have replacements and wish to arrange swap out

Downtimes

Description	Hosts	Type	Start	End	Affected VO(s)

Blocking Issues

Description	Requested Date	Required By Date	Priority	Status
HW needed to test Dataguard technology for LFC/FTS	19 May 2010	15 June 2010	Medium	[24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
#61658: HW request for CMS Squid VOBOX	30 June 2010		Medium	[30-06-2010]Request made
#62179: Request for new CMS pool accounts	16 July 2010		High	[16-07-2010]Request made [21-07-2010]Ticket closed by Fabric team [26-07-2010]Pool accounts were created yesterday

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Working on ATLAS software server on /afs [ongoing]
Written script to identify unavaliable files when a disk server is taken out of production. [testing]
Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
Working to improve pbsjobs database to allow easier monitoring of production work.
Working on ATLAS Frontier service, monitoring and backup.

Andrew

Completed & submitted change control documents about the new CMS production role [Done]
Prepared changes required for monitoring & accounting for new CMS production role [Done]
PhEDEx backup
- Grid services on-call spreadsheet now contains details about temporarily moving PhEDEx to lcgvo0598
- Ensured lcgvo0598 is ready to run PhEDEx in an emergency. [Done]
CMS Data Ops
- Backfill at IN2P3 & RAL
Add VOBOX proxy renewal restarter to lcgvo-02-21 [To do]
CMS storage consistency check [To do]
A/L Wed - Fri

Catalin

Python course [done]
ATLAS frontier monitoring
LFC quattorising (SL4 and SL5) [ongoing]

Derek

Moved LHCb to grid3000M queue [done]
Writing Strawman Cloud strategy [ongoing]
Sync production templates against QWG [ongoing]
CREAM CE quattor profile

Matt

Using FTS dev endpoint to test new timeout parameters.
Test deployment of gLite 3.2 FTS.

Richard

Submitted downtime for applying the BDII update approved in change control request # 62184
Working on the "team status page" being developed as an action from team awayday [ongoing]
Reviewing G/S process documentation [ongoing]
CASTOR items:
- Further progress on getting the 2.1.9 functional tests running on pre-prod

VO Reports

ALICE

ATLAS

CMS

Discussions about having all Tier-1s publish CPU farm information in a common XML format:
- Summary information - number of jobs running, pending, CPU time, wall time, number of jobs with efficiency < 10% (overall & for different groups)
- (Optional) Details about individual jobs

LHCb

OnCall/AoD Cover

Primary OnCall:
Grid OnCall: Derek
AoD:

RAL Tier1 weekly operations Grid 20100726

Contents

Operational Issues

Downtimes

Blocking Issues

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Andrew

Catalin

Derek

Matt

Richard

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools