RAL Tier1 weekly operations Grid 20100705

Operational Issues

Description	Start	End	Affected VO(s)	Severity	Status
Job status monitoring from CREAMCE	2-Feb-2010		CMS	medium	[10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy

Downtimes

Description	Hosts	Type	Start	End	Affected VO(s)

Blocking Issues

Description	Requested Date	Required By Date	Priority	Status
HW needed to test Dataguard technology for LFC/FTS	19 May 2010	15 June 2010	Medium	[24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
#61658: HW request for CMS Squid VOBOX	30 June 2010		Medium	[30-06-2010]Request made

Developments/Plans

Highlights for Tier-1 Ops Meeting

CE03 now deployed
Ongoing work to finalise close of SL4 batch service.
Working on failover CMS Phedex vobox
Grid Team thin on ground this week (A/L & WLCG workshop)

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Working on ATLAS software server on /afs [ongoing]
Written script to identify unavaliable files when a disk server is taken out of production. [testing]
Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
Working to improve pbsjobs database to allow easier monitoring of production work.
Working on ATLAS Frontier service, monitoring and backup.

Andrew

Adjustments to TFC & testing of new service class (cmsTemp) using a backfill workflow [Done]
Put in H/W request for Fabric team for new CMS VOBOX for Squid / PhEDEx failover [Done]
Writing call-out documentation for restarting PhEDEx on another VOBOX [Ongoing]
Updated FTS services.xml; added new domain to RGMA ACL; updated Maui fairshares [Done]
Accounting
- June accounting [Ongoing]
- Investigated CESGA/PBS differences due to dates used in queries [Done]
CMS data ops
- Accounting for previous rereco/skims
- Data rereco at KIT
- MC rereco at RAL & CNAF

Catalin

test LFC deployment using quattor [ongoing]
LFC talk for NGS [done]
Frontier monitoring [ongoing]
Alice castor+xrootd issues [ongoing]

Derek

Testing glexec update [ongoing]
Setting up NGS UEE on worker nodes
Deployed lcgce03 [done]
Implementing new updated change control process on dev helpdesk
Attending WLCG Workshop Wed-Fri

Matt

Richard

Planning updates to RAL top-level BDIIs
Further work on the "team status page" being developed as an action from team awayday
Reviewing G/S process documentation
Developed a tool to help with automating the wiki page on grid middleware versions
Writing a Nagios plugin to check the "deltas" in # of entries in RAL BDII servers
CASTOR items:
- Carried out latest phase in pre-prod upgrade
- Re-ran 2.1.8 functional tests on latest pre-prod s/w after latest re-config
- Started running stress tests
Next Week
- Finishing off 2.1.7 metrics documentation
- Continuing to run stress tests on pre-prod
4.5 days A/L

Mayo

Implement David Meredith's feedback into Certificate viewer [Done]
integrate certificate viewer module with existing NGS certificate wizard code[Done]
Create Handover Documentation for finished projects [ongoing]
Enter job plan into ssc [Done]
Create Certificate Query class for David Meredith [Done]

VO Reports

ALICE

waiting for CREAM-CE 1.6 deployment at RAL
cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7

ATLAS

CMS

Data loss: 877 files were lost from gdss67

LHCb

OnCall/AoD Cover

Primary OnCall:
Grid OnCall: Derek
AoD:

RAL Tier1 weekly operations Grid 20100705

Contents

Operational Issues

Downtimes

Blocking Issues

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Andrew

Catalin

Derek

Matt

Richard

Mayo

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools