RAL Tier1 weekly operations Grid 20100712
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy | |
Software server overloaded | Atlas | High | Software server problems |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
#61658: HW request for CMS Squid VOBOX | 30 June 2010 | Medium | [30-06-2010]Request made |
Developments/Plans
Highlights for Tier-1 Ops Meeting
- Mayo's last week
- Deployed lcgce03 (CREAM CE) for Alice
- Applied job limits to Atlas after problems with Atlas software server
Highlights for Tier-1 VO Liaison Meeting
- LHCb have requested that we raise walltime on grid2000M to 140 hours (from 96)
- Testing update to glite WN
- New CREAM CE (lcgce03) deployed for Alice
- Applied job limits to Atlas after problems with Atlas software server
Detailed Individual Reports
Alastair
- Working on ATLAS software server on /afs [ongoing]
- Written script to identify unavaliable files when a disk server is taken out of production. [testing]
- Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
- Working to improve pbsjobs database to allow easier monitoring of production work.
- Working on ATLAS Frontier service, monitoring and backup.
Andrew
- June accounting [Done]
- Wrote BDII-DAC disk capacity consistency checking script [Done]
- Checking new stage-out config for RAL so that unmerged files can be deleted (ProdAgent, SAM tests, site local config required updates) [Done]
- Checking checksums of files from T10KB tapes
- CMS data ops
- Completing MC reprocessing [Done]
- Started MC production backfill at RAL & IN2P3 [Ongoing]
- Real MC production at CNAF
- glite-APEL
- Reading documentation
- To do: setup glite-APEL instance in testbed
- Listened to CMS talks at WLCG meeting (evo)
Catalin
- test LFC deployment using quattor [ongoing]
- LFC talk for NGS [done]
- Frontier monitoring [ongoing]
- Alice castor+xrootd issues [ongoing]
Derek
- Testing glexec update [ongoing]
- Setting up NGS UEE on worker nodes
- Implementing new updated change control process on dev helpdesk
- Quattorising CREAM CE
- Mayo leaving stuff
Matt
Richard
- Catch-up after last week's leave
- Planning updates to RAL top-level BDIIs [ongoing]
- Working on the "team status page" being developed as an action from team awayday [ongoing]
- Reviewing G/S process documentation [ongoing]
- Developed a tool to help with automating the wiki page on grid middleware versions [done]
- CASTOR items:
- Ran stress tests on pre-prod
- Next Week
- Assemble results from last week's stress test runs
- Try to get 2.1.9 functional tests running on pre-prod
- Finishing off 2.1.7 metrics documentation [ongoing]
Mayo
- Implement David Meredith's feedback into Certificate viewer [Done]
- Integrate certificate viewer module with existing NGS certificate wizard code[Done]
- Create Handover Documentation for finished projects [ongoing]
- Enter job plan into ssc [Done]
- Create Certificate Query class for David Meredith [Done]
VO Reports
ALICE
- waiting for CREAM-CE 1.6 deployment at RAL
- cannot roll out new xrootd version (20100510-1509_dbg) on Castor 2.1.7
ATLAS
CMS
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek
- AoD: