From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Operational Issues
Description
|
Start
|
End
|
Affected VO(s)
|
Severity
|
Status
|
Job status monitoring from CREAMCE
|
2-Feb-2010
|
|
CMS
|
medium
|
[10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good
|
|
Downtimes
Description
|
Hosts
|
Type
|
Start
|
End
|
Affected VO(s)
|
gLite-WMS update + maintenance
|
lcgwms01
|
|
Wed 1 Sep 10:00
|
8 Sep 11:00
|
LHC
|
Blocking Issues
Description
|
Requested Date
|
Required By Date
|
Priority
|
Status
|
HW needed to test Dataguard technology for LFC/FTS
|
19 May 2010
|
15 June 2010
|
Medium
|
[24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices
|
LFC and FTS to be moved in UPS room
|
02 Sep 2010
|
15 Sep 2010
|
Medium
|
|
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server, testing CVMFS [waiting for farm to empty before running test]
- Writing script to graph transfer times for FTS transfers
- Made twiki page from Brian's disk draining/checking scripts.
- Working on Hammer cloud test of castor 2.1.9 [Analysis queue setup]
- Looking into gdss547, 554 transfer problems to Non UK sites.
Andrew
- CMS CASTOR 2.1.9 testing
- Preparing for transfers from RAL to Imperial (1.8 TB loadtest data transferred from CERN to RAL) for cmsWanOut stress testing [Done]
- Stress-testing on cmsFarmRead (with & without lazy-download) [Ongoing]
- Investigating CMS software server file access problems [Done]
- 2009-2010 VO support survey [Ongoing]
- August accounting [Done]
- CMS data ops
- Running MC redigi/rereco at CNAF [Ongoing]
- Running data rereco preproduction at KIT, FNAL, RAL [Ongoing]
Catalin
- add new frontends to non-LHC LFC alias
- gLite updates WMS01 LHC [ongoing]
- ATLAS frontier monitoring [ongoing]
- test SL5 LFC quattor profiles [ongoing]
- work on improving ganglia monitoring for Grid Services
Derek
- CREAM CE quattor profile [ongoing]
- Investigating CREAM CE instability [ongoing]
- At GridPP meeting Mon-Thu, A/L Friday and following week
Matt
- Change Controls for FTS FE updates.
- Quattorisation of FTS Agents host.
Richard
- Preparation for next week's roll-out of Quattorised site-level BDIIs
- Tracking the mystery BDII problem reported by Chris Walker at QMUL
- Working on the "team status page" being developed as an action from team awayday [ongoing]
- Reviewing G/S process documentation [ongoing]
- CASTOR items:
- Helping Cheney with Quattor profile for "combo" CASTOR headnodes
VO Reports
ALICE
ATLAS
CMS
- RAL in an ERROR state for CMS on 3rd & 5th September due to CE/batch system problems affecting SAM tests, JobRobot & production jobs.
- Deleted ~200 TB of old MC data.
- Around 400-500 CMS jobs per day recently are being killed for exceeding 2 GB memory limit. These are WMAgent test jobs with a known bad config file (WMAgent is the replacement for ProdAgent).
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall:
- AoD: