RAL Tier1 weekly operations Grid 20101101
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status |
---|---|---|---|---|---|
SW RAID problems on lcgwms03 (non-LHC) | Fri 22-Oct-2010 | non-LHC | Fabric aware of the problem |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server, testing CVMFS
- 825 test jobs have been run.
- lcg0805 has been setup for production style testing, need to add queue into ATLAS system.
- Production tasks submitted.
- ANALY_RAL is now open to normal users and they are successfully using CVMFS.
- Writing script to graph transfer times for FTS transfers [on hold]
- Emergency SRM upgrade
- Emergency ATLAS permission upgrade
- Preparing for ATLAS UK meeting tomorrow.
Andrew
- Capacity planning system project [Ongoing]
- Dealing with APEL problems (both MON box & glite-APEL)
- Fixed pbslogs2mysql (ignores jobs from job arrays) [Done]
- Completed gmetric & ganglia scripts for per-site CMS squid acess monitoring [Done]
- Put glite-APEL into production [To do]
- October accounting [To do]
- Investigate setting up batch system plugin required for producing XML job information for CMS [To do]
- CMS data ops
- Pile-up MC reprocessing at RAL & CNAF [Ongoing]
Catalin
- deploy lcglb03 (glite3.2 LB) in full (LHC and non-LHC) production
- work on (x)ROOT(d); deploy test infrastructure [ongoing]
- drain lcglb01
Derek
- Updated blparser on lcgbatch01 to fix job state issue on CREAM CEs [done]
- purged 60,000 jobs from lcgce03 stuck in Running state [done]
- Deployed new change control process [done]
- Investigation of secure deployment of ssh keys to hosts [ongoing]
- Change control for providing additional CREAM CE for Atlas
- Investigating solutions for whole node scheduling
Matt
- Testing PBS monitoring tools (pbswebmon, JobMon) [Ongoing]
- Further testing of Quattorised gLite3.2 FTS FEs. [Ongoing]
- Quattorisation of MyProxy nodes. [Ongoing]
- Test FTS SRM/GridFTP ratio configuration.
- Disk Deployment meeting.
Richard
- Prepping for Wednesday's update to RAL site-level BDIIs
- Developing a set of Quattor templates for an ARGUS server [Ongoing]
- Developing a "pseudo-update" to apply a gLite update to BDIIs
- Wrote a CGI script for logging hardware requests from G/S team in the Fabric queue in RT [Ongoing]
- Working on the "team status page" being developed as an action from team awayday [Ongoing]
- Reviewing G/S process documentation [Ongoing]
- CASTOR items:
- Running functional tests on Facilities instance
- Using grid to run many jobs so as to stress test Facilities instance
VO Reports
ALICE
ATLAS
CMS
- All CMS Tier-1s have been asked to provide XML job information (produced every 10 mins) to be consumed by central monitoring.
- Current work at RAL: pile-up MC reprocessing started last week and finished over the weekend.
- Upcoming T1 reprocessing plans (dates maybe subject to change):
- 2010-11-04 Data rereco
- 2010-11-13 Pile-up MC redigi/rereco
- 2010-12-15 Data rereco / MC redigi/rereco
LHCb
OnCall/AoD Cover
- Primary OnCall: Catalin (Mon-Sun)
- Grid OnCall: