RAL Tier1 weekly operations Grid 20100726
From GridPP Wiki
Contents
Operational Issues
Description | Start | End | Affected VO(s) | Severity | Status | |
---|---|---|---|---|---|---|
Job status monitoring from CREAMCE | 2-Feb-2010 | CMS | medium | [10-Feb-2010] WMS patch available soon; CREAMCE new version available soon [07-Apr-2010] CMS tests have shown that WMS patches resolve the problem; still waiting for patch to be installed on the production WMSs in Italy [13-Jul-2010] CNAF WMSs have been updated; testing using backfill is in progress [19-Jul-2010] So far everything looks good | ||
WMS03 | 16-Jul-2010 | 16-Jul-2010 | Non-LHC | low | Was unresponsive and rebooted | |
FTS02 | 21-Jul-2010 | All | High | SMART errors on both FTS02 disks, Fabric have replacements and wish to arrange swap out |
Downtimes
Description | Hosts | Type | Start | End | Affected VO(s) |
---|---|---|---|---|---|
Blocking Issues
Description | Requested Date | Required By Date | Priority | Status |
---|---|---|---|---|
HW needed to test Dataguard technology for LFC/FTS | 19 May 2010 | 15 June 2010 | Medium | [24-05-2010]HW available; needs to be deployed by Fabric and then handed over to Dataservices |
#61658: HW request for CMS Squid VOBOX | 30 June 2010 | Medium | [30-06-2010]Request made | |
#62179: Request for new CMS pool accounts | 16 July 2010 | High | [16-07-2010]Request made [21-07-2010]Ticket closed by Fabric team [26-07-2010]Pool accounts were created yesterday |
Developments/Plans
Highlights for Tier-1 Ops Meeting
Highlights for Tier-1 VO Liaison Meeting
Detailed Individual Reports
Alastair
- Working on ATLAS software server on /afs [ongoing]
- Written script to identify unavaliable files when a disk server is taken out of production. [testing]
- Looking into Slow LHCb transfers between SARA and RAL. (fix with James T now)
- Working to improve pbsjobs database to allow easier monitoring of production work.
- Working on ATLAS Frontier service, monitoring and backup.
Andrew
- Completed & submitted change control documents about the new CMS production role [Done]
- Prepared changes required for monitoring & accounting for new CMS production role [Done]
- PhEDEx backup
- Grid services on-call spreadsheet now contains details about temporarily moving PhEDEx to lcgvo0598
- Ensured lcgvo0598 is ready to run PhEDEx in an emergency. [Done]
- CMS Data Ops
- Backfill at IN2P3 & RAL
- Add VOBOX proxy renewal restarter to lcgvo-02-21 [To do]
- CMS storage consistency check [To do]
- A/L Wed - Fri
Catalin
- Python course [done]
- ATLAS frontier monitoring
- LFC quattorising (SL4 and SL5) [ongoing]
Derek
- Moved LHCb to grid3000M queue [done]
- Writing Strawman Cloud strategy [ongoing]
- Sync production templates against QWG [ongoing]
- CREAM CE quattor profile
Matt
- Using FTS dev endpoint to test new timeout parameters.
- Test deployment of gLite 3.2 FTS.
Richard
- Submitted downtime for applying the BDII update approved in change control request # 62184
- Working on the "team status page" being developed as an action from team awayday [ongoing]
- Reviewing G/S process documentation [ongoing]
- CASTOR items:
- Further progress on getting the 2.1.9 functional tests running on pre-prod
VO Reports
ALICE
ATLAS
CMS
- Discussions about having all Tier-1s publish CPU farm information in a common XML format:
- Summary information - number of jobs running, pending, CPU time, wall time, number of jobs with efficiency < 10% (overall & for different groups)
- (Optional) Details about individual jobs
LHCb
OnCall/AoD Cover
- Primary OnCall:
- Grid OnCall: Derek
- AoD: