Tier1 Operations Report 2010-05-19
From GridPP Wiki
Contents
RAL Tier1 Operations Report for 19th May 2010.
Review of Issues during week 12th to 19th May 2010.
- gdss380 (lhcbmdst) crashed with a disk failure on Friday night / Saturday morning. Later on Sunday, there was a GGUS ticket GGUS 58253 about disk space. We put a new server in place in lhcbmdst on Sunday. Problems with files not being available persisted until this morning (19th May).
- A tape problem on Friday 14th caused the loss of 2 files belonging to the MINOS VO. The MINOS experiment representatives were informed by email.
- A tape problem on Monday 17th caused the loss of a single file for Atlas who were informed (GGUS ticket).
Current operational status and issues.
- We are still failing some SAM test on the site BDII. Not likely to be fixed in current SAM setup - see GGUS ticket 58054
- Ongoing issues with the Atlas software server - lcg0617. There are plans in place to replace this machine.
Declared in the GOC DB:
- At Risks on remaining Castor instances while a database table is cleaned (as already done for 'GEN' & 'CMS' instances).
- Wednesday 19th May: Atlas (today)
- Tuesday 25th May: LHCb
- Wednesday 19th May: lcgce07.gridpp.rl.ac.uk Downtime for glexec upgrade.
- Until Wednesday 26th May: lcgvo-02-21.gridpp.rl.ac.uk VO box not yet in production.
- Tuesday 1st June (During technical stop) - UPS test (implying site At Risk).
Advanced warning:
The following items remain to be scheduled:
- Oracle patching of databases. Will lead to "At Risks" on OGMA (Atlas 3D) on Tuesday 25th May, LUGH (LHCb 3D & LFC) Thursday 27th May and SOMNUS (LFC, FTS) on Wednesday 2nd June.
- Doubling of network link to network stack for tape robot and Castor head nodes. Tuesday 1st June (during technical stop). Will require a Castor stop, FTS drain and batch pause. Plan to make co-incident with UPS test.
- CEs taken out of production in rotation (one at a time) while glexec configured. (CE01 and CE06 done so far, CE07 in progress.)
- Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
- Closure of SL4 batch workers at RAL
Entries in GOC DB starting between 5th and 12th May 2010.
There were 2 unscheduled entries in the GOC DB for this last week.
- lcgce07 had a mis-scheduled scheduled downtime and it had to be extended.
- lcgvo-02-21.gridpp.rl.ac.uk is a CMS VO box not yet in production.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcgce07.gridpp.rl.ac.uk, | UNSCHEDULED | OUTAGE | 17/05/2010 17:00 | 19/05/2010 17:00 | 2 day, | Downtime to reconfigure glexec. Previous downtime for this machine was mistakenly set to end on the 17/05/2010 |
lcgvo-02-21.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 17/05/2010 11:00 | 26/05/2010 11:00 | 9 days, | CMS SL5 Phedex VObox not yet in production |
lcgce07.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 17/05/2010 10:00 | 17/05/2010 17:00 | 7 hours | Downtime to reconfigure glexec. Alternate CEs are available for the affected VOs. |
srm-cms.gridpp.rl.ac.uk, | SCHEDULED | AT_RISK | 17/05/2010 09:30 | 17/05/2010 11:30 | 2 hours | At Risk on Castor CMS instance while redundant information cleaned up from Castor database. |
lcgvo-02-21.gridpp.rl.ac.uk, | UNSCHEDULED | OUTAGE | 14/05/2010 15:15 | 17/05/2010 11:00 | 2 days, 19 hours and 45 minutes | CMS SL5 Phedex VObox not yet in production |
srm-alice.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, | SCHEDULED | AT_RISK | 12/05/2010 09:30 | 12/05/2010 11:30 | 2 hours | At Risk on Castor 'GEN' instance while redundant information cleaned up from Castor database. |
lcgce06.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 10/05/2010 10:00 | 12/05/2010 11:35 | 2 days, 1 hour and 35 minutes | Reconfiguring lcgce06 to enable mappings for glexec |