Tier1 Operations Report 2010-05-26
From GridPP Wiki
Contents
RAL Tier1 Operations Report for 26 May 2010
Review of Issues during week 19th to 26th May 2010.
- gdss380 (lhcbmdst) was put back into service on 20th. It failed again on Saturday morning with similar symptoms to its previous failure.
- Problem with the CIP not publishing GlueSA information on Thursday 20th May 2010. This was noticed and fixed before any tickets were raised.
- gdss67 CMS farmRead had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th.
- On Thursday 20th there was a Problem with gdss228 and gdss229 being deployed into atlasScratchDisk.
- On Monday 24th gdss207 (aliceTap)was removed from service due to possible file system corruption.
- Tuesday 25th May 2010 there were problems with the Atlas SRM machines overnight (/var full)
- On Tuesday morning, the CMS SRM failed tests because of a stager error
Current operational status and issues.
- We are still failing some SAM test on the site BDII. Not likely to be fixed in current SAM setup - see GGUS ticket 58054
- Ongoing issues with the Atlas software server - lcg0617. The plans in place to replace this machine have now been approved.
Declared in the GOC DB
- Wednesday 26th May: lcgce08.gridpp.rl.ac.uk Downtime for glexec upgrade.
- Tuesday 1st June (During technical stop) - UPS test (implying site At Risk).
- Wednesday 2nd June Oracle patching on SOMNUS (LFC and FTS).
Advanced warning:
The following items remain to be scheduled:
- Oracle patching of databases. Will lead to "At Risks" on LUGH (LHCb 3D & LFC) Thursday 27th May and SOMNUS (LFC, FTS) on Wednesday 2nd June.
- Doubling of network link to network stack for tape robot and Castor head nodes. Tuesday 1st June (during technical stop). Will require a Castor stop, FTS drain and batch pause. Plan to make co-incident with UPS test.
- CEs taken out of production in rotation (one at a time) while glexec configured. (CE08 in progress.)
- Preventative maintenance work on transformers in R89. Being scheduled some weeks ahead. Likely to lead to two 'At Risks'.
- Closure of SL4 batch workers at RAL
Entries in GOC DB starting between 5th and 12th May 2010.
There were 1 unscheduled entry in the GOC DB for this last week.
- lcgce07 had a mis-scheduled scheduled downtime and it had to be extended
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
ogma.gridpp.rl.ac.uk, | SCHEDULED | AT_RISK | 25/05/2010 10:00 | 25/05/2010 12:00 | 2 hours | At Risk during application of Oracle PSU patches. |
srm-lhcb.gridpp.rl.ac.uk, | SCHEDULED | AT_RISK | 25/05/2010 09:30 | 25/05/2010 11:30 | 2 hours | At Risk on Castor LHCb instance while redundant information cleaned up from Castor database. |
lcgce08.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 24/05/2010 10:00 | 26/05/2010 17:00 | 2 days, 7 hours | CE being reconfigured for glexec roles mapping |
srm-atlas.gridpp.rl.ac.uk, | SCHEDULED | AT_RISK | 19/05/2010 09:30 | 19/05/2010 11:30 | 2 hours | At Risk on Castor 'Atlas' instance while redundant information cleaned up from Castor database. |
lcgce07.gridpp.rl.ac.uk, | UNSCHEDULED | OUTAGE | 17/05/2010 17:00 | 19/05/2010 17:00 | 2 day, | Downtime to reconfigure glexec. Previous downtime for this machine was mistakenly set to end on the 17/05/2010 |
lcgvo-02-21.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 17/05/2010 11:00 | 25/05/2010 13:32 | 8 days, 2 hours and 32 minutes | CMS SL5 Phedex VObox not yet in production |