Difference between revisions of "Tier1 Operations Report 2014-05-28"
From GridPP Wiki
(→) |
(→) |
||
Line 10: | Line 10: | ||
|} | |} | ||
* Maintenance on the diesel generator was carried out as planned on the morning of Thursday 22nd May. | * Maintenance on the diesel generator was carried out as planned on the morning of Thursday 22nd May. | ||
− | * There were problems with tape access late Tuesday and Wednesday (20/21 May). On the Tuesday morning a new tape controller server (ACSLS) had been put into operation. This change was reverted on Wednesday afternoon. The revertion was to put the old server back into service. | + | * There were problems with tape access late Tuesday and Wednesday (20/21 May). On the Tuesday morning a new tape controller server (ACSLS) had been put into operation. Some six hours later problems started to appear. This change was reverted on Wednesday afternoon. The revertion was to put the old server back into service. |
* This morning's planned network reconfiguration, with an 'at risk' on Castor ran into some problems causing a break in access to some disk servers for around 20 minutes. The network change itself was carried out to completion.. | * This morning's planned network reconfiguration, with an 'at risk' on Castor ran into some problems causing a break in access to some disk servers for around 20 minutes. The network change itself was carried out to completion.. | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> |
Revision as of 10:07, 28 May 2014
RAL Tier1 Operations Report for 28th May 2014
Review of Issues during the week 21st to 28th May 2014. |
- Maintenance on the diesel generator was carried out as planned on the morning of Thursday 22nd May.
- There were problems with tape access late Tuesday and Wednesday (20/21 May). On the Tuesday morning a new tape controller server (ACSLS) had been put into operation. Some six hours later problems started to appear. This change was reverted on Wednesday afternoon. The revertion was to put the old server back into service.
- This morning's planned network reconfiguration, with an 'at risk' on Castor ran into some problems causing a break in access to some disk servers for around 20 minutes. The network change itself was carried out to completion..
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Completion of roll out of CVMFS Client version 2.1.19 to whole farm.
- This morning (28th May) arc-ce01 was updated to version 4.1.0-1.
- This morning (28th May) two network switches that provide connectivity to some Castor disk servers were moved to the mesh network.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor (SRMs) and batch (CEs). | SCHEDULED | OUTAGE | 10/06/2014 08:50 | 10/06/2014 15:00 | 6 hours and 10 minutes | Castor and batch services down during upgrade of Castor Nameserver to version 2.1.14. |
arc-ce02.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk | SCHEDULED | WARNING | 02/06/2014 10:00 | 02/06/2014 12:00 | 2 hours | Upgrade arc-ce02 and arc-ce03 to v. 4.1.0. |
arc-ce01.gridpp.rl.ac.uk, arc-ce01.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 28/05/2014 10:00 | 28/05/2014 12:00 | 2 hours | Upgrade of ARC CE to version 4.1.0. |
All Castor (All SRM endpoints) | SCHEDULED | WARNING | 28/05/2014 09:30 | 28/05/2014 11:30 | 2 hours | At Risk on Castor (All SRM endpoints) during small internal network change. |
lcgui02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 30/04/2014 14:00 | 29/05/2014 13:00 | 28 days, 23 hours | Service being decommissioned. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June (now in GOC DB); Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon.
- Networking:
- Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 21st and 28th May 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
arc-ce01.gridpp.rl.ac.uk | SCHEDULED | WARNING | 28/05/2014 10:00 | 28/05/2014 12:00 | 2 hours | Upgrade of ARC CE to version 4.1.0. |
All SRMs (All Castor) | SCHEDULED | WARNING | 28/05/2014 09:30 | 28/05/2014 11:30 | 2 hours | At Risk on Castor (All SRM endpoints) during small internal network change. |
lcgui02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 30/04/2014 14:00 | 29/05/2014 13:00 | 28 days, 23 hours | Service being decommissioned. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
105571 | Green | Less Urgent | In Progress | 2014-05-21 | 2014-05-27 | LHCb | BDII and SRM publish inconsistent storage capacity numbers |
105405 | Yellow | Urgent | In Progress | 2014-05-14 | 2014-05-15 | please check your Vidyo router firewall configuration | |
105308 | Yellow | Less Urgent | On Hold | 2014-05-11 | 2014-05-27 | Atlas | Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied" |
105161 | Amber | Less Urgent | In Progress | 2014-05-05 | 2014-05-16 | H1 | hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours) |
105100 | Red | Urgent | In Progress | 2014-05-02 | 2014-05-12 | CMS | T1_UK_RAL Consistency Check (May14) |
98249 | Red | Urgent | Waiting Reply | 2013-10-21 | 2014-05-21 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
21/05/14 | 100 | 100 | 100 | 100 | 100 | 98 | 99 | |
22/05/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
23/05/14 | 100 | 100 | 100 | 100 | 100 | 97 | 99 | |
24/05/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
25/05/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
26/05/14 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
27/05/14 | 100 | 100 | 99.1 | 100 | 100 | 96 | 100 | Single SRM Get test failure. |