Difference between revisions of "Tier1 Operations Report 2014-05-28"
From GridPP Wiki
(→) |
|||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 21st to 28th May 2014. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 21st to 28th May 2014. | ||
|} | |} | ||
+ | * Maintenance on the diesel generator was carried out as planned on themorning of Thursday 22nd May. | ||
* There were problems with tape access late Tuesday and Wednesday (20/21 May). On the Tuesday morning a new tape controller server (ACSLS) had been put into operation. This change was reverted on Wednesday afternoon. The revertion was to put the old server back into service. | * There were problems with tape access late Tuesday and Wednesday (20/21 May). On the Tuesday morning a new tape controller server (ACSLS) had been put into operation. This change was reverted on Wednesday afternoon. The revertion was to put the old server back into service. | ||
* This morning's planned network reconfiguration, with an 'at risk' on Castor ran into some problems causing a break in access to some disk servers for around 20 minutes. The network change itself was completed. | * This morning's planned network reconfiguration, with an 'at risk' on Castor ran into some problems causing a break in access to some disk servers for around 20 minutes. The network change itself was completed. | ||
Line 132: | Line 133: | ||
* Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June. | * Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June. | ||
* We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3. | * We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3. | ||
− | |||
* On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor. | * On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor. | ||
'''Listing by category:''' | '''Listing by category:''' |
Revision as of 10:03, 28 May 2014
RAL Tier1 Operations Report for 28th May 2014
Review of Issues during the week 21st to 28th May 2014. |
- Maintenance on the diesel generator was carried out as planned on themorning of Thursday 22nd May.
- There were problems with tape access late Tuesday and Wednesday (20/21 May). On the Tuesday morning a new tape controller server (ACSLS) had been put into operation. This change was reverted on Wednesday afternoon. The revertion was to put the old server back into service.
- This morning's planned network reconfiguration, with an 'at risk' on Castor ran into some problems causing a break in access to some disk servers for around 20 minutes. The network change itself was completed.
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Completion of roll out of CVMFS Client version 2.1.19 to whole farm.
- This morning (28th May) arc-ce01 was updated to version 4.1.0-1.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor (SRMs) and batch (CEs). | SCHEDULED | OUTAGE | 10/06/2014 08:50 | 10/06/2014 15:00 | 6 hours and 10 minutes | Castor and batch services down during upgrade of Castor Nameserver to version 2.1.14. |
arc-ce02.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk | SCHEDULED | WARNING | 02/06/2014 10:00 | 02/06/2014 12:00 | 2 hours | Upgrade arc-ce02 and arc-ce03 to v. 4.1.0. |
arc-ce01.gridpp.rl.ac.uk, arc-ce01.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 28/05/2014 10:00 | 28/05/2014 12:00 | 2 hours | Upgrade of ARC CE to version 4.1.0. |
All Castor (All SRM endpoints) | SCHEDULED | WARNING | 28/05/2014 09:30 | 28/05/2014 11:30 | 2 hours | At Risk on Castor (All SRM endpoints) during small internal network change. |
lcgui02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 30/04/2014 14:00 | 29/05/2014 13:00 | 28 days, 23 hours | Service being decommissioned. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
- On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon.
- Networking:
- Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 21st and 28th May 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
arc-ce01.gridpp.rl.ac.uk | SCHEDULED | WARNING | 28/05/2014 10:00 | 28/05/2014 12:00 | 2 hours | Upgrade of ARC CE to version 4.1.0. |
All SRMs (All Castor) | SCHEDULED | WARNING | 28/05/2014 09:30 | 28/05/2014 11:30 | 2 hours | At Risk on Castor (All SRM endpoints) during small internal network change. |
lcgui02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 30/04/2014 14:00 | 29/05/2014 13:00 | 28 days, 23 hours | Service being decommissioned. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
105571 | Green | Less Urgent | In Progress | 2014-05-21 | 2014-05-27 | LHCb | BDII and SRM publish inconsistent storage capacity numbers |
105405 | Yellow | Urgent | In Progress | 2014-05-14 | 2014-05-15 | please check your Vidyo router firewall configuration | |
105308 | Yellow | Less Urgent | On Hold | 2014-05-11 | 2014-05-27 | Atlas | Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied" |
105161 | Amber | Less Urgent | In Progress | 2014-05-05 | 2014-05-16 | H1 | hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours) |
105100 | Red | Urgent | In Progress | 2014-05-02 | 2014-05-12 | CMS | T1_UK_RAL Consistency Check (May14) |
98249 | Red | Urgent | Waiting Reply | 2013-10-21 | 2014-05-21 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
21/05/14 | 100 | 100 | 100 | 100 | 100 | 98 | 99 | |
22/05/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
23/05/14 | 100 | 100 | 100 | 100 | 100 | 97 | 99 | |
24/05/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
25/05/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
26/05/14 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
27/05/14 | 100 | 100 | 99.1 | 100 | 100 | 96 | 100 | Single SRM Get test failure. |