Difference between revisions of "Tier1 Operations Report 2014-06-18"
From GridPP Wiki
(→) |
(→) |
||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 11th to 18th June 2014. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 11th to 18th June 2014. | ||
|} | |} | ||
− | * | + | * There was a problem with cream-ce01 which started failing tests on Friday. It was put into a downtime and drained out over the weekend after which its database was reset. It was returned to service on Monday (16th). |
− | + | * The CMS Castor stager update to version 2.1.14-13 took place yesterday (Tuesday) as planned. There were some difficulties caused by the disk server re-balancer initially after the upgrade. However, these were understood and resolved within the announced outage. Nevertheless the service was placed in a 'warning' in the GOC DB for the rest of the day and overnight. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | * The CMS Castor stager update to version 2.1.14-13 took place yesterday (Tuesday) as planned. There were some difficulties caused by the disk server re-balancer initially after | + | |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> |
Revision as of 08:08, 18 June 2014
RAL Tier1 Operations Report for 18th June 2014
Review of Issues during the week 11th to 18th June 2014. |
- There was a problem with cream-ce01 which started failing tests on Friday. It was put into a downtime and drained out over the weekend after which its database was reset. It was returned to service on Monday (16th).
- The CMS Castor stager update to version 2.1.14-13 took place yesterday (Tuesday) as planned. There were some difficulties caused by the disk server re-balancer initially after the upgrade. However, these were understood and resolved within the announced outage. Nevertheless the service was placed in a 'warning' in the GOC DB for the rest of the day and overnight.
Resolved Disk Server Issues |
- GDSS586 (AtlasDataDisk - D1T0) failed to restart after kernel/errata updates applied during the Castor update on 10th June. It was returned to production just befor this meeting (11th June).
Current operational status and issues |
- None
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Yesterday (10th June) the Castor namserver was updated to version 2.1.14-13. While Castor was down the opportunity was taken to update the firmware in some network switches and apply kernel/errata updates to the Castor disk servers.
- A new ARC CE, arc-ce04 has been brought into production.
- ARC CEs arc-ce02 & arc-ce03 have been upgraded to version 4.1.0. (All ARC CEs now updated).
- The host certificate on arc-ce02 has been updated andthe new certificate is SHA-2 signed. There was an initial error during the application of this certificate, but that was corrected and the service is now running OK with the new (SHA-2) certificate.
- On the 2nd June LHCb access was removed from cream-ce01.
- Today (11th June) a new tape controller system (ACSLS) is being installed. There have been some problems with the new server. However, last test (last week) was successful.
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Dates for the Castor 2.1.14 stager upgrades: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- The Castor 2.1.14 upgrade is underway. Some checks are ongoing before finally deciding whether the stagers will go directly to minor version 2.1.14-13 (rather than 2.1.14-11 as previously planned).
- CIP-2.2.15 publishes online resources correctly for CASTOR 2.1.14 but not nearline (they appear as zero, due to a modification to how data is held in CASTOR). CIP will be updated. We can double check the Rajanian Problem at the same time which was understood and fixed earlier.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 11th and 18th June 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor (all SRM endpoints) and batch (all CEs) | SCHEDULED | OUTAGE | 10/06/2014 08:50 | 10/06/2014 12:45 | 3 hours and 55 minutes | Castor and batch services down during upgrade of Castor Nameserver to version 2.1.14. |
arc-ce04.gridpp.rl.ac.uk | UNSCHEDULED | OUTAGE | 10/06/2014 07:05 | 10/06/2014 12:45 | 5 hours and 40 minutes | Stopping work on new CE around upgrade of Castor Storage System. |
Castor (all SRM endpoints) and batch (all CEs) | SCHEDULED | OUTAGE | 10/06/2014 06:50 | 10/06/2014 08:50 | 2 hours | Castor and batch services down for Networking Change before upgrade of Castor Nameserver to version 2.1.14. |
cream-ce01.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 07/06/2014 10:00 | 10/06/2014 12:00 | 3 days, 2 hours | EMI-3 update 14 upgrade |
srm-lhcb-tape.gridpp.rl.ac.uk, | UNSCHEDULED | OUTAGE | 05/06/2014 15:25 | 05/06/2014 16:42 | 1 hour and 17 minutes | Problem affecting disk cache in front of tape. |
srm-atlas.gridpp.rl.ac.uk, | UNSCHEDULED | WARNING | 05/06/2014 15:25 | 05/06/2014 16:41 | 1 hour and 16 minutes | Problem affecting disk cache in front of tape. Non-Tape service clasess unaffected. |
All srm endpoints | SCHEDULED | WARNING | 04/06/2014 08:00 | 04/06/2014 17:00 | 9 hours | Warning (At Risk) on tape systems during testing of new tape library controller. |
arc-ce02.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk | SCHEDULED | WARNING | 02/06/2014 10:00 | 02/06/2014 12:00 | 2 hours | Upgrade arc-ce02 and arc-ce03 to v. 4.1.0. |
arc-ce01.gridpp.rl.ac.uk | SCHEDULED | WARNING | 28/05/2014 10:00 | 28/05/2014 12:00 | 2 hours | Upgrade of ARC CE to version 4.1.0. |
All Castor (all srm endpoints): srm-alice, srm-atlas, srm-biomed, srm-cert, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-lhcb, srm-mice, srm-minos, srm-na62, srm-preprod, srm-snoplus, srm-superb, srm-t2k | SCHEDULED | WARNING | 28/05/2014 09:30 | 28/05/2014 11:30 | 2 hours | At Risk on Castor (All SRM endpoints) during small internal network change. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
105571 | Amber | Less Urgent | In Progress | 2014-05-21 | 2014-06-02 | LHCb | BDII and SRM publish inconsistent storage capacity numbers |
105405 | Red | Urgent | In Progress | 2014-05-14 | 2014-06-10 | please check your Vidyo router firewall configuration | |
105100 | Red | Less Urgent | In Progress | 2014-05-02 | 2014-05-30 | CMS | T1_UK_RAL Consistency Check (May14) |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2014-06-17 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
11/06/14 | 100 | 91.3 | 100 | 99.3 | 100 | 99 | 99 | Problem with Argus server. |
12/06/14 | 100 | 100 | 100 | 100 | 100 | 100 | 97 | |
13/06/14 | 100 | 100 | 95.3 | 100 | 100 | 98 | 99 | |
14/06/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
15/06/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
16/06/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
17/06/14 | 100 | 100 | 100 | 72.9 | 100 | 96 | 100 | CMS Castor stager 2.1.14 upgrade. |