Difference between revisions of "Tier1 Operations Report 2015-03-04"
From GridPP Wiki
(→) |
(→) |
||
Line 70: | Line 70: | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | ||
|} | |} | ||
− | + | ||
+ | |||
+ | |||
+ | {| border=1 align=center | ||
+ | |- bgcolor="#7c8aaf" | ||
+ | ! Service | ||
+ | ! Scheduled? | ||
+ | ! Outage/At Risk | ||
+ | ! Start | ||
+ | ! End | ||
+ | ! Duration | ||
+ | ! Reason | ||
+ | |- | ||
+ | | Whole site | ||
+ | | SCHEDULED | ||
+ | | WARNING | ||
+ | | 11/03/2015 10:30 | ||
+ | | 11/03/2015 11:00 | ||
+ | | 30 minutes | ||
+ | | Warning on site access during firmware updates on pair of network switches. | ||
+ | |- | ||
+ | |lcgfts3.gridpp.rl.ac.uk, | ||
+ | | SCHEDULED | ||
+ | | WARNING | ||
+ | | 05/03/2015 10:00 | ||
+ | | 05/03/2015 12:00 | ||
+ | | 2 hours | ||
+ | | Warning on FTS3 service during upgrade to version v3.2.32 | ||
+ | |- | ||
+ | | Whole site | ||
+ | | UNSCHEDULED | ||
+ | | WARNING | ||
+ | | 05/03/2015 07:45 | ||
+ | | 05/03/2015 08:30 | ||
+ | | 45 minutes | ||
+ | | Warning following installation of a replacement network router. | ||
+ | |} | ||
+ | |||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> |
Revision as of 11:46, 4 March 2015
RAL Tier1 Operations Report for 4th March 2015
Review of Issues during the week 25th February to 4th March 2015. |
- A test of the problematic Tier1 router was carried out on Thursday morning, 19th Feb.
- Yesterday (Tuesday) there was an outage of part of Castor as some racks containing disk servers (the 2011 batches) were shutdown while they were moved within the machine room to make room for new deliveries.
- There were a couple of breaks in network connectivity between 07:00 and 08:00 yesterday (Tuesday 24th Feb) while core site network switches were upgraded.
- On Wednesday (18th Feb) a single file was reportd lost to CMS. This file had been picked up by the Castor checksum checker.
Resolved Disk Server Issues |
- GDSS757 (CMSDisk - D1T0) failed early on Sunday morning (22nd Feb). Following checking it was returned to service the following morning.
Current operational status and issues |
- We are running with a single router connecting the Tier1 network to the site network, rather than a resilient pair.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- The 2011 disk servers were moved within the machine room to make space for new deliveries. As part of this the connections to these servers was migrated to the Tier1 mesh network.
- On Wednesday (18th Feb) the Argus policy was updated to support the MICE reco role.
- A Castor namesever box has been set-up to enable queries against Castor metadata to be made without affecting the throughput of production work.
- A system has been set-up to provide Atlas with Castor information that is not supplied by the SRM.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | SCHEDULED | WARNING | 11/03/2015 10:30 | 11/03/2015 11:00 | 30 minutes | Warning on site access during firmware updates on pair of network switches. |
lcgfts3.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 05/03/2015 10:00 | 05/03/2015 12:00 | 2 hours | Warning on FTS3 service during upgrade to version v3.2.32 |
Whole site | UNSCHEDULED | WARNING | 05/03/2015 07:45 | 05/03/2015 08:30 | 45 minutes | Warning following installation of a replacement network router. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Investigate problems on the primary Tier1 router. Discussions with the vendor are ongoing.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Update Castor to 2.1-14-latest.
- Networking:
- Resolve problems with primary Tier1 Router
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers. (Install patch to Router software).
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
- None
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
111856 | Green | Less Urgent | Waiting Reply | 2015-02-19 | 2015-03-03 | LHCb | Stalled LHCb jobs over night |
111699 | Green | Less Urgent | In Progress | 2015-02-10 | 2015-02-27 | Atlas | gLExec hammercloud jobs keep failing at RAL-LCG2 & RALPP |
109694 | Red | Urgent | In Progress | 2014-11-03 | 2015-03-04 | SNO+ | gfal-copy failing for files at RAL |
108944 | Red | Urgent | Waiting Reply | 2014-10-01 | 2015-03-03 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
25/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
26/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
27/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 95 | |
28/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 95 | |
01/03/15 | 100 | 100 | 100 | 100 | 100 | 0 | 98 | |
02/03/15 | 100 | 100 | 100 | 100 | 100 | 100 | 97 | |
03/03/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 |