Difference between revisions of "Tier1 Operations Report 2015-08-19"
From GridPP Wiki
(Created page with "==RAL Tier1 Operations Report for 19th August 2015== __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Start R...") |
(→) |
||
Line 132: | Line 132: | ||
| UNSCHEDULED | | UNSCHEDULED | ||
| WARNING | | WARNING | ||
− | | | + | | 18/08/2015 08:30 |
− | | | + | | 18/08/2015 10:00 |
− | | | + | | 1 hour and 30 minutes |
− | |Warning on | + | |Warning during housekeeping activities on network router. No break in connectivity expected. |
|} | |} | ||
− | |||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> |
Revision as of 08:28, 19 August 2015
RAL Tier1 Operations Report for 19th August 2015
Review of Issues during the week 12th to 19th August 2015. |
- There was a site network outage on Saturday 8th August. The Tier1 was affected from approx 07:30 until 10:00. The issue was resolved when a member of the network team came on site and re-seated a card in a router.
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- The post mortem review of the network incident on the 8th April is being finalised.
- The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked.
- There are some on-going issues for CMS. These are a problem with the Xroot (AAA) redirection accessing Castor; Slow file open times using Xroot; and poor batch job efficiencies.
Ongoing Disk Server Issues |
- gdss720 (part of ATLASDATADISK) crashed on Tuesday 11th. The machine is currently being drained so the fabric team can replace some components.
Notable Changes made since the last meeting. |
- Deployed changes to remove glite-CLUSTER node from information system and shutdown cream-ce01 and cream-ce02.
- Atlas have transferred a share of their FTS service back to RAL.
- The test of the updated worker node configuration (with grid middleware delivered via CVMFS) continues on a one whole batch of Worker Nodes. We are now draining a second batch of worker nodes.
- Investigative work into the ongoing issues for CMS Castor. We have now changed the I/O scheduler on the disk servers.
Declared in the GOC DB |
None |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of Castor disk servers to SL6. We plan to do this during the second part of August (17-28). Rollout plan to be discussed.
- Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Whilst we need to confirm exact dates we are looking at the following days in September, but the dates are likely to be revised:
- Tuesday 8th: day's Outage for Atlas & GEN.
- Tuesday 15th: at risk on Atlas & GEN.
- Thursday 17th: day's Outage for ALL instances.
- Tuesday 22nd: day's at risk for ALL instances
- Thursday 24th: Half day outage for ALL instances.
- Extending the rollout of the new worker node configuration.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases: Castor, Atlas Frontier (LFC done)
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above)
- Update disk servers to SL6.
- Update to Castor version 2.1.15.
- Networking:
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
- Cabling/switch changes to the network in the UPS room to improve resilience.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | UNSCHEDULED | WARNING | 18/08/2015 08:30 | 18/08/2015 10:00 | 1 hour and 30 minutes | Warning during housekeeping activities on network router. No break in connectivity expected. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
115573 | Green | Urgent | In progress | 2015-08-07 | 2015-08-11 | CMS | T1_UK_RAL Consistency Check (August 2015) |
115512 | very urgent | waiting for reply | 2015-08-05 | 2015-08-12 | lhcb | User cannot submit jobs directly to RAL | |
115434 | Green | Less Urgent | waiting for reply | 2015-08-03 | 2015-08-07 | SNO+ | glite-wms-job-status warning |
115387 | Green | Less Urgent | In Progress | 2015-08-03 | 2015-08-11 | SNO+ | XRootD for SNO+ from RAL |
115290 | Green | Less Urgent | On Hold | 2015-07-28 | 2015-07-29 | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | |
108944 | Red | Less Urgent | In Progress | 2014-10-01 | 2015-07-17 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
12/08/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
13/08/15 | 97.2 | 92.0 | 100 | 96.0 | 96.0 | 98 | 100 | CRL problem with cert. dated in the future; CMS: Single SRM test failure; LHCb: CEs putting incorrect information into BDII. |
14/08/15 | 100 | 100 | 100 | 96.0 | 100 | 100 | 100 | Single SRM test failure: |
15/08/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
16/08/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
17/08/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
18/08/15 | 100 | 100 | 100 | 100 | 100 | 97 | 100 |