Difference between revisions of "Tier1 Operations Report 2014-11-12"
From GridPP Wiki
(→) |
m (→) |
||
(3 intermediate revisions by one user not shown) | |||
Line 31: | Line 31: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | * Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the | + | * Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures. |
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
<!-- *************************************************************** -----> | <!-- *************************************************************** -----> | ||
Line 53: | Line 53: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | ||
|} | |} | ||
− | * Doubled the uplink (10 -> 20Gbit) | + | * Doubled the uplink (10 -> 20Gbit) for the network stack used by the Tape servers serving the T10000B & C drives. The existing link was at full capacity. |
* Grid Service updates: BDIIs updated to EMI 3 update 21; LB nodes updated to EMI 3 update 21; LFC updated to EMI 3 update 16. | * Grid Service updates: BDIIs updated to EMI 3 update 21; LB nodes updated to EMI 3 update 21; LFC updated to EMI 3 update 16. | ||
* OS Errata rolled out to Castor GEN instance (headnodes & disk servers) this morning (12th Nov). | * OS Errata rolled out to Castor GEN instance (headnodes & disk servers) this morning (12th Nov). | ||
Line 110: | Line 110: | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
* The rollout of the RIP protocol to the Tier1 routers still has to be completed. | * The rollout of the RIP protocol to the Tier1 routers still has to be completed. | ||
− | * First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room. | + | * First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.(Provisional dates: week 12-16 January). |
'''Listing by category:''' | '''Listing by category:''' | ||
* Databases: | * Databases: | ||
− | ** A new database (Oracle RAC) has been set-up to host the Atlas 3D database. This is updated from CERN via Oracle GoldenGate. This | + | ** A new database (Oracle RAC) has been set-up to host the Atlas 3D database. This is updated from CERN via Oracle GoldenGate. This system is yet to be brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gate.) |
** Switch LFC/3D to new Database Infrastructure. | ** Switch LFC/3D to new Database Infrastructure. | ||
* Castor: | * Castor: |
Latest revision as of 15:37, 12 November 2014
RAL Tier1 Operations Report for 12th November 2014
Review of Issues during the week 5th to 12th November 2014. |
- Over the weekend there was a hardware problem on one of the database nodes behind the Atlas 3D/Frontier service. The Oracle RAC continued to function correctly using the other available nodes and there was no impact on the service.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- Some problems on Atlas Castor instance. At various times in the last couple of weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Doubled the uplink (10 -> 20Gbit) for the network stack used by the Tape servers serving the T10000B & C drives. The existing link was at full capacity.
- Grid Service updates: BDIIs updated to EMI 3 update 21; LB nodes updated to EMI 3 update 21; LFC updated to EMI 3 update 16.
- OS Errata rolled out to Castor GEN instance (headnodes & disk servers) this morning (12th Nov).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Atlas, CMS & LHCb Castor instances (srm-atlas, srm-cms, srm-lhcb) | SCHEDULED | WARNING | 19/11/2014 11:00 | 19/11/2014 12:00 | 1 hour | Warning on Castor Atlas, CMS and LHCb instance during application of OS errata updates. |
Castor GEN instance | SCHEDULED | WARNING | 12/11/2014 11:00 | 12/11/2014 12:00 | 1 hour | Warning on Castor GEN instance during application of OS errata updates. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- The rollout of the RIP protocol to the Tier1 routers still has to be completed.
- First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.(Provisional dates: week 12-16 January).
Listing by category:
- Databases:
- A new database (Oracle RAC) has been set-up to host the Atlas 3D database. This is updated from CERN via Oracle GoldenGate. This system is yet to be brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gate.)
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update Castor headnodes to SL6.
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers.
- Fabric
- Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 5th and 12th November 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor GEN instance | SCHEDULED | WARNING | 12/11/2014 11:00 | 12/11/2014 12:00 | 1 hour | Warning on Castor GEN instance during application of OS errata updates. |
Whole site. | UNSCHEDULED | WARNING | 05/11/2014 09:00 | 05/11/2014 10:00 | 1 hour | Putting site At Risk for a reboot of network router. Anticipate only two very short (few seconds) break in connectivity. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
110079 | Green | Very Urgent | In Progress | 2014-11-12 | 2014-11-12 | T2K | ARC CEs no longer accepting jobs from WMS |
110028 | Green | Less Urgent | Waiting for Reply | 2014-11-10 | 2014-11-11 | CMS | Phedex transfer problems from T1_UK_RAL_Buffer |
109712 | Green | Urgent | On Hold | 2014-10-29 | 2014-11-10 | CMS | Glexec exited with status 203; ... |
109694 | Green | Urgent | On hold | 2014-11-03 | 2014-11-06 | SNO+ | gfal-copy failing for files at RAL |
108944 | Yellow | Urgent | On Hold | 2014-10-01 | 2014-11-03 | CMS | AAA access test failing at T1_UK_RAL |
107935 | Red | Less Urgent | On Hold | 2014-08-27 | 2014-11-03 | Atlas | BDII vs SRM inconsistent storage capacity numbers |
106324 | Red | Urgent | On Hold | 2014-06-18 | 2014-10-13 | CMS | pilots losing network connections at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
05/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
06/11/14 | 100 | 100 | 89.3 | 100 | 100 | 100 | 100 | Two blocks of test failures (06:00-07:00 & 09:00-10:00) Error 'Handling Timeout'. |
07/11/14 | 100 | 100 | 96.5 | 100 | 100 | 100 | 100 | Several SRM PUT test failures (timeouts). |
08/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
09/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
10/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
11/11/14 | 100 | 100 | 98.3 | 100 | 100 | 100 | 98 | Two SRM test failures one after the other. |