Difference between revisions of "Tier1 Operations Report 2014-11-26"
From GridPP Wiki
(→) |
(→) |
||
(11 intermediate revisions by one user not shown) | |||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 19th to 26th November 2014. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 19th to 26th November 2014. | ||
|} | |} | ||
− | + | * On the evening of Tuesday 18th, the CMS transfer manager machine (lcgclsf02) failed. The services failed over to the backup. A replacement machine was prepared and put into service that afternoon. The following day the original system was fixed and returned to service. | |
− | * On the evening of Tuesday 18th, the CMS transfer manager machine (lcgclsf02) failed. The services failed over to the backup | + | * In the early hours of Friday 21st Nov. there was a problem of locking sessions in the Castor database that affected CMS & LHCb. Whilst this was transitory the cause has been understood and a fix will be provided in a future version of Castor. |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 21: | Line 21: | ||
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | ||
|} | |} | ||
− | * | + | * GDSS673 (LhcbRawDst - D0T1) had failed during the evening of Friday 14th November. The server was returned to production around midday on Thursday 20th Nov. |
<!-- ***********End Resolved Disk Server Issues*********** -----> | <!-- ***********End Resolved Disk Server Issues*********** -----> | ||
<!-- ***************************************************** -----> | <!-- ***************************************************** -----> | ||
Line 32: | Line 32: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | * Some problems on Atlas Castor instance. At various times in | + | * Some problems on Atlas Castor instance. At various times in recent weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures. |
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
<!-- *************************************************************** -----> | <!-- *************************************************************** -----> | ||
Line 43: | Line 43: | ||
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues | ||
|} | |} | ||
− | * | + | * None. |
<!-- ***************End Ongoing Disk Server Issues**************** -----> | <!-- ***************End Ongoing Disk Server Issues**************** -----> | ||
<!-- ************************************************************* -----> | <!-- ************************************************************* -----> | ||
Line 54: | Line 54: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | ||
|} | |} | ||
− | * | + | * Latest WMS updates (EMI 3 update 22) applied to WMSs. |
+ | * FTS3 upgraded to 3.2.30 | ||
+ | * OS Errata updates applied to all Castor systems (apart from GEN instance for which this had already been done). | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 81: | Line 83: | ||
* The rollout of the RIP protocol to the Tier1 routers still has to be completed. | * The rollout of the RIP protocol to the Tier1 routers still has to be completed. | ||
* First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.(Provisional dates: week 12-16 January). | * First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.(Provisional dates: week 12-16 January). | ||
+ | * Castor headnode upgrades to SL6: (Assume 4 hour outage of Castor instance in each case for stager updates). | ||
+ | ** Tuesday 2nd Dec - LHCb; Tues 9th Dec - CMS; Wed 10th Dec - Atlas; Wednesday 7th Jan - GEN; Thursday 8th Jan - Nameserver (transparent - at risk) | ||
'''Listing by category:''' | '''Listing by category:''' | ||
* Databases: | * Databases: | ||
Line 204: | Line 208: | ||
| Urgent | | Urgent | ||
| In Progress | | In Progress | ||
− | | 2014-22 | + | | 2014-11-22 |
− | | 2014-24 | + | | 2014-11-24 |
− | | | + | | CMS |
− | | | + | | Submissions to RAL FTS3 REST interface are failing for some users (re-opened) |
|- | |- | ||
| 108944 | | 108944 | ||
Line 258: | Line 262: | ||
| 20/11/14 || 100 || 100 || style="background-color: lightgrey;" | 99.2 || style="background-color: lightgrey;" | 99.3 || 100 || 99 || n/a || Single SRM test failures. | | 20/11/14 || 100 || 100 || style="background-color: lightgrey;" | 99.2 || style="background-color: lightgrey;" | 99.3 || 100 || 99 || n/a || Single SRM test failures. | ||
|- | |- | ||
− | | 21/11/14 || 100 || 100 || style="background-color: lightgrey;" | 95.8 || style="background-color: lightgrey;" | 92.4 || style="background-color: lightgrey;" | 91.7 || 99 || n/a || SRM test failures. | + | | 21/11/14 || 100 || 100 || style="background-color: lightgrey;" | 95.8 || style="background-color: lightgrey;" | 92.4 || style="background-color: lightgrey;" | 91.7 || 99 || n/a || SRM test failures caused by problem of locking sessions in database. |
|- | |- | ||
| 22/11/14 || 100 || 100 || 100 || 100 || 100 || 100 || n/a || | | 22/11/14 || 100 || 100 || 100 || 100 || 100 || 100 || n/a || |
Latest revision as of 15:28, 26 November 2014
RAL Tier1 Operations Report for 26th November 2014
Review of Issues during the week 19th to 26th November 2014. |
- On the evening of Tuesday 18th, the CMS transfer manager machine (lcgclsf02) failed. The services failed over to the backup. A replacement machine was prepared and put into service that afternoon. The following day the original system was fixed and returned to service.
- In the early hours of Friday 21st Nov. there was a problem of locking sessions in the Castor database that affected CMS & LHCb. Whilst this was transitory the cause has been understood and a fix will be provided in a future version of Castor.
Resolved Disk Server Issues |
- GDSS673 (LhcbRawDst - D0T1) had failed during the evening of Friday 14th November. The server was returned to production around midday on Thursday 20th Nov.
Current operational status and issues |
- Some problems on Atlas Castor instance. At various times in recent weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
Ongoing Disk Server Issues |
- None.
Notable Changes made this last week. |
- Latest WMS updates (EMI 3 update 22) applied to WMSs.
- FTS3 upgraded to 3.2.30
- OS Errata updates applied to all Castor systems (apart from GEN instance for which this had already been done).
Declared in the GOC DB |
- None.
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- The rollout of the RIP protocol to the Tier1 routers still has to be completed.
- First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.(Provisional dates: week 12-16 January).
- Castor headnode upgrades to SL6: (Assume 4 hour outage of Castor instance in each case for stager updates).
- Tuesday 2nd Dec - LHCb; Tues 9th Dec - CMS; Wed 10th Dec - Atlas; Wednesday 7th Jan - GEN; Thursday 8th Jan - Nameserver (transparent - at risk)
Listing by category:
- Databases:
- A new database (Oracle RAC) has been set-up to host the Atlas 3D database. This is updated from CERN via Oracle GoldenGate. This system is yet to be brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gate.)
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update Castor headnodes to SL6.
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers.
- Fabric
- Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 19th and 26th November 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor (srm endpoints). | SCHEDULED | WARNING | 25/11/2014 11:00 | 25/11/2014 12:00 | 1 hour | At risk on some castor instances while we deploy errata updates |
Whole site | UNSCHEDULED | WARNING | 25/11/2014 07:00 | 25/11/2014 08:00 | 1 hour | Site warning during firewall configuration change. |
lcgfts3.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 24/11/2014 10:00 | 24/11/2014 12:00 | 2 hours | At risk for FTS3 upgrade to 3.2.30 |
srm-cms-disk.gridpp, srm-cms.gridpp.rl.ac.uk | UNSCHEDULED | WARNING | 20/11/2014 14:00 | 20/11/2014 15:00 | 1 hour | At risk while we return a CMS Castor headnode to production |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
110360 | Green | Less Urgent | Waiting for Reply | 2014-11-25 | 2014-11-25 | SNO+ | Jobs failing to load cvmfs data |
110244 | Green | Less Urgent | In progress | 2014-11-19 | 2014-11-24 | None | LFC Webdav support |
109712 | Green | Urgent | On Hold | 2014-10-29 | 2014-11-10 | CMS | Glexec exited with status 203; ... |
109694 | Green | Urgent | On hold | 2014-11-03 | 2014-11-24 | SNO+ | gfal-copy failing for files at RAL |
109276 | Green | Urgent | In Progress | 2014-11-22 | 2014-11-24 | CMS | Submissions to RAL FTS3 REST interface are failing for some users (re-opened) |
108944 | Amber | Urgent | In Progress | 2014-10-01 | 2014-11-24 | CMS | AAA access test failing at T1_UK_RAL |
107935 | Red | Less Urgent | On Hold | 2014-08-27 | 2014-11-03 | Atlas | BDII vs SRM inconsistent storage capacity numbers |
106324 | Red | Urgent | On Hold | 2014-06-18 | 2014-10-13 | CMS | pilots losing network connections at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
19/11/14 | 100 | 100 | 98.3 | 92.8 | 100 | 100 | n/a | SRM test failures. For CMS there was a general problem affecting multiple sites. |
20/11/14 | 100 | 100 | 99.2 | 99.3 | 100 | 99 | n/a | Single SRM test failures. |
21/11/14 | 100 | 100 | 95.8 | 92.4 | 91.7 | 99 | n/a | SRM test failures caused by problem of locking sessions in database. |
22/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | n/a | |
23/11/14 | 100 | 100 | 100 | 100 | 100 | 87 | 95 | |
24/11/14 | 100 | 100 | 100 | 100 | 95.8 | 96 | n/a | Single SRM test failure. |
25/11/14 | 100 | 100 | 99.1 | 100 | 100 | 100 | n/a | Expired proxy affected Atlas tests for many sites. |