Difference between revisions of "Tier1 Operations Report 2014-09-17"
From GridPP Wiki
(→) |
(→) |
||
Line 55: | Line 55: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made this last week. | ||
|} | |} | ||
− | * VO | + | * VO Londongrid enabled on LFC. |
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> |
Revision as of 11:08, 17 September 2014
RAL Tier1 Operations Report for 17th September 2014
Review of Issues during the week 10th to 17th September 2014. |
- On Saturday (13th Sep) there was a poblem with teh Atlas Castor instance that persisted into the beginning of Sunday. A number of measures were taken to improve it, although the root cause remains unknown.
- For the second half of last week there were problems with cream-ce02.
- This morning (Wednesday 17th Sep) there was a problem with some machines that run as VMs - the symptom was that their networking stopped. Restarting the network fixed the problem. This is similar to a problem seen on the 30th August. The configuration of the network interface on these systems has been changed to workaround this.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- None.
Ongoing Disk Server Issues |
- None.
Notable Changes made this last week. |
- VO Londongrid enabled on LFC.
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- The rollout of the RIP protocol to the Tier1 routers still has to be completed.
- Access to the Cream CEs will be withdrawn apart from leaving access for ALICE. The proposed date for this is Tuesday 30th September.
Listing by category:
- Databases:
- Apply latest Oracle patches (PSU) to the production database systems (Castor, LFC).
- A new database (Oracle RAC) is being set-up that will host the Atlas3D database and be updated from CERN via Oracle GoldenGate.
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update Castor headnodes to SL6.
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers.
- Fabric
- Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 10th and 17th September 2014. |
- None
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
108546 | Green | Less Urgent | In Progress | 2014-09-16 | 2014-09-16 | Atlas | RAL-LCG2_HIMEM_SL6: production jobs failed |
107935 | Yellow | Less Urgent | In Progress | 2014-08-27 | 2014-09-02 | Atlas | BDII vs SRM inconsistent storage capacity numbers |
107880 | Amber | Less Urgent | In Progress | 2014-08-26 | 2014-09-02 | SNO+ | srmcp failure |
106324 | Red | Urgent | On Hold | 2014-06-18 | 2014-08-14 | CMS | pilots losing network connections at T1_UK_RAL |
105405 | Red | Urgent | On Hold | 2014-05-14 | 2014-09-12 | Please check your Vidyo router firewall configuration |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
10/09/14 | 100 | 100 | 99.2 | 100 | 100 | 96 | 97 | Single SRM test failure on GET - [SRM_FILE_BUSY] |
11/09/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
12/09/14 | 100 | 100 | 100 | 100 | 100 | 100 | 96 | |
13/09/14 | 100 | 100 | 82.2 | 100 | 100 | 54 | 99 | Problems with Atlas Castor instance |
14/09/14 | 100 | 100 | 91.8 | 100 | 100 | 84 | 98 | Problems with Atlas Castor instance (continued) |
15/09/14 | 100 | 100 | 100 | 100 | 100 | 99 | 98 | |
16/09/14 | 100 | 100 | 100 | 100 | 100 | 98 | 99 |