RAL Tier1 Operations Report for 17th September 2014
Review of Issues during the week 10th to 17th September 2014.
|
- On Saturday (13th Sep) there was a poblem with teh Atlas Castor instance that persisted into the beginning of Sunday. A number of measures were taken to improve it, although the root cause remains unknown.
- For the second half of last week there were problems with cream-ce02.
- This morning (Wednesday 17th Sep) there was a problem with some machines that run as VMs - the symptom was that their networking stopped. Restarting the network fixed the problem. This is similar to a problem seen on the 30th August. The configuration of the network interface on these systems has been changed to workaround this.
Resolved Disk Server Issues
|
Current operational status and issues
|
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- VO Londongrid enabled on LFC.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- The rollout of the RIP protocol to the Tier1 routers still has to be completed.
- Access to the Cream CEs will be withdrawn apart from leaving access for ALICE. The proposed date for this is Tuesday 30th September.
Listing by category:
- Databases:
- Apply latest Oracle patches (PSU) to the production database systems (Castor, LFC).
- A new database (Oracle RAC) is being set-up that will host the Atlas3D database and be updated from CERN via Oracle GoldenGate.
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update Castor headnodes to SL6.
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers.
- Fabric
- Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 10th and 17th September 2014.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
108546
|
Green
|
Less Urgent
|
In Progress
|
2014-09-16
|
2014-09-16
|
Atlas
|
RAL-LCG2_HIMEM_SL6: production jobs failed
|
107935
|
Yellow
|
Less Urgent
|
In Progress
|
2014-08-27
|
2014-09-02
|
Atlas
|
BDII vs SRM inconsistent storage capacity numbers
|
107880
|
Amber
|
Less Urgent
|
In Progress
|
2014-08-26
|
2014-09-02
|
SNO+
|
srmcp failure
|
106324
|
Red
|
Urgent
|
On Hold
|
2014-06-18
|
2014-08-14
|
CMS
|
pilots losing network connections at T1_UK_RAL
|
105405
|
Red
|
Urgent
|
On Hold
|
2014-05-14
|
2014-09-12
|
|
Please check your Vidyo router firewall configuration
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
10/09/14 |
100 |
100 |
99.2 |
100 |
100 |
96 |
97 |
Single SRM test failure on GET - [SRM_FILE_BUSY]
|
11/09/14 |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
|
12/09/14 |
100 |
100 |
100 |
100 |
100 |
100 |
96 |
|
13/09/14 |
100 |
100 |
82.2 |
100 |
100 |
54 |
99 |
Problems with Atlas Castor instance
|
14/09/14 |
100 |
100 |
91.8 |
100 |
100 |
84 |
98 |
Problems with Atlas Castor instance (continued)
|
15/09/14 |
100 |
100 |
100 |
100 |
100 |
99 |
98 |
|
16/09/14 |
100 |
100 |
100 |
100 |
100 |
98 |
99 |
|