|
|
Line 193: |
Line 193: |
| |-style="background:#b7f1ce" | | |-style="background:#b7f1ce" |
| ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject |
− | |-
| |
− | | 111454
| |
− | | Green
| |
− | | Less Urgent
| |
− | | In Progress
| |
− | | 2015-01-28
| |
− | | 2015-01-28
| |
− | | OPS
| |
− | | [Rod Dashboard] Issues detected at RAL-LCG2
| |
| |- | | |- |
| | 111347 | | | 111347 |
Latest revision as of 12:21, 28 January 2015
RAL Tier1 Operations Report for 28th January 2015
Review of Issues during the week 21st to 28th January 2015.
|
- A test was carried on the problematic router during Thursday afternoon (22nd) when it failed within a few minutes of taking over as the master. A manual flip back to the other router was then carried out. This caused a 5-minute break in network connectivity to the Tier1.
- There was a problem with the LHCb SRMs yesterday (Tuesday 27th Jan). Some processes didn't re-connect to the databases following required reboots to pick up security updates.
Resolved Disk Server Issues
|
Current operational status and issues
|
- We are running with a single router connecting the Tier1 network to the site network, rather than a resilient pair.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- The safety checking of the electrical power circuits in the machine room has been completed.
- The migration of all data off T10000A & B media has been completed.
- On Friday (23rd Jan) the FTS3 srevice was upgraded to version 3.2.31.
- On Monday (26th Jan) some redundant Castor stager schemas were cleaned up.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Castor CMS & GEN instances (srm-alice, srm-biomed, srm-cms, srm-cms-disk, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-pheno, srm-snoplus, srm-superb, srm-t2k.
|
SCHEDULED
|
WARNING
|
29/01/2015 09:00
|
29/01/2015 16:00
|
7 hours
|
Warning while patching Castor disk servers
|
srm-atlas.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
28/01/2015 09:00
|
28/01/2015 16:00
|
7 hours
|
Warning while patching Castor disk servers
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Investigate problems on the primary Tier1 router. Discussions with the vendor are ongoing.
- Track appropriate security updates.
- Move of connection for CERN Backup link on Tuesday 3rd Feb.
Listing by category:
- Databases:
- Application of Oracle PSU patches to database systems.
- A new database (Oracle RAC) has been set-up to host the Atlas 3D database. This is updated from CERN via Oracle GoldenGate. This system is yet to be brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gate.)
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Update Castor to 2.1-14-latest.
- Networking:
- Resolve problems with primary Tier1 Router
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers. (Install patch to Router software).
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting between the 21st and 28th January 2015.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
28/01/2015 09:00
|
28/01/2015 16:00
|
7 hours
|
Warning while patching Castor disk servers
|
srm-lhcb.gridpp.rl.ac.uk
|
SCHEDULED
|
WARNING
|
27/01/2015 09:00
|
27/01/2015 11:52
|
2 hours and 52 minutes
|
Warning while patching Castor disk servers
|
lcgfts3
|
UNSCHEDULED
|
WARNING
|
23/01/2015 10:00
|
23/01/2015 11:00
|
1 hour
|
Upgrade of FTS3 service to version 3.2.31
|
Whole site
|
SCHEDULED
|
WARNING
|
20/01/2015 08:30
|
22/01/2015 18:00
|
2 days, 9 hours and 30 minutes
|
Warning during safety checks on power circuits in machine room. Testing carried out during working hours on each day.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
111347
|
Green
|
Urgent
|
Waiting Reply
|
2015-01-22
|
2015-01-26
|
CMS
|
T1_UK_RAL Consistency Check (January 2015)
|
111120
|
Green
|
Less Urgent
|
Waiting Reply
|
2015-01-12
|
2015-01-22
|
Atlas
|
large transfer errors from RAL-LCG2 to BNL-OSG2
|
109694
|
Red
|
Urgent
|
On hold
|
2014-11-03
|
2015-01-20
|
SNO+
|
gfal-copy failing for files at RAL
|
108944
|
Red
|
Urgent
|
In Progress
|
2014-10-01
|
2015-01-27
|
CMS
|
AAA access test failing at T1_UK_RAL
|
107935
|
Red
|
Less Urgent
|
On Hold
|
2014-08-27
|
2015-01-20
|
Atlas
|
BDII vs SRM inconsistent storage capacity numbers
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
21/01/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
22/01/15 |
100 |
100 |
99 |
100 |
100 |
100 |
99 |
Single SRM test failure - "Handling Timeout".
|
23/01/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
24/01/15 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
25/01/15 |
100 |
100 |
100 |
100 |
96 |
100 |
97 |
Single SRM test failure on 'list': No such file or directory
|
26/01/15 |
100 |
100 |
100 |
100 |
100 |
100 |
98 |
|
27/01/15 |
100 |
100 |
100 |
100 |
88 |
100 |
98 |
SRM test failures - problem on SRMs after reboots for security update.
|