|
|
Line 56: |
Line 56: |
| |} | | |} |
| * CVMFS Client version 2.1.19 being rolled out to whole farm following successful testing. | | * CVMFS Client version 2.1.19 being rolled out to whole farm following successful testing. |
| + | * One new disk server has been deployed to CMS disk. (This replaced a server that failed a couple of weeks ago). |
| <!-- *************End Notable Changes made this last week************** -----> | | <!-- *************End Notable Changes made this last week************** -----> |
| <!-- ****************************************************************** -----> | | <!-- ****************************************************************** -----> |
Revision as of 10:25, 21 May 2014
RAL Tier1 Operations Report for 21st May 2014
Review of Issues during the week 14th to 21st May 2014.
|
- Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted.
- Problem reported last week with some half dozen Atlas files were lost during the draining of a disk server at the end of February is now understood. This was an isolated incident. Draining has been resumed.
- There was a problem with on FTS2 last Friday (16th) which led to a ticket from CMS.
Resolved Disk Server Issues
|
Current operational status and issues
|
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- CVMFS Client version 2.1.19 being rolled out to whole farm following successful testing.
- One new disk server has been deployed to CMS disk. (This replaced a server that failed a couple of weeks ago).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgui02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
30/04/2014 14:00
|
29/05/2014 13:00
|
28 days, 23 hours
|
Service being decommissioned.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon.
- Networking:
- Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 14th and 21st May 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-lhcb-tape.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
20/05/2014 08:00
|
20/05/2014 11:00
|
3 hours
|
Outage of tape system for update of tape library controller. (Postponed from 13th May).
|
All SRM end points
|
SCHEDULED
|
WARNING
|
20/05/2014 08:00
|
20/05/2014 11:00
|
3 hours
|
Outage of tape system for update of tape library controller. (Postponed from 13th May).
|
lcgvo08.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
15/05/2014 15:00
|
15/05/2014 16:00
|
1 hour
|
Downtime for system maintenance
|
lcgvo07.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
15/05/2014 14:00
|
15/05/2014 15:00
|
1 hour
|
Downtime for system maintenance
|
lcglb02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
15/05/2014 13:00
|
15/05/2014 14:00
|
1 hour
|
Downtime for system maintenance
|
lcglb01.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
15/05/2014 10:45
|
15/05/2014 11:45
|
1 hour
|
Downtime for system maintenance
|
lcglb04.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
15/05/2014 09:30
|
15/05/2014 10:30
|
1 hour
|
Downtime for system maintenance
|
lcgwms06.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
14/05/2014 14:00
|
14/05/2014 15:20
|
1 hour and 20 minutes
|
Downtime for system maintenance
|
lcgwms05.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
14/05/2014 11:30
|
14/05/2014 13:30
|
2 hours
|
Downtime for system maintenance
|
lcgfts3.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
14/05/2014 10:00
|
14/05/2014 12:00
|
2 hours
|
FTS3 service at RAL unavailable for update to version 3.2.22
|
lcgwms04.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
14/05/2014 09:30
|
14/05/2014 10:46
|
1 hour and 16 minutes
|
Downtime for system maintenance
|
lcgui02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
30/04/2014 14:00
|
29/05/2014 13:00
|
28 days, 23 hours
|
Service being decommissioned.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
105493
|
Green
|
Urgent
|
In Progress
|
2014-05-16
|
2014-05-16
|
CMS
|
Failing transfers from T1_UK_RAL_Buffer to many sites
|
105405
|
Green
|
Urgent
|
In Progress
|
2014-05-14
|
2014-05-15
|
|
please check your Vidyo router firewall configuration
|
105308
|
Green
|
Less Urgent
|
On Hold
|
2014-05-11
|
2014-05-19
|
Atlas
|
Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied"
|
105161
|
Yellow
|
Less Urgent
|
In Progress
|
2014-05-05
|
2014-05-16
|
H1
|
hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours)
|
105100
|
Green
|
Urgent
|
On Hold
|
2014-05-02
|
2014-05-12
|
CMS
|
T1_UK_RAL Consistency Check (May14)
|
98249
|
Red
|
Urgent
|
In Progress
|
2013-10-21
|
2014-05-20
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
14/05/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
15/05/14 |
100 |
100 |
100 |
100 |
100 |
94 |
96 |
|
16/05/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
17/05/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
18/05/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
19/05/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
20/05/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|