Difference between revisions of "Tier1 Operations Report 2014-07-23"
From GridPP Wiki
(→) |
(→) |
||
Line 22: | Line 22: | ||
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | ||
|} | |} | ||
− | * | + | * None |
− | + | ||
− | + | ||
<!-- ***********End Resolved Disk Server Issues*********** -----> | <!-- ***********End Resolved Disk Server Issues*********** -----> | ||
<!-- ***************************************************** -----> | <!-- ***************************************************** -----> |
Revision as of 10:32, 23 July 2014
RAL Tier1 Operations Report for 23rd July 2014
Review of Issues during the week 16th to 23rd July 2014. |
- The recurring problems with the SRM processes for the castor GEN instance, that were crashing since Friday (11th), hs been solved (on Friday 18th). The problem was a malformed file name being sent that was not trapped by the relevant SRM code.
- On Thursday (17th) the Castor disk cache for AtlasTape filled up. This was traced to the garbage collector not running and was immediately fixed.
- Yesterday (Tuesday 22nd) there was a problem with the site network that effectively took the Tier1 off-air from ?? to ??. This was co-incident with, but not caused by, anongoing network update (to use the "RIP" protocol).
Resolved Disk Server Issues |
- None
Current operational status and issues |
- We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
- There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- New CERN VOMS servers added.
- Castor re-pack instance being updated to version 2.1.14-13 (ongoing now).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/07/2014 11:00 | 14/08/2014 11:00 | 31 days, | Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.
- The FTS3 service will be updated (to v3.2.26) on Monday morning, 21st July.
- The core RAL site network will be updated to use RIP for network routing on Tuesday 22nd July.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- None.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 16th and 23rd July 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole Site | UNSCHEDULED | WARNING | 22/07/2014 12:00 | 22/07/2014 15:00 | 3 hours | Extending the earlier warning during update to site network routing. |
Whole Site | UNSCHEDULED | OUTAGE | 22/07/2014 10:40 | 22/07/2014 11:25 | 45 minutes | Unexpected outage during update to site network routing. |
Whole Site | UNSCHEDULED | WARNING | 22/07/2014 10:00 | 22/07/2014 11:00 | 1 hour | Extending the earlier warning during update to site network routing. |
Whole Site | SCHEDULED | WARNING | 22/07/2014 07:00 | 22/07/2014 10:00 | 3 hours | Warning during update to site network routing. |
lcgfts3.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 21/07/2014 11:00 | 21/07/2014 13:00 | 2 hours | Update of FTS3 service to version v3.2.26 |
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/07/2014 11:00 | 14/08/2014 11:00 | 31 days, | Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
106655 | Yellow | Less Urgent | In Progress | 2014-07-04 | 2014-07-16 | Ops | [Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam) |
106324 | Red | Urgent | In Progress | 2014-06-18 | 2014-07-07 | CMS | pilots losing network connections at T1_UK_RAL |
105405 | Red | Urgent | On Hold | 2014-05-14 | 2014-07-01 | please check your Vidyo router firewall configuration |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
16/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 98 | |
17/07/14 | 100 | 100 | 100 | 100 | 100 | 97 | 100 | |
18/07/14 | 100 | 100 | 100 | 100 | 100 | 97 | 99 | |
19/07/14 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
20/07/14 | 100 | 100 | 100 | 100 | 100 | 96 | 100 | |
21/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 96 | |
22/07/14 | 100 | 95.6 | 97.4 | 95.6 | 100 | 96 | 96 | Site networking problem. |