Difference between revisions of "Tier1 Operations Report 2014-07-09"
From GridPP Wiki
(→) |
(→) |
||
Line 130: | Line 130: | ||
! Reason | ! Reason | ||
|- | |- | ||
− | | Whole | + | | srm-atlas |
+ | | SCHEDULED | ||
+ | | OUTAGE | ||
+ | | 08/07/2014 06:00 | ||
+ | | 09/07/2014 12:00 | ||
+ | | 1 day, 6 hours | ||
+ | | Atlas Castor instance down for Castor 2.1.14 Stager Update | ||
+ | |- | ||
+ | | Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k | ||
+ | | UNSCHEDULED | ||
+ | | WARNING | ||
+ | | 03/07/2014 07:45 | ||
+ | | 03/07/2014 13:00 | ||
+ | | 5 hours and 15 minutes | ||
+ | | Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK). | ||
+ | |- | ||
+ | | Whole site | ||
| SCHEDULED | | SCHEDULED | ||
| WARNING | | WARNING | ||
Line 137: | Line 153: | ||
| 1 hour | | 1 hour | ||
| RAL Tier1 site in warning state due to UPS/generator test. | | RAL Tier1 site in warning state due to UPS/generator test. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|} | |} | ||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> |
Revision as of 15:16, 4 July 2014
RAL Tier1 Operations Report for 9th July 2014
Review of Issues during the week 2nd to 9th July 2014. |
- There were probelms with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). details....
- Problems with Atlas multicore jobs on Friday 4th July....
Resolved Disk Server Issues |
- None
Current operational status and issues |
- None
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Tuesday (8th July) Atlas Castor instance upgraded to version 2.1.14-13. (to be confirmed....)
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-atlas.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 08/07/2014 06:00 | 09/07/2014 12:00 | 1 day, 6 hours | Atlas Castor instance down for Castor 2.1.14 Stager Update |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- None.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 2nd and 9th July 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-atlas | SCHEDULED | OUTAGE | 08/07/2014 06:00 | 09/07/2014 12:00 | 1 day, 6 hours | Atlas Castor instance down for Castor 2.1.14 Stager Update |
Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k | UNSCHEDULED | WARNING | 03/07/2014 07:45 | 03/07/2014 13:00 | 5 hours and 15 minutes | Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK). |
Whole site | SCHEDULED | WARNING | 02/07/2014 10:00 | 02/07/2014 11:00 | 1 hour | RAL Tier1 site in warning state due to UPS/generator test. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
106640 | Green | Less Urgent | In Progress | 2014-07-04 | 2014-07-04 | ILC | Failure to submit jobs to RAL-LCG2 CEs |
106610 | Green | Less Urgent | In Progress | 2014-07-02 | 2014-07-02 | HyperK | HyperK support |
106480 | Green | Less Urgent | Waiting Reply | 2014-06-25 | 2014-06-30 | dteam | Publishing meaningful Castor version |
106324 | Yellow | Urgent | In Progress | 2014-06-18 | 2014-07-01 | CMS | pilots losing network connections at T1_UK_RAL |
105571 | Red | Less Urgent | In Progress | 2014-05-21 | 2014-06-30 | LHCb | BDII and SRM publish inconsistent storage capacity numbers |
105405 | Red | Urgent | On Hold | 2014-05-14 | 2014-07-01 | please check your Vidyo router firewall configuration |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
25/06/14 | 100 | 100 | 94.8 | 100 | 100 | 96 | 98 | Several SUM test failures (Invalid Argument). |
26/06/14 | 100 | 100 | 90.6 | 95.8 | 92.6 | 90 | 100 | LHCb Castor Stager 2.1.14 upgrade; Atlas: Several SRM test failures; CMS: Single SRM Put test failure. |
02/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
03/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
04/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
05/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
06/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
07/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
08/07/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |