Difference between revisions of "Tier1 Operations Report 2014-06-25"
From GridPP Wiki
(→) |
(→) |
||
Line 244: | Line 244: | ||
| 22/06/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 || | | 22/06/14 || 100 || 100 || 100 || 100 || 100 || 99 || 100 || | ||
|- | |- | ||
− | | | + | | 24/06/14 || 100 || 100 ||style="background-color: lightgrey;" |96.6 || 100 || 100 || 100 || 99 || Atlas error was: 1 File was NOT copied from SRM |
|- | |- | ||
− | | | + | | 23/06/14 || 100 || style="background-color: lightgrey;" | 66 || style="background-color: lightgrey;" | 95.5 || style="background-color: lightgrey;" | 95.9 || 100 || 98 || 96 || Alice had ssues with Castor upgrades. Atlas had file not copied errors and CMS had an error stating zero number of replicas |
|} | |} | ||
<!-- **********************End Availability Report************************** -----> | <!-- **********************End Availability Report************************** -----> | ||
<!-- *********************************************************************** -----> | <!-- *********************************************************************** -----> |
Revision as of 08:54, 25 June 2014
RAL Tier1 Operations Report for 25th June 2014
Review of Issues during the week 18th to 25th June 2014. |
- On Tuesday morning (24th June) there was a crash of the database system for the Atlas SRM which failed over to another node (and was subsequently put back on the correct RAC node).
Resolved Disk Server Issues |
- None
Current operational status and issues |
- There have been some problems with xroot following the CMS Castor stager update to version 2.1.14-13 last Tuesday (17th June). The current CMS workload exhausts the number of available xroot slots on some disk servers and then fails over to use the re-director (proxy) to source the files elsewhere. Tuning made a significant improvement but the issue remains.
- There are ongoing problems with xroot on AliceDisk since the Castor 2.1.14 update.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Monday (23rd June) cream-ce02 was upgraded to use EMI upgrade 14
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-lhcb.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 26/06/2014 10:00 | 26/06/2014 16:00 | 6 hours | LHCb Castor instance down for Castor 2.1.14 Stager Update |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Dates for the Castor 2.1.14 stager upgrades: LHCb - Thu 26th June; Atlas - Tues 1st July.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- The Castor 2.1.14 upgrade is underway.
- The CIP is compatible with Castor version 2.1.14. There is an issue reported by LHCb to be investigated.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 18th and 25th June 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor GEN instance: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k. | SCHEDULED | OUTAGE | 24/06/2014 09:30 | 24/06/2014 17:00 | 7 hours and 30 minutes | Castor GEN instance down for Castor 2.1.14 Stager Update. |
cream-ce02.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 21/06/2014 11:00 | 24/06/2014 12:00 | 3 days, 1 hour | EMI-3 update 14 |
srm-cms.gridpp.rl.ac.uk, | UNSCHEDULED | WARNING | 17/06/2014 16:30 | 18/06/2014 12:00 | 19 hours and 30 minutes | Investigating some problems following the Castor 2.1.14 update of the CMS stager. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
106472 | Green | Less Urgent | In Progress | 2014-06-25 | 2014-06-25 | Atlas | Missing ATLAS file at RAL's tape instance |
106324 | Green | Urgent | In Progress | 2014-06-18 | 2014-06-23 | CMS | pilots losing network connections at T1_UK_RAL |
105571 | Red | Less Urgent | In Progress | 2014-05-21 | 2014-06-02 | LHCb | BDII and SRM publish inconsistent storage capacity numbers |
105405 | Red | Urgent | In Progress | 2014-05-14 | 2014-06-10 | please check your Vidyo router firewall configuration | |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2014-06-17 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
18/06/14 | 100 | 100 | 100 | 99.2 | 100 | 100 | 90 | Some problems with CMS Castor following the stager upgrade led to single SRM test failure. |
19/06/14 | 100 | 100 | 100 | 100 | 100 | 100 | 91 | |
20/06/14 | 100 | 100 | 100 | 100 | 100 | 99 | 96 | |
21/06/14 | 100 | 100 | 100 | 100 | 100 | 100 | 96 | |
22/06/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
24/06/14 | 100 | 100 | 96.6 | 100 | 100 | 100 | 99 | Atlas error was: 1 File was NOT copied from SRM |
23/06/14 | 100 | 66 | 95.5 | 95.9 | 100 | 98 | 96 | Alice had ssues with Castor upgrades. Atlas had file not copied errors and CMS had an error stating zero number of replicas |