Difference between revisions of "Tier1 Operations Report 2016-07-20"
From GridPP Wiki
(→) |
(→) |
||
Line 117: | Line 117: | ||
|-style="background:#b7f1ce" | |-style="background:#b7f1ce" | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
+ | |- | ||
+ | | 122861 | ||
+ | | Green | ||
+ | | Very Urgent | ||
+ | | Waiting Reply | ||
+ | | 2016-07-14 | ||
+ | | 2016-07-14 | ||
+ | | CMS | ||
+ | | SAM3 CE is critical T1_UK_RAL | ||
|- | |- | ||
| 122827 | | 122827 | ||
Line 139: | Line 148: | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | Waiting Reply |
− | + | ||
| 2016-07-12 | | 2016-07-12 | ||
+ | | 2016-07-15 | ||
| SNO+ | | SNO+ | ||
| glite-transfer failure | | glite-transfer failure |
Revision as of 13:09, 15 July 2016
RAL Tier1 Operations Report for 20th July 2016
Review of Issues during the week 13th to th20 July 2016. |
- There was a problem over the weekend with some CMS writes to tape backed areas failing owing to no tape pool having been defined for these cases. This was resolved on Monday.
- There has been saturation of the inbound 10Gbit OPN link at times over the last week. The bypass route to JANET has also shown high traffic volumes.
- We have seen a couple of periods when there have been high numbers of batch jobs started multiple times. There were some on the 7th/8th July and again last night. The cause is not yet understood.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues |
- None.
Notable Changes made since the last meeting. |
- The migration of Atlas data from "C" to "D" tapes continues. We have migrated over 700 of the 1300 tapes so far.
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC database to use new Database Infrastructure.
- Castor:
- Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
- Update to Castor version 2.1.15. This awaits successful resolution and testing of the new version.
- Migration of data from T10KC to T10KD tapes (Affects Atlas & LHCb data).
- Networking:
- Replace the UKLight Router. Then upgrade the 'bypass' link to the RAL border routers to 2*40Gbit.
- Fabric
- Firmware updates on older disk servers.
Entries in GOC DB starting since the last report. |
- None
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
122861 | Green | Very Urgent | Waiting Reply | 2016-07-14 | 2016-07-14 | CMS | SAM3 CE is critical T1_UK_RAL |
122827 | Green | Less Urgent | In Progress | 2016-07-12 | 2016-07-13 | SNO+ | Disk area at RAL |
122818 | Green | Less Urgent | In Progress | 2016-07-12 | 2016-07-12 | Atlas | Object Store at RAL |
122804 | Green | Less Urgent | Waiting Reply | 2016-07-12 | 2016-07-15 | SNO+ | glite-transfer failure |
122364 | Green | Less Urgent | On Hold | 2016-06-27 | 2016-06-29 | cvmfs support at RAL-LCG2 for solidexperiment.org | |
121687 | Yellow | Less Urgent | On Hold | 2016-05-20 | 2016-05-23 | packet loss problems seen on RAL-LCG perfsonar | |
120810 | Green | Urgent | In Progress | 2016-04-13 | 2016-06-24 | Biomed | Decommissioning of SE srm-biomed.gridpp.rl.ac.uk - forbid write access for biomed users |
120350 | Green | Less Urgent | In Progress | 2016-03-22 | 2016-05-06 | LSST | Enable LSST at RAL |
119841 | Red | Less Urgent | On Hold | 2016-03-01 | 2016-04-26 | LHCb | HTTP support for lcgcadm04.gridpp.rl.ac.uk |
117683 | Yellow | Less Urgent | On Hold | 2015-11-18 | 2016-04-05 | CASTOR at RAL not publishing GLUE 2 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 729); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
13/07/16 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
14/07/16 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
15/07/16 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
16/07/16 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
17/07/16 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
18/07/16 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
19/07/16 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |