Difference between revisions of "Tier1 Operations Report 2015-10-28"
From GridPP Wiki
(→) |
(→) |
||
Line 167: | Line 167: | ||
! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment | ! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment | ||
|- | |- | ||
− | | 21/10/15 || 100 || 100 || 100 || 100 || 100 || 98 || | + | | 21/10/15 || 100 || 100 || 100 || 100 || 100 || 98 || 100 || |
|- | |- | ||
− | | 22/10/15 || 100 || 100 || style="background-color: lightgrey;" | 98 || 100 || 100 || 100 || | + | | 22/10/15 || 100 || 100 || style="background-color: lightgrey;" | 98 || 100 || 100 || 100 || 100 || SRM test failure on PUT. (__main__.TimeoutException) |
|- | |- | ||
− | | 23/10/15 || 100 || 100 || style="background-color: lightgrey;" | 92 || 100 || 100 || 100 || | + | | 23/10/15 || 100 || 100 || style="background-color: lightgrey;" | 92 || 100 || 100 || 100 || 100 || (Four SRM test failures. Three on ‘GET’, one on ‘PUT’. All: “__main__.TimeoutException” |
|- | |- | ||
− | | 24/10/15 || 100 || 100 || 100 || 100 || 100 || boo || | + | | 24/10/15 || 100 || 100 || 100 || 100 || 100 || boo || 100 || |
|- | |- | ||
− | | 25/10/15 || 100 || 100 || 100 || 100 || 100 || 97 || | + | | 25/10/15 || 100 || 100 || 100 || 100 || 100 || 97 || 98 || |
|- | |- | ||
− | | 26/10/15 || 100 || 100 || 100 || 100 || 100 || 100 || | + | | 26/10/15 || 100 || 100 || 100 || 100 || 100 || 100 || 100 || |
|- | |- | ||
| 27/10/15 || 100 || 100 || 100 || 100 || 100 || boo || b00 || | | 27/10/15 || 100 || 100 || 100 || 100 || 100 || boo || b00 || |
Revision as of 16:22, 27 October 2015
DRAFT - DO NOT USE - RAL Tier1 Operations Report for 28st October 2015
Review of Issues during the week 14th to 21st October 2015. |
- item
- item
Resolved Disk Server Issues |
- item
Current operational status and issues |
- The LHCb problem with a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
- Long-standing CMS issues. The two items that remain are CMS Xroot (AAA) redirection and file open times. Work is ongoing into the Xroot redirection with a new server having been added in recent weeks. File open times using Xroot remain slow but this is a less significant problem.
Ongoing Disk Server Issues |
- GDSS663
- GDSS665
- GDSS644
Notable Changes made since the last meeting. |
- item
- item
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
- Some detailed internal network re-configurations to enable the removal of the old 'core' switch from our network. This includes changing the way the UKLIGHT router connects into the Tier1 network.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update disk servers to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Complete changes needed to remove the old core switch from the Tier1 network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report. |
- None
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
117171 | very urgent | waiting for reply | 2015-10-24 | 2015-10-27 | LHCb | Aborted pilots on arc-ce02.gridpp.rl.ac.uk | |
116866 | Green | Less Urgent | On Hold | 2015-10-12 | 2015-10-19 | SNO+ | snoplus support at RAL-LCG2 (pilot role) |
116864 | Green | Urgent | In Progress | 2015-10-12 | 2015-10-26 | CMS | T1_UK_RAL AAA opening and reading test failing again... |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
21/10/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
22/10/15 | 100 | 100 | 98 | 100 | 100 | 100 | 100 | SRM test failure on PUT. (__main__.TimeoutException) |
23/10/15 | 100 | 100 | 92 | 100 | 100 | 100 | 100 | (Four SRM test failures. Three on ‘GET’, one on ‘PUT’. All: “__main__.TimeoutException” |
24/10/15 | 100 | 100 | 100 | 100 | 100 | boo | 100 | |
25/10/15 | 100 | 100 | 100 | 100 | 100 | 97 | 98 | |
26/10/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
27/10/15 | 100 | 100 | 100 | 100 | 100 | boo | b00 |