Difference between revisions of "Tier1 Operations Report 2015-03-04"
From GridPP Wiki
(→) |
|||
Line 12: | Line 12: | ||
* On Thursday (26th Feb) a single file was reportd lost to CMS. This file had been picked up by the Castor checksum checker. | * On Thursday (26th Feb) a single file was reportd lost to CMS. This file had been picked up by the Castor checksum checker. | ||
* A problem was uncovered by LHCb when a particular file could not be read from Castor. A manual fixup enabled access to the file again. However, investigations showed a problem when the file was created a few weeks ago - whcih was the second occurrence of a rare bug in Castor. | * A problem was uncovered by LHCb when a particular file could not be read from Castor. A manual fixup enabled access to the file again. However, investigations showed a problem when the file was created a few weeks ago - whcih was the second occurrence of a rare bug in Castor. | ||
+ | * Five files have been declared lost to CMS. These were all picked up by the checksum checker at various times in the last weeks. | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> |
Revision as of 12:02, 4 March 2015
RAL Tier1 Operations Report for 4th March 2015
Review of Issues during the week 25th February to 4th March 2015. |
- On Thursday (26th Feb) there was a problem with teh Argus server that stopped batch work starting for an hour or so.
- On Thursday (26th Feb) a single file was reportd lost to CMS. This file had been picked up by the Castor checksum checker.
- A problem was uncovered by LHCb when a particular file could not be read from Castor. A manual fixup enabled access to the file again. However, investigations showed a problem when the file was created a few weeks ago - whcih was the second occurrence of a rare bug in Castor.
- Five files have been declared lost to CMS. These were all picked up by the checksum checker at various times in the last weeks.
Resolved Disk Server Issues |
- GDSS757 (CMSDisk - D1T0) failed again early on Friday morning (27th Feb). As this server has failed multiple times it has been put back readonly and is being drained.
Current operational status and issues |
- We are running with a single router connecting the Tier1 network to the site network, rather than a resilient pair.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- The 2011 disk servers were moved within the machine room to make space for new deliveries. As part of this the connections to these servers was migrated to the Tier1 mesh network.
- On Wednesday (18th Feb) the Argus policy was updated to support the MICE reco role.
- A Castor namesever box has been set-up to enable queries against Castor metadata to be made without affecting the throughput of production work.
- A system has been set-up to provide Atlas with Castor information that is not supplied by the SRM.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | SCHEDULED | WARNING | 11/03/2015 10:30 | 11/03/2015 11:00 | 30 minutes | Warning on site access during firmware updates on pair of network switches. |
lcgfts3.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 05/03/2015 10:00 | 05/03/2015 12:00 | 2 hours | Warning on FTS3 service during upgrade to version v3.2.32 |
Whole site | UNSCHEDULED | WARNING | 05/03/2015 07:45 | 05/03/2015 08:30 | 45 minutes | Warning following installation of a replacement network router. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Investigate problems on the primary Tier1 router. Discussions with the vendor are ongoing.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Update Castor to 2.1-14-latest.
- Networking:
- Resolve problems with primary Tier1 Router
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers. (Install patch to Router software).
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
- None
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
111856 | Green | Less Urgent | Waiting Reply | 2015-02-19 | 2015-03-03 | LHCb | Stalled LHCb jobs over night |
111699 | Green | Less Urgent | In Progress | 2015-02-10 | 2015-02-27 | Atlas | gLExec hammercloud jobs keep failing at RAL-LCG2 & RALPP |
109694 | Red | Urgent | In Progress | 2014-11-03 | 2015-03-04 | SNO+ | gfal-copy failing for files at RAL |
108944 | Red | Urgent | Waiting Reply | 2014-10-01 | 2015-03-03 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
25/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
26/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
27/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 95 | |
28/02/15 | 100 | 100 | 100 | 100 | 100 | 100 | 95 | |
01/03/15 | 100 | 100 | 100 | 100 | 100 | 0 | 98 | |
02/03/15 | 100 | 100 | 100 | 100 | 100 | 100 | 97 | |
03/03/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 |