Difference between revisions of "Tier1 Operations Report 2013-11-13"
From GridPP Wiki
Gareth smith (Talk | contribs) |
(No difference)
|
Latest revision as of 12:07, 13 November 2013
RAL Tier1 Operations Report for 13th November 2013
Review of Issues during the week 6th to 13th November 2013. |
- Service were watched closely following the work on the UPS Tuesday/Wednesday last week. A UPS/Generator load test was carried out successfully this morning.
- One batch of worker nodes has continued to give problems and has not been in in production.
- One file has been reported lost to ILC. The file was found to be corrupt when investigating why it would not migrate to tape.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
Ongoing Disk Server Issues |
- GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
Notable Changes made this last week. |
- We are now running with just the one (Condor) batch farm. Nodes that were in the Torque/Maui farm when it was stopped last week have been re-configured and added to the Condor farm. The CEs that front the old Torque/Maui farm (lcgce01,02,04,10,11) have been set as not in production in the GOC DB.
- A UPS/generator load test was successfully carried out this morning (Wed 13th Nov). This test was scheduled following the work on the UPS last week.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Update core Tier1 network and change connection to site and OPN including:
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 6th and 13th November 2013. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole Site. | SCHEDULED | WARNING | 13/11/2013 10:00 | 13/11/2013 12:00 | 2 hours | RAL site in warning state due to power generator test. |
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11) | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. |
Open GGUS Tickets (Snapshot at time of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
98838 | Green | Urgent | In Progress | 2013-11-13 | 2013-11-13 | T2K | no jobs delegated to cream-ce0* |
98833 | Green | Less Urgent | In Progress | 2013-11-12 | 2013-11-13 | SNO+ | Adoption of backup GridPP VOMS servers: lcglb03.gridpp.rl.ac.uk |
98764 | Green | Less Urgent | Waiting Reply | 2013-11-08 | 2013-11-11 | SNO+ | Storage request |
98625 | Red | Urgent | In Progress | 2013-11-04 | 2013-11-12 | LHCb | Data unavailable for Brazilian proxies at RAL-LCG2 |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2013-10-30 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
98122 | Red | Less Urgent | In Progress | 2013-10-17 | 2013-10-30 | cernatschool | CVMFS access for the cernatschool.org VO |
97868 | Red | Less Urgent | Waiting Reply | 2013-10-08 | 2013-10-30 | T2K | CVMFS for t2k.org |
97759 | Red | Urgent | On Hold | 2013-10-04 | 2013-11-07 | OPS | SHA-2 test failing on lcgce01 |
97385 | Red | Less Urgent | In Progress | 2013-09-17 | 2013-10-14 | HyperK | CVMFS for hyperk.org |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2013-05-11 | Myproxy server certificate does not contain hostname | |
91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-11-13 | LFC webdav support | |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
06/11/13 | 46.2 | 46.2 | 0 | 100 | 46.2 | Batch not restarted until the middle of the day owing to the UPS intervention. |
07/11/13 | 100 | 100 | 62.3 | 100 | 100 | Atlas remained "not available" until the 'old' CE for the Torque/Maui batch farm were marked out of production in the GOC DB. |
08/11/13 | 100 | 100 | 100 | 100 | 100 | |
09/11/13 | 100 | 100 | 100 | 100 | 100 | |
10/11/13 | 100 | 100 | 100 | 100 | 100 | |
11/11/13 | 100 | 100 | 99.1 | 100 | 100 | Single SRM test failure "could not open connection to srm-atlas.gridpp.rl.ac.uk" |
12/11/13 | 100 | 100 | 100 | 100 | 100 |