GarethSmithTestPage2
From GridPP Wiki
Revision as of 09:31, 16 September 2014 by Gareth Smith 6edc9bf92b (Talk | contribs)
RAL Tier1 Operations Report for 9th April 2014
Review of Issues during the week 2nd to 9th April 2014. |
- During the afternoon of Thursday 3rd April all three WMS systems reported problems. These problems went away without our intervention and are believed to be caused by something in jobs being submitted.
- Maintenence on Primary OPN link overnight Saturday - Sunday 5/6 April took the link down for a few hours. The failover to the backup link did not work properly. The effect of this can be seen in the failure of the SUM tests from CERN during this time.
- It was reported last week that around 50 files in tape backed service classes (mainly in GEN) had been found not to have been migrated to tape. This is now fixed.
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is largely complete. (A non-Tier1 production Castor instance has been successfully upgraded.) We are starting to look at possible dates for rolling this out (probably around May).
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Scheduled for 29th April)
- These changes will lead to the removal of the UKLight Router.
- Update core Tier1 network and change connection to site and OPN including:
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
103197 | Green | Less Urgent | Waiting Reply | 2014-04-09 | 2014-04-09 | RAL myproxy server and GridPP wiki | |
102611 | Yellow | Urgent | In Progress | 2014-03-24 | 2014-03-24 | NAGIOS *eu.egi.sec.Argus-EMI-1* failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2 | |
101968 | Red | Less Urgent | On Hold | 2014-03-11 | 2014-04-01 | Atlas | RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors |
101079 | Red | Less Urgent | In Progress | 2014-02-09 | 2014-04-01 | ARC CEs have VOViews with a default SE of "0" | |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2014-03-13 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
02/04/14 | 100 | 100 | 100 | 100 | 100 | 100 | 98 | |
03/04/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
04/04/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
05/04/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
06/04/14 | 100 | 100 | 93.6 | 95.5 | 93.6 | 100 | 100 | Primary OPN link to CERN down. Failover to backup link didn't work properly. |
07/04/14 | 100 | 100 | 86.3 | 86.2 | 81.5 | 100 | 100 | Primary OPN link to CERN down. Failover to backup link didn't work properly. |
08/04/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |