Difference between revisions of "GarethSmithTestPage"
From GridPP Wiki
Line 41: | Line 41: | ||
* Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently. | * Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently. | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Revision as of 09:29, 16 September 2014
RAL Tier1 Operations Report for 2nd April 2014
Review of Issues during the fortnight 19th March to 2nd April 2014. |
- There was a short (around 5 minute) break in external connectivity to the Tier1 during the morning of Thursday 20th March and again a similar event the following morning.
- There was a failover of an Atlas Castor Database early evening on Tuesday 25th March. The failover triggered a call-out and the database was put back onto its allocated node. The cause is a bug that has been reported to Oracle.
- On Friday, 28th March, we were not running some of the CE SUM tests in a timely manner. It was found that owing to a separate change in the Condor configuration we were no longer prioritising the test jobs. This was fixed.
Ongoing Disk Server Issues |
- GDSS239 (Atlas HotDisk) crashed this morning. This is being investigated.
Notable Changes made this last fortnight. |
- The rollout of of WNs updated to the EMI-3 version of WN continues and is expected to be completed this week.
- The EMI3 Argus server is being rolled out for use across all CEs and WNs.
- The old MyProxy server (lcgrbp01.gridpp.rl.ac.uk) has just been turned off today. Its replacement (myproxy.gridpp.rl.ac.uk) is in production.
- The 2013 purchases of worker nodes are being added to the farm this week.
- Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently.
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
102902 | Green | Urgent | In Progress | 2014-04-01 | 2014-04-02 | MICE & NA62 | Stale .cvmfswhitelist file MICE VO |
102611 | Green | Urgent | In Progress | 2014-03-24 | 2014-03-24 | NAGIOS *eu.egi.sec.Argus-EMI-1* failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2 | |
101968 | Yellow | Less Urgent | On Hold | 2014-03-11 | 2014-0-01 | Atlas | RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors |
101079 | Red | Less Urgent | In Progress | 2014-02-09 | 2014-04-01 | ARC CEs have VOViews with a default SE of "0" | |
99556 | Red | Very Urgent | On Hold | 2013-12-06 | 2014-03-21 | NGI Argus requests for NGI_UK | |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2014-03-13 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
19/03/14 | 100 | 100 | 100 | 88.6 | 100 | 99 | 73 | Multiple SRM test failures (load problems). |
20/03/14 | 100 | 100 | 99.7 | 99.6 | 100 | 100 | n/a | Atlas: One SRM Test failure; CMS - CE Test failures on all 3 Arc-ce’s (no compatible resources). |
21/03/14 | 100 | 100 | 100 | 100 | 100 | 100 | n/a | |
22/03/14 | 100 | 100 | 100 | 100 | 100 | 100 | n/a | |
23/03/14 | 100 | 100 | 100 | 100 | 100 | 100 | n/a | |
24/03/14 | 100 | 100 | 100 | 100 | 100 | 100 | n/a | |
25/03/14 | 100 | 100 | 99.0 | 89.8 | 100 | 98 | 99 | Atlas: Castor database problem (Atlas_srm DB moved to another RAC node following a DB crash); CMS SRM SUM test failures separated through day. |
26/03/14 | 100 | 100 | 100 | 87.1 | 100 | 100 | 99 | Four separate SRM test failures. |
27/03/14 | 100 | 100 | 100 | 96.5 | 100 | 97 | 100 | Two test failures of SRM Put test. |
28/03/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
29/03/14 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | |
30/03/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
31/03/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
01/04/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 |