Difference between revisions of "Tier1 Operations Report 2014-02-05"
From GridPP Wiki
Gareth smith (Talk | contribs) |
(No difference)
|
Latest revision as of 13:13, 5 February 2014
RAL Tier1 Operations Report for 5th February 2014
Review of Issues during the week 29th January to 5th February 2014. |
- During the second part of last week there were problems with the CMS Castor instance. Many timeouts were being seen within Castor and batch jobs efficiencies were very poor. Changes were made that improved the behaviour including reducing the number of concurrent xroot transfers on each disk server and CMS re-enabling 'lazy download'.
- There was a successful test of a new interface system for the tape libraries on Tuesday morning (4th Feb).
Resolved Disk Server Issues |
- None
Current operational status and issues |
- We are investigating intermittent failures of Castor access via the SRM (as seen in the availability tests) for multiple Castor instances. The SRM Front-end daemons were erstarted (for Atlas, CMS & LHCb instances) late morning today and we will continue to track this problem.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- t2k.org have been enabled on the ARC CEs
- CVMFS client version 2.1.17 is being tested on one batch of worker nodes (approx 10% of the batch farm).
- The same batch of worker nodes has also been configured to access the new CernVM-FS Stratum-1 service at RAL (cvmfs-wlcg.gridpp.rl.ac.uk).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole Site | SCHEDULED | WARNING | 12/02/2014 10:00 | 12/02/2014 12:00 | 2 hours | RAL Tier1 site in warning state due to UPS/generator test. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
- Networking:
- Implementation of new site firewall. Date for Tier1 proposed to be 11th March. (Initial changes for links that do not affect the Tier1 commenced this week.)
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Required before firewall changes on 11th March).
- These changes will lead to the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 29th January and 5th February 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor endpoints (srm-alice, srm-atlas, srm-biomed, srm-cert, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-lhcb, srm-mice, srm-minos, srm-na62, srm-preprod, srm-snoplus, srm-superb, srm-t2k. | SCHEDULED | WARNING | 04/02/2014 08:00 | 04/02/2014 10:00 | 2 hours | Testing of new interface to the tape library. During this time Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the testing is completed. |
lcglb03, lcglb04. | SCHEDULED | OUTAGE | 18/12/2013 11:00 | 31/01/2014 00:00 | 43 days, 13 hours | old EMI-2 hosts to be retired |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
100887 | Green | Less Urgent | In Progress | 2014-01-31 | 2014-01-31 | Please update gridsite on WebDAV LFC | |
100343 | Red | Less Urgent | In Progress | 2014-01-16 | 2014-02-03 | RAL WMS still generating 512 proxies | |
100114 | Red | Less Urgent | On Hold | 2014-01-08 | 2014-01-30 | Jobs failing to get from RAL WMS to Imperial | |
99556 | Red | Very Urgent | In Progress | 2013-12-06 | 2014-01-30 | NGI Argus requests for NGI_UK | |
98249 | Red | Urgent | On Hold | 2013-10-21 | 2014-01-29 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2014-01-06 | Myproxy server certificate does not contain hostname |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
29/01/14 | 100 | 100 | 98.2 | 54.1 | 100 | CMS: Main availability loss in morning: Condor scheduling (as yesterday); Plus a single SRM test failure. Atlas: Two separate SRM Put test failures. |
30/01/14 | 100 | 100 | 99.7 | 95.9 | 96.0 | One SRM test failure in each case: (Atlas, CMS & LHCb) |
31/01/14 | 100 | 100 | 98.5 | 100 | 100 | Single SRM test failure |
01/02/14 | 100 | 100 | 100 | 100 | 100 | |
02/02/14 | 100 | 100 | 100 | 100 | 100 | |
03/02/14 | 100 | 100 | 99.5 | 98.8 | 95.7 | One SRM test failure in each case: (Atlas, CMS & LHCb) |
04/02/14 | 100 | 100 | 97.4 | 96.0 | 91.9 | A number of SRM test failures across the VOs. |