Tier1 Operations Report 2013-02-06
From GridPP Wiki
RAL Tier1 Operations Report for 6th February 2013
Review of Issues during the week 30th January to 6th February 2013. |
- There was a problem with the Atlas Castor during the night/morning of Thursday 31st January. This was traced to a single unresponsive disk server. Rebooting the server fixed the problem.
Resolved Disk Server Issues |
- GDSS644 (AtlasScratchDisk D1T0) was found to be responding very slowly on Thursday (31st Jan) and causing problems for the Atlas Castor instance and was rebooted.
Current operational status and issues |
- There has been an intermittent problem over the last couple of days (5/6 Feb) with the start rate for batch jobs that is being investigated.
- There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- System set-up for participation in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week |
- On Monday (4th February) the upgrading of the Top-BDII to newer systems running SL6/EMI-2 was completed. There are now three systems in the top-bdii alias.
- H1 have been added to the CVMFS system for smaller VOs.
Declared in the GOC DB |
- None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
- Improve the stack 13 uplink.
- Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
- Grid Services:
- Removal of AFS clients from Worker Nodes.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 30th January and 6th February 2013. |
None
Open GGUS Tickets (Snapshot at time of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
91152 | Green | Less Urgent | In Progress | 2013-02-04 | 2013-02-04 | CMS | RAL tape migration |
91146 | Green | Urgent | In Progress | 2013-02-04 | 2013-02-05 | Atlas | RAL input bandwith issues |
91060 | Yellow | Less Urgent | On Hold | 2013-01-31 | 2013-02-01 | CMS | glexec issues on a subset of worker nodes |
91029 | Red | Very Urgent | In Progress | 2013-01-30 | 2013-02-06 | Atlas | FTS problem in queryin jobs |
90528 | Red | Less Urgent | In Progress | 2013-01-17 | 2013-02-04 | SNO+ | WMS not assiging jobs to sheffield |
90151 | Red | Less Urgent | Waiting Reply | 2013-01-08 | 2013-02-04 | NEISS | Support for NEISS VO on WMS |
89733 | Red | Urgent | In Progress | 2012-12-17 | 2013-02-04 | RAL bdii giving out incorrect information | |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-01-16 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
30/01/13 | 100 | 100 | 94.9 | 100 | 100 | Multiple failures "unable to delete file from SRM", plus one 'user timeout' failure. |
31/01/13 | 100 | 100 | 90.1 | 100 | 100 | Atlas Castor instance showing lots of timeouts. Traced to a single disk server that was very unresponsive. Reboot of disk server fixed it. |
01/02/13 | 100 | 92.3 | 100 | 100 | 100 | 330 min timeout for the job exceeded. Cancelling the job. |
02/02/13 | 100 | 100 | 98.5 | 100 | 100 | Single SRM test failure - unable to delete file from SRM |
03/02/13 | 100 | 100 | 98.2 | 100 | 100 | One user timeout, one failure to delete file. |
04/02/13 | 100 | 100 | 100 | 100 | 100 | |
05/02/13 | 100 | 100 | 100 | 100 | 100 |