From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 6th February 2013
Review of Issues during the week 30th January to 6th February 2013.
|
- There was a problem with the Atlas Castor during the night/morning of Thursday 31st January. This was traced to a single unresponsive disk server. Rebooting the server fixed the problem.
Resolved Disk Server Issues
|
- GDSS644 (AtlasScratchDisk D1T0) was found to be responding very slowly on Thursday (31st Jan) and causing problems for the Atlas Castor instance and was rebooted.
Current operational status and issues
|
- There has been an intermittent problem over the last couple of days (5/6 Feb) with the start rate for batch jobs that is being investigated.
- There is a GGUS ticket for a problem seen by the FTS that is caused by a problem within the Castor SRM.
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- System set-up for participation in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place. Currently being tested by Atlas.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- On Monday (4th February) the upgrading of the Top-BDII to newer systems running SL6/EMI-2 was completed. There are now three systems in the top-bdii alias.
- H1 have been added to the CVMFS system for smaller VOs.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Replace central switch (C300). (Tentative date 5th March, but Atlas would like earlier). This will:
- Improve the stack 13 uplink.
- Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Removal of AFS clients from Worker Nodes.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 30th January and 6th February 2013.
|
None
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
91152
|
Green
|
Less Urgent
|
In Progress
|
2013-02-04
|
2013-02-04
|
CMS
|
RAL tape migration
|
91146
|
Green
|
Urgent
|
In Progress
|
2013-02-04
|
2013-02-05
|
Atlas
|
RAL input bandwith issues
|
91060
|
Yellow
|
Less Urgent
|
On Hold
|
2013-01-31
|
2013-02-01
|
CMS
|
glexec issues on a subset of worker nodes
|
91029
|
Red
|
Very Urgent
|
In Progress
|
2013-01-30
|
2013-02-06
|
Atlas
|
FTS problem in queryin jobs
|
90528
|
Red
|
Less Urgent
|
In Progress
|
2013-01-17
|
2013-02-04
|
SNO+
|
WMS not assiging jobs to sheffield
|
90151
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-01-08
|
2013-02-04
|
NEISS
|
Support for NEISS VO on WMS
|
89733
|
Red
|
Urgent
|
In Progress
|
2012-12-17
|
2013-02-04
|
|
RAL bdii giving out incorrect information
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-01-16
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
30/01/13 |
100 |
100 |
94.9 |
100 |
100 |
Multiple failures "unable to delete file from SRM", plus one 'user timeout' failure.
|
31/01/13 |
100 |
100 |
90.1 |
100 |
100 |
Atlas Castor instance showing lots of timeouts. Traced to a single disk server that was very unresponsive. Reboot of disk server fixed it.
|
01/02/13 |
100 |
92.3 |
100 |
100 |
100 |
330 min timeout for the job exceeded. Cancelling the job.
|
02/02/13 |
100 |
100 |
98.5 |
100 |
100 |
Single SRM test failure - unable to delete file from SRM
|
03/02/13 |
100 |
100 |
98.2 |
100 |
100 |
One user timeout, one failure to delete file.
|
04/02/13 |
100 |
100 |
100 |
100 |
100 |
|
05/02/13 |
100 |
100 |
100 |
100 |
100 |
|