Latest revision as of 12:17, 30 January 2013
RAL Tier1 Operations Report for 30th January 2013
Review of Issues during the week 23rd to 30th January 2013.
|
- The work on the main site power supply has been completed. This started last June and one half of the switchgear was brought into service on 17 September. The work on the second half has been completed and was brought into use on Monday (28th Jan). This restores resilience in this part of the site power supply.
Resolved Disk Server Issues
|
- GDSS594 (GenTape - D0T1) Was taken out of production Tuesday (22nd Jan) with multiple disk failures. It was returned to service on Thursday (24th).
- GDSS433 (AtlasDataDisk - D1T0) failed with a read only filesystem on Friday (25th Jan). It was returned tos ervice on Sunday (27th).
Current operational status and issues
|
- The batch server process sometimes consumes memory, something which is normally triggered by a network/communication problem with worker nodes. A test for this (with a re-starter) is in place.
- High load observed on uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage).
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- Problems with the Top-BDII are seen and known to cause problems (the daemon restarts). The rolling upgrade of the Top-BDII is underway.
- System set-up for participation in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- Thurs (24th Jan) The batch farm was configured to have access to CVMFS areas for na62 and mice.
- On Tuesday (29th Jan) the Argus server was upgraded to EMI-2/SL6.
- On Tuesday (29th Jan) the RAL status page (http://www.gridpp.rl.ac.uk/status/) was modified to shows tape usage information.
- On Wednesday (30th Jan) and upgraded version of the maui batch scheduler was installed.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Replace central switch (C300). This will:
- Improve the stack 13 uplink.
- Change the network trunking as part of investigation (& possible fix) into asymmetric data rates.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services:
- Checking VO usage of, and requirement for, AFS clients from Worker Nodes.
- Upgrade Top-BDIIs to latest (EMI-2) version on SL6.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
Entries in GOC DB starting between 23rd and 30th January 2013.
|
There were no unscheduled entries in the GOC DB for this period.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11
|
SCHEDULED
|
WARNING
|
29/01/2013 10:00
|
29/01/2013 12:00
|
2 hours
|
Update of Argus Server to SL6/EMI-2
|
Whole Site
|
SCHEDULED
|
WARNING
|
28/01/2013 08:00
|
28/01/2013 20:00
|
12 hours
|
Following completion of work on the power supply to RAL new equipment will be switched in. This will be in parallel with the existing equipment and re-enables redundancy in the tranformer/switchgear.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
90995
|
Green
|
Less Urgent
|
In Progress
|
2013-01-29
|
2013-01-30
|
CMS
|
Stageout errors for single workflow at RAL
|
90986
|
Green
|
Urgent
|
In Progress
|
2013-01-29
|
2013-01-29
|
NA62
|
FTS channell BELGRID-UCL to RAL-LCG2 for na62
|
90844
|
Green
|
Less Urgent
|
In Progress
|
2013-01-26
|
2013-01-28
|
|
LFC for cernatschool.org
|
90528
|
Red
|
Less Urgent
|
In Progress
|
2013-01-17
|
2013-01-17
|
SNO+
|
WMS not assiging jobs to sheffield
|
90151
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-01-08
|
2013-01-24
|
NEISS
|
Support for NEISS VO on WMS
|
89733
|
Red
|
Urgent
|
In Progress
|
2012-12-17
|
2013-01-21
|
|
RAL bdii giving out incorrect information
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-01-16
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
23/01/13 |
100 |
100 |
100 |
100 |
100 |
|
24/01/13 |
87.5 |
100 |
100 |
100 |
100 |
Failed the Site-BDII test as the SL6 test CE had wrong string in GlueHostOperatingSystemName.
|
25/01/13 |
100 |
100 |
100 |
100 |
100 |
|
26/01/13 |
100 |
100 |
68.6 |
100 |
100 |
The CE test jobs did not run within the time allowed. Problem of hitting maximum number of AtlasSGM jobs. These were queued behind SL6 Atlas S/W validation jobs.
|
27/01/13 |
100 |
100 |
96.2 |
100 |
100 |
Four SRM test failures - unable to delete file from SRM.
|
28/01/13 |
100 |
100 |
100 |
100 |
100 |
|
29/01/13 |
100 |
100 |
77.4 |
100 |
100 |
Repeat of problem of 26/1/13. Fix to batch scheduler did not work and the jobs queued behind more SL6 S/W validation jobs.
|