From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 13th November 2013
Review of Issues during the week 6th to 13th November 2013.
|
- Service were watched closely following the work on the UPS Tuesday/Wednesday last week. A UPS/Generator load test was carried out successfully this morning.
- One batch of worker nodes has continued to give problems and has not been in in production.
- One file has been reported lost to ILC. The file was found to be corrupt when investigating why it would not migrate to tape.
Resolved Disk Server Issues
|
Current operational status and issues
|
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
Ongoing Disk Server Issues
|
- GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
Notable Changes made this last week.
|
- We are now running with just the one (Condor) batch farm. Nodes that were in the Torque/Maui farm when it was stopped last week have been re-configured and added to the Condor farm. The CEs that front the old Torque/Maui farm (lcgce01,02,04,10,11) have been set as not in production in the GOC DB.
- A UPS/generator load test was successfully carried out this morning (Wed 13th Nov). This test was scheduled following the work on the UPS last week.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11)
|
SCHEDULED
|
OUTAGE
|
05/11/2013 07:00
|
30/11/2013 23:59
|
25 days, 16 hours and 59 minutes
|
Service being decommissioned.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 6th and 13th November 2013.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole Site.
|
SCHEDULED
|
WARNING
|
13/11/2013 10:00
|
13/11/2013 12:00
|
2 hours
|
RAL site in warning state due to power generator test.
|
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11)
|
SCHEDULED
|
OUTAGE
|
05/11/2013 07:00
|
30/11/2013 23:59
|
25 days, 16 hours and 59 minutes
|
Service being decommissioned.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
98838
|
Green
|
Urgent
|
In Progress
|
2013-11-13
|
2013-11-13
|
T2K
|
no jobs delegated to cream-ce0*
|
98833
|
Green
|
Less Urgent
|
In Progress
|
2013-11-12
|
2013-11-13
|
SNO+
|
Adoption of backup GridPP VOMS servers: lcglb03.gridpp.rl.ac.uk
|
98764
|
Green
|
Less Urgent
|
Waiting Reply
|
2013-11-08
|
2013-11-11
|
SNO+
|
Storage request
|
98625
|
Red
|
Urgent
|
In Progress
|
2013-11-04
|
2013-11-12
|
LHCb
|
Data unavailable for Brazilian proxies at RAL-LCG2
|
98249
|
Red
|
Urgent
|
In Progress
|
2013-10-21
|
2013-10-30
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
98122
|
Red
|
Less Urgent
|
In Progress
|
2013-10-17
|
2013-10-30
|
cernatschool
|
CVMFS access for the cernatschool.org VO
|
97868
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-10-08
|
2013-10-30
|
T2K
|
CVMFS for t2k.org
|
97759
|
Red
|
Urgent
|
On Hold
|
2013-10-04
|
2013-11-07
|
OPS
|
SHA-2 test failing on lcgce01
|
97385
|
Red
|
Less Urgent
|
In Progress
|
2013-09-17
|
2013-10-14
|
HyperK
|
CVMFS for hyperk.org
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2013-05-11
|
|
Myproxy server certificate does not contain hostname
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-11-13
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-10-18
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
06/11/13 |
46.2 |
46.2 |
0 |
100 |
46.2 |
Batch not restarted until the middle of the day owing to the UPS intervention.
|
07/11/13 |
100 |
100 |
62.3 |
100 |
100 |
Atlas remained "not available" until the 'old' CE for the Torque/Maui batch farm were marked out of production in the GOC DB.
|
08/11/13 |
100 |
100 |
100 |
100 |
100 |
|
09/11/13 |
100 |
100 |
100 |
100 |
100 |
|
10/11/13 |
100 |
100 |
100 |
100 |
100 |
|
11/11/13 |
100 |
100 |
99.1 |
100 |
100 |
Single SRM test failure "could not open connection to srm-atlas.gridpp.rl.ac.uk"
|
12/11/13 |
100 |
100 |
100 |
100 |
100 |
|