Tier1 Operations Report 2014-01-08
From GridPP Wiki
RAL Tier1 Operations Report for 8th January 2014
Review of Issues during the weeks 18th December 2013 to 8th January 2014. |
- With the Christmas & New Year holiday occurring since the last report this has been a quiet three weeks. Operations were smooth over the holiday period.
- There were a number of restarts of the xrootd daemon on disk servers in AliceDisk over the days 21-23 December. These then stopped and the root cause remains unknown. There was also a problem with the Stager on the Castor GEN instance on Tuesday 24th December which was traced to a logging problem and fixed.
- The Atlas file renaming in Castor is ongoing with around 14 million files renamed so far. The number of files lost has been found to be significantly lower (by around 90%) than previously thought owing to attempts to rename some files more than once.
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- None
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- The batch farm roll-out of a condor update and a reduction in the memory over-commit (as well as kernel/errata updates) was completed before Christmas.
- LB servers lcglb03 & lcglb04 have been replaced by new SL6 EMI-3 update L&B nodes lcglb01 and lcglb02.gridpp.rl.ac.uk.
- EMI-3 update 8 on LFC nodes
- There have been some tweaks to the Condor for multicore jobs (Added multicore accounting groups for Atlas to enable fairshares for this type of job; updated algorithm used to free up job slots for multicore jobs.)
- One of the tranches of CPU orders is currently being delivered.
Declared in the GOC DB |
- There is an entry for the retirement of two old (and replaced) Logging & Bookkeeping servers.
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is ongoing. A data for deployments awaits successful completion of this testing.
- Networking:
- Implementation of new site firewall.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 18th December 2013 and 8th January 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 18/12/2013 11:00 | 31/01/2014 00:00 | 43 days, 13 hours | Old EMI-2 hosts to be retired |
Open GGUS Tickets (Snapshot at time of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
100086 | Green | Less Urgent | Waiting Reply | 2014-01-07 | 2014-01-08 | T2K | WMS jobs cleared too rapidly |
99768 | Red | Less Urgent | Waiting Reply | 2013-12-13 | 2014-01-07 | Atlas | RAL-LCG2_DATADISK: transfer failures with "source file doesn't exist" |
99647 | Red | Less Urgent | Waiting Reply | 2013-12-12 | 2013-12-17 | SNO+ | lcg-cp connection timeouts |
99556 | Red | Very Urgent | In Progress | 2013-12-06 | 2014-01-07 | NGI Argus requests for NGI_UK | |
98249 | Red | Urgent | Waiting Reply | 2013-10-21 | 2014-01-06 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
98122 | Red | Less Urgent | Waiting Reply | 2013-10-17 | 2014-01-06 | cernatschool | CVMFS access for the cernatschool.org VO |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2014-01-06 | Myproxy server certificate does not contain hostname | |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
18/12/13 | 100 | 100 | 100 | 100 | 100 | |
19/12/13 | 100 | 100 | 99.1 | 100 | 100 | Single SRM GET failure: "could not open connection to srm-atlas.gridpp.rl.ac.uk" |
20/12/13 | 100 | 100 | 100 | 100 | 100 | |
21/12/13 | 100 | 100 | 100 | 100 | 100 | |
22/12/13 | 100 | 100 | 100 | 100 | 100 | |
23/12/13 | 100 | 100 | 100 | 100 | 100 | |
24/12/13 | 100 | 100 | 100 | 100 | 100 | |
25/12/13 | 100 | 100 | 100 | 100 | 100 | |
26/12/13 | 100 | 100 | 100 | 100 | 100 | |
27/12/13 | 100 | 100 | 100 | 100 | 100 | |
28/12/13 | 100 | 100 | 100 | 100 | 100 | |
29/12/13 | 100 | 100 | 100 | 100 | 100 | |
30/12/13 | 100 | 100 | 100 | 100 | 100 | |
31/12/13 | 100 | 100 | 100 | 100 | 100 | |
01/01/14 | 100 | 100 | 100 | 100 | 100 | |
02/01/14 | 100 | 100 | 100 | 100 | 100 | |
03/01/14 | 100 | 100 | 100 | 100 | 100 | |
04/01/14 | 100 | 100 | 100 | 100 | 100 | |
05/01/14 | 100 | 100 | 100 | 100 | 100 | |
06/01/14 | 100 | 100 | 100 | 100 | 100 | |
07/01/14 | 100 | 100 | 100 | 100 | 100 |