Tier1 Operations Report 2014-03-12
From GridPP Wiki
RAL Tier1 Operations Report for 12th March 2014
Review of Issues during the week 5th to 12th March 2014. |
- The problems with the virtual machine infrastructure reported last week have been worked around. Some further movement of VMs around is still required but there should be no, or very minimal, effect on services.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
- The Atlas disk space in Castor has become full. We are aware of an ongoing problem where file deletions triggered by Atlas' central service are slow. Some 'manual' deletions of files are taking place to speed up the process.
- Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- ILC production role added (to the cream CEs and Argus)
- One batch of WNs now updated to EMI-3 version of WN.
- Castor 2.1.14 testing of tape servers is underway.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole Site | SCHEDULED | WARNING | 19/03/2014 10:00 | 19/03/2014 12:00 | 2 hours | RAL Tier1 site in warning state due to UPS/generator test. |
Whole Site | SCHEDULED | WARNING | 17/03/2014 07:00 | 17/03/2014 17:00 | 10 hours | Site At Risk during and following change to use new firewall. |
lcgfts.gridpp.rl.ac.uk, lcgfts3.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 17/03/2014 06:00 | 17/03/2014 09:00 | 3 hours | Drain and stop of FTS services during update to new site firewall. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- The Tier1 will move to use the new site firewall on Monday 17th March (as announced in the GOC DB). There will some interruption to services as seen from outside RAL. Internally services are expected to continue uninterrupted.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
- Networking:
- Implementation of new site firewall.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 5th and 12th March 2014. |
- None
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
101968 | Green | Less Urgent | In Progress | 2014-03-11 | 2014-03-12 | Atlas | RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors |
101079 | Red | Urgent | In Progress | 2014-02-09 | 2014-02-25 | ARC CEs have VOViews with a default SE of "0" | |
101052 | Red | Urgent | In Progress | 2014-02-06 | 2014-03-06 | Biomed | Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk |
99556 | Red | Very Urgent | In Progress | 2013-12-06 | 2014-03-06 | NGI Argus requests for NGI_UK | |
98249 | Red | Urgent | On Hold | 2013-10-21 | 2014-01-29 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2014-03-04 | Myproxy server certificate does not contain hostname |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
05/03/14 | 100 | 100 | 96.9 | 96.0 | 100 | SRM test failures. |
06/03/14 | 100 | 100 | 91.6 | 100 | 100 | Two blocks of SRM test failures. In all cases " Invalid argument" |
07/03/14 | 100 | 100 | 100 | 100 | 100 | |
08/03/14 | 100 | 100 | 99.2 | 100 | 100 | Single SRM tests error on Delete (No such file or directory). |
09/03/14 | 100 | 100 | 100 | 100 | 100 | |
10/03/14 | 100 | 100 | 100 | 100 | 100 | |
11/03/14 | 100 | 100 | 100 | 100 | 100 |