Difference between revisions of "Tier1 Operations Report 2014-03-12"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 11:57, 12 March 2014

RAL Tier1 Operations Report for 12th March 2014

Review of Issues during the week 5th to 12th March 2014.
  • The problems with the virtual machine infrastructure reported last week have been worked around. Some further movement of VMs around is still required but there should be no, or very minimal, effect on services.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
  • The Atlas disk space in Castor has become full. We are aware of an ongoing problem where file deletions triggered by Atlas' central service are slow. Some 'manual' deletions of files are taking place to speed up the process.
  • Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • ILC production role added (to the cream CEs and Argus)
  • One batch of WNs now updated to EMI-3 version of WN.
  • Castor 2.1.14 testing of tape servers is underway.
Declared in the GOC DB
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED WARNING 19/03/2014 10:00 19/03/2014 12:00 2 hours RAL Tier1 site in warning state due to UPS/generator test.
Whole Site SCHEDULED WARNING 17/03/2014 07:00 17/03/2014 17:00 10 hours Site At Risk during and following change to use new firewall.
lcgfts.gridpp.rl.ac.uk, lcgfts3.gridpp.rl.ac.uk SCHEDULED OUTAGE 17/03/2014 06:00 17/03/2014 09:00 3 hours Drain and stop of FTS services during update to new site firewall.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The Tier1 will move to use the new site firewall on Monday 17th March (as announced in the GOC DB). There will some interruption to services as seen from outside RAL. Internally services are expected to continue uninterrupted.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
  • Networking:
    • Implementation of new site firewall.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 5th and 12th March 2014.
  • None
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
101968 Green Less Urgent In Progress 2014-03-11 2014-03-12 Atlas RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
101079 Red Urgent In Progress 2014-02-09 2014-02-25 ARC CEs have VOViews with a default SE of "0"
101052 Red Urgent In Progress 2014-02-06 2014-03-06 Biomed Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
99556 Red Very Urgent In Progress 2013-12-06 2014-03-06 NGI Argus requests for NGI_UK
98249 Red Urgent On Hold 2013-10-21 2014-01-29 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
97025 Red Less urgent On Hold 2013-09-03 2014-03-04 Myproxy server certificate does not contain hostname
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
05/03/14 100 100 96.9 96.0 100 SRM test failures.
06/03/14 100 100 91.6 100 100 Two blocks of SRM test failures. In all cases " Invalid argument"
07/03/14 100 100 100 100 100
08/03/14 100 100 99.2 100 100 Single SRM tests error on Delete (No such file or directory).
09/03/14 100 100 100 100 100
10/03/14 100 100 100 100 100
11/03/14 100 100 100 100 100