Difference between revisions of "Tier1 Operations Report 2014-01-22"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 09:52, 22 January 2014

RAL Tier1 Operations Report for 22nd January 2014

Review of Issues during the week 15th to 22nd January 2014.
  • Generally steady operations.
  • There was a problem with the CMS Castor instance for just under 30 minutes around midday yesterday (Tuesday 21st Jan). The xrootd process on one of the CMS Castor headnodes died and was down until the restarter kicked in.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • None
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • On Thursday (16th Jan) the diskpools behind the aliceTape and genTape service classes were merged.
  • On Friday (17th Jan) xroot for small VOs (on the Castor GEN instance) was enabled.
  • Yesterday (Tuesday 22nd Jan) an attempt was made to update FTS3 (to resolve openssl problem). However, the FTS3 then failed to work and was backed out.
  • Yesterday (Tuesday 22nd Jan) the microcode in the tape libraries was updated. This new version enables use of "T10000D" tape drives.
  • The second (and final) tranche of worker nodes in this year's purchase were delivered earlier this week.
Declared in the GOC DB
  • On Monday, 27th January. 10:00 - 12:00. Upgrade of FTS3 gridsite and openssl. Will remove existing proxies on the server as part of upgrade.
  • There is an entry for the retirement of two old (and replaced) Logging & Bookkeeping servers.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
  • Networking:
    • Implementation of new site firewall. Date for Tier1 proposed to be 10th March. (Initial changes for links that do not affect the Tier1 commenced this week.)
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Required before firewall changes on 10th March).
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 15th and 22nd January 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor (all SRM endpoints) UNSCHEDULED WARNING 21/01/2014 09:00 21/01/2014 13:00 4 hours For microcode updates to the tape robots. Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the microcode updates are completed.
lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk, SCHEDULED OUTAGE 18/12/2013 11:00 31/01/2014 00:00 43 days, 13 hours old EMI-2 hosts to be retired
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
100369 Green Less Urgent In Progress 2014-01-18 2014-01-20 Read only LFC accessible only if you have credentials on the read-write LFC
100343 Yellow Less Urgent In Progress 2014-01-16 2014-01-21 RAL WMS still generating 512 proxies
100114 Red Less Urgent In Progress 2014-01-08 2014-01-10 Jobs failing to get from RAL WMS to Imperial
99768 Red Less Urgent In Progress 2013-12-13 2014-01-07 Atlas RAL-LCG2_DATADISK: transfer failures with "source file doesn't exist"
99556 Red Very Urgent In Progress 2013-12-06 2014-01-21 NGI Argus requests for NGI_UK
98249 Red Urgent In Progress 2013-10-21 2014-01-14 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
97025 Red Less urgent On Hold 2013-09-03 2014-01-06 Myproxy server certificate does not contain hostname
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
15/01/14 100 100 100 100 100
16/01/14 100 100 100 100 100
17/01/14 100 100 100 100 100
18/01/14 100 100 100 100 100
19/01/14 100 100 100 100 100
20/01/14 100 100 100 100 100
21/01/14 100 100 100 100 100