Difference between revisions of "Tier1 Operations Report 2016-07-20"

From GridPP Wiki
Jump to: navigation, search
(DRAFT DRAFT DRAFT DRAFT)
()
Line 9: Line 9:
 
|-
 
|-
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 13th to th20 July 2016.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 13th to th20 July 2016.
|}
+
|}  
* There was a problem over the weekend with some CMS writes to tape backed areas failing owing to no tape pool having been defined for these cases. This was resolved on Monday. 
+
 
* There has been saturation of the inbound 10Gbit OPN link at times over the last week. The bypass route to JANET has also shown high traffic volumes.
 
* There has been saturation of the inbound 10Gbit OPN link at times over the last week. The bypass route to JANET has also shown high traffic volumes.
 
* We have seen a couple of periods when there have been high numbers of batch jobs started multiple times. There were some on the 7th/8th July and again last night. The cause is not yet understood.
 
* We have seen a couple of periods when there have been high numbers of batch jobs started multiple times. There were some on the 7th/8th July and again last night. The cause is not yet understood.

Revision as of 12:04, 20 July 2016

RAL Tier1 Operations Report for 20th July 2016

Review of Issues during the week 13th to th20 July 2016.
  • There has been saturation of the inbound 10Gbit OPN link at times over the last week. The bypass route to JANET has also shown high traffic volumes.
  • We have seen a couple of periods when there have been high numbers of batch jobs started multiple times. There were some on the 7th/8th July and again last night. The cause is not yet understood.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
  • The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues
  • None.
Notable Changes made since the last meeting.
  • The migration of Atlas data from "C" to "D" tapes continues. We have migrated over 700 of the 1300 tapes so far.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC database to use new Database Infrastructure.
  • Castor:
    • Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
    • Update to Castor version 2.1.15. This awaits successful resolution and testing of the new version.
    • Migration of data from T10KC to T10KD tapes (Affects Atlas & LHCb data).
  • Networking:
    • Replace the UKLight Router. Then upgrade the 'bypass' link to the RAL border routers to 2*40Gbit.
  • Fabric
    • Firmware updates on older disk servers.
Entries in GOC DB starting since the last report.
  • None
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
122861 Green Very Urgent Waiting Reply 2016-07-14 2016-07-15 CMS SAM3 CE is critical T1_UK_RAL
122827 Green Less Urgent In Progress 2016-07-12 2016-07-13 SNO+ Disk area at RAL
122818 Green Less Urgent In Progress 2016-07-12 2016-07-12 Atlas Object Store at RAL
122804 Green Less Urgent Waiting Reply 2016-07-12 2016-07-15 SNO+ glite-transfer failure
122364 Green Less Urgent Waiting Reply 2016-06-27 2016-07-15 cvmfs support at RAL-LCG2 for solidexperiment.org
121687 Yellow Less Urgent On Hold 2016-05-20 2016-05-23 packet loss problems seen on RAL-LCG perfsonar
120810 Green Urgent In Progress 2016-04-13 2016-06-24 Biomed Decommissioning of SE srm-biomed.gridpp.rl.ac.uk - forbid write access for biomed users
120350 Green Less Urgent In Progress 2016-03-22 2016-05-06 LSST Enable LSST at RAL
119841 Red Less Urgent On Hold 2016-03-01 2016-04-26 LHCb HTTP support for lcgcadm04.gridpp.rl.ac.uk
117683 Yellow Less Urgent On Hold 2015-11-18 2016-04-05 CASTOR at RAL not publishing GLUE 2
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 729); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
13/07/16 100 100 100 100 100 100 100
14/07/16 100 100 100 100 100 100 100
15/07/16 100 100 100 100 100 100 100
16/07/16 100 100 100 100 100 100 100
17/07/16 100 100 100 100 100 100 100
18/07/16 100 100 100 100 100 100 100
19/07/16 100 100 100 100 100 100 100