Difference between revisions of "Tier1 Operations Report 2015-11-04"

From GridPP Wiki
Jump to: navigation, search
 
(4 intermediate revisions by one user not shown)
Line 9: Line 9:
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 28th October to 4th November 2015.
 
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 28th October to 4th November 2015.
 
|}
 
|}
* Early on Saturday (24/10/2015) morning, arc-ce02 developed problems. It was probably due to problems with the underlying hyper-visor. Some other less critical machines on the same hyper-visor were also affected.
+
* We again saw high load on the AtlasTape Castor instance - exacerbated by the failure of some disk servers in the cache in front of this tape area.
* On Monday (26/10/2015) evening, there were problems with the WMS servers. A user was submitting jobs with very big output files. This was filling up the working partition on the machines.
+
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- ***********End Review of Issues during last week*********** ----->
 
<!-- *********************************************************** ----->
 
<!-- *********************************************************** ----->
Line 21: Line 20:
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues
 
|}
 
|}
* gdss665 (AtlasTape - D0T1) failed on Sat (24th Oct). This server is still with the fabric team.
+
* gdss665 (AtlasTape - D0T1) failed on Sat (24th Oct). Following a disk replacement and updating the firmware in the disk controller the system was re-run through the acceptance testing for 5 days before being returned to service yesterday (3rd November).
* gdss663 (AtlasTape - D0T1) failed on Sun (25th Oct). This server is still with the fabric team.
+
* gdss663 (AtlasTape - D0T1) failed on Sun (25th Oct). Following a disk and battery replacement and updating the firmware in the disk controller the system was re-run through the acceptance testing for 5 days before being returned to service yesterday (3rd November).
 
<!-- ***********End Resolved Disk Server Issues*********** ----->
 
<!-- ***********End Resolved Disk Server Issues*********** ----->
 
<!-- ***************************************************** ----->
 
<!-- ***************************************************** ----->
Line 46: Line 45:
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues
 
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues
 
|}
 
|}
* gdss707 (AtlasDataDisk - D1T0) has been out of production since Friday (16th Oct). The server is currently undergoing testing with fabric.
+
* gdss707 (AtlasDataDisk - D1T0) has been out of production since Friday (16th Oct). The server was drained and is currently undergoing testing with fabric.
 
+
* gdss664 (AtlasTape - D0T1) was removed from service on the 28th Oct. The system was having some problems running some network commands. These were resolved by a reboot. A failing disk was also replaced. The system has had the disk controller firmware updated and has been re-running the acceptance tests since yesterday (3rd Nov).
* gdss664 (AtlasTape - D0T1) has had a few issues starting over the weekend. This morning (28th Oct) we decided to remove it from production so it can be investigated.
+
 
<!-- ***************End Ongoing Disk Server Issues**************** ----->
 
<!-- ***************End Ongoing Disk Server Issues**************** ----->
 
<!-- ************************************************************* ----->
 
<!-- ************************************************************* ----->
Line 94: Line 92:
 
** Update to Castor version 2.1.15.
 
** Update to Castor version 2.1.15.
 
* Networking:
 
* Networking:
*** Complete changes needed to remove the old core switch from the Tier1 network.
+
** Complete changes needed to remove the old core switch from the Tier1 network.
 
** Make routing changes to allow the removal of the UKLight Router.
 
** Make routing changes to allow the removal of the UKLight Router.
 
* Fabric
 
* Fabric
Line 140: Line 138:
 
| 2015-11-03
 
| 2015-11-03
 
|  
 
|  
| Incorrect Certificate for SRM
+
| Incorrect Certificate for SRM (new certs requested but not yet deployed)
 
|-
 
|-
 
| 116866
 
| 116866

Latest revision as of 10:56, 4 November 2015

RAL Tier1 Operations Report for 4th November 2015

Review of Issues during the week 28th October to 4th November 2015.
  • We again saw high load on the AtlasTape Castor instance - exacerbated by the failure of some disk servers in the cache in front of this tape area.
Resolved Disk Server Issues
  • gdss665 (AtlasTape - D0T1) failed on Sat (24th Oct). Following a disk replacement and updating the firmware in the disk controller the system was re-run through the acceptance testing for 5 days before being returned to service yesterday (3rd November).
  • gdss663 (AtlasTape - D0T1) failed on Sun (25th Oct). Following a disk and battery replacement and updating the firmware in the disk controller the system was re-run through the acceptance testing for 5 days before being returned to service yesterday (3rd November).
Current operational status and issues
  • The LHCb problem with a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites.
  • The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
  • Long-standing CMS issues. The two items that remain are CMS Xroot (AAA) redirection and file open times. Work is ongoing into the Xroot redirection with a new server having been added in recent weeks. File open times using Xroot remain slow but this is a less significant problem.
Ongoing Disk Server Issues
  • gdss707 (AtlasDataDisk - D1T0) has been out of production since Friday (16th Oct). The server was drained and is currently undergoing testing with fabric.
  • gdss664 (AtlasTape - D0T1) was removed from service on the 28th Oct. The system was having some problems running some network commands. These were resolved by a reboot. A failing disk was also replaced. The system has had the disk controller firmware updated and has been re-running the acceptance tests since yesterday (3rd Nov).
Notable Changes made since the last meeting.
  • None
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Upgrade of remaining Castor disk servers (those in tape-backed service classes) to SL6. This will be transparent to users.
  • Some detailed internal network re-configurations to enable the removal of the old 'core' switch from our network. This includes changing the way the UKLIGHT router connects into the Tier1 network.

Listing by category:

  • Databases:
    • Switch LFC/3D to new Database Infrastructure.
  • Castor:
    • Update SRMs to new version (includes updating to SL6).
    • Update disk servers to SL6 (ongoing)
    • Update to Castor version 2.1.15.
  • Networking:
    • Complete changes needed to remove the old core switch from the Tier1 network.
    • Make routing changes to allow the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, LFC)
Entries in GOC DB starting since the last report.
  • None
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
117277 Green Urgent Waiting for Reply 2015-10-30 2015-11-03 Atlas UK RAL-LCG2 staging error "bring-online timeout has been exceeded" (over 500 errors)
117248 Green Less Urgent In Progress 2015-10-28 2015-11-03 Incorrect Certificate for SRM (new certs requested but not yet deployed)
116866 Green Less Urgent On Hold 2015-10-12 2015-10-19 SNO+ snoplus support at RAL-LCG2 (pilot role)
116864 Green Urgent In Progress 2015-10-12 2015-10-26 CMS T1_UK_RAL AAA opening and reading test failing again...
Availability Report

Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud

Day OPS Alice Atlas CMS LHCb Atlas HC CMS HC Comment
28/10/15 100 100 100 100 100 97 100
29/10/15 100 100 100 100 100 93 100
30/10/15 100 100 100 100 100 100 100
31/10/15 100 100 100 100 100 100 100
01/11/15 100 100 100 100 100 100 100
02/11/15 100 100 100 100 96 100 100 Single SRM test failure. "could not open connection to srm-lhcb.gridpp.rl.ac.uk"
03/11/15 100 100 100 100 100 91 100