Difference between revisions of "Tier1 Operations Report 2014-11-05"
From GridPP Wiki
(→) |
(→) |
||
(5 intermediate revisions by one user not shown) | |||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 29th October to 5th November 2014. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 29th October to 5th November 2014. | ||
|} | |} | ||
− | * The problem reported last week with the ARC CEs | + | * The problem reported last week with the ARC CEs not keeping up processing the jobs has been fixed. A modification to use 'per-job' condor history files has sped this process up. Two of the ARC CEs (arc-ce02 & arc-ce03) have been re-installed with this change. |
* There was a reboot of the Teir1 primary router, followed a few minutes later by the secondary, this morning (5th Nov). This was part of the investigations into getting the RIP protocol to work. | * There was a reboot of the Teir1 primary router, followed a few minutes later by the secondary, this morning (5th Nov). This was part of the investigations into getting the RIP protocol to work. | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
Line 55: | Line 55: | ||
|} | |} | ||
* Oracle patches were applied to the 'somnus' database behind the LFC on Thursday (30th Oct). | * Oracle patches were applied to the 'somnus' database behind the LFC on Thursday (30th Oct). | ||
+ | * The repacking of the CMS data from T10KB to T10KD tapes is progresssing and is now around three-quarters complete. | ||
+ | * Port opened up to allow external Castor WebDav access (requested by LHCb). | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 79: | Line 81: | ||
|} | |} | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
− | |||
* The rollout of the RIP protocol to the Tier1 routers still has to be completed. | * The rollout of the RIP protocol to the Tier1 routers still has to be completed. | ||
* First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room. | * First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room. | ||
'''Listing by category:''' | '''Listing by category:''' | ||
* Databases: | * Databases: | ||
− | |||
** A new database (Oracle RAC) has been set-up to host the Atlas3D database. This is updated from CERN via Oracle GoldenGate. | ** A new database (Oracle RAC) has been set-up to host the Atlas3D database. This is updated from CERN via Oracle GoldenGate. | ||
** Switch LFC/3D to new Database Infrastructure. | ** Switch LFC/3D to new Database Infrastructure. | ||
Line 194: | Line 194: | ||
| Atlas | | Atlas | ||
| BDII vs SRM inconsistent storage capacity numbers | | BDII vs SRM inconsistent storage capacity numbers | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| 106324 | | 106324 | ||
Line 241: | Line 232: | ||
| 02/11/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 || | | 02/11/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 || | ||
|- | |- | ||
− | | 03/11/14 || 100 || 100 || style="background-color: lightgrey;" | 98.3 || 100 || 100 || 100 || 100 || | + | | 03/11/14 || 100 || 100 || style="background-color: lightgrey;" | 98.3 || 100 || 100 || 100 || 100 || Two consecutive SRM PUT test failures. |
|- | |- | ||
− | | 04/11/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 || | + | | 04/11/14 || 100 || 100 || 100 || 100 || 100 || 100 || 100 || |
|} | |} | ||
<!-- **********************End Availability Report************************** -----> | <!-- **********************End Availability Report************************** -----> | ||
<!-- *********************************************************************** -----> | <!-- *********************************************************************** -----> |
Latest revision as of 08:50, 11 November 2014
RAL Tier1 Operations Report for 5th November 2014
Review of Issues during the week 29th October to 5th November 2014. |
- The problem reported last week with the ARC CEs not keeping up processing the jobs has been fixed. A modification to use 'per-job' condor history files has sped this process up. Two of the ARC CEs (arc-ce02 & arc-ce03) have been re-installed with this change.
- There was a reboot of the Teir1 primary router, followed a few minutes later by the secondary, this morning (5th Nov). This was part of the investigations into getting the RIP protocol to work.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- None
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- Oracle patches were applied to the 'somnus' database behind the LFC on Thursday (30th Oct).
- The repacking of the CMS data from T10KB to T10KD tapes is progresssing and is now around three-quarters complete.
- Port opened up to allow external Castor WebDav access (requested by LHCb).
Declared in the GOC DB |
None
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- The rollout of the RIP protocol to the Tier1 routers still has to be completed.
- First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.
Listing by category:
- Databases:
- A new database (Oracle RAC) has been set-up to host the Atlas3D database. This is updated from CERN via Oracle GoldenGate.
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update Castor headnodes to SL6.
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers.
- Fabric
- Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 29th October and 5th November 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole Site. | UNSCHEDULED | WARNING | 05/11/2014 09:00 | 05/11/2014 10:00 | 1 hour | Putting site At Risk for a reboot of network router. Anticipate only two very short (few seconds) break in connectivity. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
109845 | Green | Less Urgent | Waiting for Reply | 2014-11-04 | 2014-11-04 | egi.eu CVMFS repository and GridPP wiki | |
109712 | Green | Urgent | In Progress | 2014-10-29 | 2014-10-29 | CMS | Glexec exited with status 203; ... |
109694 | Green | Urgent | In Progress | 2014-11-03 | 2014-11-04 | SNO+ | gfal-copy failing for files at RAL |
109276 | Green | Urgent | On Hold | 2014-10-11 | 2014-11-03 | CMS | Submissions to RAL FTS3 REST interface are failing for some users |
108944 | Yellow | Urgent | On Hold | 2014-10-01 | 2014-11-03 | CMS | AAA access test failing at T1_UK_RAL |
107935 | Red | Less Urgent | On Hold | 2014-08-27 | 2014-11-03 | Atlas | BDII vs SRM inconsistent storage capacity numbers |
106324 | Red | Urgent | On Hold | 2014-06-18 | 2014-10-13 | CMS | pilots losing network connections at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
29/10/14 | 100 | 100 | 97.7 | 100 | 100 | 99 | 100 | Two SRM Test failures (both timeouts) |
30/10/14 | 100 | 100 | 100 | 95.9 | 95.9 | 99 | 99 | Both CMS and LHCb had a single SRM test failure just after midnight. |
31/10/14 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | |
01/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
02/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
03/11/14 | 100 | 100 | 98.3 | 100 | 100 | 100 | 100 | Two consecutive SRM PUT test failures. |
04/11/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |