Difference between revisions of "Tier1 Operations Report 2014-12-03"
From GridPP Wiki
(→) |
(→) |
||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 26th November to 3rd December 2014. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 26th November to 3rd December 2014. | ||
|} | |} | ||
− | + | ||
− | + | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> |
Revision as of 15:39, 2 December 2014
RAL Tier1 Operations Report for 3rd December 2014
Review of Issues during the week 26th November to 3rd December 2014. |
Resolved Disk Server Issues |
- None
Current operational status and issues |
- Some problems on Atlas Castor instance. At various times in recent weeks the Atlas workload has led to differing groups of disk servers spending a lot of time in a "wait i/o" state. This is triggered by the numbers of reads using xroot and has led to some SAM test failures.
Ongoing Disk Server Issues |
- None.
Notable Changes made this last week. |
- LHCb Castor headnodes were updated to SL6 on Tuesday 2nd December
Declared in the GOC DB |
- None.
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- The rollout of the RIP protocol to the Tier1 routers still has to be completed.
- First quarter 2015: Circuit testing of the remaining (i.e. non-UPS) circuits in the machine room.(Provisional dates: week 12-16 January).
- Castor headnode upgrades to SL6: (Assume 4 hour outage of Castor instance in each case for stager updates).
- Tuesday 2nd Dec - LHCb; Tues 9th Dec - CMS; Wed 10th Dec - Atlas; Wednesday 7th Jan - GEN; Thursday 8th Jan - Nameserver (transparent - at risk)
Listing by category:
- Databases:
- A new database (Oracle RAC) has been set-up to host the Atlas 3D database. This is updated from CERN via Oracle GoldenGate. This system is yet to be brought into use. (Currently Atlas 3D/Frontier still uses the OGMA datase system, although this was also changed to update from CERN using Oracle Golden Gate.)
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update Castor headnodes to SL6.
- Fix discrepancies were found in some of the Castor database tables and columns. (The issue has no operational impact.)
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Enable the RIP protocol for updating routing tables on the Tier1 routers.
- Fabric
- Migration of data to new T10KD tapes. (Migration of CMS from 'B' to 'D' tapes underway; migration of GEN from 'A' to 'D' tapes to follow.)
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room (Expected first quarter 2015).
Entries in GOC DB starting between the 26th November and 3rd December 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-lhcb-tape.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 02/12/2014 10:00 | 02/12/2014 14:00 | 4 hours | OS upgrade (SL6) on headnodes for LHCb Castor instance. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
110497 | Green | Less Urgent | In Progress | 2014-12-02 | 2014-11-02 | [Rod Dashboard] Issues detected at RAL-LCG2 | OPS |
110397 | Green | Less Urgent | In Progress | 2014-11-26 | 2014-11-27 | Unable to access LFC webdav interface via browser | dteam |
110382 | Green | Less Urgent | In Progress | 2014-11-26 | 2014-11-26 | RAL-LCG2: please reinstall your perfsonar hosts(s) | N/A |
109712 | Amber | Urgent | In Progress | 2014-10-29 | 2014-11-27 | CMS | Glexec exited with status 203; ... |
109694 | Green | Urgent | On hold | 2014-11-03 | 2014-11-26 | SNO+ | gfal-copy failing for files at RAL |
108944 | Amber | Urgent | In Progress | 2014-10-01 | 2014-11-26 | CMS | AAA access test failing at T1_UK_RAL |
107935 | Red | Less Urgent | On Hold | 2014-08-27 | 2014-11-03 | Atlas | BDII vs SRM inconsistent storage capacity numbers |
106324 | Red | Urgent | On Hold | 2014-06-18 | 2014-11-27 | CMS | pilots losing network connections at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
26/11/14 | 100 | 100 | 100 | 100 | 100 | 99 | n/a | |
27/11/14 | 100 | 100 | 100 | 100 | 100 | 98 | n/a | |
28/11/14 | 100 | 100 | 98.1 | 100 | 100 | 100 | n/a | |
29/11/14 | 100 | 100 | 100 | 98.5 | 100 | 100 | n/a | Start of problems with CMS Castor scheduler headnode. |
30/11/14 | 100 | 100 | 100 | 73.0 | 100 | 100 | n/a | Problems with CMS Castor scheduler headnode. |
01/12/14 | 100 | 100 | 99.2 | 95.9 | 91.9 | 100 | n/a | |
02/12/14 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |