Difference between revisions of "Tier1 Operations Report 2014-01-22"
From GridPP Wiki
Gareth smith (Talk | contribs) |
(No difference)
|
Latest revision as of 09:52, 22 January 2014
RAL Tier1 Operations Report for 22nd January 2014
Review of Issues during the week 15th to 22nd January 2014. |
- Generally steady operations.
- There was a problem with the CMS Castor instance for just under 30 minutes around midday yesterday (Tuesday 21st Jan). The xrootd process on one of the CMS Castor headnodes died and was down until the restarter kicked in.
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- None
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- On Thursday (16th Jan) the diskpools behind the aliceTape and genTape service classes were merged.
- On Friday (17th Jan) xroot for small VOs (on the Castor GEN instance) was enabled.
- Yesterday (Tuesday 22nd Jan) an attempt was made to update FTS3 (to resolve openssl problem). However, the FTS3 then failed to work and was backed out.
- Yesterday (Tuesday 22nd Jan) the microcode in the tape libraries was updated. This new version enables use of "T10000D" tape drives.
- The second (and final) tranche of worker nodes in this year's purchase were delivered earlier this week.
Declared in the GOC DB |
- On Monday, 27th January. 10:00 - 12:00. Upgrade of FTS3 gridsite and openssl. Will remove existing proxies on the server as part of upgrade.
- There is an entry for the retirement of two old (and replaced) Logging & Bookkeeping servers.
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
- Networking:
- Implementation of new site firewall. Date for Tier1 proposed to be 10th March. (Initial changes for links that do not affect the Tier1 commenced this week.)
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Required before firewall changes on 10th March).
- These changes will lead to the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 15th and 22nd January 2014. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor (all SRM endpoints) | UNSCHEDULED | WARNING | 21/01/2014 09:00 | 21/01/2014 13:00 | 4 hours | For microcode updates to the tape robots. Castor disk services will remain up but there will be no tape access. Tape recalls will stall. Writes to tape backed service classes will carry on, with files flushed from the disk caches to tape once the microcode updates are completed. |
lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 18/12/2013 11:00 | 31/01/2014 00:00 | 43 days, 13 hours | old EMI-2 hosts to be retired |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
100369 | Green | Less Urgent | In Progress | 2014-01-18 | 2014-01-20 | Read only LFC accessible only if you have credentials on the read-write LFC | |
100343 | Yellow | Less Urgent | In Progress | 2014-01-16 | 2014-01-21 | RAL WMS still generating 512 proxies | |
100114 | Red | Less Urgent | In Progress | 2014-01-08 | 2014-01-10 | Jobs failing to get from RAL WMS to Imperial | |
99768 | Red | Less Urgent | In Progress | 2013-12-13 | 2014-01-07 | Atlas | RAL-LCG2_DATADISK: transfer failures with "source file doesn't exist" |
99556 | Red | Very Urgent | In Progress | 2013-12-06 | 2014-01-21 | NGI Argus requests for NGI_UK | |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2014-01-14 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2014-01-06 | Myproxy server certificate does not contain hostname | |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
15/01/14 | 100 | 100 | 100 | 100 | 100 | |
16/01/14 | 100 | 100 | 100 | 100 | 100 | |
17/01/14 | 100 | 100 | 100 | 100 | 100 | |
18/01/14 | 100 | 100 | 100 | 100 | 100 | |
19/01/14 | 100 | 100 | 100 | 100 | 100 | |
20/01/14 | 100 | 100 | 100 | 100 | 100 | |
21/01/14 | 100 | 100 | 100 | 100 | 100 |