Difference between revisions of "Tier1 Operations Report 2012-02-15"
From GridPP Wiki
John kelly (Talk | contribs) |
(No difference)
|
Latest revision as of 14:03, 15 February 2012
Contents
- 1 RAL Tier1 Operations Report for 15th February 2012
RAL Tier1 Operations Report for 15th February 2012
Review of Issues during the week 8th to 15th February 2012.
- We note the significant planned outage on Wednesday (8th Feb.) during work on the core Tier1 network.
- In the morning of Thursday (9th Feb)) there was a problem with the network link that provides the 'bypass' route for data traffic to Tier2s which was down between roughly 07:40 and 09:20. The fix was a disable/enable of the link between the "Site Access" and "UKLight" routers.
- On Thursday (9th Feb) there were problems with the Castor Information Provider. It was found that cached information was being provided. Resolving this led to problems with SAM tests and the solution was to revert the CIP change.
- On Thursday (8th Feb) the planned update to the MyProxy server ran into problems and was backed out.
- Overnight Thursday/Friday there were problems with some of the Atlas SRMs which crashed. The cause is partly understood and is linked to a similar problem last week.
- On Sunday (12th Feb 07:04) there was a repeat of the network link problem between the "Site Access' and "UKLight" routers. Networks believe that they understand the problem.
- Tuesday (14th Feb) Site outage for various upgrades.
- Tuesday (14th Feb) There was a site wide power anomaly at approx 12:00 midday. Three racks of disk servers rebooted. A small number required manual intervention but they all returned to production.
Resolved Disk Server Issues
- gdss438 had faulty memory replaced during the tier1 downtime.
Current operational status and issues.
- Atlas are not running any work at RAL due to the LFC migration.
Ongoing Disk Server Issues
- None
Notable Changes made this last week
- Tuesday (12th February) Significant outage to:
- Upgrade the castor nameserver to 2.1.11-8
- Migrate the batch farm to use the new UMD batch server and upgrade the worker nodes.
- Physically move the atlas RAL TAGs database machines.
- There was a separate ongoing work to migrate the Atlas LFC from RAL to CERN.
- lcgwms01 was upgraded to EMI version
Forthcoming Work & Interventions
- Following the Castor Nameserver Upgrade to version 2.1.11 on Tuesday 14th February, it is planned to update the Castor stagers on the following dates. New hardware will be brought into use as the various Castor elemnts are upgraded.
- Monday 20 Feb CMS
- Wednesday 22 Feb ATLAS
- Monday 27 Feb LHCb
- Wednesday 29 Feb Gen
Declared in the GOC DB
- Re-installation of WMS01 for upgrade to UMD distribution (9-17 Feb. including the drain)
- Atlas LFC is in downtime while it is being migrated. (14-02-2012 - 14-03-2012)
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.
- Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
- Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
- Castor:
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g.
- Networking:
- Changes required to extend range of addresses that route over the OPN.
- Install new Routing & Spine layers.
- Fabric:
- BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
- Network changes, including:
- Changes to accommodate new networking equipment.
- Grid Services:
- Updates of Grid Services (including WMS, FTS, MyProxy, LFC front ends) to EMI/UMD versions.
Entries in GOC DB starting between 8th and 15th February 2012.
There were 2 unscheduled outages during this period. Both to fix problems discovered after our planned upgrade.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All CE's | UNSCHEDULED | OUTAGE | 14/02/2012 13:00 | 14/02/2012 16:00 | 3 hours | Putting CE's into downtime while we invetigate problems with batch system |
All CASTOR Instances | UNSCHEDULED | WARNING | 14/02/2012 13:00 | 14/02/2012 16:00 | 3 hours | At-Risk on the tape system following upgrade of castor nameserver |
lfc-atlas.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 14/02/2012 09:00 | 14/03/2012 12:00 | 29 days, 3 hours | ATLAS LFC database migration to CERN |
All CE's and CASTOR Instances | SCHEDULED | OUTAGE | 14/02/2012 08:00 | 14/02/2012 13:00 | 5 hours | Castor Nameserver Upgrade to version 2.1.11. |
All CE's | SCHEDULED | OUTAGE | 13/02/2012 20:00 | 14/02/2012 08:00 | 12 hours | Batch drain ahead of intervention to upgrade Castor Nameserver to version 2.1.11. |
lcgwms01.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 09/02/2012 15:00 | 15/02/2012 12:00 | 5 days, 21 hours | System unavailable - EMI installation |
lcgrbp01.gridpp.rl.ac.uk, | SCHEDULED | WARNING | 09/02/2012 10:50 | 09/02/2012 12:20 | 1 hour and 30 minutes | At-Risk for MyProxy service while service switched to new machine effectively upgrading from glite 3.1 to UMD versions. |
Whole site | SCHEDULED | OUTAGE | 08/02/2012 09:00 | 08/02/2012 16:00 | 7 hours | Outage for intervention on core network within the RAL Tier1. |
All CEs | SCHEDULED | OUTAGE | 07/02/2012 21:00 | 08/02/2012 09:00 | 12 hours | Drain of batch system ahead of intervention on core network within the RAL Tier1. |
lcgwms03.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 07/02/2012 15:00 | 10/02/2012 15:00 | 3 days, | System unavailable - EMI installation |
Open GGUS Tickets
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
77026 | Red | Less Urgent | Waiting Reply | 2011-12-05 | 2012-02-03 | BDII | |
68853 | Red | less urgent | On hold | 2011-03-22 | 2011-12-15 | Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s) |