Tier1 Operations Report 2012-05-02

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 2nd May 2012

Review of Issues during the week 25th April to 2nd May 2012
  • The primary OPN link to CERN failed at around 10:30 Friday morning (27th April) owing to a fibre fault. This was fixed around 23:00 that evening. Traffic failed over to the backup link for this period and there was no operational impact on the Tier1.
  • On Saturday evening (28th April) there was a break in the bypass link (SAR to UKLight router) that was resolved by Networking staff.
  • There were problems with the batch system overnight Thursday/Friday (26/27 April) and again on Sunday morning (28th April). These were partly caused by a problematic worker node with a lot of jobs going into a 'wait' state. (The blackhole detector also picked up a different failing worker node). On Sunday the batch (PBS) server need to be restarted to clear the problems. During this incident there was little effect on running jobs (we continued to run a large number of jobs) but for a few hours we did not start jobs and the queued jobs disappeared.
Resolved Disk Server Issues
  • GDSS209 (AtlasScratcHDisk - D1T0) crashed on the evening of Friday 20th April. It was out of service until the Sunday. Following some investigations this server has been put into Read-Only mode ahead of being removed from production.
Current operational status and issues
  • Investigations into an ongoing communications problem between the CEs and the batch server continue. A further parameter change has been made on the batch server yesterday (Tuesday 1st May).
  • As reported above there was a further break in the UKLight-SAR link this week. The fibre module in the SAR was replaced by a brand new one as part of the intervention.
  • There is a known problem with the handling of some certificates within FTS that is currently causing problems for LHCb FTS transfers.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Alice VOBOX (lcgvo-alice.gridpp.rl.ac.uk) updated to glite-VOBOX v3.2.13-1 (Monday 30th April).
  • Planned intervention on a power board supplied by the UPS took place yesterday (Tuesday 1st May).
Forthcoming Work & Interventions
  • Some modified WAN tuning settings are being rolled out across disk servers.
  • The ganglia server will be replaced.
  • Castor will move to use the new "Tape Gateway" and "Transfer Manager" features.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 14th May. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Entries in GOC DB starting between 25th April and 2nd May 2012

There was one unscheduled entries in the GOC DB during this period. This is the 'warning' when some disk servers were rebooted.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-alice.gridpp.rl.ac.uk, srm-dteam.gridpp.rl.ac.uk, srm-hone.gridpp.rl.ac.uk, srm-ilc.gridpp.rl.ac.uk, srm-lhcb.gridpp.rl.ac.uk, srm-mice.gridpp.rl.ac.uk, srm-minos.gridpp.rl.ac.uk, srm-na62.gridpp.rl.ac.uk, srm-snoplus.gridpp.rl.ac.uk, srm-superb.gridpp.rl.ac.uk, srm-t2k.gridpp.rl.ac.uk, UNSCHEDULED WARNING 25/04/2012 11:00 25/04/2012 13:00 2 hours Warning while some disk servers are rebooted to pick up new kernels.


Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
81669 yellow Less Urgent In Progress 2011-04-27 2012-04-30 NA62 FTS channel for na62.vo.gridpp.ac.uk
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)