Tier1 Operations Report 2012-05-30

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 30th May 2012

Review of Issues during the week 23rd to 30th May 2012
  • Last Thursday (24th) a problem found with xrootd on a draining disk server affected CMS. Workaround in place.
  • One of Site BDIIs failed on Saturday (26th) and was removed from DNS alias.
  • On Saturday (26th) a problem with a power controller caused two service nodes (APEL and one of the CMS squids) to fail. The CMS squid was removed from the relevant CMS configs on the same day. These services were restored on Monday (28th).
Resolved Disk Server Issues
  • GDSS644 (atlasStripInput) was found to have an incorrect installation on 12th May. It was drained and re-installed and returned to service on the 29th May.
Current operational status and issues
  • Investigations into an ongoing communications problem between the CEs and the batch server continue.
  • There is a known problem with the handling of some certificates within FTS that is currently causing problems for outgoing CMS FTS transfers.
Ongoing Disk Server Issues
  • GDSS374 (AtlasTape - D0T1) and GDSS607 (LHCbDst - D1T0) are both drained and undergoing re-acceptance testing following earlier failures.
Notable Changes made this last week
  • On Monday (28th May) the LHCb Castor instance was successfully changed to use the new "Transfer Manager" scheduler.
  • On Wednesday morning (30th May) the Transfer Manager was deployed for the Castor GEN instance.
  • The older disk servers in AliceDisk have now been drained and removed. This means the space token now has around 200TB of storage as planned. (It was temporarily over-allocated after newer servers were added before the old ones removed.) Of note is that the draining uncovered around 4000 files that were listed in the Castor Nameserver as of zero size, but did occupy space on disk. None of these files were recorded by Alice as present at RAL (so there was no data loss) but this did represent some dark data.
  • Errata and kernel updates are being deployed on worker nodes.
Declared in the GOC DB
  • Deploy Transfer Manager for Castor. Dates for this are now in the GOCDB.
    • 30 May 2012 09:00-11:00 Gen
    • 31 May 2012 09:00-11:00 CMS
    • 07 Jun 2012 09:00-11:00 ATLAS
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

The following items will particularly affect services:

  • Castor 2.1.11-9 update (provisionally - Wed 13th June).
  • Update LFC/FTS databases to Oracle 11 (provisionally - Wed 13th June).
  • Replacement of site access router on Tuesday 19th June.
  • Castor Oracle 11 update. (provisionally Wed 27th June).

Listing by category:

  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Wednesday 6th June)
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • One step still remains in extending the IP address range used for disk servers that will use the OPN.
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer. There is now a firm date of 19th June for upgrading the Site Access Router. The replacement of the UKLight Router will follow.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Reconfiguration and maintenance operation will be scheduled for lcgwms03 (non-LHC WMS) from 1-7 June.
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.|
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. This has been postponed from the 14th May.


Entries in GOC DB starting between 23rd and 30th May 2012

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
Castor GEN instance: srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k. SCHEDULED OUTAGE 30/05/2012 09:00 30/05/2012 11:00 2 hours downtime to upgrade GEN castor instance to use Transfer Manager
srm-lhcb.gridpp.rl.ac.uk, SCHEDULED OUTAGE 28/05/2012 09:00 28/05/2012 11:00 2 hours downtime to upgrade LHCb castor instance to use Transfer Manager
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
82100 Red Less Urgent On hold 2012-05-10 2012-05-28 SNO+ default se
82496 Amber Less Urgent In Progress 2012-05-24 2012-05-29 T2K Cannot delegate proxies to FTS

}