Tier1 Operations Report 2010-05-05

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 5th May 2010.

Review of Issues during week 28th April to 5th May 2010.

  • As reported last week there were DNS issues from the Monday (26th). These were finally resolved by Thursday morning (29th). The problem was limited to the lookup of some .uk addresses and the cause was believed to lie outside RAL.
  • As reported last week gdss290, part of CMSFarmRead, was out of production since Monday 26th. This was returned to production on Friday (30th) with no data loss.
  • There was a planned outage to Castor and batch systems, along with "At Risks" for LFC & FTS on Wednesday 28th April. This lasted the scheduled four hours. The planned work was completed OK.
  • There was an "At Risk" on the batch system at short notice last Thursday (29th) while 32-bit Castor libraries were installed on the SL5 worker nodes.
  • CE01 (CREAM CE) was unavailable for some days last week while it was drained for glexec to be configured.
  • There was a security challenge that took place on Thursday (29th).

Current operational status and issues.

  • gdss397, part of ATLASDATADISK, failed over the bank holiday weekend and is still unavailable. There is a problem with the RAID controller card. The status of files on the server is (at present) unknown.
  • There has been a problem relating to the FTS complaining about bad checksums when transferring some files. This is not a real data corruption, but a problem in checksum handling.

Declared in the GOC DB:

  • NGS CE has been declared as an Outage from tomorrow (6th May) for decommissioning.

Advanced warning:

The following items remain to be scheduled:

  • Kernel and glibc updates will need to be done on LFC, FTS & LHCb 3D Oracle database (RAC) nodes.
  • CEs taken out of production in rotation (one at a time) while glexec configured.
  • Advanced notice: Probably UPS test (implying site At Risk) during next LHC technical stop.

Entries in GOC DB starting between 28th April and 5th May 2010.

There was 1 unscheduled entry in the GOC DB for this last week.

  • Extension to time needed for configuration of CE01 for glexec. Required as security challenge interfered with original work plan.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce01.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 29/04/2010 17:00 30/04/2010 17:00 24 hours Reconfiguring the CE to enable mappings for glexec
All CEs (batch) UNSCHEDULED AT_RISK 29/04/2010 10:00 29/04/2010 13:00 3 hours At Risk: 32-bit versions of Castor libraries will be installed on worker nodes.