Tier1 Operations Report 2010-06-16

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 16th June 2010

Review of Issues during week from 9th to 16th June 2010.

  • On Wednesday 9th June we saw saturation of the OPN link to CERN, but this data flow was not seen in the experiment dashboards. It was traced to a user running jobs on the batch farm at CERN and copying data directly from Castor at RAL.
  • On Friday (11th) a problem was found with some newly deployed disk servers (mainly for CMS) which were then set to a draining state. These servers have been deployed using with SL5 & XFS. The problem was subsequently traced to a configuration issue on a port and the servers were returned to normal service on Monday. However, one disk server (gdss539 - part of CMSFarmRead) showed other problem and remains in 'draining'.
  • Over the weekend there was a problem with the top-bdii that affected (mainly, if not only) OPS VO SAM tests. This started during the night Friday-Saturday and was resolved late Saturday evening. The fix was to increase a size parameter within the BDII. This recent update to the BDII would have included this change but had not been applied. (There was a recurrence of BDII problems across the WLCG grid during Tuesday (15th) affecting many sites. Whilst this appeared similar it is believed to have a different cause not related to the RAL Tier1.)
  • The FTS was upgraded to version 2.2.4 this morning (16th June).

Current operational status and issues.

  • gdss67 (CMSFarmRead) had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th May. The memory in this system was replaced but it is still showing problems and is being worked on.
  • On Monday 24th May gdss207 (AliceTape) was removed from service due to possible file system corruption. System not yet back in service. It is being rebuilt.
  • On Thursday 3rd June gdss474 (LHCbMDst) failed with a read only file system. Work is ongoing to resolve hardware issues on this system.
  • Gdss539 (CMSFarmRead) is still in a draining node as reported above.
  • Dust in the Computer Room - particularly the HPD room: As reported last week this comes from lagging on cold water pipes which is abrading in the high airflow in some locations close to the CRAC (Computer Room Air Conditioning) units. This is particularly bad in the HPD room. Investigations last Friday showed that a fitting a membrane around the lagging is a workable solution. we await further information as to when this could be done.

Declared in the GOC DB

  • Preventative maintenance work on transformers in R89 in two phases. In both cases an At Risk on the Tier1 site has been declared.
    • Monday - Wednesday 28-30 June. (Old date for LHC Technical Stop).
    • Monday - Thursday 19-22 July. (New dates for LHC Technical Stop).

Advanced warning:

The following items remain to be scheduled:

  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.
  • Doubling of network link to network stack for tape robot and Castor head nodes.

Entries in GOC DB starting between 9th and 16th June 2010.

There were no unscheduled outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
FTS, FTM SCHEDULED AT_RISK 16/06/2010 08:00 16/06/2010 12:00 4 hours At Risk during update to FTS version 2.2.4.