Tier1 Operations Report 2011-06-01

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 1st June 2011

Review of Issues during the week from 25th May to 1st June 2011.

  • Wednesday (25th) GDSS293 (CMSFarmRead - D0T1) was returned to production. It had originally been removed from service following a read-only file system problem on 30th April. The system had been drained before the RAID array was rebuilt and the software re-installed.
  • Wednesday (25th) GDSS120 (LHCbRawRDst – D0T1) taken out of production at the end of the afternoon. It was showing a high load following a disk replacement. It was returned to service at the end of the following afternoon (26th).
  • Towards the end of last week (Thursday 26th) there was high load on the lhcbRawRdst and lhcbUser service classes. The situation on lhcbRawRdst being exacerbated while GDSS120 was out of service.
  • Thursday (26th) there was a large backlog of tape migrations for the GEN service class - reaching a peak of 15,000 files. This was resolved and the backlog had gone by Friday morning.
  • Changes made this last week:
    • Some increases in Atlas maximum job start rates as this is no longer limited by the Atlas software server.
    • Three disk servers moved from atlasStripInput to atlasScratchDisk on Thursday (26th).

Current operational status and issues.

  • GDSS294 (CMSFarmRead - D0T1)failed with a read-only file system on the evening of Monday 9th May. It is currently out of production having been drained.
  • On Friday (20th) GDSS365 (CMSTemp - D1T0) reported a read only filesystem and was taken out of production. It has since been drained and is out of production for further tests.
  • Since CE07 was taken out of service for re-installation as a CREAM CE, LHCb are not reporting any CE availability via GridView. As a result we no longer see the effect of intermittent problems with CVMFS (which still occur) on this availability view.
  • We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
  • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The investigation into this is still ongoing.

Declared in the GOC DB

  • CE07 is out of production while it is upgraded to a CREAM CE.

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

  • Updates to Site Routers (the Site Access Router and the UKLight router) are required.
  • Upgrade Castor clients on the Worker Nodes to version 2.1.10.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Minor Castor update to enable access to T10KC tapes.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 25th May and 1st June 2011.

There were no unscheduled entries in the GOCDB for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce07 SCHEDULED OUTAGE 24/05/2011 09:00 07/06/2011 16:00 14 days, 7 hours drain and decommission as lcg-CE