Tier1 Operations Report 2011-06-29

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 29th June 2011

Review of Issues during the week from 22nd to 29th June 2011.

  • Friday morning (24th) The rate of batch job starts for Atlas & CMS was lower than expected. Deleting a couple of old jobs at the head of the batch queue fixed it.
  • Friday afternoon (24th) There was a backlog of CMS migrations to tape - these were many small files and the migration was not being triggered. resolved by a manual intervention.
  • Friday (24th) we reported a single lost file to LHCb (from LHCbDst). This was an old file and it was picked up by the regular checksum validation process.
  • Over the weekend there were two database problems that caused Castor outages.
    • Saturday (25th) during the evening. At around 20:38, all Castor Database instances started to churn out a huge amount of trace files. The database itself was actually working fine up until the point that these trace files actually filled up the OS on all 5 instances. As the audit areas on the databases were filled, the database hangs as precautionary matter until space has been cleared. Clearing out the older trace files resolved the problem. This problem affected Castor for a few hours, although the effect appears different for the each VO.
    • Sunday (26th) Castor name server affected by a Database problem. Outage declared in GOC DB for 3 hours (05:30 to 08:30) – slightly longer for LHCB (to 09:20). The cause was database locking from 04:40 with particular Castor internal jobs (Space Monitor) being locked by each other. From the logs the database did appear to keep running but by 07:45, the database has hung and had to be stopped. It is probable that the two database problems over the weekend were linked, and a service request has been opened with Oracle to try and understand the cause.
  • Monday (27th) there was a problem with the CMS Castor instance that strated around midnight and was fixed around midday when it was resolved by a restart of both the LSF scheduler and Job Manager within CMS Castor.
  • Disk Server Issues:
    • Over the weekend there were a large number (16) of 'soft' disk errors, with a peak of these occurring on Friday evening. These errors did not cause failures of the disks but they have all been replaced although this caused some server outages.
    • GDSS256 (AliceTape) had a read only file system on 17th June. and have been taken out of production pending investigation. (There were no migration candidates on the system.). It was returned to service on Tuesday (28th June).
    • On Monday (29th) three Atlas disk servers were taken out of production: gdss211 & gdss233 (both AtlasGroupDisk - D1T0) with multiple drive failures; Gdss233 was returned to production on Tuesday (28th) and gdss211 on Wednesday morning (29th).
    • On Monday afternoon gdss552 (AtlasDataDisk - D1T0) had become unresponsive with a network card issue and was out of production for a short time (30 minutes).
    • On Tuesday gdss354 (AtlasDataDisk - D1T0), which had had a drive replaced, encountered problems. After having its disk controller replaced it was returned to production in 'Read only' mode on Tuesday (28th).
  • Changes made this last week:
    • On Wednesday (22nd June) the Tape Migration policy for LHCb was updated to only migrate once we have 100G to move, or files are > 6 hours old in order to try and make writes more efficient. Once the backlog had cleared this was then changed to 20GByte or 1 hour - whichever comes first.
    • Castor client version 2.1-10 has been rolled out across the worker nodes.

Current operational status and issues.

  • We have observer some packet loss on the main network link from the RAL site (not the route used by our data). This is being followed up by the RAL Networking team. There is a planned intervention to try and resolve this on 5th July.
  • LHCb are still experiencing some issues staging files for the LHCbRawRDst service class. Investigations are ongoing.
  • The following points are unchanged from previous reports:
    • Since CE07 was taken out of service for re-installation as a CREAM CE, both Alice & LHCb are not including any CEs in their availability calculations. As a result our LHCb availability is no longer affected by intermittent problems with the LHCb CVMFS test - although we know this test does still fail from time to time.
    • We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
    • Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated.

Declared in the GOC DB

  • Wednesday 29th June to Tuesday 5th July - LCGCE06 - drain and decommission as lcg-CE
  • Wednesday 29th June to Monday 4th July - LCGCE07 - draining and VO re-configuration

Advanced warning:

The following items are being discussed and are still to be formally scheduled and announced:

  • On Tuesday 5th July the following are planned:
    • Update to Castor version 2.1.10-1)in order to prepare for the higher capacity "T10KC" tapes.
    • Update to the UKLight Router and intervention on the problematic link between the Site Access Router and the Firewall.
    • Doubling of the link between the C300 and one of the switch stacks within the Tier1.
    • Move many batch worker nodes to a new IP address range.
    • Quattorization of the CMS Squids.
  • Updates to Site Router (the Site Access Router) is required.
  • Address permissions problem regarding Atlas User access to all Atlas data.
  • Networking upgrade to provide sufficient bandwidth for T10KC tapes.
  • Microcode updates for the tape libraries are due.
  • Switch Castor and LFC/FTS/3D to new Database Infrastructure.
  • Further updates to CEs: (Glite updates on CE09 outstanding).

Entries in GOC DB starting between 22nd and 29th June 2011.

There were two unscheduled entries in the GOCDB for this period, both relate to the Castor outage on Sunday morning.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce07 SCHEDULED OUTAGE 29/06/2011 15:30 04/07/2011 13:00 4 days, 21 hours and 30 minutes draining and VO re-configuration
lcgce06 SCHEDULED OUTAGE 29/06/2011 12:00 05/07/2011 11:00 5 days, 23 hours Drain and decommission as lcg-CE
srm-lhcb UNSCHEDULED OUTAGE 26/06/2011 06:30 26/06/2011 10:20 3 hours and 50 minutes Castor LHCb issues under investigation
All Castor (all srm end points) UNSCHEDULED OUTAGE 26/06/2011 06:30 26/06/2011 09:30 3 hours Problems on Oracle DB behind Castor services