Tier1 Operations Report 2010-09-08

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 8th September 2010

Review of Issues during the two weeks from 1st to 8th September 2010.

  • Gdss547 (AtlasScratchDisk) was found on 23rd August not to be able to communicate with CERN. It was put back into production in 'draining' mode on Thursday 26th August. On 7th September it was removed from service (for use in test systems).
  • On Thursday 2nd September gdss473 (LHCbMdst) was taken out of production for a few hours. This server reported hardware errors having just been put into service. The RAID card was replaced and it was returned to production within a couple of hours.
  • Gdss379(lhcbuser) went down around mid-day Saturday (4th Sep). It was rebooted and it returned to service later that day.
  • On Monday there were problems on CE07 which were fixed by a restart. The system was using a very large amount of memory and was taking a very long time to process jobs through. CMS & LHCb also encountered problems with jobs failing owing to CE problems over the weekend.
  • Streaming of the 3D databases from CERN was stopped for while from late Monday. This was to help resolve a problem at SARA.
  • There was a genuine failure of the RAL-CERN OPN link last Tuesday evening, the same day as the backup link and its failure was commissioned. (Note added after the meeting.)
  • Maintenance and update of WMS01 (to glite-WMS 3.1.29) completed ahead of schedule on 7th Sep.
  • Planned update to site-BDII completed OK on 7th September.
  • Two new LFC front ends to the lfc.gridpp.rl.ac.uk alias (LFC non-LHC instance) brought into use on 7th Sep. These are located in R89 and will replace the one in the Atlas building in due course.

Current operational status and issues.

  • Gdss280 (CMSFarmRead) showed FSProbe errors and was taken out of production on Thursday 19th August. Investigations into this disk server are still ongoing.
  • Gdss81 (AtlasdataDisk) had a problem (read only file system) on Wednesday 25th August and was taken out of production. Investigations showed disk array problems, and FSCK errors on one of the partitions. It was returned to production, but in 'draining' mode on Friday 27th.
  • Top-BDII problems. Reported over the weekend and into Monday, the top-BDII was returning no information. Investigations have shown problems with the updating process within the BDII. A workaround is in place and progress is underway to update to version 3.2 which appears to be working OK elsewhere.
  • Ongoing transfer problems to a new batch of disk servers at NorduGrid affecting only RAL.
  • Problem using globus-url-copy doesn't work from CERN on to SL5 64bit disk servers at RAL. A fix (downgrading globus-gridftp-server from 1.10.1 to 1.8.1) is being tested. This is the cause of the problem on gdss547.
  • There is a problem with the use of 'lazy download' by CMS which causes excessive memory usage on disk servers. This was initially found on the Castor 2.1.9 test instance, but has now been seen on the production castor 2.1.7 service.
  • Performance issues on Castor Disk Servers for LHCb: last week we reported problems with multiple servers in the LHCbMDst service class having failed. (The server stopped but subsequently there is no obvious reason for the failure.) A set of incorrect network tuning parameters was identified. Correct parameters were rolled out to the affected batch of servers on Thursday 2nd Sep. Since then these servers have remained up under heavy load. Over the weekend there was a somewhat similar failure of a different LHCb server (Gdss379 - referred to above). It is thought this has a different, but unknown, cause. Following the resolution of the problem of servers crashing it became clear that it was not possible to run many LHCb batch jobs owing to a severe bottleneck accessing the Castor disk. In order to achieve reasonable job efficiencies the maximum number of simultaneous batch jobs was reduced (to 200). investigations into the bottleneck are ongoing.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2.
  • There has been some discussion over the batch scheduling as regards long jobs (Alice in this case) filling the farm during a 'quiet' period and preventing a other VO's jobs starting until the jobs have finished. This is compounded by further contention between other VOs (Atlas, CMS, LHCb) as job slots become free. Ways of improving this are still being reviewed.

Declared in the GOC DB

  • 9-16 September. WMS02. Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.

Advanced warning:

The following items remain to be scheduled/announced:

  • Weekend Power Outage in Atlas building 2/3 October. Plans under way for the necessary moves to ensure continuity for any services still using equipment in the Atlas building.
  • Glite update on worker nodes.
  • Update firmware in RAID controller cards for a batch of disk servers.
  • Doubling of network link to network stack for tape robot and Castor head nodes.
  • Re-visit the SAN / multipath issue for the non-castor databases.
  • During next LHC technical stop (18-21 October): Probable checks on transformers.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Monday 13 September 10:00-12:00 At Risk for switch to local Nameserver on LHCb instance.
    • Tuesday 14 September 10:00-12:00 At Risk for switch to local Nameserver on GEN instance.
    • Wednesday 15 September 10:00-12:00 At Risk for switch to local Nameserver on CMS instance.
    • Thursday 16 September 10:00-12:00 At Risk for switch to local Nameserver on Atlas instance.
    • Upgrade LHCb - during the week beginning 27 September
    • Upgrade Gen (including ALICE) - during the week beginning 25 October
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November

Entries in GOC DB starting between 1st and 8th September 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
site-bdii.gridpp.rl.ac.uk, SCHEDULED AT_RISK 07/09/2010 08:30 07/09/2010 17:00 8 hours and 30 minutes At Risk while switching over to a quattorised pair of site-level BDIIs.
lcgwms01 SCHEDULED OUTAGE 01/09/2010 10:00 07/09/2010 14:40 6 days, 4 hours and 40 minutes Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.