Tier1 Operations Report 2010-09-22

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 22nd September 2010

Review of Issues during the week from 15th to 22nd September 2010.

  • Transfer problems to a new batch of disk servers at NorduGrid affecting only RAL. Fixed in Friday (17th Sep.) A linecard terminating the RAL primary link on the CERN router was replaced.
  • There have been a number of problems relating to CEs. These are being followed up and mitigation is in place. The longer term fix will be a glite update which is being combined with a move to Quattorize the CEs.
  • On Friday (17th) srm-205 (LHCb) was rebooted. LHCb had reported at the daily WLCG meeting that around half their jobs were stalling. This has been traced to a memory issue. The amount of memory in the system will be increased.
  • On Monday (20th) it was found that diskserver GDSS81 (Atlas) was disabled rather than in draining mode. Fixed by putting it back in draining mode. This change is believed to have been triggered when the Atlas nameserver configuration was changed the previous Thursday.
  • Yesterday (Tuesday 21st) disk server GDSS364 (CMS Temp) was found to have a hardware problem. It was returned to production in draining mode at the end of that afternoon pending further diagnostics.
  • On Tuesday morning (21st) the RAL Site Firewall was updated. We declared an outage from 07:30 to 10:00. This went well although there was a delay restarting FTS transfers. There had been a separate hardware problem on the machine that hosts the FTS agents. The opportunity was taken to fix this during the scheduled outage. However, the machine developed more problems and the agents were switched across to the hot standby machine.
  • Final Castor reconfigurations to use a local NameServer completed. (Last one was Atlas on Thursday 16th Sep.)

Current operational status and issues.

  • Gdss280 (CMSFarmRead) had showed FSProbe errors and was taken out of production on Thursday 19th August. As reported last week this server was returned to production on the morning of 15 Sep). The server again gave FSPROBE errors and was taken back out of production the next day (16th). 30 un-migrated files were lost. A review of the problems encountered will be followed up via a post mortem.
  • Problem using globus-url-copy doesn't work from CERN on to SL5 64bit disk servers at RAL. A fix (downgrading globus-gridftp-server from 1.10.1 to 1.8.1) is being tested on several servers.
  • Performance issues on Castor Disk Servers for LHCb: This is being kept under observation. We are normally running a maximum of 800 LHCb batch jobs and checking the job success rates. There was an operational problem over the weekend. the maximum number of jobs had been upped to 1000 and, on Sunday, when LHCb submitted a large amount of batch work the jobs showed very high failure rates. The number of LHCb batch jobs has been reduced (to 250 at present) while further investigations take place.
  • We are aware of and taking appropriate action regarding a recent Security alert.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2.

Declared in the GOC DB

  • Mon 27 - Wed 29 Sep: Castor upgrade on LHCb. (Still need to confirm arrangements for LHCb batch and file transfers). It is proposed that ahead of this outage we will:
    • Drain and turn off the CEs used by LHCb. (This will leave the other LHHCs with one CE each).
    • Drain out and stop LHCb tranfers in the FTS.

Advanced warning:

The following items remain to be scheduled/announced:

  • Weekend Power Outage in Atlas building 2/3 October. Plans well under way for the necessary moves to ensure continuity for any services still using equipment in the Atlas building.
  • Glite update on worker nodes - ongoing.
  • Wednesday 20th October (during LHC technical stop): UPS maintenance.
    • During next LHC technical stop (18-21 October): Probable checks on transformers.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Upgrade Gen (including ALICE) - during the week beginning 25 October
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November

Entries in GOC DB starting between 15th and 22nd September 2010.

There was one unscheduled entry in the GOC DB for this last week. This was for the FTS service caused by hardware problems on the system that hosts the FTS agents.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts UNSCHEDULED OUTAGE 21/09/2010 10:00 21/09/2010 11:00 1 hour Retrospective outage for FTS. The FTS service at RAL failed during our scheduled outage and we had to swap the FTS service to a hot standby node. This took about 1 hour until 11:00 local time.
Whole Site SCHEDULED OUTAGE 21/09/2010 07:30 21/09/2010 10:00 2 hours and 30 minutes Outage declared for whole site while updates applied to site firewall.
lcgfts SCHEDULED OUTAGE 21/09/2010 06:30 21/09/2010 07:30 1 hour Drain of FTS ahead of outage on site for updates to firewall.
lcgce01, lcgce06, lcgce08, srm-atlas SCHEDULED AT_RISK 16/09/2010 10:00 16/09/2010 12:00 2 hours Atlas Castor instance and dependent services at risk while we change to using a local nameserver.
lcgce01, lcgce06, lcgce07, srm-cms SCHEDULED AT_RISK 15/09/2010 10:00 15/09/2010 12:00 2 hours CMS Castor instance and dependent services at risk while we change to using a local nameserver
lcgwms02 SCHEDULED OUTAGE 09/09/2010 15:00 15/09/2010 09:59 5 days, 18 hours and 59 minutes Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.