Tier1 Operations Report 2010-09-15

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 15th September 2010

Review of Issues during the two weeks from 8th to 15th September 2010.

  • Gdss280 (CMSFarmRead) showed FSProbe errors and was taken out of production on Thursday 19th August. This server was returned to production this morning (15 Sep).
  • Top-BDII problems of last week when the top-BDII was returning no information. At the time of last week's meeting a workaround was is in. By the end of last Thursday (9th September) all five nodes had been updated to version 3.2 and the problem has been resolved.
  • There was a problem on CE01 Friday and Saturday and an outage was declared for it. The solution was to recreate its internal database.
  • On Tuesday morning the RAL Site Firewall was rebooted. Problems has been seen with round trip times of up of up to 1 second reported. However, we have seen no evidence that Tier1 operations were impacted.
  • This morning, Wed 15th Sep, ACLs were changed on the Site Access Router to fix a problem whereby, owing to an old error, the open port range for gridftp transfers was insufficiently wide. This problem came to light as a result of castor 2.1.9 testing.
  • A problem was found on disk server GDSS379 which had not been returned to production correctly following an intervention on 4th September. This was corrected this morning (15th Sep.)
  • The problem reported last week with the use of 'lazy download' by CMS causing excessive memory usage on disk servers has been fixed.
  • Castor has been reconfigured to use a local NameServer for the LHCb, GEN and CMS instances. (Atlas scheduled for tomorrow). There was a problem following the LHCb move that was fixed shortly afterwards and the procedures updated so this did not occur for the other instances.
  • The planned WMS02 upgrade was completed successfully today.

Current operational status and issues.

  • Gdss81 (AtlasdataDisk) had a problem (read only file system) on Wednesday 25th August and was taken out of production. Investigations showed disk array problems, and FSCK errors on one of the partitions. It was returned to production, but in 'draining' mode on Friday 27th. It will be left in this mode as this is an old server and will be retired from use in due course.
  • Ongoing transfer problems to a new batch of disk servers at NorduGrid affecting only RAL. networking teams have been looking at this, but not yet resolved.
  • There have been a number of problems relating to CEs. These are being followed up. The longer term fix will be a glite update which is being combined with a move to Quattorize the CEs.
  • Problem using globus-url-copy doesn't work from CERN on to SL5 64bit disk servers at RAL. A fix (downgrading globus-gridftp-server from 1.10.1 to 1.8.1) is being tested on several servers.
  • Performance issues on Castor Disk Servers for LHCb. This is being kept under observation. We are normally running a maximum of 800 LHCb batch jobs and checking the job success rates. There was an operational error on Monday that allowed more than 800 batch jobs to run, but this was resolved. We believe the problem of servers crashing has been fixed.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing but some errors (roughly once/week) have been seen.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. I had reported last week that the cause appeared to be related to temperature. However, further investigations suggest the cause is related to earth leakage detection. Two (of the four) transformers still require checking out as part of planned work following the first failure of TX2.
  • There has been some discussion over the batch scheduling as regards long jobs (Alice in this case) filling the farm during a 'quiet' period and preventing a other VO's jobs starting until the jobs have finished. This is compounded by further contention between other VOs (Atlas, CMS, LHCb) as job slots become free. Ways of improving this are still being reviewed. We are running with a low rate of job starts for Alice, but an elevated maximum job count, at the moment.

Declared in the GOC DB

  • Thursday 16 September 10:00-12:00 At Risk for switch to local Nameserver on Atlas instance.
  • Mon 27 - Wed 29 Sep: Castor upgrade on LHCb. (Still need to confirm arrangements for LHCb batch and file transfers).

Advanced warning:

The following items remain to be scheduled/announced:

  • Weekend Power Outage in Atlas building 2/3 October. Plans under way for the necessary moves to ensure continuity for any services still using equipment in the Atlas building.
  • Tuesday 21st Sep. RAL Site firewall Firmware update.
  • Glite update on worker nodes - ongoing.
  • Wednesday 20th October (during LHC technical stop): UPS maintenance.
    • During next LHC technical stop (18-21 October): Probable checks on transformers.
  • Monday 13th December (just after LHC 2010 run ends): UPS test.
  • Castor Upgrade to 2.1.9.
    • Upgrade Gen (including ALICE) - during the week beginning 25 October
    • Upgrade CMS - during the week beginning 8 November
    • Upgrade ATLAS - during the week beginning 22 November

Entries in GOC DB starting between 8th and 15th September 2010.

There were three unscheduled entries in the GOC DB for this last week. Two relate to the firewall reboot and one to the problem on CE01.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgce01, lcgce06, lcgce07, srm-cms. SCHEDULED AT_RISK 15/09/2010 10:00 15/09/2010 12:00 2 hours CMS Castor instance and dependent services at risk while we change to using a local nameserver
lcgce01, lcgce03, lcgce05, lcgce07, srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-superb, srm-t2k. SCHEDULED AT_RISK 14/09/2010 10:00 14/09/2010 12:00 2 hours Alice and Gen Castor instances and dependent services at risk while we change to using a local nameserver
Whole Site UNSCHEDULED AT_RISK 14/09/2010 07:30 14/09/2010 08:30 1 hour At Risk while site firewall is rebooted.
lcgfts.gridpp.rl.ac.uk, UNSCHEDULED OUTAGE 14/09/2010 06:30 14/09/2010 08:30 2 hours Drain of FTS ahead of reboot of site firewall.
lcgce01, lcgce07, lcgce08, srm-lhcb SCHEDULED AT_RISK 13/09/2010 10:00 13/09/2010 12:00 2 hours LHCb Castor instances and dependent services at risk while we change to using a local nameserver
lcgce01.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 10/09/2010 17:52 13/09/2010 17:52 3 days, System is suffering high load and is not processing jobs correctly
lcgwms02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 09/09/2010 15:00 15/09/2010 09:59 5 days, 18 hours and 59 minutes Maintenance and update (glite-WMS 3.1.29). Includes time for drain ahead of intervention.