RAL Tier1 weekly operations castor 06/09/2010

Work previous week

Matthew:
- A/L
Shaun:
- ..
Chris:
- Preparing DB test script for Rich
- Doing 2.1.9 work related
- Helping experiments to do tests on PreProd
- Preparing for 2.1.9 test upgrade on certification/preproduction
Richard:
- ..
Brian:
- ..
Jens:
- ..

Degraded performance for LHCb due to reduced bandwidth on lhcbMdst has lead to an excessive number of jobs failing in Dirac because of low job efficiency. Possibly related to LHCb starting to download whole files over gridftp in August. This problem conincided with 3 disk servers failed within 24 hours due to a networking misconfiguration with Quattor which was subsequently fixed. Throttling the batch system within Dirac reduced the failures, but reducing the job slots for all of LHCb did not help. Reducing the RFIO (gridftp) jobslots from 400 to 16 increases the network bandwidth. However, LHCb is still suspending transfers of raw data from CERN until the problem has been resolved.
gdss379 (lhcbUser) SL08 went crashed without any reason
globus-url-copy doesn't work from CERN on V09 64bit disk servers as GLOBUS_TCP_PORT_RANGE doesn't seem to be picked up by gridftp. The problem is fixed by downgrading globus-gridftp-server from 1.10.1 to 1.8.1.
There are transfer problems to a new batch of disk servers at NorduGrid affecting only RAL

PreProd

Lazy downloading for CMS is causing problems by causing disk servers to run out of memory. Later found to be a bug in the CMS software stack.

Any ongoing production problems at present will jepordize the timeline for starting 2.1.9 upgrades at the end of this month.

Entries in/planned to go to GOCDB None