Difference between revisions of "Glasgow SC4"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:25, 25 January 2008

This page is a logbook for Glasgow in Service Challenge 4

Transfer Tests

Preamble: Initial tests of transfers between RAL and Glasgow revealed a serious problem transfering data between dCache and DPM. Rates were dire, ~2Mb/s.

Transfer rates from DPM (Edinburgh) to DPM (Glasgow) achieved 200Mb/s without tuning. Transfering back to Edinburgh was at a lower write speed (~80Mb/s), probably as their pool has filesystems nfs mounted (poor write performance).

2005-12-22

Set se2 pool to RDONLY. pool1 and pool2 writable. 5 files on the channel. Noticed that 4 files would go to pool1, 1 to pool2 - so pool selection algorithm obviously works on a filesystem, rather than server, basis.

Experimenting with forcing transfers to one pool, or allowing them to float:

Glasgow SC4 File Transfer Tests
Writable Pools Files Streams Size Number Bandwidth Notes
pool1(5 fs), pool2 5 4 1GB 20 220 16 Transfers went to pool1, 4 to pool2
pool1(5 fs) 5 4 1GB 20 212
pool2 5 4 1GB 20 196
pool1(1 fs), pool2 5 4 1GB 20 216 10 transfers to each pool
pool1(1 fs), pool2 5 1 1GB 20 203 10 transfers to each pool
pool1(1 fs), pool2 5 10 1GB 20 213 10 transfers to each pool

File:Ed2gla var pools 20051222.png

Differences in transfer rates are minor, and probably within error bars. Suspect transfers are being limited by network or by Edinburgh output data rate. No great effect from multiple streams.

2005-12-24

On basis of above tests, initiated a 1TB transfer from Ed DPM to Gla DPM. 10 simultaneous files, 4 streams per file:

Transfer Bandwidth Report:
  1000/1000 transfered in 35683.9609549 seconds
  1e+12 bytes transfered.
Bandwidth: 224.190358523Mb/s

Steady as she goes, basically.

2006-01-09

1TB Upload to RAL

Was seeing really good rates to RAL when doing lcg-rep testing, so triggered a 1TB transfer overnight, using 5 streams. Rate was an excellent 331Mb/s.

Transfer Bandwidth Report:
  1000/1000 transferred in 24165.1441939 seconds
  1e+12 bytes transferred.
Bandwidth: 331.055338872Mb/s

I did vary the number of concurrently transfered files, from 3 up to 8, during the transfer, which had no noticable effect on the transfer rate. It did, however, affect the load average (naturally) and, interestingly, the system CPU load on the machine: 45% for 8 concurrent transfers, 25% for 3.

I also applied the SC3 kernel tweaks during the transfer, but again there was no noticable effect on the rate. Possibly it will have more effect on the sink than the source?

File:Gla2ral1TB-2006-01-09.gif

2006-01-26

Tuning pools for incoming transfers seems to be tricky (see Glasgow DPM Tuning), but settled on having pool1 with two writable filesystems and pool2 with 1. Then 5 files seem to almost guarantee that there's always a file being written to pool2, without having so many transfer streams as to provoke a load crisis on one of the pools. 3 streams also keeps the total number of TCP streams reasonable.

Initial rate was quite good - ~200Mb/s, but this declined after about 4am, down to as low as ~120Mb/s at 0830. Overall the rate was 166Mb/s:

Transfer Bandwidth Report:
  998/1000 transferred in 47889.3235981 seconds
  998000000000.0 bytes transferred.
Bandwidth: 166.717744168Mb/s

File:Ral2gla 5f 3s 1TB pool1 2 writeable day rate.png

The two failures were interesting:

 Transfer: srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned1G to srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/pytest/tfr000-file00171
 Size: 1000000000
 FTS Reason: No such active transfer  - RAL-GLAtransXXo0MOgp
 Transfer: srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned1G to srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/pytest/tfr000-file00219
 Size: 1000000000
 FTS Reason: Transfer succeeded.

Haven't seem either of these before. Looking on the DPM both files transferred successfully, so I suspect that this was an error in FTS itself - should check the logs.

Finally, write rates were very patchy - indicating that the balancing of incoming writes isn't working. Snapshot from the end of the transfer shows that pool2 was not transferring continuously:

File:Ral2gla 5f 3s 1TB pool1 2 writeable end.png

Rates to pool1 and pool2 are out of sync, so I think the network may have become the limiting factor at this point.

Work to be done on i/o rates to our pools, I think.

2006-10-27

Inbound test from Edinburgh to Glasgow

Started a 24 hour test from Edinburgh to Glasgow at ~1600 (needed Matt to setup new FTS channels for the site).

Rate was, as excpected, rather dissapointing. Started at ~200Mb/s, rising to ~250Mb/s as, presumably, the traffic calmed on the universtity's WAN and the Janet backbone.

However, this is still way below what we know that hardware is capable of: 3 disk servers alone managed to hit 800Mb/s+ when sinking data from a source connected directly to their switches. During the write test we had 6 disk servers available and DPM used them all.

Oubound test from Glasgow to Edinburgh

Started up the outbound test after seeding in files from Edinburgh (at the same rate achieved for the inbound test).

Rate was even worse than the inbound rate! Struggled up to 80Mb/s. Ganglia clearly showed the load spread nicely over all the disk servers, so why was the rate so bad.

Then on Sunday morning, between ~0600 and 0920 the rate dropped to zero, with all files failing. The FTS server said:

 Reason: Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !

but it was not clear whose SRM was down.

Action Plan

  1. iperf tests from off campus to old production site and to new disk servers
  2. Setup a test DPM using svr023 and a couple of the spare disk servers
    1. Test interally
    2. Experiment with stack settings on external tests
    3. Use SL3?

File:Glasgow-tfr-tests-2006-10.png