Difference between revisions of "Edinburgh SC4"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 12:03, 1 March 2007

This page contains a log of the FTS tests that were carried out as part of Edinburgh's participation in SC4. In conjunction, these tests were used to undertand the local dCache setup.

Service_Challenge_Transfer_Tests

Outline of tests (draft)

SC4 requirement is for a sustained transfer of 1TB of data from the RAL Tier-1. As a warm up for this test, I will transfer smaller amount of data from the Tier-1 and at the same time modify our dCache setup to observe the effect that the following have on the data transfer rate:

  • only 1 NFS mounted pool
  • only NFS mounted pools
  • only 1 RAID volume pool
  • only RAID pools
  • all pools available for use

It is expected that there will be a decrease in the transfer rate when only the NFS mounted pools are available to the dCache, but I would like to get a quantitative results. This data will be used to modify our dCache setup in the future towards a more optimal configuration. In addition to modifying the dCache setup, it is also possible to use FTS to modify the configuration of the RAL-ED channel (in terms of number of concurrent file transfers and parallel streams). It is hoped that there will be sufficient time to study these effects on the transfer rate.

12/12/05

Started trying to initiate FTS tests. FTS was accepting the jobs, but querying the transfer status produced a strange error message. Problem was eventually resolved when a new myproxy -d was issued.

13/12/05

FTS tests were started properly. Initially just using Matt Hodges test script to start some transfers in order to observe the performance before any tuning took place. Also gave chance to study the FTS logs and ganglia monitoring pages. Submitted a batch transfer of files

Size Concurrent Files Parallel Streams
50GB 5 5

Initially transfers were successful, then started seeing error messages in the dCache pool node gridftp door logs:

 12/13 16:14:48 Cell(GFTP-dcache-Unknown-998@gridftpdoor-dcacheDomain) : CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue
 12/13 16:14:48 Cell(GFTP-dcache-Unknown-994@gridftpdoor-dcacheDomain) : CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue
 12/13 16:14:48 Cell(GFTP-dcache-Unknown-999@gridftpdoor-dcacheDomain) : CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue

Not clear what is causing this. Had to cancel the transfer due to this. Even if I now just submit a single file for transfer, I get an error these error messages. Setup another transfer:

Size Concurrent Files Parallel Streams
25*1GB 5 1

Only saw 11Mb/s. FTS log files reported problem with pool dcache_24. Confirmed by dCache monitoring. Not sure why, possibly excessive load. Also having problem with gridftp door on the admin node not starting up. For the moment, I have disabled it and all traffic now going through the pool node.

14/12/05

Setup another transfer, passing the option -g "-p 10" to FTS. Now using Chris Brew's test script.

Size Concurrent Files Parallel Streams
50*10MB=500MB 5 10
Transfer IDs are d4f03598-6c95-11da-a18f-e44be7748cb0
Transfer Started - Wed Dec 14 11:36:20 GMT 2005
Active Jobs: Done 4 files (1 active, 45 pending and 0 delayed) - Wed Dec 14 11:36:50 GMT 2005
Active Jobs: Done 5 files (5 active, 40 pending and 0 delayed) - Wed Dec 14 11:36:57 GMT 2005
...
Active Jobs: Done 49 files (0 active, 0 pending and 1 delayed) - Wed Dec 14 11:40:36 GMT 2005
Transfer Finished - Wed Dec 14 11:40:37 GMT 2005
Transfered 49 files in 257 s (+- 10s)
Approx rate = 1561 Mb/s

Saw rate of 1561Mb/s accoring to this! This number cannot be corret, there must be an error in the test script. Setup another transfer, passing the option -g "-p 10" to FTS.

Size Concurrent Files Parallel Streams
10*1GB=10GB 5 10
Transfer IDs are dc24f7dc-6c98-11da-a18f-e44be7748cb0
Transfer Started - Wed Dec 14 11:58:00 GMT 2005
Active Jobs: Done 1 files (5 active, 4 pending and 0 delayed) - Wed Dec 14 12:24:20 GMT 2005
...
Active Jobs: Done 8 files (0 active, 0 pending and 2 delayed) - Wed Dec 14 12:56:35 GMT 2005
Transfer Finished - Wed Dec 14 12:56:36 GMT 2005
Transfered 8 files in 3516 s (+- 10s)
Approx rate = 18 Mb/s

So now back to a low transfer rate of 18Mb/s. Try transferring smaller files (10*100Mb = 1GB).

Size Concurrent Files Parallel Streams
10*100MB=1GB 5 10
Transfer IDs are dc662920-6ca4-11da-a18f-e44be7748cb0
Transfer Started - Wed Dec 14 13:24:23 GMT 2005
Active Jobs: Done 1 files (5 active, 4 pending and 0 delayed) - Wed Dec 14 13:26:08 GMT 2005
...
Active Jobs: Done 10 files (0 active, 0 pending and 0 delayed) - Wed Dec 14 13:29:35 GMT 2005
Transfer Finished - Wed Dec 14 13:29:36 GMT 2005
Transfered 10 files in 313 s (+- 10s)
Approx rate = 261 Mb/s

So now up to a respectable 261Mb/s. Looks like dCache may be having problems with transferring large files, possibly timing out. Perform another test with 100*100MB = 10GB .

Size Concurrent Files Parallel Streams
100*100MB=10GB 5 10
Transfer IDs are 0364f5a2-6ca6-11da-a18f-e44be7748cb0
Transfer Started - Wed Dec 14 13:32:16 GMT 2005
Active Jobs: Done 2 files (5 active, 93 pending and 0 delayed) - Wed Dec 14 13:34:15 GMT 2005
...
Active Jobs: Done 100 files (0 active, 0 pending and 0 delayed) - Wed Dec 14 14:21:20 GMT 2005
Transfer Finished - Wed Dec 14 14:21:21 GMT 2005
Transfered 100 files in 2945 s (+- 10s)
Approx rate = 278 Mb/s

Decent transfer rate of 278Mb/s. Now observing problems with dCache pools going offline (as reported by web interface). The offline pools are ones that are NFS mounted from the University SAN. 4 of the 10 NFS mounted pools remain online. FTS transfers were hanging when trying to use these pools to transfer files into. Seeing java processes in status D.

# ps aux|grep " D "
root      4353  0.0  2.1 622700 84668 pts/0  D    11:29   0:00 /usr/java/j2sdk1.
root      4393  0.0  2.1 622700 84668 pts/0  D    11:29   0:00 /usr/java/j2sdk1.
root     11186  0.0  1.5 506484 58760 pts/0  D    15:04   0:00 /usr/java/j2sdk1.
root     11864  0.0  1.5 509448 62100 pts/0  D    15:20   0:00 /usr/java/j2sdk1.
root     13029  0.0  1.6 584296 65912 pts/0  D    16:27   0:00 /usr/java/j2sdk1.
root     13194  0.0  0.0  1740  584 pts/0    S    16:32   0:00 grep  D

Problem not resolved by restaring dcache-pool or NFS. Reboot required.


16/12/05

Noticed that if I try to make gridftp use > 10 parallel streams (dCache -> dCache), the transfer does not work and Graeme's python script has repeating output of:

Child:  /opt/glite/bin/glite-transfer-status -l 449b254e-6e2e-11da-a18f-e44be7748cb0
Overall status:  Active
Matching for duration in fts query line number 6 failed.
Found

and the pool node gridftp log reports:

12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: SocketRedirector(Thread-632):Adapter: done, EOD received ? = false
12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: Closing data channel: 1 remaining: 9 eodc says there will be: -1
12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: SocketRedirector(Thread-633):Adapter: done, EOD received ? = false
12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: Closing data channel: 2 remaining: 8 eodc says there will be: -1
...

Now change dCacheSetup file on the pool node to see if this has any effect on performance. First modify so that it only uses 1 parallel stream.

# Set number of parallel streams per GridFTP transfer
parallelStreams=1

(default was 10). Restart dcache-opt and see what sort of transfer rate we get - 8.37Mb/s (100MB file, 1 stream). Now just try submitting same job, but with "-p 2" option - 8.30Mb/s. This is strange. Why did I not get the error messages as above? Try again with "-p 11" - same response as above with the "Matching duration...". glite-transfer-status returns:

State:       Waiting
Retries:     1
Reason:      Transfer failed. ERROR the server sent an error response: 426 426 Transfer aborted, closing connection :Unexpected Exception : java.net.SocketException: Connection reset

Possibly I need to change parallelStreams in the admin node config file. Do this then re-run the tests.

 1 stream  - 9.37Mb/s
 2 streams - 7.52Mb/s
 10 streams- 9.43Mb/s
 11 streams- failed with same error as above.

??

What about the .srmconfig/config.xml file. I had been using:

<buffer_size> 131072 </buffer_size>
<tcp_buffer_size> 0 </tcp_buffer_size>
<streams_num> 10 </streams_num>

Now try again, but with streams_num of 1, but pass the "-p 10" option to gridftp.

  "-p 1",  <streams_num> 1 - 9.40Mb/s  : FTS logs report that 1 stream was used
  "-p 10", <streams_num> 1 - 10.65Mb/s : FTS logs report that 10 streams were used, so this does not appear to have any influence.

What about modifying the buffer sizes? There are also corresponding buffere sizes in dCacheSetup. Change config.xml to this:

 <buffer_size> 2048 </buffer_size>
 "-p 1",  <streams_num> 1 - 10.76Mb/s  : FTS logs report that 1 stream was used
 "-p 10", <streams_num> 1 - 15.01Mb/s  : 10 streams

Faster in both cases. Try educing buffer size further to 1024 and run tests again.

 "-p 10", <streams_num> 1 - 9.42Mb/s : 

Change to 4096:

 "-p 10", <streams_num> 1 - 10.77Mb/s

The ~10Mb/s limit may be an issue due to the NFS mounted disk pools that dCache is using. I will make these pools read only for now to see what effect this has on transfer rate. Setup is now:

nfs-test
linkList :
  nfs-test-link  (pref=10/10/0;ugroups=2;pools=1)
poolList :
  dcache_28  (enabled=true;active=22;links=0;pgroups=1)
  dcache_26  (enabled=true;active=24;links=0;pgroups=1)
  dcache_22  (enabled=true;active=4;links=0;pgroups=1)
  dcache_30  (enabled=true;active=24;links=0;pgroups=1)
  dcache_24  (enabled=true;active=0;links=0;pgroups=1)
  dcache_32  (enabled=true;active=22;links=0;pgroups=1)
  dcache_25  (enabled=true;active=26;links=0;pgroups=1)
  dcache_27  (enabled=true;active=23;links=0;pgroups=1)
  dcache_29  (enabled=true;active=8;links=0;pgroups=1)
  dcache_23  (enabled=true;active=1;links=0;pgroups=1)
  dcache_31  (enabled=true;active=25;links=0;pgroups=1)
ResilientPools
 linkList :
 poolList :
default
 linkList :
  default-link  (pref=10/10/10;ugroups=2;pools=1)
 poolList :
  dcache_1  (enabled=true;active=1;links=0;pgroups=1)
  dcache_7  (enabled=true;active=25;links=0;pgroups=1)
  dcache_14  (enabled=true;active=13;links=0;pgroups=1)
  dcache_13  (enabled=true;active=16;links=0;pgroups=1)
  dcache_20  (enabled=true;active=6;links=0;pgroups=1)
  dcache_6  (enabled=true;active=22;links=0;pgroups=1)
  dcache_16  (enabled=true;active=13;links=0;pgroups=1)
  dcache_8  (enabled=true;active=23;links=0;pgroups=1)
  dcache_11  (enabled=true;active=19;links=0;pgroups=1)
  dcache_4  (enabled=true;active=29;links=0;pgroups=1)
  dcache_18  (enabled=true;active=7;links=0;pgroups=1)
  dcache_21  (enabled=true;active=5;links=0;pgroups=1)
  dcache_3  (enabled=true;active=0;links=0;pgroups=1)
  dcache_17  (enabled=true;active=11;links=0;pgroups=1)
  dcache_19  (enabled=true;active=6;links=0;pgroups=1)
  dcache_2  (enabled=true;active=1;links=0;pgroups=1)
  dcache_12  (enabled=true;active=19;links=0;pgroups=1)
  dcache_9  (enabled=true;active=22;links=0;pgroups=1)
  dcache_10  (enabled=true;active=19;links=0;pgroups=1)
  dcache_5  (enabled=true;active=25;links=0;pgroups=1)
  dcache_15  (enabled=true;active=12;links=0;pgroups=1)

Notice the writepref value (0) for the nfs-test-link. This makes the NFS mounted pools read only.

Ed to RAL

Just as a test, I tried transferring 100MB file from Ed dCache to RAL dCache.

Size Concurrent Files Parallel Streams (-p) Effective Rate (Mb/s)
100MB 1 50 24.75

So there are definitely issues with writing to our dCache. This would imply that it is NFS causing the problem. If I perform another test, but this time take a file that is definitely on a non-NFS mounted pool, then I get:

Size Concurrent Files Parallel Streams (-p) Effective Rate (Mb/s)
1*1GB 1 50 95.54
5*1GB 1 50 143.06
5*1GB 5 50 see below

In the last test above, the transfers started and 3 files were successfully copied to RAL. Via ganglia I was seeing rates of ~30MB/s out of my pool node! However, the python script then started outputting <Matching for duration in fts query line number 6 failed. Not clear what is casuing this. Could the transfer rate be too high? Try lowering the number of streams:

Size Concurrent Files Parallel Streams (-p) Effective Rate (Mb/s)
5*1GB 5 10 300.96
5*1GB 5 20 get same Matching error again
5*1GB 5 10 289.66
20*1GB 5 10 317.00
20*1GB 10 10 388.93
20*1GB 20 10 1 file Done, 19 went into Waiting state
15*1GB 20 10 422.08

So seem to be able to write to the Tier-1 at a decent rate, when coming from a non-NFS mounted pool. Now try transfer with identical parameters, but with 1GB file coming from NFS mounted pool (from dcache_27 = scotgrid10).

Size Concurrent Files Parallel Streams (-p) Effective Rate (Mb/s)
1*1GB 20 10 110.03
10*1GB 20 10 391.92
15*1GB 20 10 430.12

This shows that reading from an NFS mounted pool gives a good transfer rate. Writing performance appears to be terrible. Now test writing to pools that connected via fibre channel.

17/12/05

RAL-ED, writing to the pools that reside on the RAID disk.

Size Concurrent Files Parallel Streams (-p) Effective Rate (Mb/s)
10*1GB 20 10 125.43
15*1GB 20 10 156.37
15*1GB 20 20 Same error as before - Matching for duration in fts query line number 6 failed, plus entry in dCache gridftp logs.
20*1GB 20 10 143.58
20*1GB 5 11 Same error as before - Matching for duration in fts query line number 6 failed, plus entry in dCache gridftp logs.
20*1GB 1 11 Same error as before - Matching for duration in fts query line number 6 failed, plus entry in dCache gridftp logs.

Seem to have reached a parallel stream limit of 10. Unsure what is imposing this limit. Try some GLA-ED transfers to see if same limit exists with DPM.

GLA-ED

Writing data into the non-NFS pools.

Size Concurrent Files Parallel Streams (-p) Effective Rate (Mb/s)
10*1GB 5 10 56.08
10*1GB 5 20 57.0
20*1GB 20 20 files being transferred, then all 20 went into Waiting for some reason.

ED-GLA

Size Concurrent Files Parallel Streams (-p) Effective Rate (Mb/s)
20*1GB 5 10 Files going into Waiting state, timeouts appearing in fts logs.
20*1GB 20 10 ditto
5*1GB 20 10 181
10*1GB 20 10 122.76
10*1GB 20 30 122.74
50*1GB 20 10 transfers going into Waiting, SRM timeouts in FTS logs (30 min limit reached)



20/12/05

Want to perform some systematic testing of the Ed to RAL channel to see what effect changing the number of parallel streams and concurrent files has on transfer rate and file transfer success.

Ed to RAL

If no entry in Note column, then all file transfers succssful.

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
5*1GB 5 1 5 238.53
5*1GB 5 2 10 201.34
5*1GB 5 4 20 228.53
5*1GB 5 8 40 264.89
5*1GB 5 10 50 122.17 2 done, 3 waiting
5*1GB 5 12 60 110.2 1 done, 4 waiting - FTS logs show 426 426 Transfer aborted. All transfers talking to gftp0446.
5*1GB 5 12 60 224.16 3 done, 2 waiting - FTS logs show 426 426 Transfer aborted. All transfers talking to gftp0444.
5*1GB 5 18 90
5*1GB 5 20 100
5*1GB 5 22 110
5*1GB 5 24 120
5*1GB 5 26 130
5*1GB 5 28 140
5*1GB 5 30 150
5*1GB 5 32 160
10*1GB 10 1 10 148.65
10*1GB 10 2 20 256.89 8 Done, 2 Waiting. FINAL:NETWORK: Transfer failed due to possible network problem - timed out
10*1GB 10 10 100 373.92
10*1GB 10 12 120 288.69 5/10. 426 errors again.
10*1GB 10 20 200 103.97 4/10. 426 errors again.

RAL to Ed

If no mention made in Note column, then all file transfers succssful.

  • Files being put into the non-NFS mounted pools.
  • dCacheSetup file on both admin and pool node using parallelStreams=1 (yes, dcache services have been restarted after changing file).
Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
5*1GB 5 1 5 144.82
5*1GB 5 2 10 148.26
5*1GB 5 4 20 154.55
5*1GB 5 6 30 35.2 1 Done, 4 waiting. Strange since the files are all in the dCache if I do an ls -l in /pnfs/...
5*1GB 5 8 40 156.51
5*1GB 5 10 50 156.13
5*1GB 5 12 60 20.95 3 done, 2 waiting, 426 error again. FTS log shows that it took ~20 mins between starting transfer and it finishing. Pool gridftpdoor logs show same messages as before 4 remaining: 6 eodc says there will be: -1 etc.
5*1GB 5 14 70 5 waiting immediated after submission. 426 errors in FTS logs. pool node logs show similar errors to above.
5*1GB 5 16 80 Same as above.
10*1GB 10 1 10 156.25 Pool node log repeatedly contains CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue, but files still transferred.
10*1GB 10 2 20 144.25
10*1GB 10 4 40 156.56
10*1GB 10 6 60
10*1GB 10 8 80 146.02
10*1GB 10 10 100 151.88
10*1GB 10 12 120 142.90 3 done, 7 waiting. 426 errors again in FTS logs.
10*1GB 10 14 140
10*1GB 10 16 160

21/12/05

ED-ED (dCache to DPM)

Just want to look at some dCache to DPM transfers to see how bad the transfer rate is.

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
5*1GB 5 10 50 0 Transfers were taking place very slowly (I could see file size increasing in the DPM filesystem), but then SRM timed out.
5*100MB 5 10 50 9.14 Writing to NFS mounted RAID disk.
5*100MB 5 20 100 9.13 Writing to NFS mounted RAID disk.
5*100MB 5 20 100 3.79 2/5 done, 3 files exist. Writing to NFS mounted SAN.
5*100MB 5 20 100 9.31 5/5 done. Writing to filesystem local to DPM admin node.

ED-GLA (dCache to DPM)

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
5*100MB 5 5 25 8.90
5*100MB 5 10 50 9.12
5*100MB 5 20 100 9.34


So seeing same consistently low transfer rate from dCache into DPM (even accounting for writing to different filesystems that are mounted in different ways).

ED-ED (DPM to dCache)

Copy files from the DPM pool that resides on the DPM admin node so that there are no conflicts with simultaneous reading and writing to the RAID array.

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
5*100MB 5 10 50 34.27
5*100MB 5 20 100 0 Same problem as before when using > 10 streams into the dCache. Files immediately go into Waiting state.
10*100MB 10 10 100 105.40
50*100MB 50 10 500 88.61
10*1GB 10 1 10 171.57
10*1GB 10 1 10 170.01 Again, to check. Not sure why it is slower than GLA-ED.
10*1GB 10 2 20 172.81
10*1GB 10 5 50 172.41
10*1GB 10 10 100 152.44


GLA-ED (DPM to dCache)

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
5*1GB 5 10 50 152.63
10*1GB 10 1 10 228.93
10*1GB 10 1 10 234.24 Did this as a check.
10*1GB 10 2 20 205.43
10*1GB 10 5 50 186.49


10*1GB 10 10 100 169.18
10*1GB 10 20 200 0 See same problems when using > 10 streams. 426 426 Data connection. data_write() failed: Handle not in the proper state

Why is the GLA-ED rate decreasing as the number of parallel streams increases? Could this be related to the value of parallelStreams on the dCache server?

ED-RAL (dCache to dCache)

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
15*1GB 15 1 15 423.05
15*1GB 15 5 75 457.31
15*1GB 15 10 150 398.23
15*1GB 15 10 150 453.14
15*1GB 15 15 225 426 errors again.


ED-GLA (DPM to DPM)

These files are all coming from the NFS mounted pools.

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
15*1GB 15 1 15 159.74
15*1GB 15 5 75 125.02 14/15. SRM timeout for one file. Error in srm__setFileStatusSOAP-ENV:Client - Invalid state
15*1GB 15 10 150 99.40 14/15. SRM timeout for one file.
15*1GB 15 20 300 68.64 11/15. SRM timeout for four files.

Now try with a 1GB file coming from a pool that is local to the DPM head node.

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
15*1GB 15 1 15 161.13
15*1GB 15 5 75 77.36 11/15. Error in srm__setFileStatusSOAP-ENV:Client - Invalid state
15*1GB 15 10 150 80.04 12/15.Error in srm__setFileStatusSOAP-ENV:Client - Invalid state
15*1GB 15 20 300
5*1GB 5 1 5 127.67
5*1GB 5 10 50 114.33
5*1GB 5 20 100 116.01
5*1GB 5 50 250 118.63

The above transfer rates are not a true reflection of what happened. Performing a dpns-ls of the destination directory at GLA shows that the files that went into waiting state were infact transferred. However, due to the above SOAP error in FTS, the file status was never set to Done and therefore the files were always in a waiting state since the SRM eventually timed out. This meant that Graeme's script never got round to calling srm-adv-del. Looking at the ganglia plots of the DPM node shows that there were peaks in the data output rate of <~30MB/s.

12/01/06

ED-RAL

1000*1GB files dCache to dCache. 10 concurrent files, 5 streams. You can see from the plots that the transfer took approximately 5 hours, giving a rate of about 440Mb/s. A few of the transfers failed. Now need to try and improve the transfer rate in the reverse direction (NFS and RAID5 issues I think).

File:ED-RAL-1TB-scotgrid-switch.png
ScotGrid switch, green is traffic in

16/01/06

ED-DUR

First test of transfering files from Edinburgh DPM to Durham DPM (dCache to DPM issue still exists).

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
10*1GB 10 5 50 90.33Mb/s File NFS mounted from the RAID'ed disk.


100*1GB 10 5 50 92.75Mb/s 90/100 transferred. File NFS mounted from the RAID'ed disk.


ED-GLA

Potential fix for the dCache to DPM problems that we have been seeing:

export RFIO_TCP_NODELAY=yes

in /etc/sysconfig/dpm-gsiftp and restart dpm-gsiftp. (From the web: TCP_NODELAY is for a specific purpose; to disable the Nagle buffering algorithm. It should only be set for applications that send frequent small bursts of information without getting an immediate response, where timely delivery of data is required (the canonical example is mouse movements) ).

Size Concurrent Files Parallel Streams (-p) Con File * Paral Streams Effective Rate (Mb/s) Notes
10*1GB 10 5 50 96Mb/s 9/10 sucessful. File from the RAID'ed disk.
20*1GB 10 5 50 85Mb/s 18/20 sucessful. File from the RAID'ed disk. Failed due to file already existing.


27/01/06

RAL-ED

Started 1TB transfer with 10 files, 10 streams. Seeing rates of ~130Mb/s into our RAID 5 disk (not over NFS) before the ScotGrid machines were powered down due to maintenance.

Had to apply change to dCache setup at Edinburgh to allow the FTS transfers to succeed.