Difference between revisions of "Edinburgh SC4"
(No difference)
|
Latest revision as of 12:03, 1 March 2007
This page contains a log of the FTS tests that were carried out as part of Edinburgh's participation in SC4. In conjunction, these tests were used to undertand the local dCache setup.
Service_Challenge_Transfer_Tests
Contents
Outline of tests (draft)
SC4 requirement is for a sustained transfer of 1TB of data from the RAL Tier-1. As a warm up for this test, I will transfer smaller amount of data from the Tier-1 and at the same time modify our dCache setup to observe the effect that the following have on the data transfer rate:
- only 1 NFS mounted pool
- only NFS mounted pools
- only 1 RAID volume pool
- only RAID pools
- all pools available for use
It is expected that there will be a decrease in the transfer rate when only the NFS mounted pools are available to the dCache, but I would like to get a quantitative results. This data will be used to modify our dCache setup in the future towards a more optimal configuration. In addition to modifying the dCache setup, it is also possible to use FTS to modify the configuration of the RAL-ED channel (in terms of number of concurrent file transfers and parallel streams). It is hoped that there will be sufficient time to study these effects on the transfer rate.
12/12/05
Started trying to initiate FTS tests. FTS was accepting the jobs, but querying the transfer status produced a strange error message. Problem was eventually resolved when a new myproxy -d
was issued.
13/12/05
FTS tests were started properly. Initially just using Matt Hodges test script to start some transfers in order to observe the performance before any tuning took place. Also gave chance to study the FTS logs and ganglia monitoring pages. Submitted a batch transfer of files
Size | Concurrent Files | Parallel Streams |
---|---|---|
50GB | 5 | 5 |
Initially transfers were successful, then started seeing error messages in the dCache pool node gridftp door logs:
12/13 16:14:48 Cell(GFTP-dcache-Unknown-998@gridftpdoor-dcacheDomain) : CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue 12/13 16:14:48 Cell(GFTP-dcache-Unknown-994@gridftpdoor-dcacheDomain) : CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue 12/13 16:14:48 Cell(GFTP-dcache-Unknown-999@gridftpdoor-dcacheDomain) : CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue
Not clear what is causing this. Had to cancel the transfer due to this. Even if I now just submit a single file for transfer, I get an error these error messages. Setup another transfer:
Size | Concurrent Files | Parallel Streams |
---|---|---|
25*1GB | 5 | 1 |
Only saw 11Mb/s. FTS log files reported problem with pool dcache_24. Confirmed by dCache monitoring. Not sure why, possibly excessive load. Also having problem with gridftp door on the admin node not starting up. For the moment, I have disabled it and all traffic now going through the pool node.
14/12/05
Setup another transfer, passing the option -g "-p 10"
to FTS. Now using Chris Brew's test script.
Size | Concurrent Files | Parallel Streams |
---|---|---|
50*10MB=500MB | 5 | 10 |
Transfer IDs are d4f03598-6c95-11da-a18f-e44be7748cb0 Transfer Started - Wed Dec 14 11:36:20 GMT 2005 Active Jobs: Done 4 files (1 active, 45 pending and 0 delayed) - Wed Dec 14 11:36:50 GMT 2005 Active Jobs: Done 5 files (5 active, 40 pending and 0 delayed) - Wed Dec 14 11:36:57 GMT 2005 ... Active Jobs: Done 49 files (0 active, 0 pending and 1 delayed) - Wed Dec 14 11:40:36 GMT 2005 Transfer Finished - Wed Dec 14 11:40:37 GMT 2005 Transfered 49 files in 257 s (+- 10s) Approx rate = 1561 Mb/s
Saw rate of 1561Mb/s accoring to this! This number cannot be corret, there must be an error in the test script. Setup another transfer, passing the option -g "-p 10"
to FTS.
Size | Concurrent Files | Parallel Streams |
---|---|---|
10*1GB=10GB | 5 | 10 |
Transfer IDs are dc24f7dc-6c98-11da-a18f-e44be7748cb0 Transfer Started - Wed Dec 14 11:58:00 GMT 2005 Active Jobs: Done 1 files (5 active, 4 pending and 0 delayed) - Wed Dec 14 12:24:20 GMT 2005 ... Active Jobs: Done 8 files (0 active, 0 pending and 2 delayed) - Wed Dec 14 12:56:35 GMT 2005 Transfer Finished - Wed Dec 14 12:56:36 GMT 2005 Transfered 8 files in 3516 s (+- 10s) Approx rate = 18 Mb/s
So now back to a low transfer rate of 18Mb/s. Try transferring smaller files (10*100Mb = 1GB).
Size | Concurrent Files | Parallel Streams |
---|---|---|
10*100MB=1GB | 5 | 10 |
Transfer IDs are dc662920-6ca4-11da-a18f-e44be7748cb0 Transfer Started - Wed Dec 14 13:24:23 GMT 2005 Active Jobs: Done 1 files (5 active, 4 pending and 0 delayed) - Wed Dec 14 13:26:08 GMT 2005 ... Active Jobs: Done 10 files (0 active, 0 pending and 0 delayed) - Wed Dec 14 13:29:35 GMT 2005 Transfer Finished - Wed Dec 14 13:29:36 GMT 2005 Transfered 10 files in 313 s (+- 10s) Approx rate = 261 Mb/s
So now up to a respectable 261Mb/s. Looks like dCache may be having problems with transferring large files, possibly timing out. Perform another test with 100*100MB = 10GB .
Size | Concurrent Files | Parallel Streams |
---|---|---|
100*100MB=10GB | 5 | 10 |
Transfer IDs are 0364f5a2-6ca6-11da-a18f-e44be7748cb0 Transfer Started - Wed Dec 14 13:32:16 GMT 2005 Active Jobs: Done 2 files (5 active, 93 pending and 0 delayed) - Wed Dec 14 13:34:15 GMT 2005 ... Active Jobs: Done 100 files (0 active, 0 pending and 0 delayed) - Wed Dec 14 14:21:20 GMT 2005 Transfer Finished - Wed Dec 14 14:21:21 GMT 2005 Transfered 100 files in 2945 s (+- 10s) Approx rate = 278 Mb/s
Decent transfer rate of 278Mb/s. Now observing problems with dCache pools going offline (as reported by web interface). The offline pools are ones that are NFS mounted from the University SAN. 4 of the 10 NFS mounted pools remain online. FTS transfers were hanging when trying to use these pools to transfer files into. Seeing java processes in status D.
# ps aux|grep " D " root 4353 0.0 2.1 622700 84668 pts/0 D 11:29 0:00 /usr/java/j2sdk1. root 4393 0.0 2.1 622700 84668 pts/0 D 11:29 0:00 /usr/java/j2sdk1. root 11186 0.0 1.5 506484 58760 pts/0 D 15:04 0:00 /usr/java/j2sdk1. root 11864 0.0 1.5 509448 62100 pts/0 D 15:20 0:00 /usr/java/j2sdk1. root 13029 0.0 1.6 584296 65912 pts/0 D 16:27 0:00 /usr/java/j2sdk1. root 13194 0.0 0.0 1740 584 pts/0 S 16:32 0:00 grep D
Problem not resolved by restaring dcache-pool or NFS. Reboot required.
16/12/05
Noticed that if I try to make gridftp use > 10 parallel streams (dCache -> dCache), the transfer does not work and Graeme's python script has repeating output of:
Child: /opt/glite/bin/glite-transfer-status -l 449b254e-6e2e-11da-a18f-e44be7748cb0 Overall status: Active Matching for duration in fts query line number 6 failed. Found
and the pool node gridftp log reports:
12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: SocketRedirector(Thread-632):Adapter: done, EOD received ? = false 12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: Closing data channel: 1 remaining: 9 eodc says there will be: -1 12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: SocketRedirector(Thread-633):Adapter: done, EOD received ? = false 12/16 12:20:05 Cell(GFTP-dcache-Unknown-207@gridftpdoor-dcacheDomain) : CellAdapter: Closing data channel: 2 remaining: 8 eodc says there will be: -1 ...
Now change dCacheSetup file on the pool node to see if this has any effect on performance. First modify so that it only uses 1 parallel stream.
# Set number of parallel streams per GridFTP transfer parallelStreams=1
(default was 10). Restart dcache-opt and see what sort of transfer rate we get - 8.37Mb/s (100MB file, 1 stream). Now just try submitting same job, but with "-p 2" option - 8.30Mb/s. This is strange. Why did I not get the error messages as above? Try again with "-p 11" - same response as above with the "Matching duration...". glite-transfer-status returns:
State: Waiting Retries: 1 Reason: Transfer failed. ERROR the server sent an error response: 426 426 Transfer aborted, closing connection :Unexpected Exception : java.net.SocketException: Connection reset
Possibly I need to change parallelStreams in the admin node config file. Do this then re-run the tests.
1 stream - 9.37Mb/s 2 streams - 7.52Mb/s 10 streams- 9.43Mb/s 11 streams- failed with same error as above.
??
What about the .srmconfig/config.xml file. I had been using:
<buffer_size> 131072 </buffer_size> <tcp_buffer_size> 0 </tcp_buffer_size> <streams_num> 10 </streams_num>
Now try again, but with streams_num
of 1, but pass the "-p 10" option to gridftp.
"-p 1", <streams_num> 1 - 9.40Mb/s : FTS logs report that 1 stream was used "-p 10", <streams_num> 1 - 10.65Mb/s : FTS logs report that 10 streams were used, so this does not appear to have any influence.
What about modifying the buffer sizes? There are also corresponding buffere sizes in dCacheSetup. Change config.xml to this:
<buffer_size> 2048 </buffer_size>
"-p 1", <streams_num> 1 - 10.76Mb/s : FTS logs report that 1 stream was used "-p 10", <streams_num> 1 - 15.01Mb/s : 10 streams
Faster in both cases. Try educing buffer size further to 1024 and run tests again.
"-p 10", <streams_num> 1 - 9.42Mb/s :
Change to 4096:
"-p 10", <streams_num> 1 - 10.77Mb/s
The ~10Mb/s limit may be an issue due to the NFS mounted disk pools that dCache is using. I will make these pools read only for now to see what effect this has on transfer rate. Setup is now:
nfs-test linkList : nfs-test-link (pref=10/10/0;ugroups=2;pools=1) poolList : dcache_28 (enabled=true;active=22;links=0;pgroups=1) dcache_26 (enabled=true;active=24;links=0;pgroups=1) dcache_22 (enabled=true;active=4;links=0;pgroups=1) dcache_30 (enabled=true;active=24;links=0;pgroups=1) dcache_24 (enabled=true;active=0;links=0;pgroups=1) dcache_32 (enabled=true;active=22;links=0;pgroups=1) dcache_25 (enabled=true;active=26;links=0;pgroups=1) dcache_27 (enabled=true;active=23;links=0;pgroups=1) dcache_29 (enabled=true;active=8;links=0;pgroups=1) dcache_23 (enabled=true;active=1;links=0;pgroups=1) dcache_31 (enabled=true;active=25;links=0;pgroups=1) ResilientPools linkList : poolList : default linkList : default-link (pref=10/10/10;ugroups=2;pools=1) poolList : dcache_1 (enabled=true;active=1;links=0;pgroups=1) dcache_7 (enabled=true;active=25;links=0;pgroups=1) dcache_14 (enabled=true;active=13;links=0;pgroups=1) dcache_13 (enabled=true;active=16;links=0;pgroups=1) dcache_20 (enabled=true;active=6;links=0;pgroups=1) dcache_6 (enabled=true;active=22;links=0;pgroups=1) dcache_16 (enabled=true;active=13;links=0;pgroups=1) dcache_8 (enabled=true;active=23;links=0;pgroups=1) dcache_11 (enabled=true;active=19;links=0;pgroups=1) dcache_4 (enabled=true;active=29;links=0;pgroups=1) dcache_18 (enabled=true;active=7;links=0;pgroups=1) dcache_21 (enabled=true;active=5;links=0;pgroups=1) dcache_3 (enabled=true;active=0;links=0;pgroups=1) dcache_17 (enabled=true;active=11;links=0;pgroups=1) dcache_19 (enabled=true;active=6;links=0;pgroups=1) dcache_2 (enabled=true;active=1;links=0;pgroups=1) dcache_12 (enabled=true;active=19;links=0;pgroups=1) dcache_9 (enabled=true;active=22;links=0;pgroups=1) dcache_10 (enabled=true;active=19;links=0;pgroups=1) dcache_5 (enabled=true;active=25;links=0;pgroups=1) dcache_15 (enabled=true;active=12;links=0;pgroups=1)
Notice the writepref value (0) for the nfs-test-link. This makes the NFS mounted pools read only.
Ed to RAL
Just as a test, I tried transferring 100MB file from Ed dCache to RAL dCache.
Size | Concurrent Files | Parallel Streams (-p) | Effective Rate (Mb/s) |
---|---|---|---|
100MB | 1 | 50 | 24.75 |
So there are definitely issues with writing to our dCache. This would imply that it is NFS causing the problem. If I perform another test, but this time take a file that is definitely on a non-NFS mounted pool, then I get:
Size | Concurrent Files | Parallel Streams (-p) | Effective Rate (Mb/s) |
---|---|---|---|
1*1GB | 1 | 50 | 95.54 |
5*1GB | 1 | 50 | 143.06 |
5*1GB | 5 | 50 | see below |
In the last test above, the transfers started and 3 files were successfully copied to RAL. Via ganglia I was seeing rates of ~30MB/s out of my pool node! However, the python script then started outputting <Matching for duration in fts query line number 6 failed.
Not clear what is casuing this. Could the transfer rate be too high? Try lowering the number of streams:
Size | Concurrent Files | Parallel Streams (-p) | Effective Rate (Mb/s) |
---|---|---|---|
5*1GB | 5 | 10 | 300.96 |
5*1GB | 5 | 20 | get same Matching error again |
5*1GB | 5 | 10 | 289.66 |
20*1GB | 5 | 10 | 317.00 |
20*1GB | 10 | 10 | 388.93 |
20*1GB | 20 | 10 | 1 file Done, 19 went into Waiting state |
15*1GB | 20 | 10 | 422.08 |
So seem to be able to write to the Tier-1 at a decent rate, when coming from a non-NFS mounted pool. Now try transfer with identical parameters, but with 1GB file coming from NFS mounted pool (from dcache_27 = scotgrid10).
Size | Concurrent Files | Parallel Streams (-p) | Effective Rate (Mb/s) |
---|---|---|---|
1*1GB | 20 | 10 | 110.03 |
10*1GB | 20 | 10 | 391.92 |
15*1GB | 20 | 10 | 430.12 |
This shows that reading from an NFS mounted pool gives a good transfer rate. Writing performance appears to be terrible. Now test writing to pools that connected via fibre channel.
17/12/05
RAL-ED, writing to the pools that reside on the RAID disk.
Size | Concurrent Files | Parallel Streams (-p) | Effective Rate (Mb/s) |
---|---|---|---|
10*1GB | 20 | 10 | 125.43 |
15*1GB | 20 | 10 | 156.37 |
15*1GB | 20 | 20 | Same error as before - Matching for duration in fts query line number 6 failed, plus entry in dCache gridftp logs. |
20*1GB | 20 | 10 | 143.58 |
20*1GB | 5 | 11 | Same error as before - Matching for duration in fts query line number 6 failed, plus entry in dCache gridftp logs. |
20*1GB | 1 | 11 | Same error as before - Matching for duration in fts query line number 6 failed, plus entry in dCache gridftp logs. |
Seem to have reached a parallel stream limit of 10. Unsure what is imposing this limit. Try some GLA-ED transfers to see if same limit exists with DPM.
GLA-ED
Writing data into the non-NFS pools.
Size | Concurrent Files | Parallel Streams (-p) | Effective Rate (Mb/s) |
---|---|---|---|
10*1GB | 5 | 10 | 56.08 |
10*1GB | 5 | 20 | 57.0 |
20*1GB | 20 | 20 | files being transferred, then all 20 went into Waiting for some reason. |
ED-GLA
Size | Concurrent Files | Parallel Streams (-p) | Effective Rate (Mb/s) |
---|---|---|---|
20*1GB | 5 | 10 | Files going into Waiting state, timeouts appearing in fts logs. |
20*1GB | 20 | 10 | ditto |
5*1GB | 20 | 10 | 181 |
10*1GB | 20 | 10 | 122.76 |
10*1GB | 20 | 30 | 122.74 |
50*1GB | 20 | 10 | transfers going into Waiting, SRM timeouts in FTS logs (30 min limit reached) |
20/12/05
Want to perform some systematic testing of the Ed to RAL channel to see what effect changing the number of parallel streams and concurrent files has on transfer rate and file transfer success.
Ed to RAL
If no entry in Note column, then all file transfers succssful.
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes | |
---|---|---|---|---|---|---|
5*1GB | 5 | 1 | 5 | 238.53 | ||
5*1GB | 5 | 2 | 10 | 201.34 | ||
5*1GB | 5 | 4 | 20 | 228.53 | ||
5*1GB | 5 | 8 | 40 | 264.89 | ||
5*1GB | 5 | 10 | 50 | 122.17 | 2 done, 3 waiting | |
5*1GB | 5 | 12 | 60 | 110.2 | 1 done, 4 waiting - FTS logs show 426 426 Transfer aborted . All transfers talking to gftp0446.
| |
5*1GB | 5 | 12 | 60 | 224.16 | 3 done, 2 waiting - FTS logs show 426 426 Transfer aborted . All transfers talking to gftp0444.
| |
5*1GB | 5 | 18 | 90 | |||
5*1GB | 5 | 20 | 100 | |||
5*1GB | 5 | 22 | 110 | |||
5*1GB | 5 | 24 | 120 | |||
5*1GB | 5 | 26 | 130 | |||
5*1GB | 5 | 28 | 140 | |||
5*1GB | 5 | 30 | 150 | |||
5*1GB | 5 | 32 | 160 | |||
10*1GB | 10 | 1 | 10 | 148.65 | ||
10*1GB | 10 | 2 | 20 | 256.89 | 8 Done, 2 Waiting. FINAL:NETWORK: Transfer failed due to possible network problem - timed out
| |
10*1GB | 10 | 10 | 100 | 373.92 | ||
10*1GB | 10 | 12 | 120 | 288.69 | 5/10. 426 errors again. | |
10*1GB | 10 | 20 | 200 | 103.97 | 4/10. 426 errors again. |
RAL to Ed
If no mention made in Note column, then all file transfers succssful.
- Files being put into the non-NFS mounted pools.
- dCacheSetup file on both admin and pool node using parallelStreams=1 (yes, dcache services have been restarted after changing file).
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
5*1GB | 5 | 1 | 5 | 144.82 | |
5*1GB | 5 | 2 | 10 | 148.26 | |
5*1GB | 5 | 4 | 20 | 154.55 | |
5*1GB | 5 | 6 | 30 | 35.2 | 1 Done, 4 waiting. Strange since the files are all in the dCache if I do an ls -l in /pnfs/... |
5*1GB | 5 | 8 | 40 | 156.51 | |
5*1GB | 5 | 10 | 50 | 156.13 | |
5*1GB | 5 | 12 | 60 | 20.95 | 3 done, 2 waiting, 426 error again. FTS log shows that it took ~20 mins between starting transfer and it finishing. Pool gridftpdoor logs show same messages as before 4 remaining: 6 eodc says there will be: -1 etc.
|
5*1GB | 5 | 14 | 70 | 5 waiting immediated after submission. 426 errors in FTS logs. pool node logs show similar errors to above. | |
5*1GB | 5 | 16 | 80 | Same as above. | |
10*1GB | 10 | 1 | 10 | 156.25 | Pool node log repeatedly contains CellAdapter: cought SocketTimeoutException: Do nothing, just allow looping to continue , but files still transferred.
|
10*1GB | 10 | 2 | 20 | 144.25 | |
10*1GB | 10 | 4 | 40 | 156.56 | |
10*1GB | 10 | 6 | 60 | ||
10*1GB | 10 | 8 | 80 | 146.02 | |
10*1GB | 10 | 10 | 100 | 151.88 | |
10*1GB | 10 | 12 | 120 | 142.90 | 3 done, 7 waiting. 426 errors again in FTS logs. |
10*1GB | 10 | 14 | 140 | ||
10*1GB | 10 | 16 | 160 |
21/12/05
ED-ED (dCache to DPM)
Just want to look at some dCache to DPM transfers to see how bad the transfer rate is.
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
5*1GB | 5 | 10 | 50 | 0 | Transfers were taking place very slowly (I could see file size increasing in the DPM filesystem), but then SRM timed out. |
5*100MB | 5 | 10 | 50 | 9.14 | Writing to NFS mounted RAID disk. |
5*100MB | 5 | 20 | 100 | 9.13 | Writing to NFS mounted RAID disk. |
5*100MB | 5 | 20 | 100 | 3.79 | 2/5 done, 3 files exist. Writing to NFS mounted SAN. |
5*100MB | 5 | 20 | 100 | 9.31 | 5/5 done. Writing to filesystem local to DPM admin node. |
ED-GLA (dCache to DPM)
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
5*100MB | 5 | 5 | 25 | 8.90 | |
5*100MB | 5 | 10 | 50 | 9.12 | |
5*100MB | 5 | 20 | 100 | 9.34 |
So seeing same consistently low transfer rate from dCache into DPM (even accounting for writing to different filesystems that are mounted in different ways).
ED-ED (DPM to dCache)
Copy files from the DPM pool that resides on the DPM admin node so that there are no conflicts with simultaneous reading and writing to the RAID array.
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
5*100MB | 5 | 10 | 50 | 34.27 | |
5*100MB | 5 | 20 | 100 | 0 | Same problem as before when using > 10 streams into the dCache. Files immediately go into Waiting state. |
10*100MB | 10 | 10 | 100 | 105.40 | |
50*100MB | 50 | 10 | 500 | 88.61 | |
10*1GB | 10 | 1 | 10 | 171.57 | |
10*1GB | 10 | 1 | 10 | 170.01 | Again, to check. Not sure why it is slower than GLA-ED. |
10*1GB | 10 | 2 | 20 | 172.81 | |
10*1GB | 10 | 5 | 50 | 172.41 | |
10*1GB | 10 | 10 | 100 | 152.44 |
|
GLA-ED (DPM to dCache)
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
5*1GB | 5 | 10 | 50 | 152.63 | |
10*1GB | 10 | 1 | 10 | 228.93 | |
10*1GB | 10 | 1 | 10 | 234.24 | Did this as a check. |
10*1GB | 10 | 2 | 20 | 205.43 | |
10*1GB | 10 | 5 | 50 | 186.49 |
|
10*1GB | 10 | 10 | 100 | 169.18 | |
10*1GB | 10 | 20 | 200 | 0 | See same problems when using > 10 streams. 426 426 Data connection. data_write() failed: Handle not in the proper state
|
Why is the GLA-ED rate decreasing as the number of parallel streams increases? Could this be related to the value of parallelStreams on the dCache server?
ED-RAL (dCache to dCache)
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
15*1GB | 15 | 1 | 15 | 423.05 | |
15*1GB | 15 | 5 | 75 | 457.31 | |
15*1GB | 15 | 10 | 150 | 398.23 | |
15*1GB | 15 | 10 | 150 | 453.14 | |
15*1GB | 15 | 15 | 225 | 426 errors again. |
ED-GLA (DPM to DPM)
These files are all coming from the NFS mounted pools.
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
15*1GB | 15 | 1 | 15 | 159.74 | |
15*1GB | 15 | 5 | 75 | 125.02 | 14/15. SRM timeout for one file. Error in srm__setFileStatusSOAP-ENV:Client - Invalid state
|
15*1GB | 15 | 10 | 150 | 99.40 | 14/15. SRM timeout for one file. |
15*1GB | 15 | 20 | 300 | 68.64 | 11/15. SRM timeout for four files. |
Now try with a 1GB file coming from a pool that is local to the DPM head node.
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
15*1GB | 15 | 1 | 15 | 161.13 | |
15*1GB | 15 | 5 | 75 | 77.36 | 11/15. Error in srm__setFileStatusSOAP-ENV:Client - Invalid state
|
15*1GB | 15 | 10 | 150 | 80.04 | 12/15.Error in srm__setFileStatusSOAP-ENV:Client - Invalid state
|
15*1GB | 15 | 20 | 300 | ||
5*1GB | 5 | 1 | 5 | 127.67 | |
5*1GB | 5 | 10 | 50 | 114.33 | |
5*1GB | 5 | 20 | 100 | 116.01 | |
5*1GB | 5 | 50 | 250 | 118.63 |
The above transfer rates are not a true reflection of what happened. Performing a dpns-ls of the destination directory at GLA shows that the files that went into waiting state were infact transferred. However, due to the above SOAP error in FTS, the file status was never set to Done and therefore the files were always in a waiting state since the SRM eventually timed out. This meant that Graeme's script never got round to calling srm-adv-del. Looking at the ganglia plots of the DPM node shows that there were peaks in the data output rate of <~30MB/s.
12/01/06
ED-RAL
1000*1GB files dCache to dCache. 10 concurrent files, 5 streams. You can see from the plots that the transfer took approximately 5 hours, giving a rate of about 440Mb/s. A few of the transfers failed. Now need to try and improve the transfer rate in the reverse direction (NFS and RAID5 issues I think).
16/01/06
ED-DUR
First test of transfering files from Edinburgh DPM to Durham DPM (dCache to DPM issue still exists).
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
10*1GB | 10 | 5 | 50 | 90.33Mb/s | File NFS mounted from the RAID'ed disk.
|
100*1GB | 10 | 5 | 50 | 92.75Mb/s | 90/100 transferred. File NFS mounted from the RAID'ed disk. |
ED-GLA
Potential fix for the dCache to DPM problems that we have been seeing:
export RFIO_TCP_NODELAY=yes
in /etc/sysconfig/dpm-gsiftp and restart dpm-gsiftp. (From the web: TCP_NODELAY is for a specific purpose; to disable the Nagle buffering algorithm. It should only be set for applications that send frequent small bursts of information without getting an immediate response, where timely delivery of data is required (the canonical example is mouse movements) ).
Size | Concurrent Files | Parallel Streams (-p) | Con File * Paral Streams | Effective Rate (Mb/s) | Notes |
---|---|---|---|---|---|
10*1GB | 10 | 5 | 50 | 96Mb/s | 9/10 sucessful. File from the RAID'ed disk. |
20*1GB | 10 | 5 | 50 | 85Mb/s | 18/20 sucessful. File from the RAID'ed disk. Failed due to file already existing.
|
27/01/06
RAL-ED
Started 1TB transfer with 10 files, 10 streams. Seeing rates of ~130Mb/s into our RAID 5 disk (not over NFS) before the ScotGrid machines were powered down due to maintenance.
Had to apply change to dCache setup at Edinburgh to allow the FTS transfers to succeed.