FTS vs srmcp
This page documents the results of file transfer tests using FTS and srmcp in bulk copy mode. Note that srmcp is the FNAL SRM client. srmCopy is the SRM API function.
Note that all tests are being performed with Edinburgh's production dCache which is setup so that data is written into the RAID disk only. Also, the SC3 kernel tweaks have been applied to the door nodes to try and improve throughput.
See Edinburgh FTS testing for the results of transferring data from Sheffield dCache to Edinburgh dCache using FTS. In summary, using FTS with 10 concurrent files and a single GridFTP stream gave a rate of 203Mb/s. This was via two GridFTP doors (on srm.epcc and dcache.epcc) that we have in our dCache setup.
No file deletion
When the same test (250GB, 10 files, 1 stream) was re-run, but this time using the FTS filetransfer.py script so that it did not srm-adv-del the files as they were transferred into the Ed dCache, the measured transfer rate was 197Mb/s. This is approximately the same as the previous case where the files were deleted and indicates that the srm-adv-del has little effect on the transfer performance.
Use srmcp in batch mode, where srm.batch contains a list of source and destination srm SURLS. Time how long it takes to complete the transfer.
time srmcp -debug=false -copyjobfile=srm.batch
Then srm-adv-del them after the transfer.
Initial transfer of 200 1GB files from Edinburgh dCache to Sheffield dCache resulted in only 30 files being transferred. The srmcp client reported this error repeatedly:
later appended: nonfatal error [org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.SRMException: TransferManager errortoo many transfers!] retrying
Try again, this time transferring 50 files in batch mode - only 31 files transferred and same error seen. Maybe I'm exceeding the maximum number of transfers that can occur simultaneously. Trying again with 25 files in batch mode is successful, no errors returned and all files transferred in 1906 seconds, giving a rate of about 105Mb/s.
dCache has a default maximum limit of 30 active gridftp transfers that it can sustain. Note, that in bulk mode, all transfers try and occur simultaneously (like the number of concurrent files in FTS) but the SRM does not perform any queing of these requests. The maximum limit can be modified by changing the line:
RemoteGsiftpTransferManager section of srm.batch file on the dCache SRM node. This change can also be made (non-permanent) via the admin interface.
Sheffield -> Edinburgh bulk transfer
In the test I ran, data was being transferred into the dCache at about 20MB/s (~160Mb/s). The load on the pool node (dcache.epcc) was also fairly high. This is understandable since the srmCopy mode of operation involves direct communication with the pool node and does not route the data via a door node(s).
Also, the number of parallel streams that are used for each GridFTP transfer is controlled by the parameter
parallelStream on the door server. Changing the value of the
num_streams parameter in the srmcp client has no effect.
parallelStream to 1 (from 10) and rerunning the tests, the transfer rate was improved significantly as can be seen in the network plot:
The network profile was not as regular as the previous case, but the 100 1GB files were transferred in 35mins => ~370Mb/s. A second larger transfer tests gave:
250*1GB files transferred in 5245 seconds => 381Mb/s
This is a huge increase on the previous value of 160Mb/s and on the rate achieved with FTS (203). The same test was also done with the default kernel tuning parameters (not the SC3 ones) and there was no observable impact on the rate, however, the load on the machine was lower (about 11 compared to 18). So I have decided to stick with the default tcp values for now on dcache.epcc.
Now run two batch transfers simultaneously to see what effect this has - little or no improvement in transfer rate. It must be noted that the Edinburgh dCache setup has >10 write pools available to it, but only on dcache.epcc, none on srm.epcc. Since we are running srmCopy, the mode of transfer involved direct communication with the pool node. The transfer does not first of all go through one of the GridFTP doors. Therefore the fact that I had two doors available is of no consequence.
Fairer comparison to the FTS transfer script
Since I have typically been using the FTS transfer script in the mode where it deletes each file after it has been transferred, I will run an srmcp bulk transfer where srm-advisory-delete is used after each bulk transfer of 10 files.
250*1GB files transferred and deleted in 6438 seconds => 310Mb/s
Again, this is faster than the case where the filetransfer script/FTS was used to transfer 250 1GB files with 10 concurrent files and a single parallel stream, deleting the files immediately after they were transferred.
strace analysis of FTS/srmcp disk writes
K. Georgiou at Imperial has performed an strace analysis of disk writes on one of their dCache pool nodes during FTS and srmcp transfers to that node. The intention was to observe the impact that these transfer mechanisms had on disk writes in an attempt to explain some of the behaviour that is documented above.
The operation that was performed was:
$ strace -f -e lseek,write -p $PID -ttt -o transfer
where PID is the java process for the dCache gridftp door.
The output of strace gives the disk block that was written to by the process being observed. This information can be plotted against time to give a profile of the disk writes during the transfer of the file into the dCache. The two figures below show one example of this:
The FTS plot on the left shows a much more random disk access pattern than the srmcp plot on the right. This random disk access is likely to degrade the performance of the transfer.
FTS in srmCopy mode
STAR-ED FTS channel (i.e. all sites to Edinburgh) has now been setup to use srmCopy instead of 3rd party GridFTP (urlcopy). Edinburgh dCache configuration is as before, with two GridFTP doors and read/write pools. Since now using srmCopy, data will go straight to the disk pool and not be routed via a GridFTP door first of all. Since we only have a single disk server but two GridFTP doors, this mode of operation could potentially lower our writing rate.
NOTE: With the gLite 1.5 version of FTS, there is a bug when operating it in srmCopy mode that causes a file descriptor leak on the channel transfer agent. This has been documented here and a fix is present in the gLite 3.0 release.
25 concurrent files. Single stream (a netstat on the disk server during the transfer reveals that there are 50 connections opened up between the Ed and Gla disk servers; 25 are control channels (connected to 2811) and 25 are data channels, connected to ports in 50000-52000).
Transfer Bandwidth Report: 50/50 transferred in 1253.38757515 seconds 50000000000.0 bytes transferred. Bandwidth: 319.135124626Mb/s
As it turns out, we see an even faster write rate into our dCache (previously seeing 276Mb/s from Glasgow). As can be seen from the ganglia plots (1540-1600), there was little network traffic between the dCache nodes (Out/blue line) since all of the data was being transferred directly to the disk server (dcache.epcc). The load on dCache was slightly higher than the previous case where urlcopy was being used, but this is to be expected since it is doing all of the work now. The head node (srm.epcc) had essentially no load on it.
Another point to note is that there were no error messages in the dCache SRM logs unlike the case with the IC-HEP dCache yesterday. I will need to retest, but these errors are possibly the result of a problem wer had with the Edinburgh dCache which was being used as a source during that test.
Now repeat the test but with 7 concurrent files, just to get an idea of how it influences the rate.
Transfer Bandwidth Report: 50/50 transferred in 1222.87713408 seconds 50000000000.0 bytes transferred. Bandwidth: 327.09745636Mb/s
Remarkably it is even faster and the load on our disk server isn't as large. Do another test:
Transfer Bandwidth Report: 50/50 transferred in 1332.07108617 seconds 50000000000.0 bytes transferred. Bandwidth: 300.284274731Mb/s
OK, rate dropped a little, but still better than the urlcopy case.
Currently the number of parallel GridFTP streams is 1 since this is set by the parallelStreams parameter in the destination dCacheSetup file. This cannot be overridden by passing the -p=N_s option to FTS. I think the parallelStreams option will override in all cases.
RAL-PP -> ED, GLA
Run a test from another dCache source SRM, just to compare to the DPM case above.
Transfer Bandwidth Report: 100/100 transferred in 4156.05011821 seconds 100000000000.0 bytes transferred. Bandwidth: 192.490460232Mb/s
Hmm, clearly not as fast. Maximum outbound rate from RAL-PP -> RAL has been recorded as 380, so how come we do not see this here? Networking issues?
Run identical test into Glasgow DPM, just for comparison.
Transfer Bandwidth Report: 54/55 transferred in 2389.1915729 seconds 54000000000.0 bytes transferred. Bandwidth: 180.814299238Mb/s
So, maybe RAL-PP do have some networking issues when sending data to other Tier-2's.
1TB test GLA -> ED
Transfer Bandwidth Report: 1000/1000 transferred in 22697.0156629 seconds 1e+12 bytes transferred. Bandwidth: 352.469246125Mb/s
Very decent transfer rate. (1-minute) Load on the dcache pool node (4 CPUs) was about 20 for the duration of the transfer. This is slightly higher than the load on the node during a urlcopy transfer, but it is not significantly higher. The load on the head node (dual core) was ~1. Additionally, there is no inter-node traffic since srmCopy transfers directly to the pool node when dCache is the destination SRM.
Multiple Parallel GridFTP streams
Changed the number of parallel GridFTP streams that dCache uses during a transfer by editing
/opt/d-cache/config/dCacheSetup on the head node and restarting dcache-core servrices.
parallelStreams was increased from 1 to 10 and a 50GB transfer run:
Transfer Bandwidth Report: 50/50 transferred in 1957.17592192 seconds 50000000000.0 bytes transferred. Bandwidth: 204.37610923Mb/s
Running a netstat -tap on the pool node shows that there are now 11 (1 control+10 data) open connections to the source disk server per file, but this has had a detrimental effect on the final transfer rate. The load on the machine is also higher than the case where only a single stream is used.