RAL Tier1 CASTOR SRM tests T1toT2

From GridPP Wiki
Jump to: navigation, search

Timetable

Castor Aggregate File Tests
Date Start Time:Duration Site
2006-10-04 11.00:2hours Glasgow
2006-10-04 11.00:2hrs
11.15:2hrs
Glasgow
Lancaster
2006-10-04 11.00:2hrs
11.15:2hrs
11.30:2hrs
Glasgow
Lancaster
UCL-Central
2006-10-04 11.00:2hrs
11.15:2hrs
11.30:2hrs
11.45:2hrs
Glasgow
Lancaster
UCL-Central
Cambridge

SURLs

There are 100GB of files (100 * 1GB) stored on Castor which are to be used for the tests.
Their SURLs are as follows.

srm://ralsrma.rl.ac.uk:8443//castor/ads.rl.ac.uk/prod/grid/hep/disk1tape1/dteam/j/jkf/castorTest/1GBcanned000
srm://ralsrma.rl.ac.uk:8443//castor/ads.rl.ac.uk/prod/grid/hep/disk1tape1/dteam/j/jkf/castorTest/1GBcanned001

... and so on until ...

srm://ralsrma.rl.ac.uk:8443//castor/ads.rl.ac.uk/prod/grid/hep/disk1tape1/dteam/j/jkf/castorTest/1GBcanned098
srm://ralsrma.rl.ac.uk:8443//castor/ads.rl.ac.uk/prod/grid/hep/disk1tape1/dteam/j/jkf/castorTest/1GBcanned099

A full list is available here.

filetransfer.py Command

Ensure version 0.5.2-1 of the script is installed on the machine that the FTS transfers shall be submitted from. Previous versions cannot take a file containing source SURLs as an argument and the 0.5.0 release has a nasty bug which cancels transfers after an hour, so use 0.5.2-1.

The options/arguments that should be used are...

 --duration => time in minutes that the test should run for.
 --delete => the files are to be deleted from the destination after each transfer has taken place.
 --uniform-source-size => the metadata (size) of just one file is checked and not all of the source SURLs.
 Checking 100 would take a long time and is unnecessary if all source files are the same size.
 castorSURLs.txt is the file containing the list of SURLs.
 srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/castorTest is the destination endpoint. It is 
 formulated from an srm hostname and directory which exists/shall be created in the srm namespace.

A typical example would be...

filetransfer.py --duration=120 --delete --uniform-source-size 
srm://ralsrma.rl.ac.uk:8443//castor/ads.rl.ac.uk/prod/grid/hep/disk1tape1/dteam/j/jkf/castorTest/1GBcanned[000:099]
srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/castorTest

Ganglia Endpoints


Results

Overall Summary

So far the best rates, according to ganglia, to each site have been as follows,
RALPPD => 400Mbps
Lancaster => 100Mbps
Edinburgh => 200Mbps
Glasgow => 30 - 50Mbps but only with No. of concurrent files = 1
When No. of concurrent files > 1, the rate is barely detectable - in the order of few Kbps.
Birmingham => 400Mbps peak but this rate could not be sustained

RALPPD

17-Aug-06
The Castor->RALPPD test was submitted using the command below (using version 0.5.0-1 of the script) and ran for an hour before being cancelled by the script.

 filetransfer.py --number=500 --delete -u -g"-p 1"
 srm://castor-srmv1.ads.rl.ac.uk:8443//castor/ads.rl.ac.uk/test/grid/hep/disk/dteam/j/jkf/1GBcanned[001:025]
 srm://heplnx204.pp.rl.ac.uk:8443/pnfs/pp.rl.ac.uk/data/dteam/TestFiles

Lancaster

21-Aug-06
The Castor->Lancaster test was submitted using the command below (using version 0.5.0-1 of the script) and was allaowed to run for approximately an hour before being manually cancelled.

 filetransfer.py --number=500 --delete -u -g"-p 1"
 srm://castor-srmv1.ads.rl.ac.uk:8443//castor/ads.rl.ac.uk/test/grid/hep/disk/dteam/j/jkf/1GBcanned[005:025]
 srm://fal-pygrid-20.lancs.ac.uk:8443/pnfs/lancs.ac.uk/data/dteam/TestFiles

Glasgow

25-Aug-06
The figures below come from the filetransfer.py script (version 0.5.0-1)
channel-set -f 5, and using different source files

 rate - barely detectible

channel-set -f 5, and using same source file

 rate - barely detectible

channel-set -f 1, and using different source files

 rate - 32Mbps

channel-set -f 1, and using same source file

 rate - 37Mbps

18-Aug-06

Transfer Bandwidth Report Summary
=================================
transfer 0 (a45345d3-5eaf-11db-b949-e6cd043d6f48)
61/100 (61000000000.0) transferred. Started at 14:51:45, Canceled at 15:51:22, Duration = 0:59:37, Bandwidth = 136.404571674Mb/s
transfer 1 (69d7df85-5eb8-11db-b949-e6cd043d6f48)
64/100 (64000000000.0) transferred. Started at 15:54:33, Canceled at 16:54:35, Duration = 1:0:2, Bandwidth = 142.119974639Mb/s
transfer 2 (302e4871-5ec1-11db-9cda-d8e9ca8bb8b4)
66/100 (66000000000.0) transferred. Started at 16:57:22, Canceled at 17:57:19, Duration = 0:59:56, Bandwidth = 146.795180623Mb/s
transfer 3 (f5cd6e0d-5ec9-11db-9cda-d8e9ca8bb8b4)
72/100 (72000000000.0) transferred. Started at 18:0:9, Canceled at 19:0:31, Duration = 1:0:21, Bandwidth = 159.029164846Mb/s
transfer 4 (943843fb-5ed2-11db-983f-bee6fd519f4e)
73/100 (73000000000.0) transferred. Started at 19:2:21, Canceled at 20:1:40, Duration = 0:59:18, Bandwidth = 164.104040354Mb/s
transfer 5 (f178aa90-5eda-11db-983f-bee6fd519f4e)
82/100 (82000000000.0) transferred. Started at 20:3:28, Canceled at 21:3:51, Duration = 1:0:23, Bandwidth = 181.059464779Mb/stransfer 6 (75cd7302-5ee2-11db-9aeb-884ba7711b8a)
81/100 (81000000000.0) transferred. Started at 21:5:32, Canceled at 22:5:34, Duration = 1:0:1, Bandwidth = 179.933476843Mb/s
transfer 7 (696035f5-5eeb-11db-9aeb-884ba7711b8a)
83/100 (83000000000.0) transferred. Started at 22:7:14, Canceled at 23:7:33, Duration = 1:0:18, Bandwidth = 183.495147493Mb/stransfer 8 (97dc90cf-5ef3-11db-bb97-9f39275fa11b)
88/100 (88000000000.0) transferred. Started at 23:9:30, Canceled at 0:8:58, Duration = 0:59:27, Bandwidth = 197.312074118Mb/stransfer 9 (2656be09-5efc-11db-bb97-9f39275fa11b)
90/100 (90000000000.0) transferred. Started at 0:10:52, Canceled at 1:11:24, Duration = 1:0:32, Bandwidth = 198.221413217Mb/stransfer 10 (e1e5e7a7-5f04-11db-8fa8-bed50c69e441)
80/100 (80000000000.0) transferred. Started at 1:13:18, Canceled at 2:3:42, Duration = 0:50:24, Bandwidth = 211.628663176Mb/stransfer 11 (03714569-5f0d-11db-8fa8-bed50c69e441)
80/100 (80000000000.0) transferred. Started at 2:1:22, Canceled at 2:50:27, Duration = 0:49:4, Bandwidth = 217.319135586Mb/s
transfer 12 (b96c6ed9-5f13-11db-be92-a9c20fdf7840)
97/100 (97000000000.0) transferred. Started at 2:48:10, Canceled at 3:48:12, Duration = 1:0:1, Bandwidth = 215.453996401Mb/s
transfer 13 (3e0ae03f-5f1a-11db-be92-a9c20fdf7840)
96/100 (96000000000.0) transferred. Started at 3:47:1, Canceled at 4:46:54, Duration = 0:59:53, Bandwidth = 213.734181929Mb/stransfer 14 (71cbf3dc-5f22-11db-bcbd-b6d19d9d7877)
92/100 (92000000000.0) transferred. Started at 4:45:48, Canceled at 5:46:24, Duration = 1:0:36, Bandwidth = 202.37585275Mb/s
transfer 15 (f0ffc265-5f2a-11db-bcbd-b6d19d9d7877)
87/100 (87000000000.0) transferred. Started at 5:46:27, Canceled at 6:46:57, Duration = 1:0:30, Bandwidth = 191.722507401Mb/stransfer 16 (b9b37c5e-5f33-11db-873e-aafc1f6dd63c)
78/100 (78000000000.0) transferred. Started at 6:48:34, Canceled at 7:37:48, Duration = 0:49:13, Bandwidth = 211.271038287Mb/s
transfer 17 (b0369de0-5f3b-11db-873e-aafc1f6dd63c)
89/100 (89000000000.0) transferred. Started at 7:34:14, Canceled at 8:33:47, Duration = 0:59:32, Bandwidth = 199.316927729Mb/s
transfer 18 (a3e278c9-5f42-11db-87c2-b87c63e6ef06)
79/100 (79000000000.0) transferred. Started at 8:35:36, Canceled at 9:35:32, Duration = 0:59:55, Bandwidth = 175.763215737Mb/s
transfer 19 (f83dd84a-5f4b-11db-87c2-b87c63e6ef06)
61/100 (61000000000.0) transferred. Started at 9:37:18, Canceled at 10:36:42, Duration = 0:59:23, Bandwidth = 136.9423075Mb/stransfer 20 (9038c539-5f55-11db-bfd2-ca92950a4d5c)
24/100 (24000000000.0) transferred. Started at 10:39:28, Canceled at 11:8:22, Duration = 0:28:54, Bandwidth = 110.710029247Mb/s
transfer 21 (f7e157dd-5f59-11db-bfd2-ca92950a4d5c)
40/100 (40000000000.0) transferred. Started at 11:11:0, Active at 11:51:25, Duration = 0:40:25, Bandwidth = 131.912676408Mb/s                  
                                                                                                           
Date of Submission was 19/10/2006
Total number of FTS submissions = 22
 1663/2200 transferred in 75580.4656711 seconds
 1.663e+12bytes transferred.
Average Bandwidth:176.024319007Mb/s

Edinburgh

25-Aug-06
The number of files on the RAL-Ed channel was initially 1, then increased to 5 (approx 4pm) and then 10 (approx 4.15pm)

File:Castor to Ed 250806 4pm Network.png

Birmingham

27-Aug-06 I configured the channel to use 1 file (and 5 streams) to start with until I had transfered 10GB. I got an average rate (from Ganglia) of about 150Mb/s. I then increased the number of concurrent files from 1 to 5, the rate rapidly increased beyond 400Mb/s before falling to under 200MB/s. A few transfers were in the Waiting state:

   Destination:
 srm://epgse1.ph.bham.ac.uk:8443/srm/managerv1?SFN=/dpm/ph.bham.ac.uk/home/dteam/castortest/tfr-file015
   State:       Waiting
   Retries:     2
   Reason:       Transfer failed. ERROR the server sent an error response:
 425 425 Can't open data connection. timed out() failed

After 15 mins or so, the transfer was stalled. I left if in this state for just under 20 mins when I gave up hope and decided to give a kick to FTS by setting the number of concurrent transfers to 10. The transfer resumed until getting stalled again. I had then several failed transfers with the same ftp 425 error as above when the 3rd attempt failed. I cancelled the FTS transfer at this point. 36GB have been transfered at an average rate of 86Mb/s.

File:Castor to Bir 270806 4pm Network.png

I had another go with poor rates as well and similar problems, hoever the rate was bit better:

   35/50 transferred in 1579.58328986 seconds
   35000000000.0 bytes transferred.
 Bandwidth: 177.261941043Mb/s

Imperial College

04-Sep-06

Test1: Failed

filetransfer.py --number=5 --delete -u -g"-p 1" srm://castor-srmv1.ads.rl.ac.uk:8443//castor/ads.rl.ac.uk/test/grid/hep/disk/dteam/j/jkf/1GBcanned[001:005] srm://gfe02.hep.ph.ic.ac.uk:8443/pnfs/hep.ph.ic.ac.uk/data/dteam/castortest/sep04

Reason:      Failed SRM copy context. put  on httpg://gfe02.hep.ph.ic.ac.uk:8443/srm/managerv1 ; id=-2147344621 Error is 
RequestFileStatus#-2147344620 failed with error:[ retrieval of "from" TURL failed with error rs.state = Failed rs.error = null]

Test2: Successful with changed SURL Path:

filetransfer.py --number=1 --delete --uniform-source-size -g"-p 1" srm://castor-srmv1.ads.rl.ac.uk:8443//castor/ads.rl.ac.uk/test/grid/hep/disk/dteam/j/jkf/castorTest/1GBcanned001 srm://gfe02.hep.ph.ic.ac.uk:8443/pnfs/hep.ph.ic.ac.uk/data/dteam/castortest/sep04

Transfer Bandwidth Report Summary
=================================
transfer 0 (afef7179-3c1b-11db-88b9-d346ee9ad713)
1/1 (1000000000.0) transferred. Started at 14:46:57, Done at 14:59:13, Duration = 0:12:15, Bandwidth = 10.8730536802Mb/s

Total number of FTS submissions = 1
 1/1 transferred in 735.763864994 seconds
 1000000000.0bytes transferred.
Average Bandwidth:10.8730536802Mb/s

filetransfer.py --number=5 --delete --uniform-source-size -g"-p 1" srm://castor-srmv1.ads.rl.ac.uk:8443//castor/ads.rl.ac.uk/test/grid/hep/disk/dteam/j/jkf/castorTest/1GBcanned[000:005] srm://gfe02.hep.ph.ic.ac.uk:8443/pnfs/hep.ph.ic.ac.uk/data/dteam/castortest/sep04

Transfer Bandwidth Report Summary
=================================
transfer 0 (31597dd7-3c20-11db-88b9-d346ee9ad713)
5/5 (5000000000.0) transferred. Started at 15:25:20, Done at 15:53:14, Duration = 0:27:53, Bandwidth = 23.8968979369Mb/s

Total number of FTS submissions = 1
 5/5 transferred in 1673.85742307 seconds
 5000000000.0bytes transferred.
Average Bandwidth:23.8968979369Mb/s