RALPP Local SC4 Preparations

From GridPP Wiki
Jump to: navigation, search


Upgrading the UI/FTS Client

The first task was to upgrade one of our UIs with the latest version of the FTS client software.

For this I set up a new repository on our Yum server and added it to the yum.conf file on the UI.

The first attempt ran into problems because I was following the instructions given at https://uimon.cern.ch/twiki/bin/view/LCG/FtsClientInstall rather than those given at https://uimon.cern.ch/twiki/bin/view/LCG/FtsClientInstall13, which meant that I installed an old version.

To configure the client I copied the /opt/glite/etc/services.xml file from lcgui02.gridpp.rl.ac.uk to the same location on my UI.

At that point I was able to run the suggested tests:

glite-transfer-channel-list -v
glite-transfer-list -v Pending

We should now be able to do some transfers.

Initial Transfer Tests

Derek sent a list of files from on the Tier 1 SRM that I could copy to the Tier 2.

The steps needed were:

myproxy-init

I had some slight trouble here because my certificate files are in a non-standard place, setting the X509_USER_CERT and X509_USER_KEY variables eventual solved the problem.

After a few false starts getting the srm destination format right I was able to run:

glite-transfer-submit srm://dcache.gridpp.rl.ac.uk:8443/srm/managerv1?SFN=/pnfs/gridpp.rl.ac.uk/data/dteam/fts_test/fts_test-0 srm://heplnx204.pp.rl.ac.uk:8443/srm/managerv1?SFN=/pnfs/gridpp.rl.ac.uk/data/dteam/fts_test/fts_test-0

Give it my MyProxy password and have it transfer the file successfully to heplnx204.

I then used srm-advisory-delete to mark the file for detetion. Annoyingly the SURL is a different format.

Next I created a transfer file listing all twenty files on the Tier 1 to be copied to the Tier 2 and submitted this with:

glite-transfer-submit -f fts-test.list

Checking the status with glite-transfer-status -l the first file failed because although I had advisory deleted it it had not been removed and the rewrite was not allowed.

To "delete" the files I used

for file in `awk '{print $2}' fts-test.list | sed -e 's$/srm/managerv1?SFN=$$'`
do
/opt/d-cache/srm/bin/srm-advisory-delete $file
done

The manually deleted the coresponding file in dCache.

Looking at the ganglia montoring plots (gmond needed to be restarted so the plots aren't complete) it looks like the peak transfer rate was about 25MB/s which equates to approx 200Mb/s.

Netx steps:

  1. Set up the channel to transfer 2,4,... files in parallel
  2. Move the GridFTP dCache door of the head node onto the Pool node - With only one pool node this should be a big win since the data will only have to make one hop.
  3. Start tweaking the GridFTP transfer parameters

Status at 02/12/2005

I've been working on the script (see below) and that should now be able to manage the simple tests I want to do initially but my current tests are being hammered by SRM errors; about 60% of the file transfers fail with:

  Source:      srm://dcache.gridpp.rl.ac.uk:8443/srm/managerv1?SFN=/pnfs/gridpp.rl.ac.uk/data/dteam/fts_test/fts_test-3
  Destination: srm://heplnx204.pp.rl.ac.uk:8443/srm/managerv1?SFN=/pnfs/pp.rl.ac.uk/data/dteam/fts_test/2/fts_test-3
  State:       Waiting
  Retries:     1
  Reason:       Failed on SRM get: SRM getRequestStatus timed out on get

This seems to be down to load on the dcache pnfs server.

The throughput on single file transfers is very promising though; I've already seen rates of over 250Mb/s even without doing any performance tuning but since I cannot run any bulk transfers I cannot get any results for sustained rates.

Status at 06/12/2005

The FTS service seems much more stable now (touch wood) I was able to run some 100 and 200GB transfers yesterday with rates of 180 to 200 Mb/s.

That's still with a single file transfered at once, the GSIFTP Door on the Admin not data node and no tuning of the GSIFTP parameters.

Status at 08-12-2005

Ran a 1TB test from the Tier 1 last night which completed successfully with a rate of 197Mb/s.

Settings as on 06/12/2005

Status at 12-12-2005

Increased the number of files transfered in parallel to 2, on small transfers this increases the rate significantly, I've managed to transfer 2GB at 307Mb/s but for anything larger than a 2GB transfer the rate drops to below that for the single file rate. Looking at the monitoring on the dCache pool node I suspect that it is running out of memeory in the (P)NFS cache. Hopefully moving the GridFTP door to the pool node will cut out the admin to pool node transfer and eliminate this problem.

Status at 14-12-2005

Ran another 1TB test last night after having reduced the number of parallel file transfers back to 1. The test stopped after 865 out of 1000 files apparently due to the RAL MyProxy Server locking up. However the file sizes and times on the server were enough for me to calculate some numbers for the transfer.

In the incoming directory on the target server.

find . -type f -printf '%C@ %s\n' | sort -n -k 1

will get you the second the transfer of each file ended and the size of the file. You can then feed that into your favorite linear regression package and calculate a rate. That gives me 194Mb/s for the test last night which matches well with the previous tests.

Status at 11-01-2006

Upgraded dCache to version 1.6.6-3 yesterday and moved the GridFTP door onto the pool node rather than the admin nodes. The upgrade went fairly smoothly apart from some problems getting the correct data into the info system and a repeat of the globus tcp port range being miss set when I started the door on the pool node.

Ran some tests with 10 1GB files to test different combinations of number of parallel files and number of streams:

Files Streams Rate
1 5 214Mb/s
2 5 327Mb/s
4 5 384Mb/s
8 5 360Mb/s
4 10 391Mb/s

Ran a 1TB test using 4 parallel files each with 10 parallel streams. It lost some files when I had to restart the GridFTP door to solve the globus tcp port range problem but still transfered 0.985TB at 397Mb/s.

Status at 26-01-2006

Ran a 1TB Test in the reverse directon (RALPP -> RAL) successfully, with a rate of 388Mb/s (4 files at once, 10 streams per file)

Test Script

I've improved the test script to handle multiple "transfer lists" and to better delete the files at the end. It also now asks you for the myproxy password and kills transfers once there are no more pending files.

#!/bin/bash

files_list=$*

read -esp "MyProxy Password:" passwd
echo

function active_transfers {
  active=0
  pending=0
  for transfer in ${id[*]}
  do
    case `glite-transfer-status $transfer` in
      "Active" )
	let "active+=1"
      ;;
      "Pending" )
        let "pending+=1"
      ;;
      * )
      ;;
  esac
  done
  echo $active $pending
}

function file_statuses {
  now_done=0
  now_wait=0
  now_pend=0
  now_active=0
  for i in `seq 1 ${#id[*]}`
  do

    let j=i-1
    transfer=${id[$j]}

    case `glite-transfer-status $transfer` in
      "Active" )
        eval ` \
               glite-transfer-status -l $transfer 2>/dev/null | \
               awk ' \
                     BEGIN{ a=0;p=0;w=0;d=0 }; \
                     /State: *Active/{a++}; \
                     /State: *Done/{d++}; \
                     /State: *Waiting/{w++}; \
                     /State: *Pending/{p++}; \
                     END{print "let \"apend=" p "\"\;"}; \
                     END{print "let \"await=" w "\"\;"}; \
                     END{print "let \"aacti=" a "\"\;"}; \
                     END{print "let \"now_pend+=" p "\"\;"}; \
                     END{print "let \"now_active+=" a "\"\;"}; \
                     END{print "let \"now_done+=" d "\"\;"}; \
                     END{print "let \"now_wait+=" w "\""}' \
             `
	if [ $apend -eq 0 -a $aacti -eq 0 -a $await -ge 1 ]
	then
	  glite-transfer-cancel $transfer
	fi

      ;;
      "Pending" )
        let "now_pend+=${num[$j]}"
      ;;
      "Done" )
        let "now_done+=${num[$j]}"
      ;;
      * )
      ;;
    esac
  done
}

i=0
for file in $files_list
do
  id[$i]=`glite-transfer-submit -p $passwd -f $file`
  num[$i]=`grep --count -E '^srm' $file`
  let "i++"
done

echo Transfer IDs are ${id[*]}

until [ `active_transfers | awk '{print $1}'` -ge 1 ]
do
  sleep 5
done
starts=`date +%s`

echo Transfer Started - `date`

ndone=0
until [ "`active_transfers`" = "0 0" ]
do
  sleep 5

  file_statuses
  if [ $now_done -ne $ndone ]
  then
    echo "Active Jobs: Done $now_done files ($now_active active, $now_pend pending and $now_wait delayed) -" `date`
    ndone=$now_done
  fi
done
ends=`date +%s`

echo Transfer Finished - `date`
secs=`dc -e "$ends $starts - p"`
files=0

for transfer in ${id[*]}
do
   eval ` \
               glite-transfer-status -l $transfer 2>/dev/null | \
               awk ' \
                     BEGIN{ a=0;p=0;w=0;d=0 }; \
                     /State: *Done/{d++}; \
                     END{print "let \"files+=" d "\"\;"}'
             `
done

echo Transfered $files files in $secs s '(+- 10s)'
rate=`dc -e "$files 8192 * $secs / p"`
echo Approx rate = $rate Mb/s

for transfer in ${id[*]}
do
  for file in `glite-transfer-status -l $transfer 2>/dev/null | \
    awk '/Destination/{ file=$2; getline; if ( $2 == "Done" ) print file }' | \
    sed -e 's$/srm/managerv1?SFN=$$'`
  do
    /opt/d-cache/srm/bin/srm-advisory-delete $file
    edg-gridftp-rm `echo $file | sed -e 's/srm/gsiftp/;s/:8443//'`
  done
done

It's very basic and assumes a number of things but should give rough estmates of the rates

heplnx101 - ~/SC4 $ ./test-fts.sh fts-test-5-1.list fts-test-5-2.list
MyProxy Password:
Transfer IDs are 7df8fdcc-6359-11da-9db1-af193b49b5ed 7e35b8ad-6359-11da-9db1-af193b49b5ed
Transfer Started - Fri Dec 2 17:31:44 GMT 2005
Active Jobs: Done 1 files (1 active, 7 pending and 1 delayed) - Fri Dec 2 17:33:35 GMT 2005
Active Jobs: Done 2 files (1 active, 6 pending and 1 delayed) - Fri Dec 2 17:34:19 GMT 2005
Active Jobs: Done 0 files (1 active, 4 pending and 0 delayed) - Fri Dec 2 17:36:41 GMT 2005
Active Jobs: Done 1 files (0 active, 0 pending and 4 delayed) - Fri Dec 2 17:41:57 GMT 2005
Transfer Finished - Fri Dec 2 17:41:58 GMT 2005
Transfered 3 files in 614 s (+- 10s)
Approx rate = 40 Mb/s

That was an attempt to transfer ten files, it managed three that's a 70% failure rate! The overall rate is low because it spends most of it's time waiting for the srm to timeout before abandoning the file.