SC4 Aggregate Throughput

From GridPP Wiki
Revision as of 17:31, 24 March 2006 by Graeme stewart (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary

Objective

Demonstrate 1Gb aggregate bandwidth in and out of RAL to 4 Tier 2 centres.

Timetable

GridPP Aggregate File Transfer Tests
Date Objective
2006-02-27 Setup for transfers tests. SRM tuning. Try 50GB transfers to/from RAL
2006-02-28 to 2006-03-01 Rate test to Tier2s. Sustain 48hrs of transfers at 250Mb/s to each T2.
2006-03-02 to 2006-03-02 Continue rate test to Tier2s. Add CERN load into RAL to check rates can be sustained
2006-03-06 Setup for transfers tests. SRM tuning. Try 50GB transfers to/from RAL
2006-03-07 to 2006-03-08 Rate test from Tier2s. Sustain 48hrs of transfers at 250Mb/s to each T2.
2006-03-09 to 2006-03-10 Continue rate test from Tier2s. Add CERN load into RAL to check rates can be sustained

Ganglia Endpoints

Tier 1 to 4 Tier 2s

Preparations

RAL T1

  • Checked replication status of /pnfs/gridpp.rl.ac.uk/tfr2tier2/canned1G (expected source file) - on 4 disk servers, pushed to 2 more.
  • Unable to ping castorgridsc.cern.ch from OPN hosts, CERN reported outage at SARA last week, mailed to check if it is still down.

Birmingham

Glasgow

  • DPM Filesystems converted to xfs during 2.7 upgrade.
  • Input rate test, 51GB of files transferred at 400Mb/s from RAL. Transfer to RAL was 300Mb/s, but Jamie was doing some internal tests concurrently which probably brought the rate down a bit. It still exceeds the target.

Manchester

QMUL

Date Description
2006-02-27 25x2GB files two streams RAL -> QMUL: Bandwidth: 154.314824833Mb/s
2006-02-27 25x2GB files two streams QMUL -> RAL: Bandwidth: 424.945137527Mb/s

The commands which I used were:

QMUL-RAL:

[mazza@grid05 mazza]$ filetransfer.py --background --ftp-options="-p 2" --number=25 --delete \                                        
srm://se01.esc.qmul.ac.uk:8443/dpm/esc.qmul.ac.uk/home/dteam/canned2G \
srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/qmul/canned2G_QMUL_to_RAL

RAL-QMUL:

[mazza@grid05 mazza]$ filetransfer.py --background --ftp-options="-p 2" --number=25  --delete \
srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned2G \
srm://se01.esc.qmul.ac.uk:8443/dpm/esc.qmul.ac.uk/home/dteam/canned2G_RAL_to_QMUL

Tuesday 2006-02-28

Overview

Plan to start in the morning, once Andrew Samsun has reported on the RAL network status.

Each T2 should trigger a backgrounded 1000GB transfer using the RAL canned data, e.g.:

 filetransfer.py -g "-p 2" --delete --number=1000 --ignore-status-error --logfile=transfer1 --poll-time=60 \
 srm://dcache.gridpp.rl.ac.uk:8443/pnfs/gridpp.rl.ac.uk/data/dteam/tfr2tier2/canned1G \
 srm://se2-gla.scotgrid.ac.uk:8443/dpm/scotgrid.ac.uk/home/dteam/trtest

I suggest using a poll time of 60s - these are big transfers and we do not want the crash the FTS server by overloading tomcat.

N.B. because the streams option in FTS 1.3 does not function, the number of streams will be fixed on the transfer at the beginning through the -g option. The number of concurrent files can be modified on the fly though, e.g.,:

 glite-transfer-channel-set -f 10 RAL-GLA

At the target rate of 250Mb/s this TB transfer will take about 8 hours. The aim should be to run with a single transfer request at the start. At the end of the day sufficient requests can be scheduled to maintain the load over the night (use the --logfile= option to get the script to write to a different output file).

Bandwidth Measurements

As each transfer is long and there may be concurrent transfer requests going on then the bandwidth measured by the transfer script may not be accurate. Graeme's working on getting gridftp data published to RGMA for use with gridview, but this will only work for DPMs, not for the dCache sites.

For the moment, I suggest using ganglia bandwidth plots to get an idea of how the transfers are going.

Start Times

To get a better idea of how bringing in more Tier2s affects things we should start in this order:

Site Transfer Start Time
Birmingham 1000
Glasgow 1015
Manchester 1030
QMUL 1045

Results

Overall Summary

Transfers out to Tier2s started between 1015 (BHAM, GLA) and 1100 (QMUL, MAN). RAL was shipping out at ~800Mb/s, but probably about 100-150Mb/s was "background", so the test rate was ~650-700Mb/s (accords with summing the rates seen at the 4 sites).

RAL had not managed to get their second 1Gb link in place, so this traffic was shared with the normal Gb production infrastructure at RAL. As the test proceeded we started to see network dropouts at RAL, eventually provoking user complaints, so the transfers were halted at 1450. RAL network people don't currently understand why the network failed so badly, so the plan is to repeat the test, in a slightly more controlled fashion on Thursday, when more tracers will be available to understand the failure modes.

File:Aggregate ral network dropout 20050228.png

Glasgow

Rate initially was ~240Mb/s. This dropped to ~200Mb/s when QMUL and MAN started their transfers. When the BHAM channel got stuck at 1200 the rate jumped to ~300Mb/s.

Overall we transferred 416GB at an average rate of ~200Mb/s:

 Transfer Bandwidth Report:
 416/1000 transferred in 17016.7657559 seconds
 416000000000.0 bytes transferred.
 Bandwidth: 195.57182885Mb/s

File:Aggregate glasgow day 20050228.png

No problems suffered at the Glasgow end - we showed in the setup phase we could take data at at least 400Mb/s, so we were not stressed here.

Manchester

Manchester summary from the tranfer script

 Transfer Bandwidth Report:
 96/1000 transferred in 5851.14570904 seconds
 96000000000.0 bytes transferred.
 Bandwidth: 131.256345029Mb/s

However 212 1GB files have been transfered to Manchester at a rate 121 Mb/s calculated using the last file time stamp.

Manchester nodes, which are also dcache pools, were loaded with running jobs.

File:Man-sc4-20060228.png

Birmingham

Two small transfer tests of one file before 10am were successful, however when the proper transfers were started at 10am, no transfer entered the Active phase. We cancelled the test and initiated a new test at 1015. From Ganglia plots and monitoring of files on the SE, the transfers did take place though the status was still reported incorrectly:

Summary: Active 0, Done 0, Pending 1000, Submitted 0,
Child:  /opt/glite/bin/glite-transfer-status -l 67253321-a842-11da-8fab-a05498efeebd
FTS status query for 67253321-a842-11da-8fab-a05498efeebd failed (ignored due to --ignore-status-error switch).
FTS Error: status: getFileStatus: requestID <67253321-a842-11da-8fab-a05498efeebd> was not found

We deciced to proceed with the test rather than interrupting it (debugging for later) and to use the Ganglia plots to monitor the rates:

File:AggregateBirmingham20050228.png

We observed initial rates of ~200Mb/s which fell to ~120Mb/s soon after QMUL and MAN started their transfers. The transfers stopped at 1200. It turned that the RAL-BHAM channel agent was stuck and transfers resumed at 1315 after Matt's intervention. Rates fluctuated around 120Mb/s until all transfers were stopped at 1450. The transfer did not cause any perturbation at Birmingham.


QMUL

File:060228 RAL-QMUL.gif


 At 15:18 the situation was the following.
 Summary: Done 75, Pending 408, Waiting 17,

Wednesday 2006-03-01

Overview

On the second day, they rested...

Though if Manchester and QMUL were able to get gangila onto their disk nodes this would be useful ;-)


Thursday 2006-04-02

Overview

Ramp up in a more controlled way than on Tuesday. Bring in 1 Tier2 per hour, giving the RAL network people time to understand how and why the network is failing so badly.

Proposed schedule:

Site Transfer Start Time
Glasgow 0900
QMUL 1000
Birmingham 1100
Manchester 1200

Again, 1000x1GB transfers should be submitted. Plan will remain flexible so stay glued to GRIDPP-SC list!

Status at 1300

CMS were transferring at ~500Mb/s this morning which was a bit of a surprise! We backed off starting Glasgow. However, then FTS went down and it took a while to get going again.

FTS is now backup, and the CMS transfers will continue. We will add more of our sites to bring up the load. Current proposed start times for sites are:

Site Transfer Start Time
Glasgow 1400
Birmingham 1430
Manchester 1500
QMUL 1530 (if needed)

Results

Imperial

Time Info
8h00 CMS phedex transfer are smooth at 50Mb/s File:Ic-phedex.gif
9h40 CMS phedex stopped for testing publishing File:Ic-phedex-down.gif
 ?? Agreed to put Phedex Back online


Tuesday 2006-03-07

It fell over again.


Thursday 2006-03-09

RAL's firewall now capped at 800Mb/s. Sites started transfers at 9am and hit ~700Mb/s rates within 15 minutes. After this adding more sites just lowered the rates to the sites already transferring. RAL were stable and no dropouts occured throughout the transfer, which went on until 12.15.

At 11am Lancaster joined and added some lightpath traffic. There were some issues about this causing load exclusively onto one of RAL's disk servers, but the traffic to the other T2s through the production network stayed steady.

File:Aggregate-ral-day 20060309.png

Wednesday 2006-03-15

After the successful test last Thursday, plan was now to do a 48 hour test to 4 Tier 2s (Birmingham, Glasgow, Manchester and QMUL) over SJ4 and simultaneously to Lancaster over the UKLIGHT link.

Sites started load at 9am. QMUL had some problems with their disk server and had to drop out quite early. Lancaster prduced load via Brain's ATLAS certificate, which cause some asymmetry in the RAL load.

However, despite these minor problems, we did run very sucessfully for 48 hours:

File:Aggregate-48hr-fromral-composite.png

On the second day Oxford also generated load to try and increase the rate through the SJ4 link. As the FTS plots show this was successful, but there is a suspicion that some other traffic was going through the SJ4 link.

File:Fts-sj4-rate.gif

Data Transferred

48 Hour Transfer Tests from RAL
Site Data Transfered (GB) Average Rate (Mb/s) Failure Rate (%) Comments
Glasgow 3774 175 2.5 FTS sensitive to network failures and timeouts?
Birmingham 2972 137 0.9
Manchester 1700 80
Lancaster 13000 600
Oxford 994 168 0.6
SJ4 Subtotal 9440 437
Total 22440 1038

Wednesday 2006-03-22

After the successful test last week, plan was now to reverse the direction and upload data to RAL from 4 Tier2s over SJ4 and Lancaster with UKLIGHT. Oxford were still in downtime for their 2.7.0 upgrade, so the SJ4 Tier2s were Birmingham, Bristol, Glasgow and Manchester.

Sites started loading at 9am, with Lancaster generating load using srmcp with Brian's Atlas certificate. However, very limited write space for Atlas at the T1 meant that Brian's copies were quite fragile.

This turned out to be the least of our issues though. Early on Thursday morning the FTS server became extremely unstable, virtually halting FTS server transfers. This required several restarts on Thursday morning, and it was 10am before old transfers could be cancelled and new ones started. Brian tried to add load from Lancaster again, but FTS behaved rather erratically on the LANC-RAL channel.

After this, we ran steadily at aout 800Mb/s, but then at 0230 on Friday morning the FTS backend database packed up, halting all transfers.

File:Aggregate-48hr-toral-composite.png

Data transferred through FTS shows the pattern (but not the early srmcp success from Lancaster):

File:Fts-rate-toral-48hour.gif

A note of success was that Glasgow were able to get gridview working (select GLSGOW from drop down list).

Data Transferred

48 Hour Transfer Tests to RAL
Site Data Transfered (GB) Average Rate (Mb/s) Failure Rate (%) Comments
Glasgow 3315 175 29 Did lose a little time when load was switched off
Birmingham 2922 154 22
Manchester 1089 57 42
Lancaster 5900 312 47 Failure rate is only for FTS managed transfers (1.9TB)
Bristol 1659 88 34
SJ4 Sub Total 10885 503
Total 14885 787

N.B. Rates are calculated only over 42 hours, until the Oracle db failed.

Conclusions

  1. As expected the T2s don't have a great deal of trouble shipping data back out to RAL - it's a less stressful i/o operation.
  2. The RAL disk servers can take what we throw at them, though at the beginning, when Lancaster were able to ship out over the OPN, the load across the disk server cluster was significant. (However, given that the servers stood up to the SC3 tests, probably nothing to worry too much about.)
  3. The FTS server needs attention: certainly upgraded, and probably load balanced by moving the agents away from the web services frontend. There may be issues with the backend db which the oracle DBAs might be able to comment on.