Ganga robot tests
From GridPP Wiki
Revision as of 17:16, 20 March 2010 by Alessandra forti
Manchester ganga robot tests. Both run by Manchester to test their tweaks and by atlas to test their configuration. Each test has a date, a link to the test and a description.
- 2010-03-17 Test 1199 Clone of 1078 run by me. One hour test to test functionality was ok. unfortunately cluster loaded with production jobs didn't allow a real test. The 4 servers have now all 4 Gbs bonded links going into a switch with 7Gbs links to the big cisco. Maui config GROUPCFG[atlpil] MAXJOB=70 MAXIJOB=10
- 2010-03-18 Test 1203 Clone of 1199 run by me. 12 hours. 99% efficiency. There are some really good improvements compared to 991. Athena running time is 4059 (over 12 hours) compared to 2381 (over 27hours). The cpu efficiency mean value is 46% vs 47% but the sigma is 15 as opposed to 29. i.e. the histogram looks a healthier distribution. Overall efficiency is 99% vs 86% in 991 test. event/athena time and event/s are marginally better. Overall it looks healthier. Maui config GROUPCFG[atlpil] MAXJOB=90 MAXIJOB=10. Cluster full of prod jobs. Load peaks on the SEs were around 80 this afternoon only on se04. se02 is having problems of connection with nagios and se01 and se03 were well behaved.
- 2010-02-01 Test 1078 Clone of 1040 run by me. 48% efficiency had problems with jobs being killed because the previous job couldn't compile. It can be a problem with our NFS servers since we have increased the bandwith of the storage, but I'm not sure. We need to see if it happens again or if it was a problem with the release. Maui config GROUPCFG[atlpil] MAXJOB=70 MAXIJOB=10
- 2010-01-11 Test 1036Test run by atlas to test the frontier server at RAL. 100% efficiency. Low intensity job scheduling though.
- 2010-01-12 Test 1040: Clone of 909 run by me to test stability. 87% efficiency. Max concurrent jobs 160 and then reduced to 140 half way when one of the servers was overloaded. The server with 2x2Gbs bonded links doesn't overload anymore. Bought further additional interfaces for the other servers.
- 2010-01-21 Test 1060: Test run by atlas to test the frontier server at RAL. 72% efficiency. Interrupted. Slightly higher intensity. Introduced double bonding on all servers. Introduced a delay between batches of jobs in pbs server will have to play with that. Max concurrent jobs is still 140. Two servers have degraded raid and that might affect the test. Working to fix that. 72% interrupted. problem with atlas CMTCONFIG. Error "CMTCONFIG is not available on the local system: NotAvailable (required of task: i686-slc5-gcc43-opt)". Degraded servers didn't create problems, se03 and se04 behaved quite well. Need to try next week a more demanding test. Maui config GROUPCFG[atlpil] MAXJOB=70 MAXIJOB=10
- 2010-01-21 Test 1063 Clone of 1040 run by me. Ignore: It started with job failing without any error at all. I wanted to pause it but stopped it instead. Will try to find out what's up. Maui config GROUPCFG[atlpil] MAXJOB=70 MAXIJOB=10
- 2009-12-01 Test 923: Clone of 909 run by me to test stability.95.6% efficiency. Running with and average of 140 jobs.Errors due to incorrect checksum on input files and inconsitent XML file. Network bandwidth doubled on the server that crashed.
- 2009-12-01 Test 932: Low impact test run by Atlas to measure RAL frontier server response time. Errors in Manchester due to release 15.5.0 not installed.
- 2009-12-04 Test 953: Clone of 932 run by me after I installed release 15.5.0.
- 2009-12-09 Test 964: Clone of 932 run by Atlas still testing RAL frontier server response. 100% efficiency. Number of concurrent jobs really low. Manchester ok also with continuous submission.
- 2009-12-14 Test 991: Clone of 909 run by me to test stability.86% efficiency. Running constantly with 200 jobs. The second server were most of the data are was instead struggling. And most of the failures were due to the pilot killing eventually the job. Additional cards have been ordered and a test switch will be used to setup the 4 servers in a configuration as near as possible to the final one. Cap on the number of jobs decreased for now.
- 2009-11-27 Test 909: custom test run by me to verify cluster stability.Cloned from a previous UK test. ~50% efficiency. Errors due to one of the data server crashing.It was ok up until then.