Getstats supports simple hypothesis testing using a two sample t-test. If you have two samples (i.e., configurations), and you want to determine whether one configuration's results is larger than, smaller than, or equal to the other configuration's results you can use the --twosamplet transform. The --twosamplet operates like the overhead transform in that the first file on the command line is compared to each subsequent file on the command line. Before executing the command you should pick your null hypothesis, which is what you assume to be true (and would like to disprove). For example, if you just spent time optimizing a function, then you should assume your new software is slower than the existing software, and seek to prove otherwise. You can also assume that two samples are equal, and then seek to differentiate them (if you fail, then the results are statistically indistinguishable).
For a primer on hypothesis testing, I suggest reading any statistics book such as Ott and Longnecker's "An Introduction to Statistical Methods and Data Analysis", MathWorld at http://mathworld.wolfram.com/HypothesisTesting.html or WikiPedia at http://en.wikipedia.org/wiki/Hypothesis_testing.
For example, to compare grep:reboot.res with
grep:noreboot.res, you should run getstats
--twosamplet grep:noreboot.res
grep:reboot.res
. This command produces the basic tabular
report, and afterwards each quantity is compared as follows:
grep:noreboot.res: High z-score of 2.33972893857958 for elapsed in epoch 7. grep:noreboot.res: Linear regression slope for sys is: 1.856%. grep:reboot.res: High z-score of 2.82303417219122 for elapsed in epoch 1. grep:reboot.res: High z-score of 2.33133550896239 for sys in epoch 3. grep:reboot.res: High z-score of 2.47125762323635 for io in epoch 1. grep:noreboot.res NAME COUNT MEAN MEDIAN LOW HIGH MIN MAX SDEV% HW% Elapsed 10 38.751 38.699 38.580 38.921 38.465 39.307 0.614 0.439 System 10 1.796 1.790 1.677 1.915 1.580 2.080 9.255 6.620 User 10 23.806 23.730 23.614 23.998 23.430 24.330 1.130 0.808 Wait 10 13.149 13.158 12.912 13.386 12.725 13.797 2.519 1.802 CPU% 10 66.071 66.019 65.556 66.586 64.899 67.075 1.090 0.779 grep:reboot.res NAME COUNT MEAN MEDIAN LOW HIGH MIN MAX SDEV% HW% O/H Elapsed 10 40.422 40.661 39.885 40.960 38.301 40.788 1.859 1.330 4.314 System 10 1.693 1.700 1.620 1.766 1.560 1.930 6.005 4.296 -5.735 User 10 23.718 23.745 23.451 23.985 23.180 24.220 1.572 1.124 -0.370 Wait 10 15.011 15.102 14.569 15.454 13.481 15.632 4.124 2.950 14.168 CPU% 10 62.875 62.764 62.166 63.584 61.561 64.802 1.576 1.127 -4.837 Comparing grep:reboot.res (Sample 1) to grep:noreboot.res (Sample 2). Elapsed: 95%CI for grep:reboot.res - grep:noreboot.res = (1.148, 2.195) Null Hyp. Alt. Hyp. P-value Result u1 <= u2 u1 > u2 0.000 REJECT H_0 u1 >= u2 u1 < u2 1.000 ACCEPT H_0 u1 == u2 u1 != u2 0.000 REJECT H_0 System: 95%CI for grep:reboot.res - grep:noreboot.res = (-0.232, 0.026) Null Hyp. Alt. Hyp. P-value Result u1 <= u2 u1 > u2 0.944 ACCEPT H_0 u1 >= u2 u1 < u2 0.056 ACCEPT H_0 u1 == u2 u1 != u2 0.112 ACCEPT H_0 User: 95%CI for grep:reboot.res - grep:noreboot.res = (-0.393, 0.217) Null Hyp. Alt. Hyp. P-value Result u1 <= u2 u1 > u2 0.724 ACCEPT H_0 u1 >= u2 u1 < u2 0.276 ACCEPT H_0 u1 == u2 u1 != u2 0.552 ACCEPT H_0 Wait: 95%CI for grep:reboot.res - grep:noreboot.res = (1.396, 2.329) Null Hyp. Alt. Hyp. P-value Result u1 <= u2 u1 > u2 0.000 REJECT H_0 u1 >= u2 u1 < u2 1.000 ACCEPT H_0 u1 == u2 u1 != u2 0.000 REJECT H_0 CPU%: 95%CI for grep:reboot.res - grep:noreboot.res = (-4.009, -2.382) Null Hyp. Alt. Hyp. P-value Result u1 <= u2 u1 > u2 1.000 ACCEPT H_0 u1 >= u2 u1 < u2 0 REJECT H_0 u1 == u2 u1 != u2 0.000 REJECT H_0
From this report, we we can see that grep with an intervening reboot runs for a longer period of time than without the intervening reboot (because we reject the null hypothesis of u1 <= u2 for Elapsed time). We also see that System and User times are indistinguishable for the two tests. Wait and CPU time are however distinguishable (reboot has higher Wait, and lower CPU utilization).
If you want to have a quieter version of the t-test, pass --set
rejectonly=1 so that only rejected hypothesis are displayed. For
example, getstats --set warn=0 --set rejectonly=1 --twosamplet
produces the following:
grep:noreboot.res NAME COUNT MEAN MEDIAN LOW HIGH MIN MAX SDEV% HW% Elapsed 10 38.751 38.699 38.580 38.921 38.465 39.307 0.614 0.439 System 10 1.796 1.790 1.677 1.915 1.580 2.080 9.255 6.620 User 10 23.806 23.730 23.614 23.998 23.430 24.330 1.130 0.808 Wait 10 13.149 13.158 12.912 13.386 12.725 13.797 2.519 1.802 CPU% 10 66.071 66.019 65.556 66.586 64.899 67.075 1.090 0.779 grep:reboot.res NAME COUNT MEAN MEDIAN LOW HIGH MIN MAX SDEV% HW% O/H Elapsed 10 40.422 40.661 39.885 40.960 38.301 40.788 1.859 1.330 4.314 System 10 1.693 1.700 1.620 1.766 1.560 1.930 6.005 4.296 -5.735 User 10 23.718 23.745 23.451 23.985 23.180 24.220 1.572 1.124 -0.370 Wait 10 15.011 15.102 14.569 15.454 13.481 15.632 4.124 2.950 14.168 CPU% 10 62.875 62.764 62.166 63.584 61.561 64.802 1.576 1.127 -4.837 Comparing grep:reboot.res (Sample 1) to grep:noreboot.res (Sample 2). Elapsed: 95%CI for grep:reboot.res - grep:noreboot.res = (1.148, 2.195) Null Hyp. Alt. Hyp. P-value Result u1 <= u2 u1 > u2 0.000 REJECT H_0 u1 == u2 u1 != u2 0.000 REJECT H_0 Wait: 95%CI for grep:reboot.res - grep:noreboot.res = (1.396, 2.329) Null Hyp. Alt. Hyp. P-value Result u1 <= u2 u1 > u2 0.000 REJECT H_0 u1 == u2 u1 != u2 0.000 REJECT H_0 CPU%: 95%CI for grep:reboot.res - grep:noreboot.res = (-4.009, -2.382) Null Hyp. Alt. Hyp. P-value Result u1 >= u2 u1 < u2 0 REJECT H_0 u1 == u2 u1 != u2 0.000 REJECT H_0
There are several other variables that control the test. To replace u1 and u2 with their test names, (e.g., u1 would be replaced with grep:reboot.res, pass --set verbosettest=1. The confidence level can be adjusted with --set confidencelevel. To determine if two samples are different by a given delta use --set twosampledelta=delta.
Finally, if you want to compare each sample to every other sample in a pair-wise manner pass --pairwiset instead of --twosamplet.