Validating Statistical Experiments

6 min readMay 8, 2021

How to tell good experimental design from a bad one

Introduction

Whether it’s trying to comprehend the why behind evolutionary psychology questions (the field that ties our ‘apparently’ sophisticated habits to those of primates), predict the trajectory of a muon shot through a magnetic field, or argue about the latest election polls, experiments follow us in every aspect of modern-day life. This article answers the very question millions have: how can I tell if what I’m reading is true?

I start by answering why experiments are important then delve into the variables and factors that crafty tricksters manipulate to push their agenda. Sadly, most of these people are not tricksters- they are simply not aware of the techniques they should use to validate their statistical tests.

The Why

Statistical experiments are what allow mankind to push its limits and advance beyond what was thought possible. If the experiments are well designed, discovery is almost always guaranteed. What separates good experimental design from the bad? A good statistical experiment is specific. It specifies the question it’s trying to answer to the letter. This is done via a hypothesis test, which is simply trying to ascertain whether a group or process is different from another. An example: You are the production manager and would like to find out whether machine A is more efficient than machine B in the production line. The hypothesis, in this case, is to check for a difference between A and B, and thereafter delve deeper if there’s a difference. The hypothesis test is broken into two statements; the null hypothesis which says there’s no difference, and the alternative which states that there’s one. Most statistical experimental results you see mostly confirm a difference and report a p-value, which is the magnitude of the statistical significance. Mind you, statistical significance does not imply a real effect/difference. The experiment might be flawed in which case it may return a statistically significant p-value.

Sample Size

So what should we look out for in a published statistical experiment? Since we’re making predictions about the entire population based on a sample, the sample size is very important to get correct. If it is too small then we cannot be too sure about our results, or rather, our confidence intervals are too large. It cannot be too big either mostly due to budget constraints. So we start by stating a confidence interval- say 90% (which means that we’re 90% certain the real effect exists within a given range of values). A larger sample size reduces the range of these values. Where do we get our confidence intervals? From an estimate of the experimental error variance.

Confidence Intervals

What is margin of error? It is the range in which we think our value will fall. A confidence interval is our estimate plus the margin of error. As stated earlier, these figures are obtained from estimating the error variance of the experiment. Types of errors in experimental design will be discussed in detail a bit later. The reader should be keen to note however, they are different from type 1 and type 2 errors which respectively mean rejecting a true null hypothesis/ detecting a change where there isn’t and failure to detect a change that exists/ accepting a false null hypothesis.

Power Analysis

The next thing you should look out for in experimental research is power. Power is the probability of an experiment to correctly reject a null hypothesis (Stephanie, StatisticsHowto), or rather, the probability of making a correct decision. If you take a look at most published experiments you’ll see that power of 0.8 or 80% and above was used. But where does this figure come from? Picture this, you are a researcher and want to design your experiment to detect a difference that is meaningful enough- too small a difference could not be enough evidence and too big a difference could mean wasted resources. So through iteration (and domain knowledge), you structure your experiment to detect your specified difference almost perfectly. Power is positively correlated to sample size, and, as a side note is equal to 1 — type 2 error.

Effect Size

Another item of importance is effect size. Most statistical experiments flaunt their p-values and statistical significance but don’t be fooled, these numbers cannot be taken seriously unless they are backed by a sizeable effect size as well. It is the difference between two groups divided by the standard deviation of one of the groups. It is a range between 0 and 1 with effects sizes larger than 0.5 indicating a sizeable effect.

Test

The last but not least item when looking through a statistical experiment is the kind of test to be performed. Now, this can range from a simple Student t-test and ANOVA in case of continuous data, to chi-square tests for categorical data, to more complex multiple comparison tests like Bonferonni methods which seek to reduce the probability of spurious experimental results.

In my experience (so far), most of these tests build from two basic tests; the t-test and the chi-square test. I’ll briefly discuss the gist behind these two, but for further information, a quick google search yields a ton of intuitive explanations on several methods.

Most of the experiments I have come across tend to compare two groups, and the statistic being compared is usually the mean. Now, consider a situation in which your sample size is less than 30, your sample values more or less fit a normal distribution, and, they are independent of each other. A t-test is used to check if the difference between the two groups’ means is statistically significant or due to random chance. The t-test is relatively easy to set up and online calculators abound. Hand calculation for the t-statistic only requires the two means (x1 and x2), the two-sample sizes (n1 and n2), and pooled variance, s, (which is the variance of the two groups).

t = (x1 — x2) / √[s(n1 + n2–2)]

The resulting t statistic is cross-referenced in a statistical table, together with the chosen confidence interval and power, to yield the p-value. If the p-value is significantly below 0.05, it indicates a difference between the two means. ANOVA and regression analysis build from the t-test and consider sample sizes greater than 30.

The chi-square test cross-tabulates categorical results and uses a value to check if the results are due to chance or there is a pattern between the variables. This value is the chi-square statistic and is checked up against a table or graph to get the p-value. The greater the number of categories, the more this graph resembles a normal distribution. Online calculators for this test are also very available.

The keen reader should look through the experimental tests used for research to ensure the above basic rules are followed. Since we do not live in a perfect world, how do we handle unavoidable errors via our assumptions? Let us consider this next.

Errors

Errors can be broadly broken down into two main types; systematic error and experimental error. Experimental error is the error inherent in all experiments, the difference between our result and the true value, and cannot be done away with. The best we can do is account for this error via confidence intervals and in some cases include an error term in the calculations.

Systematic error is what we focus on and aim to eliminate. To achieve this, the two techniques with the greatest impact are balance and randomization. But why is randomization important? The simple explanation is that we are trying as hard as possible to account for all factors affecting our result, and avoid unseen factors also referred to as confounding factors. Balance, (for instance, choosing samples of equal numbers) ensures we avoid unnecessary complexity in our experimental design.

The next way to ensure we minimize systematic error is to ensure our samples are independent of each other and are normally distributed.

Conclusion

This article takes a very simplistic view of experimental design and acknowledges the complexity, experience, and iteration that goes into the planning of an experiment. I will be posting some deeper explanations about factorial design and mixed methods to complement some of the above methods. I would also recommend the book A First Course in Design and Analysis of Experiments by Gary W. Oehlert which has excellent explanations of several concepts mentioned.

I am very convinced that our superiority (or is it efficiency?) as a species has been and remains massively dependent on our ability to design successful experiments.

References

S. (2021, January 6). Statistical Power: What it is, How to Calculate it. Statistics How To. https://www.statisticshowto.com/statistical-power/

Validating Statistical Experiments

Written by Ed warothe