By Ben Csiernik
I am obsessive about being appropriately critical of research; I both love and hate everything about it. One of the things that I am most critical about is the concept of statistical power, and by proxy, the importance of sample sizes in studies (a.k.a, the number of participants in a study). After seeing a few discussions on social media over the past few weeks around some studies with VERY small sample sizes, I thought I would try to explain why we need to be extremely cautious with how we interpret these studies. After you finish this article, you should be able to understand, and maybe even explain, the dangers of small sample sizes and underpowered studies.
But first, a few plain language definitions:
P-value: I apologize to every statistician who may ever read this article, but for today’s lesson, we are going to only discuss p-values in the sense that they are often the cut off for “statistical significance”. So JUST for today, all we need to know is that in research, if a p-value from a study is under the cut off (often 0.05), then the finding is statistically significant.
Power: This is the probability that the test we run will find a statistically significant difference when this difference actually exists in the real world. Here’s an example: We want to test the idea that fresh oranges taste better than 3-week old oranges. If our power is set to a value of 0.8 or 80% (the most commonly used power in studies), we have an 80% or greater chance of discovering that the fresh oranges taste better, as long as fresh oranges actually taste better in life, which I hope we all can agree on. High levels of power help us avoid wrongly declaring that something doesn’t have an effect, when it actually does. A study with LOW POWER is more likely to conclude that there is no difference in the tastiness of fresh Cohen’oranges over old, gross oranges, and well folks, that’s just plain wrong.
Effect Size: In the simplest of language, the effect size is how we can compare the difference of results between two groups. Imagine we have two groups of people with back pain, and one group gets no treatment, while the other group receives an exercise program. At the end of the study we can compare how well exercise worked on back pain compared to receiving no treatment, and we can determine an effect size from there! The most commonly used statistic for effect sizes in musculoskeletal health research (and beyond), is Cohen’s D. Often, Cohen’s D values are interpreted as: small = 0.2, medium = 0.5, and large = 0.8. Now since numbers like this are meaningless, you will find two small images below further expanding what this means in real life, thanks to the wonderful website https://rpsychologist.com/cohend/.
On the left is a Cohen’s D of 0.3 (small-ish), and on the right is a Cohen’s D of 0.8 (pretty big!).
Okay so that might not be that simple, so I’ll expand. Basically, as Cohen’s D values/effect size increases, the more people feel…more better. This means that MORE people in one group will likely get better with one treatment over the other (in our example, exercise over doing nothing). Here is a visualized form of a Cohen’s D value of 0.8 to go along with the description above on the right. As you’ll see from the visual, while there’s a lot of overlap between the two groups, the light blue group is doing better than the dark blue group.
Now that we understand that we want high power to ensure we are finding true results when they exist, and what different effect sizes mean, we can finally get to today’s topic, sample size.
A crucial step in the design of a study is figuring out how many participants need to be included in the study, in order to properly test what we are intending to test for! Now this is another simplification, but in order to determine how many participants we need in a study, we need to know the power of our study, the effect size we are expecting to detect, and the alpha value (a.k.a p-value). Add a little math and science to those values, and you can determine how many participants you need. This is called a ‘Sample Size Calculation’, and you SHOULD BE CHECKING to see if the journal articles you are reading have done this. If one of these isn’t done, or if it’s done incorrectly, you will most likely end up with a sample size that is too small. When a sample size is too small, you now have an underpowered analysis, and finally, after 800 words, we are getting to why this is problematic.
To illustrate what I perceive as the biggest risk of an underpowered study (a study with too small of a sample size to detect a real effect), I did a little simulation (using code from https://osf.io/84hba/) . I ran 2500 ‘studies’, with 5 different sample sizes (10, 20, 40, 70, 100) and with 5 different power levels (10%, 30%, 50%, 70%, 90%). Here’s the thing though, unlike in real life, I knew exactly what the true effect size of my experiment was. In this case, the true effect size I selected was Cohen’s D = 0.3
In my graph, what you will see is the average effect size (on the y axis), that corresponds with the statistical power (on the x axis). Here’s the catch though, these are only the values for ‘studies’ in my simulation that returned a statistically significant p-value, which in this case was less than 0.05. Without further ado, here is the result of my simulation.
So now what? This graph shows that if we took a whole bunch of studies that *actually* had 10% statistical power (a.k.a. a sample size way too small in this case), the average effect size of the studies was pretty darn close to Cohen’s d = 1.25. That’s a MASSIVE difference compared to the true effect size of 0.3 Even if we have 50% statistical power, the average of the ‘studies’ that would turn out to be statistically significant will have an effect size that is more than double what the true effect size is. This means that when studies are underpowered AND they find a statistically significant finding, they tend to MASSIVELY overpredict the effect of an intervention. That means that lots of research out there claims that things work way better than they actually do.
The other side of the coin is that a study with low power is, as described above, more likely to not find a difference between groups in a study, EVEN if a true effect exists. So, when we combine the fact that underpowered studies probably won’t find a difference between groups, and if they do, the effect size is massively overblown, we NEED to be cautious with our interpretations. Simply put, studies with very small sample sizes, especially in the musculoskeletal field, should be interpreted cautiously. Underpowered studies should be used for exploratory purposes, and should not necessarily be used as justification for us to believe in whichever intervention was being tested.
Now if you’re still reading, I do want you to remember that small sample sizes don’t always mean a study/analysis is underpowered. When we are looking for a really big effect size from an intervention, we can actually have quite a small sample size. Here’s a little case study I’ve designed for us. We want to test to see if doing sit ups is better than doing push ups for neck pain. We are interested in seeing a medium effect (cohen’s D = 0.5), an alpha level of 0.05, and our minimum desired power is 0.8. When I do my sample size calculation, I am told I need 64 people in EACH group in order to detect the difference I am expecting to see. Below, you will see what my actual statistical power for my study would be depending on how many people I put in each group. See how low the power is with a small sample size in this case?
Now, if we were expecting a big effect size, one that’s bigger than the 0.5 from our case study, you can see below that we would actually need a lot less people. If we were expecting an effect size of 1.0, we would only need around 18 people in each group.
Unfortunately, we don’t really see effect sizes this big in the musculoskeletal world; they’re super rare. Therefore, I encourage everyone to be cautious with their interpretation of studies with small sample sizes, and I now hope you now know a little more about why. If you want to learn more about this from people much smarter than me, I’ve provided some links below.
1) How to perform a power analysis (a 39 minute lecture on Power analysis by Dan Quintana from the University of Oslo)
2) 16 Minutes of Fun - YouTube
3) Coursera - Full Online Course on Statistical Inferences - Warning: basic statistics background is encouraged.
Author’s Note for statisticians #1: You’re right, calling something “underpowered” is a little lazy. The correct term should be “underpowered to detect the desired effect” or “not sensitive to detect certain effect sizes” as Richard Morey points out here: Richard Morey's Thoughts on the Term 'Underpowered'
Author’s Note for statisticians #2: If any statistician finds themselves reading this, you’re right, there are a few generalizations and one or two liberties taken. Hopefully the main takeaway still stands, that for the most part, we should be cautious about large effect sizes stemming from studies with small sample sizes in the field of musculoskeletal health.