Notion of the Statistical Power in Clinical Trials That Require Statistical Analysis
In this blog, we provide a brief overview of the notion of statistical power, along with factors that can impact it.
The notion of statistical power is one of the critical points that should be taken into account when carrying out studies that require statistical analysis. In simple language, the reason is that most statistical analyses include inferential analysis, in particular hypotheses tests. Hypotheses tests typically come with a possibility for a couple of errors. When the statistical power is low, there is a high chance of error.
Therefore, the results might not reveal the real nature of the association between the factors which are being studied. High statistical power is an important target, especially in clinical trials, as it is directly related to the drug manufacturer’s interest. In this paper, we will review the notion of power and present a couple of factors that can impact it.
What is the Statistical Power?
To define the statistical power within the context of a hypotheses test, there are two competing hypotheses: the null hypothesis and the alternative hypothesis. Based on the available evidence, i.e., using the collected data, the null might or might not be rejected against the alternative. In most cases, the alternative hypothesis represents the desired end. That means, the researcher tries to prove that the null hypothesis is not true. This makes two kinds of errors possible. The errors are called type I and type II errors:
- Type I error: While the null hypothesis is true, the test rejects it in favor of the alternative hypothesis.
- Type II error: While the alternative hypothesis is true, the test fails to reject null in favor of the alternative.
As mentioned before, in most tests, particularly in the area of clinical trial, the desired end is put in the alternative test. For instance, when trying to prove the bioequivalence, two tests are set up simultaneously. The null hypotheses for the two tests are the geometric ratio being smaller than .8 and greater than 1.25, respectively.
The alternative hypotheses are opposite to the null hypotheses, i.e., the geometric ratio being larger than .8 and smaller than 1.25. Therefore, drug manufacturers are interested in controlling type II error.
It is worth noting that agencies are mostly concerned with type I error, as they try to avoid certifying a drug by mistake. Type I error is controlled by setting reliable levels of significance for tests such as .05 or .1. These are usually mandated by agencies in their guidelines.
The probability of the complement of type II error is called the statistical power of the test. Therefore, the power of a test is the probability to truly reject the null in favor of the alternative.
Methods of calculating statistical power are diverse and rather complex. In fact, the topic of statistical power is one of the most complex topics in the area of statistics. There are several reasons as to why. We will not discuss further detail here. However, one important reason is that describing the true distribution of test statistics for both null and alternative is typically not straightforward and in many occasions not possible.
This is specifically difficult in a couple of situations, such as when the nature of data doesn’t follow some specific distributions or when there are various alternatives being considered at the same time.
Some classic methods provide a clear-cut formula for power. This makes calculating the power very convenient. However, those formulas usually depend on certain assumptions on the data distribution that might not be met. Mostly, a normal distribution is needed to use the majority of specific power formulas.
There are other methods based on simulation that could be more realistic. However, they are usually very specific and not universal and highly dependent on real data that might not be readily available.
Factors that Impact Statistical Power
There are a couple of factors that can impact statistical power. Some statistical power factors are inherent, meaning that they depend on the nature of data and might not be controllable. Some other factors are adjustable (at least theoretically, although it might be practically difficult or costly!). These factors could allow researchers to adjust their study so that the power will be as high as possible.
Below, listed are the main statistical power factor. The impact of each factor is explained using a frame containing two plots. Each plot consists of two probability distributions. The one on the left is the distribution of the test statistic under the null assumption, while the distribution on the right represents the test statistic under the alternative (or more precisely, under the true effect size).
For simplicity, it is assumed that test statistics under both null and alternative are known and normally distributed. The dashed area in each figure represents the statistical power of the test.
1. The Significance Level:
The probability of type I error is referred to as the significance level (also called the size of the test). There is a trade-off between the significance level and statistical power. This means increasing any of them causes the other one to increase as well (note that smaller significance levels are more desirable). One simple explanation is that a higher significance level means that we have to reject more often (we are conservative), as a result, the chance of a correct rejection increases too. This makes the power larger.
Figure 1: Two tests with α = 0.03 (left) and α = 0.05 (right), other factors remain the same.
Figure 1 displays this fact (dashed area represents the power of the test). While the shape and position of the test statistic under both null and alternative remains the same (as a result of keeping the sample size and the effect size fixed), increasing the significance level shifts the dashed area to the left and causes the power to go up.
2. The Effect Size:
An effect size refers to the magnitude of the result as it occurs, or would be found, in the population. In other words, and in the context of clinical trial studies, the effect size is the real effect of the drug on the whole population out in the world that might take the drug. The real value of this effect is, of course, unknown, and we normally use some conjectures based on the available data from previous studies.
Let’s represent the effect in the null hypothesis by θ0 and the effect in the alternative hypothesis by θ1. The closer θ1 to θ0, the more mixed the distribution under the null and alternative. As a result, the test statistic under the null is more likely not to be rejected in favor of the alternative.
Figure 2: Two tests with θ0 = 0, θ1 = 1 (left) and θ1 = 3 (right), other factors remain the same.
Figure 2 shows the impact of effect size on the statistical power. Notice that the shapes of the test statistic and the start point of the dashed area remain the same (as a result of keeping the sample size and the significance level fixed). However, shifting the test statistic under alternative to the right in the right frame causes the power to increase.
3. The Sample Size:
A larger sample size results in a higher statistical power. A simple explanation is that there is a reverse relationship between the sample size and the variance of the test statistic under both null and alternative. A small sample size causes the variance of the test statistic to increase. This makes the test statistics under both null and alternative flatter and causes them to be more mixed, which results in lower power.
This fact has been represented in Figure 3. While the significance level and the effect size are the same in both tests, smaller sample size results in a flatter shape in the right frame. The effect of sample size on the power can also be justified using another argument. By definition, power is the probability of rejecting null when it should “truly” be rejected. In this definition, the condition part, i.e. “Should be truly rejected” refers to the real population while the event, i.e., “Reject” refers to the sample.
As a result, the larger the sample size, the better the true state of the population reflects on the sample. This results in more true rejections, which implies higher statistical power.
Figure 3: Two tests with different sizes: small (right) and large (left), other factors remain the same.
4. Variability in the Data:
The larger the variance of the sample, the larger the variance of the test statistic. As explained for the impact of the sample size, the large variance of the test statistic results in lower power. The impact of variance on statistical power can be similarly represented using Figure 3.
Please note, this factor has been less studied compared to the other three statistical power factors. One main reason is that it seems to be inherent to the data and not fairly adjustable. So, it cannot be deployed to improve the power of the test.
However, an example is provided by Friedman et al. to show that with a better study design (or more precisely, a better design for data cleaning and collection) one can improve the power while keeping the sample size, the significance level, and the effect size the same!
The example is regarding a clinical trial where the impact of a new drug on the cholesterol level is of interest. The sample is divided into the intervention and the control group. The impact of the drug can be measured by comparing the mean value of the cholesterol level in two groups. There are two possible designs.
One can only measure the cholesterol level at the end of the clinical trial and just compare the mean values in the two groups. Another possibility is to measure it both at the beginning and at the end of the clinical trial, then calculate differences for each individual and finally make an inference by comparing the mean of differences. It is proven by Friedman et al. that the latter clinical trial results in a lower variance and consequently a higher statistical power.
Why BioPharma Services for Your Next Drug Development Project?
At BioPharma Services Inc., we use solid statistical approaches to evaluate and provide statistical measurements. For every statistical notion, there are normally various methods available to approach it. We strive to choose the best models with concrete scientific background among all the available models to achieve the most precise outcomes. We believe that a statistical model cannot be used before its theory is fully understood, and its rationale has been verified. Furthermore, we then translate the theoretical knowledge into practical steps and at the end convert it into a statistical programming language (mostly SAS). The code is tested several times using verification data. At the end, the results that are provided to our clients are theoretically concrete, practically understandable, and tested programming-wise.
If, at any time, our clients would like us to explain the rationale behind a method used or a result observed, we do our best to refer to the theoretical background using as simple and understandable language as possible.
Written By: Jafar Farsani, Senior Biostatistician