In the first article of this series, we looked at understanding type I and type II errors in the context of an A/B test, and highlighted the issue of “peeking”. In the second, we illustrated a way to calculate always-valid p-values that were immune to peeking. We will now explore multiple hypothesis testing, or what happens when multiple tests are conducted on the same family of data.

We will set things up as before, with the false positive rate \(\alpha = 0.05\) and false negative rate \(\beta=0.20\).

```
library(pwr)
library(ggplot2)
set.seed(1)
mde <- 0.1 # minimum detectable effect
cr_a <- 0.25 # the expected conversion rate for group A
alpha <- 0.05 # the false positive rate
power <- 0.80 # 1-false negative rate
ptpt <- pwr.2p.test(h = ES.h(p1 = cr_a, p2 = (1+mde)*cr_a),
sig.level = alpha,
power = power
)
n_obs <- ceiling(ptpt$n)
```

To illustrate the concepts in this article, we are going to use the same `monte_carlo`

utility function that we used previously:

```
#
# monte carlo runs n_simulations and calls the callback function each time with the ... optional args
#
monte_carlo <- function(n_simulations, callback, ...){
simulations <- 1:n_simulations
sapply(1:n_simulations, function(x){
callback(...)
})
}
```

We’ll use the `monte_carlo`

utility function to run 1000 experiments, measuring whether the p.value is less than alpha **after n_obs observations**. If it is, we reject the null hypothesis. We will set the effect size to 0; we know that there is no effect and that the null hypothesis is globally true. In this case, we expect about 50 rejections and about 950 non-rejections, since 50/1000 would represent our expected maximum false positive rate of 5%.

```
set.seed(1)
# make our "true" effect zero: the null hypothesis is always true
effect <- 0
cr_b <- (1+effect)*cr_a
observations <- 2*n_obs
reject_at_i <- function(observations, i){
conversions_a <- rbinom(observations, 1, cr_a)
conversions_b <- rbinom(observations, 1, cr_b)
( prop.test(c(sum(conversions_a[1:i]),sum(conversions_b[1:i])), c(i,i))$p.value ) < alpha
}
# run the simulation
rejected.H0 <- monte_carlo(1000,
callback=reject_at_i,
observations=n_obs,
i=n_obs
)
# output the rejection table
table(rejected.H0)
```

```
## rejected.H0
## FALSE TRUE
## 939 61
```

In practice, we don’t usually test the same thing 1000 times; instead, we test it once and state that there is a maximum 5% chance that we have falsely said there was an effect when there wasn’t one^{1}.

## The Family-Wise Error Rate (FWER)

Now imagine we test two separate statistics using the same source data, with each test constrained by the same \(\alpha\) and \(\beta\) as before. What is the probability that we will detect *at least one* false positive considering the results of both tests? This is known as the family-wise error rate (FWER^{2}^{3}), and would apply to the case where a researcher claims there is a difference between the populations if any of the tests yields a positive result. It’s clear that this could present issues, as the family-wise error rate Wikipedia page illustrates:

Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes increasingly likely that the treatment and control groups will appear to differ on at least one attribute due to random sampling error alone.

What is the FWER for the two tests? To calculate the probability that *at least one* false positive will arise in our two-test example, consider that the probability that one test will not reject the null is \(1-\alpha\). Thus, the probability that both tests will not reject the null is \((1-\alpha)^2)\) and the probability that *at least one* test will reject the null is \(1-(1-\alpha)^2\). For \(m\) tests, this generalizes to \(1-(1-\alpha)^m\). With \(\alpha=0.05\) and \(m=2\), we have:

```
m<-2
1-(1-alpha)^m
```

`## [1] 0.0975`

Let’s see if we can produce the same result with a Monte Carlo simulation. We will run the Monte Carlo for `n_trials`

and run `n_tests_per_trial`

. For each trial, if *at least one* of the `n_tests_per_trial`

results in a rejection of the null, we consider that the trial rejects the null. We should see that about 1 in 10 trials reject the null. This is implemented below:

```
set.seed(1)
n_tests_per_trial <- 2
n_trials <- 1000
rejects <- 0
for(i in 1:n_trials){
# run the sim
rejected.H0 <- monte_carlo(n_tests_per_trial,
callback=reject_at_i,
observations=n_obs,
i=n_obs
)
if(!is.na(table(rejected.H0)[2])) {
rejects <- rejects + 1
}
}
# Calculate FWER
rejects/n_trials
```

`## [1] 0.103`

Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on \(m=4\) tests will lead to an ~18% chance!

A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than \(\alpha\), then they must control for the FWER and carefully consider how individual tests reject the null.

## Controlling the FWER

There are many ways to control for the FWER, and the most conservative is the Bonferroni correction. The “Bonferroni method” will reject null hypotheses if \(p_i \le \frac{\alpha}{m}\). Let’s switch our `reject_at_i`

function for a `p_value_at_i`

function, and then add in the Bonferroni correction:

```
set.seed(1)
p_value_at_i <- function(observations, i){
conversions_a <- rbinom(observations, 1, cr_a)
conversions_b <- rbinom(observations, 1, cr_b)
prop.test(c(sum(conversions_a[1:i]),sum(conversions_b[1:i])), c(i,i))$p.value
}
n_tests_per_trial <- 2
n_trials <- 1000
rejects <- 0
for(i in 1:n_trials){
# run n_tests_per_trial
p_values <- monte_carlo(n_tests_per_trial,
callback=p_value_at_i,
observations=n_obs,
i=n_obs
)
# Bonferroni: adjust the p-values and reject any cases with p-values <= alpha
rejects <- rejects + sum(any(p_values*n_tests_per_trial <= alpha))
}
# Calculate FWER
rejects/n_trials
```

`## [1] 0.055`

With the Bonferroni correction, we see that the realized false positive rate is back near the 5% level. Note that we use `any(...)`

to add 1 if any hypothesis is rejected.

Until now, we have only shown that the Bonferroni correction controls the FWER for the case that all null hypotheses are actually true: the effect is set to zero. This is called controlling in the *weak sense*. Next, let’s use R’s `p.adjust`

function to illustrate the Bonferroni and Holm adjustments to the p-values:

```
set.seed(1)
n_tests_per_trial <- 2
n_trials <- 1000
res_bf <- c()
res_holm <- c()
for(i in 1:n_trials){
# run n_tests_per_trial
p_values <- monte_carlo(n_tests_per_trial,
callback=p_value_at_i,
observations=n_obs,
i=n_obs
)
# Bonferroni: adjust the p-values and reject/accept
bf_reject <- p.adjust(p_values, "bonferroni") <= alpha
res_bf <- c(res_bf, sum(any(bf_reject)))
# Holm: adjust the p-values and reject/accept
holm_reject <- p.adjust(sort(p_values), "holm") <= alpha
res_holm <- c(res_holm, sum(any(holm_reject)))
}
# Calculate FWER
sum(res_bf)/length(res_bf)
```

`## [1] 0.055`

`sum(res_holm)/length(res_holm)`

`## [1] 0.055`

We see that the Holm correction is very similar to the Bonferroni correction in the case that the null hypothesis is always true.

# Strongly controlling the FWER

Both the Bonferroni and Holm corrections guarantee that the FWER is controlled in the *strong sense*, in which we have any configuration of true and non-true null hypothesis. This is ideal, because in reality, we do not know if there is an effect or not.

The Holm correction is uniformly more powerful than the Bonferroni correction, meaning that in the case that there *is* an effect and the null is false, using the Holm correction will be more likely to detect positives.

Let’s test this by randomly setting the effect size to the minimum detectable effect in about half the cases. Note the slightly modified `p_value_at_i`

function as well as the `null_true`

variable, which will randomly decide if there is a minimum detectable effect size or not for that particular trial.

Note that in the below example, we will not calculate the FWER using the same `any(...)`

construct from the previous code segments. If we were to do this, we would see that the both corrections have the same FWER *and* the same power (since the outcome of the trial is then decided by whether at least one of the hypotheses was rejected for the trial). Instead, we will tabulate the result for each of the hypotheses. We should see the same false positive rate^{4}, but great power for the Holm method.

```
set.seed(1)
p_value_at_i <- function(observations, i, cr_a, cr_b){
conversions_a <- rbinom(observations, 1, cr_a)
conversions_b <- rbinom(observations, 1, cr_b)
prop.test(c(sum(conversions_a[1:i]),sum(conversions_b[1:i])), c(i,i))$p.value
}
n_tests_per_trial <- 2
n_trials <- 1000
res_bf <- c()
res_holm <- c()
for(i in 1:n_trials){
null_true <- rbinom(1,1,prob = 0.5)
effect <- mde * null_true
cr_b <- (1+effect)*cr_a
# run n_tests_per_trial
p_values <- monte_carlo(n_tests_per_trial,
callback=p_value_at_i,
observations=n_obs,
i=n_obs,
cr_a=cr_a,
cr_b=cr_b
)
# Bonferroni: adjust the p-values and reject/accept
reject_bf <- p.adjust(p_values, "bonferroni") <= alpha
for(r in reject_bf){
res_bf <- rbind(res_bf, c(r, null_true))
}
# Holm: adjust the p-values and reject/accept
reject_holm <- p.adjust(sort(p_values), "holm") <= alpha
for(r in reject_holm){
res_holm <- rbind(res_holm, c(r, null_true))
}
}
# the rows of the table represent the test result
# while the columns represent the null truth
table_bf <- table(Test=res_bf[,1], Null=res_bf[,2])
table_holm <- table(Test=res_holm[,1], Null=res_holm[,2])
# False positive rate
fpr_bf <- table_bf['1','0']/sum(table_bf[,'0'])
fpr_holm <- table_holm['1','0']/sum(table_holm[,'0'])
print(paste0("FPR Bonferroni: ", round(fpr_bf,3), " FPR Holm: ", round(fpr_holm,3)))
```

`## [1] "FPR Bonferroni: 0.029 FPR Holm: 0.029"`

```
# Power
power_bf <- table_bf['1','1']/sum(table_bf[,'1'])
power_holm <- table_holm['1','1']/sum(table_holm[,'1'])
print(paste0("Power Bonferroni: ", round(power_bf,3), " Power Holm: ", round(power_holm,3)))
```

`## [1] "Power Bonferroni: 0.713 Power Holm: 0.774"`

```
# Comparing the Power of Holm vs. Bonferroni
print(paste0("Power Holm/Power Bonferroni: ", round(power_holm/power_bf,3)))
```

`## [1] "Power Holm/Power Bonferroni: 1.084"`

Indeed, we observe that while the realized false positive rates of both the Bonferroni and Holm methods are very similar, the Holm method has greater power. These corrections essentially reduce our threshold for each test so that across the family of tests, we produce false positives with a probability of no more than \(\alpha\). This comes at the expense of a reduction in power from the optimal power (\(1-\beta\)).

We have illustrated two methods for deciding what null hypotheses in a family of tests to reject. The Bonferroni method rejects hypotheses at the \(\alpha/m\) level. The Holm method has a more involved algorithm for which hypotheses to reject. The Bonferroni and Holm methods have the property that they *do* control the FWER at \(\alpha\), and Holm is uniformly more powerful than Bonferroni.

This raises an interesting question: What if we are not concerned about controlling the probability of detecting at least one false positive, but something else? We might be more interested in controlling the expected proportion of false discoveries amongst all discoveries, known as the false discovery rate. As a quick preview, let’s calculate the false discovery rate for our two cases:

`table_holm`

```
## Null
## Test 0 1
## 0 959 229
## 1 29 783
```

`table_bf`

```
## Null
## Test 0 1
## 0 959 290
## 1 29 722
```

```
fdr_holm <- table_holm['1','0']/sum(table_holm['1',])
fdr_holm
```

`## [1] 0.03571`

```
fdr_bf <- table_bf['1','0']/sum(table_bf['1',])
fdr_bf
```

`## [1] 0.03862`

By choosing to control for a metric other than the FWER, we may be able to produce results with power closer to the optimal power (\(1-\beta\)). We will look at the false discovery rate and other measures in a future article.

*Roland Stevenson is a data scientist and consultant who may be reached on LinkedIn.*

And a maximum 20% chance that we said there wasn’t an effect when there was one.↩

Hochberg, Y.; Tamhane, A. C. (1987). Multiple Comparison Procedures. New York: Wiley. p. 5. ISBN 978-0-471-82222-6↩

A sharper Bonferroni procedure for multiple tests of significance↩

Verify this by inspecting the

`table_bf`

and`table_holm`

variables.↩

You may leave a comment below or discuss the post in the forum community.rstudio.com.