Robin has a bag containing red and green balls. Robin wants to test the following hypotheses, where $\pi$ is the proportion of green balls in the bag:
$H_0\colon \pi=\frac{1}{2}$ and $H_1\colon \pi\ne\frac{1}{2}$
Robin is allowed to take out a ball at random, note its colour and then replace it: this is called a trial. Robin can do lots of trials, but each trial has a certain cost.
Robin wants to test these hypotheses as cheaply as possible, so suggests the following approach:
"I will do at most 50 trials. If the p-value* drops below 0.05 at any point, then I will stop and reject the null hypothesis at the 5% significance level, otherwise I will accept it."
Robin tells you about this plan. What advice could you give to Robin? Warning - the computer needs a little bit of thinking time to do the simulations!
In this simulation, you can:
specify the number of green and red balls actually in the bag (and the true ratio is shown with a green dashed line on the graph) - note that in a real experiment we would not know this!
specify the number of trials (up to 200)
specify the proportion for the null hypothesis (which we took to be $\frac{1}{2}$ above)
choose whether to show the proportion of green balls after each ball is picked
choose whether to show the p-value after each ball is picked*
rerun the simulation ("Repeat experiment")
The "Final p-value" shows the p-value at the end of the experiment, and the orange lines are at 0.1, 0.05 and 0.01.
Here are some questions you could consider as you think about Robin's approach:
What do you notice about the patterns of proportions and p-values? Is there anything which is the same every time or most times you run the simulation?
If we repeat the experiment lots of times, how often does $H_0$ get rejected using Robin's approach? Does the answer to this depend on how many trials we perform?
Does the answer change if you change the true proportion of greens in the bag?
What would happen if you changed the hypothesised proportion $\pi$?
What would happen if you changed the significance level from 5% to 10% or 1%?
You may want to ask and explore other questions as well.
Rejecting $H_0$ when it is true is called a Type I error.
* To read more about p-values, have a look at What is a Hypothesis Test? The p-values here are calculated like this: after $k$ trials, we find twice the probability of obtaining this number of greens or a more extreme number in $k$ trials, assuming that $H_0$ is true. The graph shows how this p-value changes with
$k$.
This resource was inspired by the controversy surrounding a paper published in Nature Communications, as discussed by Casper Albers here.