This problem is designed to help students understand the meaning of hypothesis tests, and in particular why it is necessary to fully specify the experiment - in particular, the sample size - before we begin, otherwise our results may be meaningless. There is an important technique called sequential testing which allows one to stop an experiment early while the results remain valid, but
significant care must be taken in this situation, as shown by this resource. (Bayesian inference has an alternative approach to this, but that is another story entirely.)
In this resource, we use a binomial test, but the principles are more generally applicable. The solution section provides a more detailed explanation of these ideas.
Possible approach
Students would benefit from having some exposure to hypothesis testing before looking at this simulation. It would also be very helpful for them to have access to the simulation themselves so that they can explore it.
To put the problem in a real-world context as opposed to picking balls from a bag, you could ask students to suggest real-life contexts where we would want to or have to limit the number of trials in an experiment. For example, we could be doing laboratory experiments, and all of the materials involved are expensive. Or we might be trialling a new drug, and it costs a large amount to
test it on a person, or there are only a limited number of people with the condition the drug is designed to treat. It might be that this is an experiment on animals, and we wish to limit the number of animals we are working with for ethical reasons. Another reason (which is related to the cost reason) is that each trial takes a large amount of time, perhaps a day or two, so it is not
feasible to do very large numbers of trials.
You could then explain that Robin, the experimenter, has suggested a way of saving money, as described in the problem. Your students, as budding statisticians, will need to consider Robin's proposed method, and explain why it is good and will save money, or why it is broken and will potentially give a misleading answer.
Students may require guidance as to how to use the simulation. For example, they could begin with 2 red balls, 2 green balls, $H_0\colon \pi=\frac{1}{2}$ and 50 trials, hide the p-values graph, and just note the proportion of the experiments in which $H_0$ is rejected based on the final p-value. They could then repeat this but note the proportion of the experiments in which the
p-value ever drops below 0.05. What does this suggest?
Students could then go on to change some of the parameters in a systematic fashion, exploring whether their initial ideas hold true more generally.
Key questions
Is it necessary to specify the number of trials in advance?
What would happen if we didn't?
Possible extension
Is there any way of stopping the experiment early and still obtaining useful results?
What is the benefit of doing more trials? Surely we would still only reject $H_0$ 5% of the time? You can use the simulation to explore this.
Possible support
There are several things which can be changed in the simulation, and it is easy to get lost. Students will benefit from being systematic, and guiding them to structure their exploration and recording of results will help them to understand what is happening.