Powerful Hypothesis Testing

Age 16 to 18

Challenge Level Yellow star

Why do this problem?

This problem is designed to help students understand that the power of a test depends on a variety of factors. It is thus a far more intricate question than that of handling the significance of a test. It can also lead to an understand that interpreting the result of a hypothesis test is not straightforward: what does a non-significant result actually mean? Is it that the null hypothesis is true, or that the experiment was simply not powerful enough to discover that it is false? The distinction between these possibilities is crucial in many areas where hypothesis testing is performed: it is too easy to incorrectly assert that the null hypothesis is true (or likely to be true). This links in well with the activity Hypothetical Shorts.

As an extension, it is also possible to work out algebraically the probability of rejecting the null hypothesis if it is false; it is important, though, to also develop a sense of how different factors affect the answer.

In this resource, we use a binomial hypothesis test for the simplicity of description, but the principles are applicable more generally.

Possible approach

Students would benefit from having some exposure to hypothesis testing before looking at this simulation. It would also be very helpful for them to have access to the simulation themselves so that they can explore it.

The problem could be posed in a real-world context as opposed to picking balls from a bag: you could ask students to suggest real-life contexts where we would be interested in distinguishing between two competing hypotheses. For example, we could be trying to find out whether a new drug is better than the standard one, or whether eating certain foods for breakfast or doing a certain amount of exercise improves students' chances of passing a particular test. The former would lead to a decision about whether to use the drug in future, while the latter might affect advice on how best to prepare for tests. Nevertheless, the theoretical ideas are subtle enough that it is probably simpler to work with abstract coloured balls for the actual activity.

You could then explain that Robin, the experimenter, wants to know how likely it is that the experiment will successfully reject the null hypothesis if it is false. (Robin knows that it will reject the null hypothesis if it is true with a probability of 5%, the significance level.)

Students may require guidance as to how to use the simulation. For example, they could begin with the default of 2 red balls, 3 green balls, $H_0\colon \pi=\frac{1}{2}$ and 50 trials, and note the proportion of the experiments in which $H_0$ is rejected after doing some large number of experiements. (The simulation provides this figure for students.) They could then do this again with a different proportion of red and green balls and note what changes. It would be good to ask students to make a prediction before they rerun the simulation, and compare their prediction with the actual results.

Students could then go on to change some of the parameters in a systematic fashion and consider the questions provided.

Key questions

What does a significant result (one with the p-value below 0.05) tell us?
What factors affect the probability of obtaining a significant result if the null hypothesis is false?
What does a non-significant result (one with the p-value above 0.05) tell us?

Possible extension

Can you theoretically work out the probability of obtaining a significant result if the null hypothesis is false?

Possible support

Students will benefit from being systematic when working with the simulation and recording their results as they go. There are several factors involved, and adjusting just one factor at a time is a wise thing to do.

To work out the answer to the question of what a significant result means, students may need prompting to use a tree diagram.

Number and algebra

Geometry and measure

Probability and statistics

Working mathematically

Advanced mathematics

For younger learners

Powerful Hypothesis Testing

Why do this problem?

Possible approach

Key questions

Possible extension

Possible support

You may also like

Very Old Man

Reaction Timer Timer

Chi-squared Faker