If we run the simulation with 50 trials, 2 red balls and 2 green balls, with $H_0\colon\pi=\frac{1}{2}$, we discover that about 5% of the time, the final p-value is less than 0.05. It might take a lot of experiments to get an accurate percentage: I did 100 experiments, and 3 times the final p-value was less than 0.05. (This is what the significance means: it is the probability that
the null hypothesis will be rejected given that it is true.)
I then did another 100 experiments, counting the number of times the p-value went below 0.05: it was a total of 18 times out of 100. This suggests that the probability of the p-value going below 0.05 is much higher than 0.05, and so Robin's approach is likely to reject the null hypothesis even when there is insufficient evidence to do so.
In fact, there is a theorem which says that if the null hypothesis is true and we keep doing trials for ever, the probability that the p-value will go below 0.05 at some point is 1. (This is clearly true if the null hypothesis is false, as the proportion of green balls will tend to the true proportion and so the
p-value will tend to 0. The amazing result is that this statement is true even if the null hypothesis is true.) So if we allow ourselves to do lots of trials, Robin's approach gradually becomes even worse. For example, when I experimented with 200 trials, the p-value went below 0.05 on 23 occasions out of 100. This seems a little worse than with 50 trials, but not by that
much. It turns out that one needs to do a huge number trials to reach, say, a probability of 0.5 of obtaining a p-value less than 0.05 at some point.
Fixing Robin's approach
There is a way that we could sometimes stop early and thereby save money. Let's say that we decide that we're going to do 50 trials. If we reach the 45th trial, say, and see that it is impossible for the p-value to drop below 0.05 by the 50th trial, we can stop and accept the null hypothesis. This would take a little calculation, but could save Robin some money without
invalidating the conclusion.
There are also more sophisticated ways of analysing a sequence of trials such as these, which can allow one to reject the null hypothesis earlier if it is wrong. One needs to take account of the above problems, and adjust the calculations of p-values as one goes to ensure that the probability of incorrectly rejecting $H_0$ is still only 5%. This technique is known as sequential analysis, and is very important in modern statistics.
Changing the conditions
If we change the true proportion of green balls and the hypothesised proportion $\pi$ to match it, then we still see similar behaviour to that observed earlier.
If, though, we change the true proportion to be something other than $\pi$, say we have 3 green balls and 2 red balls, with $H_0\colon \pi=\frac{1}{2}$ still, we observe that $H_0$ is rejected much more frequently. In my experiments, $H_0$ was rejected 20 times out of 100 in this case. This is good, as in this case we know that $H_0$ is not correct.
The more extreme the difference between the hypothesised $\pi$ and the true proportion, the more frequently $H_0$ is rejected.