Poll reporting and Margin of Error.

For good or ill – well, ill – polling has become a cornerstone of the Australian political process. This is not uncommon in and of itself: for nearly a century polling has been deployed in many countries as an instrument to measure public opinion, particularly with respect political parties, leaders, and their policies.  Australian political parties are developing a reputation for dispensing with leaders on the strength of poll results, but that’s a matter for the parties – an issue of confidence and paranoia.

There is a further peculiarity with respect to polling in Australia, though, concerning how polls are reported. In most countries where polling has taken hold as a device for assessing public opinion, results commonly reported in newspapers, TV and on radio, it’s conventional to report the margin of error (and possibly the sample size) whenever the poll results are reported.  Not so, in this country. Why this should be is difficult to say. The argument here is that the oversight should be remedied, and quickly, especially since it wouldn’t greatly trouble media networks and publishers to do so.

In order to argue the importance of this detail of a poll result in evaluating the overall results of the poll, there is utility in giving a brief explanation as to how the margin of error features in a poll and how it is calculated.  Many people will likely be familiar already with this – those who are may skip to the last two paragraphs.

To tease out the significance of margin of error, it’s best to begin with the process by which, on a probabilistic basis, talking to a smallish group of people about an issue might be useful in determining the opinions of a country’s entire population. How can asking 1,200 people (for example) a question possibly yield a result that represents the responses a population of 23 million people might give?

The reasoning behind this isn’t difficult, and it’s worth setting out here to substantiate the claim that limitations of polling should be kept in mind, and reported whenever poll results are. If a poll finds that 60% of the sample say ‘yes’ to a question and 40% say ‘no’ this should mean that if you walk up to anyone and ask the same question, the probability that the answer will be ‘yes’ is 60%. Six in ten people will say ‘yes’.  The poll (assuming the sample asked the question is representative of the population) tests the probability of any individual’s answer to the question being ‘yes’ (or ‘no’).

Instead of a response to a question in a survey, suppose we want to test the probability that a coin, when tossed, will yield a ‘heads’ or ‘tails’ result. This is a good example because we already have the answer – 50% or half/half. We obviously can’t toss all the coins there are and get the actual result from all the times they might be tossed. But we can use a certain number of coin tosses as a sample and generalise from this sample to develop a general rule about the probability of a ‘heads’ or ‘tails’ result for the entire population of coins, and every event in which one of them might be tossed.

Tossing a coin once gives a sample of 1. The result is either ‘heads’ or ‘tails’.  Depending on how things pan out the result of this poor survey of coin tosses is that in 100% of cases coins will come down ‘heads’, or that in 100% of cases coins come down ‘tails’.  So we might generalise this result to all the events of coin tosses and say that coins, when tossed, will always come down heads.  This conclusion isn’t entirely wrong – we’d be right half the time.  But it is wildly inaccurate.

To improve the accuracy of this survey, the sample of coin tosses can be increased.  Toss a coin 10 times (or 10 coins once) and it’s very unlikely that the survey would show that in 100% of cases the result is ‘heads’.  Toss a coin 100, or 500 or 1200 times and this becomes much, much more unlikely. The ideal but impractical ‘sample’  is an infinite number of coin tosses, which would yield the general result that a coin comes down ‘heads’ exactly 50% of the time, and ‘tails’ the other 50% (again, this is a good example, because unlike a question in a poll, we have the answer in terms of the probability of a certain result in advance).

Counting all the coin tosses in history would yield perfectly accurate results as to the probability of a ‘heads’ or ‘tails’ result.  Taking a sample as representative of all coin tosses will never yield perfectly accurate results.  That is to say, toss a coin 10 times, and it’s quite likely it won’t come down on ‘heads’ five times and ‘tails’ five times.  Toss a coin 100 times and it still probably won’t come down ‘heads’ exactly 50 times and ‘tails’ the other 50.  However, if the conclusion from a survey of 100 coin tosses is that the probability of a ‘heads’ result is 53% and the probability of a ‘tails’ result is 47%, that’s a much more accurate result than the sample of one coin gave (a coin always comes down ‘heads’ or always comes down ‘tails’).

The larger the sample of coin tosses is, the closer the result is likely to get to the ‘right’ answer, 50% ‘heads’, 50% tails.  The ‘right’ answer to a question to which the answer is ‘yes’ or ‘no’ may not be that 50% say ‘yes’ and 50% say ‘no’. But whatever the true answer is, to get this perfectly accurate answer every individual in the population would have to be asked the question.  But, if the sample is representative of the population, the likelihood the answer to the question the sample gives would match this perfectly accurate result increases as the sample size increases.

Although the result of a poll that puts a question to a sample of individuals in the population will not be absolutely accurate, we can at the least (using, for example, the rules for coin tosses) work out how inaccurate the result will be.  This inaccuracy is consistent and unchanging for particular sample sizes, whether what is being tested is coin tosses or the answer to a question in a survey.  Further, this result is unaffected by how large the total population is: a sample of 1200 people is as effective in assessing the opinions of a population of 23 million people as it is given a population of 350 million.

Because most political polls use particular sample sizes we don’t need to calculate or remember the degree of inaccuracy, or Margin of Error, in every case (there’s calculators available online that will let you plug in a figure and get a result, here’s one).  So I’ll take two examples: a sample size of 500 and a sample size of 1200. Again, the statistical margin of error is the same, whether the sample is coins, or people being asked a question.

If 500 coins are tossed, or 500 people are asked a question, provided the coin tosses are representative of all coin tosses, or the people representative of the population, there is a 95% chance the poll result will be accurate to within 5% of the ‘true’ result.  For example, in tossing 500 coins 20 times, 19 out of those 20 times the coin will come down ‘heads’ somewhere between 225 and 275 times. Likewise ‘tails’.  So the Margin of Error for this poll, with this sample size, is +/-5% (to a 95% level of accuracy).

If 1200 coins are tossed (or 1200 people asked a question) as can be expected, the Margin of Error is smaller: +/-3% to the same 95% degree of accuracy.  If 1200 coins are tossed 20 times, 19 out of 20 times the coin will come down heads somewhere between 564 and 636 times. Three percent of 1200 is 36, and so there is a 95% chance the result of 1200 coin tosses will be 36 or less on either side of the ‘true’ result.

So it is that if 1200 people are asked a question (and 1200 is a common sample size for polls precisely because it gets us to that +/-3% margin of error) then 19 out of 20 times the result will be within a similar margin of whatever the true result is.  Supposing, for example, that 1200 Australian citizens are asked whether they prefer politician x or politician y.  There’s all kinds of problem with this question, not just probabilistic ones, but for the moment, let’s suppose the result is that 52%  (624) say they prefer x and 48% (576) say they prefer y. The question is, to what extent does this result reflect the result we would get if all 23 million Australians were asked (or all those old enough to answer)? And what does it tell us about the reputation and popularity of x and y?

In this case, a finding that 52% for x, 48% for y, there is no result at all as to which of the two politicians is more popular.  With a sample size of 1200 the margin of error is +/-3, as previously established.  Any result within a six percentile range is ‘statistically insignificant.’ It’s not just less likely to be right, it’s a meaningless difference. Going back to coin tosses, a result of 52% ‘heads’ with a sample of 1200 doesn’t mean that overall coins will come down ‘heads’ more often than they do ‘tails.’ The result contains error because the sample size is not large enough. So it is with the popularity of politicians x and y. In generalising from a sample of 1200 to a population a difference of less than 3% either way must be written off.

A result of 52% popularity for x and 48% popularity for y from a sample of 1200  respondents therefore means that x and y are equally popular, as far as this methodology can tell us. A result of 53% to 47% likewise yields a statistically insignificant difference in popularity – one politician is no more popular than the other, as far as we can tell from the sample.  A result of 54% to 46%, though, places politician y in trouble. In this case the difference is larger than the +/-3% margin of error, and is therefore statistically significant. Particularly significant for the politician in question.

One last thing to return to here is that this margin of error, +/-3% is calculated at a certain level of confidence – 95%. If the coins are tossed or the question asked twenty times it should reasonably be expected that on one of those occasions the error in the result will be larger than 3% either way.  So there is another dimension of uncertainty here, one that suggests we cannot be entirely sure even of results that do not fall within the range of the margin of error. Repeating the poll – asking the question again – seems a good move to deal with this problem, although repeating a survey in such a way as to achieve reliably consistent results brings with it its own set of problems.

When polls are reported in Australian media the margin of error is often omitted. I say ‘Australian media’ because this is unusual.  On-screen graphics on TV in most countries include the margin of error with the poll result. Newsreaders, by convention, will mention the margin of error. Not in Australia, though, and it isn’t clear why.

Of course, TV and radio reports of poll results are most often second-hand reports.  There is usually a primary source for the poll in one newspaper or another, and online, and here the margin of error is given.  But when the details of a poll are published on pages eight and nine of a newspaper, it is not unusual to see a headline on the front page related to it, screaming, for example ‘Politician y loses ground!’ Nor is it unusual that the headline references a result that page 8 and 9 reveal is statistically insignificant, according to the margin of error.  A loss of 3% against rival politician x does not constitute losing ground. It is no result at all: nothing has changed.

(for my own grasp of MoE and much more I’m indebted to Professor Warwick Blood, under whose tutelage I began teaching on research methods)