Daily Kos

Statistics for Political Bloggers

Mon Nov 24, 2003 at 10:46:54 PM PDT

Today's entry is about margins of random sampling error in political surveys.  I'm qualified to teach you about this because, before I sold out and became a lawyer, I was a math major in college.

Disclaimer  The most important kind of survey error is in picking a sample that is in some way skewed or biased.  Essentially all surveys you see in newspapers and press releases and on blogs are telephone surveys.  They do their best.  But, non-answers, unlisted numbers, people without phones, people with multiple phones, and more can skew these lists.  Also, sometimes people lie to people who take surveys.  None of this is captured by the analysis below.

The analysis below looks at a basic problem.  You have a target group.  Typically for a general population this is likely voters.  Typically for a Democratic primary, this is likely primary voters or caucus attenders as the case may be.  There are millions of these critters out there.  The survey typically asks 300 to 1500 people in the relevant group some questions.  The cases we care most about are a list of candidates for a particular office.

No matter how perfect your sampling methods are, random chance is going to cause the 300 to 1500 people you pick to not prefer candidates in exactly the same proportions as the entire target population.  But, it is possible to show with some fancy mathematics, that as your sample gets larger, your results will look more and more like the general population.  Random fluke results tend to average out in larger surveys.  Fancy mathematics also shows that the distribution of survey results from the same population tends to cluster around one value, if repeated, and that the more distant values are quite unlikely.

The flukiness of the results of surveys due to random sampling difference from the population, is very well defined mathematically.  In cases where the target population is significantly greater than the survey sample, there are only two formulas that really matter, plus a couple of corrolaries.

Any particular result in a survey has its own margin of error.  For example, in a survey with a sample size of 408, if Dean polls 32%, the margin of error of this result at the 95% confidence level is 4.5%.  Popular convention describes margin of error as the 95% confidence level, which means that if the survey is repeated over and over again, that 95% of results will be within the margin of error range.  

This convention is arbitrary.  A 95% confidence level is a result that is within 1.96 standard deviations of the "mean" result.  A 99% confidence level is a result that is within 2.57 standard deviations from the "mean" result.  A 90% confidence level is 1.65 standard deviations from the "mean".  A single standard deviation from the mean is a 64% confidence level -- the results are within that range two thirds of the time.  A useful rule of thumb is to remember that two-thirds of the time, a survey result will be within half the margin of error.

The margin of error formula for large target population sizes is as follows:

MOE=Z*SQRT(P*(1-P)/N)

Where MOE is margin of error at the confidence level for the Z chosen, Z is the number of standard deviations from the mean in the MOE created confidence interval, P is the percentage result expressed as a decimal, and N is the survey size.  SQRT means square root of and is the symbol that looks like a checkmark on your calculator.  Hence in the Dean example above it looks like this:

MOE=1.96*SQRT(.32*(1-.32)/408) which is 4.5%.

When the margin of error for an entire survey is presented the "P" figure used is 50%, which is the point at which a survey is least accurate and hence a conservative estimate.  The margin of error for individual results is generally lower.  The MOE of a survey is purely a function of survey size.  It is as follows:


Survey Size          MOE

  1.                  9.8%
  2.                  6.9%
  3.                  5.7%
  4.                  4.9%
  5.                  4.4%
  6.                  4.0%
  7.                 3.1%
  8.                 2.5%
  9.                 1.8%
  10.                0.98%
  11.                0.44%

Most political surveys are conducted with samples of 400-1500.  Subsamples are often 100-300in size.  The largest survey I use on a regular basis is the American Survey of Religous Identification which has a sample size of 50,000, and subsamples of 1,000.

Now onto the issue of comparing two results.

Suppose that you have a survey with a sample size of 408, and one candidate, Dean has 32% of the people supporting him, and Gephardt has 22% supporting him.  What is the likelihood that the gap is a statistical fluke?

The formula for this is as follows:

MOE of gap=Z*SQRT((P1*(1-P1)N)+(P2*(1-P2)/N))

As applied in this case, we use a Z=1.96 for the customary 95% confidence level, P1 is Dean's level of support, and P2 is Gephardt's level of support.  The gap P1-P2=10%.  But, how accurate is that? The survey size is 408 so:

MOE=1.96*SQRT((.32*(1-.32)/408)+(.22*(1-.22)/408))

This produces a result of MOE of gap is 6.05%.  So, the real gap between Dean and Gephardt is 10%+-6.05% (i.e. there is a 95% chance that the gap is between 4% and 16%), with about two-thirds of the results likely to come between 7% and 13%.

Now, is this is too much trouble there is a good approximation of the gap MOE formula that is fairly easy to calculate and better than most crude methods.

MOE gap is approximately equal to the average of the MOE of candidate one'sresults and the MOE of candidate two's results times a factor of one point four.

For example, in the example above Dean has a MOE of 4.5% and Gephardt has a MOE of 4.0.  The average is 4.25%, and that times 1.4 is 5.95%, which is quite close to the real answer.  (This works because the number inside the square root of the formula is the sum of the two MOE factors, and this is quite close to two times the average of the MOE factors, and the square foot of 2 is 1.41 and with some algebra you can see that it comes out quite close).

Poll

Did this make sense?

28%10 votes
11%4 votes
20%7 votes
2%1 votes
2%1 votes
8%3 votes
11%4 votes
14%5 votes

| 35 votes | Vote | Results

Tags: (all tags) :: Previous Tag Versions

Permalink | 26 comments

  •  Re: Statistics for Political Bloggers (4.00 / 13)

    Rate this diary entry.

    "Those who can make you believe absurdities can make you commit atrocities" -- Voltaire

    by ohwilleke on Mon Nov 24, 2003 at 10:49:46 PM PDT

  •  Re: Statistics for Political Bloggers (none / 0)

    One more point.  The MOE formula doesn't tell you how much confidence to put in a "zero" result.  For example, suppose that no one responds that they support Kucinich in a poll of 400 people.  What is the 95% confidence level of Kucinich support?

    The answer is that the maximum percentage p of the zero respondent with a confidence level C, expressed as a decimile is:

    (1-p)^N=(1-C).

    In the example above (1-p)^400=.05

    Or, alternately, log (1-p)=(log .05)/400

    Someone whose scientific calculator hasn't broken, or has a slide rule or log table handy, can figure out what that number is.

    "Those who can make you believe absurdities can make you commit atrocities" -- Voltaire

    by ohwilleke on Mon Nov 24, 2003 at 10:56:50 PM PDT

  •  Re: Statistics for Political Bloggers (none / 0)

    I like this.. I honestly suck at math, but this makes a little sense to me, and since I hated stats in college, this refreshes my memory a little.
  •  Re: Statistics for Political Bloggers (3.66 / 3)

    As a Statistics Professor, I want to make a few corrections and points:

    1. Under a normal distribution, about 68% (not 64%) of the data is within one standard deviation of the mean. Other than that, your analysis of the single proportion confidence interval is correct.

    2. The formula MOE of gap=Z*SQRT((P1*(1-P1)N)+(P2*(1-P2)/N)) is for use when you have two independent samples. Since the Dean and Gephardt numbers come from the same poll, they are dependent statistics (errors in favor of Dean would go against the others), and this formula is not appropriate here.

    3. The biggest problem in my opinion: This analysis assumes a random sample from a population of actual voters. The most difficult problem for pollsters is to adjust the sampled population to match the actual population. They may use Bayesian techniques such as adjusting the sample for known demographics of the population. This is why see Fox's numbers consistently different from Newsweeks numbers as they are applying different methods in determining "likely voters".

    •  Re: Statistics for Political Bloggers (none / 0)

      As a Statistics Professor, I want to make a few corrections and points:

      etc.

      I knew there was an option missing in the poll.

      * My head hurts.

    •  Re: Statistics for Political Bloggers (none / 0)

      Fair criticism on all points, I stand corrected.  How would you suggest approaching the statistical significance of a gap between two results in the same survey on a practical basis?

      My problem is with those who simply say: Well Dean numbers are 32% +- 4.5% and Gephardt numbers are 22% +- 4.0%, hence the game numbers are subject to a 8.5% inaccuracy, which overstates the amount of error since that assumes two results at the extreme of the margin of error, and hence is really more than a 95% confidence level.

      "Those who can make you believe absurdities can make you commit atrocities" -- Voltaire

      by ohwilleke on Mon Nov 24, 2003 at 11:18:20 PM PDT

      [ Parent ]

      •  Re: Statistics for Political Bloggers (none / 0)

        How would you suggest approaching the statistical significance of a gap between two results in the same survey on a practical basis?

        If it was just two choices, you simply use the one proportion Confidence Interval.

        If there are more than two choices, it's more complicated as you are making multiple comparisons - it follows a multinomial distribution and you could use a Chi-square goodness of fit analysis.

    •  Re: Statistics for Political Bloggers (none / 0)

      As a math and stats major, I concur with Mo on the dependence of the survey.  I'm glad that you brought this issue up as few people realize the significance of the confidence interval.  A 95% confidence interval means that 95% of the time the results are within MOE.  However, this also means that 5% (i.e. 1 in every 20) the survey is completely off base and not within the MOE.

      In terms of "likely-voters", I'm not sure how the pollsters take this into account.  I'm pretty sure that some of them ask the respondent questions and if they decide the person isn't a likely voter they completely disregard his response.  Others may attach a weighting to a vote (i.e. person X supports Dean, but based on questions that we asked X, we determine that there is a 50% that he'll actually make it to the polls and vote).  Anyone know for sure as to how pollsters do this?

      •  Re: Statistics for Political Bloggers (none / 0)

        The likely voter method seems to be a closely held metric. Zogby, for example, claims theirs is better.

        See my post below re response rates (another untalked about issue)...

        "Politics is the art of looking for trouble, finding it everywhere, diagnosing it incorrectly and applying the wrong remedies." - Groucho Marx

        by DemFromCT on Mon Nov 24, 2003 at 11:45:41 PM PDT

        [ Parent ]

        •  Re: Statistics for Political Bloggers (none / 0)

          Another diary entry a few days ago noted that one of the main reasons polls other than Zogby were off in 2000, was that their estimates for the proportion of voters who were black was about 2% less than what exit polls had reflected, so that the samples systemically underrepresented black voters as survey responses were adjusted to reflect national demographics.

          "Those who can make you believe absurdities can make you commit atrocities" -- Voltaire

          by ohwilleke on Mon Nov 24, 2003 at 11:54:52 PM PDT

          [ Parent ]

  •  Question (none / 0)

    "For example, in the example above Dean has a MOE of 4.5% and Gephardt has a MOE of 4.0.  The average is 4.25%, and that times 1.4 is 5.95%, which is quite close to the real answer. "

    How did you calculate the MOE for Gephardt and Dean in this case?

    •  Re: Question (none / 0)

      The Dean MOE is calculated in the text.  The Gephardt MOE is calculated as follows:

      1.96*SQRT(.22(1-.22)/400)=0.0405961 approx equals 4.0%, I should have written 4% rather than 4.0% which overstates the accuracy of my answer.

      "Those who can make you believe absurdities can make you commit atrocities" -- Voltaire

      by ohwilleke on Mon Nov 24, 2003 at 11:27:17 PM PDT

      [ Parent ]

  •  The WaPo poll as an example... (3.50 / 2)

    Claudia Dean, the asst. polling director for the WaPO, maintains a polling page with some other important discussions such as the contact and response rates for the polls, discussed About Washington Post Response Rates.

    She answers e-mail questions within a day or two. I emailed a question about how the WaPo poll defined regions of the country, e.g. (see my diary about where the midwest is... who knew?).

    There's more to polling analysis than statistics, although you and the prof were very helpful. For example, Parties Spinning the Polls: Who's Right, Who's Wrong? by terry neal is very instructive re why the WaPo poll gives Junior higher scores than other national polls.

    "Politics is the art of looking for trouble, finding it everywhere, diagnosing it incorrectly and applying the wrong remedies." - Groucho Marx

    by DemFromCT on Mon Nov 24, 2003 at 11:33:11 PM PDT

  •  Re: Statistics for Political Bloggers (none / 0)

    Thank you for posting this.  I was going to post something like this (I used to do survey research) but you have done a fantastic job. Your discussion of confidence intervals and sampling bias is particularly important to consider when evaluating some of the odder polling data out there.
  •  Re: Statistics for Political Bloggers (none / 0)

    This whole discussion is outstanding - I wish Kos could archive it in some special way, for handy reference.

    -- Rick Robinson

    The best fortress is to be found in the love of the people - Niccolo Machiavelli

    by al Fubar on Mon Nov 24, 2003 at 11:50:24 PM PDT

  •  Re: Statistics for Political Bloggers (none / 0)

    It's more comprehensible than anything I could have written.

    Though I am mathematician not a statistician,  I think  theoretical work in nonparametric statistics may also be relevant to sampling. ( I honestly don't know since this isn't my field

    Moreover, I think that practical matters of choosing a sample are extremely tricky.  This is one reason the "survey" taken by AARP in support of the Medicare legislation was so suspect. Do you know anything about this?

    Anyway congratulations on raising the level of discussion.

Permalink | 26 comments