Tuesday, August 25, 2009

New Jersey and Virginia: A Diagnostic Comparison of the State of the Race(s)

Part of the reason FHQ wanted to examine the New Jersey and Virginia gubernatorial races this year was not the races themselves as much as what they represent: an opportunity to test and try out a few things in terms of how we process new polling information as it comes in. To this point, though, we have essentially leaned on the graduated weighted averaging formula used to a fair amount of success in the presidential race last year. There's nothing wrong with that formula. It was far more simplistic than some of the alternatives out there and only missed North Carolina and Indiana in the electoral college categorization (Even then, North Carolina was essentially a tie in the average. But I digress...). The governors races in New Jersey and Virginia, then, are being utilized with an eye toward 2012 and the electoral college.

First of all, one feature that would have been nice last year (for every state or at the very least the swing states) is a graph similar to the ones FHQ has appended to each New Jersey and Virginia polling update. As I said recently, though, the lines on those charts seem to be floating in space without some baseline for comparison (the actual raw polling data, for instance). But that got me to thinking: The graduated weighted average is constructed to give the most recent poll the most weight, but also to incorporate past polling data in a way that guards against a shock to the system; an anomaly. And all that really is is a poor man's regression line. My question then was, How does the graduated weighted average stack up against a simple regression projection based on the polling data we have at our disposal? Sure, I could take an "everything but the kitchen sink" approach and add seemingly relevant variables to my heart's delight, but let's see how a simple bivariate regression as a start. Remember, Virginia and New Jersey are test cases for the 2012 electoral college model.

So all I did was regress the time in the campaign so far (measured as the number of days in the campaign*) on each candidate's share of support in the polls conducted over that period. All that basically does is provide us with a trendline based on the hypothesis that over time there will be some changes to a candidate's level of support. Yes, that is ambiguous, so let me be a bit more specific. Most clearly, we can hypothesize that over time, the undecided share will decrease and in this particular instance, that the Republican share will increase. Indeed, in both cases, the time component explained a surprising amount of the variation in the undecided share across polls as well as both Chris Christie's and Bob McDonnell's support (between 30 and 50%).

But the two Democratic models performed far worse. In both cases, less than 10% of the changes in Deeds' and Corzine's shares were accounted for in the time series. Why? Well, in neither case is there much change to speak of. There's more change in the Deeds case than for Corzine, but not by much.

Fine, what does any of this mean and what does it leave us with graphically? Good questions. I'll take the second one first and then use the two figures below to illustrate the former. Graphically, as you can see below it leaves us with a bit of a mess. Nine separate lines are a lot to take in. However, there is a wealth of information in these two figures. The most volatile lines are the raw polling datum (referred to there as actual) while the smoother two lines around with they hover (and are based upon) are the graduated weighted average (average) and the regression projection (predicted).

The raw data are nice, but let's focus on the other two lines, as this post is supposed to be about comparing two different projections of the state of each of these races.

[Click to Enlarge]

In the New Jersey example, we see that the graduated weighted average and the regression line track each other almost exactly in the case of Jon Corzine. Again, there isn't too terribly much change, relatively speaking, in the Corzine numbers and that keeps the lines closer together. Where there is more volatility, there is more divergence between the two measures. This is most clear among the undecideds. The graduated weighted average projects the level of voters yet to be had by either campaign at a consistently higher level in New Jersey since June than does the regression measure of the same concept. The same sort of phenomenon can be seen in Chris Christie's numbers. However, in this case the graduated weighted average comes in below where the regression finds the Republican candidate's support across these polls. And on the whole, the difference between the two measures appears to be growing over time. If the regression prediction is the more reliable measure (and that is an arguable point given the simplicity of the model), then the graduated weighted average is losing predictive power over time in regards to Chris Christie's share of support in this race. That isn't really the best trajectory to be on if you are attempting to use polling information as a means of forecasting the results of an election.

But New Jersey is just one case. How do things look further south in the Old Dominion?

[Click to Enlarge]

Things are a bit more muddled in the Virginia example and that is largely attributable to the differences across the two races.

First of all, there are far fewer polls that have been conducted in the Virginia race. Still, given the window of time that is being considered in each race, each state is averaging a poll every seven or eight days. Regardless, fewer polls overall in the Virginia case translates to more volatility in the graduated weighted average.

But also, there have been different dynamics at work in both races. In New Jersey, Jon Corzine has been stuck in a holding pattern in the polls while Chris Christie has, on the whole, gained over time. The Virginia case is quite different. The polls showed a close race early, but over time that has yielded a seemingly comfortable McDonnell lead.

The smaller window of time in Virginia means that there is less time for past polling results to have decayed and less new polls to have outweighed them. As a consequence, the graduated weighted average is stuck to some degree; overvaluing some of those past results that were more Deeds-heavy. Well wait, what that really means is that this graduated weighted averaging methodology is bias against Republicans. It happens to be in this case. But what the average is really biased against is rapid change. And in 2009, both Republican candidates are the ones who are moving in the polls, at least as compared to their Democratic counterparts. Which brings us to the crux of the matter: The issue with the average has always been whether the past polls are over or undervalued. In this comparison, it seems as if those past polls are still being overvalued, potentially at the expense of gleaning the true state of each race.

But let's return to those Virginia results for a moment. With the above caveats in mind, we would expect the average to underperform the regression in the McDonnell model while the two lines would remain rather close to each other (while slightly overperforming) for Deeds. All that means is that the status quo from poll to poll is protected more in the case of the average than with the progression of the regression trendline.

What does this mean for the graduated weighted average? I'm not apt to scrap it just yet. This exercise is helpful in determining the usefulness of the measure in settings other than the electoral college (and even for the electoral college, truth be told). Again, FHQ even examining these races in the first place is a function of tweaking the measure with 2012 on the distant horizon. Better to do it while that is distant and not on top of us with or after the midterms next year.

As for what this means for what you'll see in subsequent iterations of these polling updates, you'll continue to see the Actual vs. Average trend and will likely see occasional (and perhaps more advanced) regression model predictions. So, be on the lookout for that.

*For New Jersey, that means the number of days since the first of the year (as Christie was the clear cut Republican frontrunner to challenge Corzine) and in Virginia, the time since the Washington Post's endorsement of Creigh Deeds in the Democratic primary race (It was at that point that Deeds was really first seen as a legitimate candidate in the race -- primary or general election.).

Recent Posts:
Don't Forget Your Change Commission Reform Suggestions: Deadline Today

2012 Presidential Race: August PPP Trial Heats In-Depth

PPP Poll: 2012 Trial Heats (Obama v. Gingrich/Huckabee/Palin/Romney) August Edition

No comments: