The article “What the Latest Nate Silver Controversy Teaches Us About Big Data” from fortune.com analyzes the meaning behind the famous poll analyst’s twitter response to criticisms of his presidential polling model by the Huffington Post. As of November 6, the politically left-leaning Huffington Post gave Hilary Clinton a 98% chance of winning the election over Donald Trump, while Silver’s model gave her a 64.9% chance. This discrepancy came from Silver’s technique known as trend line adjustment, which Huffington Post Bureau Chief Ryan Grim called “merely political punditry dressed up as sophisticated mathematical thinking.”
A major reason why Silver’s model is far less sure about Clinton’s presidency is his perceived correlation between state polling errors. He has noted that if one state’s polls overstate a candidate’s popularity, then that candidate’s popularity will also be overstated in states with similar demographics. This means that the effect polling errors would have on an election could possibly be largely understated by models like the Huffington Post’s.
Another reason Silver gives Trump a better chance of winning this election than most analysts is that he realizes that there have only been 11 elections since 1972, the first election year a significant number of state polls were recorded. This small number means two things.
First, it is important to calibrate the model based on polls and results going all the way back to 1972 because there is such a limited amount of data. Many models only include more recent data, as polling from 2000-2012 has been much more predictive than its predecessors. This is a mistake because this sample size is only four elections, too small a number to conclude that we are significantly better at polling now.
The second thing it means is that even going back to 1972 is too small a sample size to be so conclusive in poll findings. Eleven elections is not enough of a sample to know how much to adjust poll results based on state demographics.
The thing to notice about Silver’s reasoning is that he doesn’t believe Trump’s support is being undervalued in polls, but rather that he simply doesn’t know for sure if he is or if he isn’t, and that no one at this moment can be sure. That is why giving Hilary a 98% chance of winning is outrageous, not because she’s not likely to win, but because it’s impossible to be that sure she’s going to.
This is what can sometimes be the problem with big data. People want it to give them absolute answers, making human decision mistakes impossible. But the amount of data required for this level of certainty just doesn’t exist yet in any field, be it politics, sports, economics, healthcare or anything else. Big data is obviously extremely useful as a predictor, but it isn’t perfect.