Over a period of 10 years, databright has been a major contributor to the CalculatedPolitics website, which is focused on employing data analytics in predicting the outcome of Canadian federal and provincial elections.
The approach to develop the projection employs sophisticated statistical models, incorporating such variables as polling data, previous election results, candidate incumbency, and several other contributing factors. A package was developed using R, which executes thousands of simulations to arrive at a final projection. During the 2015 Federal Election, the projection successfully predicted the winning candidate in over 75% of ridings.
This predictive model was developed initially out of curiosity, and quickly became a success: Through the years, the model has been continuously enhanced and has become increasingly rigorous in its application of statistical techniques. It was developed as a way to present the possible outcome of an election at any point during the election campaign.
Some of the challenges encountered included:
- Determining the key variables that are predictive in election outcomes, and the degree to which the weights change as an election approaches.
- Employ an approach such that models can be updated quickly, with relatively little manual intervention. This is important as an election date approaches as the frequency of polls increases.
- Simulate thousands of elections in order to develop a distribution of results, which accounts for variability in underlying variables and assumptions.
The resulting product, which is available on CalculatedPolitics, incorporates the following components:
- Many different models were back tested against previous elections in order to determine which variables were most explanatory in projecting election outcomes.
- Multiple linear regression was used fit the models, with different models being developed for different regions.
- Polling data is often only available on a macro scale, so modeling was done to transpose onto a more regional basis.
- A package was developed within the R language in order to format data, execute models and generate outputs in PHP format.
- Software was developed in a modular fashion so that components could be reused in other applications (e.g., simulation engine).
- Structure was implemented in such a way that the software could be used for any election in Canada.
- Variance in underlying data is a concern in any model, so a simulation engine was developed in order to execute elections in each riding thousands of times: To run a projection for Canada, we perform over 1.5 million simulations!
- A Monte Carlo approach is used to provide randomization, and better simulate a true election by introducing variability.
- Based on the simulations, a distribution is developed which allows us to calculate the probabilities of a given candidate winning.
Results & Recommendations
Some of the key findings and recommendations associate with this project are as follows:
- Using a Monte Carlo incorporates variability in underlying variables to generate a more robust solution than simply employing the models directly without simulation.
- The modular design provides a portable solution that can be quickly deployed for similar projections.
- The design allows for implementing additional variables (inputs) as required in a relatively simple manner.
The development of this solution has provided databright with a platform that can be used on other projects, and provides some fundamental tools and skills for future work.