Data Analytics and Horse Racing

I have always been fascinated with the concept of horse racing. I became involved at a very young age. The unpredictability and high variability of animals and their behaviour, coupled with the task of predicting outcomes, has always interested me.

In more recent times, I became excited by the challenge of using data analytics and predictive modelling as a means to predict these outcomes. I wondered if I could use data analytics and machine learning to make accurate and reliable predictions. I then set out to look at this as a project.

Exploratory Data Analysis

I began with many horse racing datasets obtained from various sources. Specifically, the data included race information (track, distance, class, etc.) and horse information (everything about the horse including its age, sex, win strike rate and details of its past performances).

Data must always be reliable and of a high quality, otherwise it will almost always contribute to wrong outcomes or predictions. This is why I spent much time gathering as much data as I could on all different aspects of horse racing. I then cleaned the data to remove unwanted data, missing values, duplicate values, etc.

After cleaning the data, the challenge was to understand what variables would be informative and worthy of including in my model. To that end, I performed an exploratory data analysis.

Findings

Firstly, there is a strong correlation between the horse’s barrier position and its win rate, specifically over sprint distances (usually up to 1200m). Specifically, win rates are the highest starting from barrier position 1. Inside barrier positions perform the best, then as the barrier position gets wider, wins begin to drop off.

However, when exploring a horse’s win rate by barrier position over staying distances (typically greater than 2400m), there does not appear to be a strong correlation between the horse’s barrier position and its win rate.

Therefore, the suggestion is that a horse’s barrier position over sprint distances is of greater importance than over staying distances.

Secondly, it was observed that the win rate of each jockey and trainer varies significantly. The win rate shown below for each jockey and trainer is adjusted by a number of factors. This suggests that, depending on the jockey and/or trainer, they may have predictive power on the outcome of a race, hence they are included in the model.

You can see the full analysis and model in my GitHub repository here.