My Sports Betting Algorithm Is Finally Ready
It's been a journey, but the algorithm is finally ready to dive into action and put some money on the table.
Picking up from the last post on the algorithm, we ran into the problem of the algorithm being accurate but overwhelmed by the sportsbook’s commission. After a few weeks of further tweaks, I found a solution to this problem, re-designed the system, and got it ready to start betting real money.
Before diving in, I recommend refreshing on the last post discussing the system, as I assume a knowledge of some terminology:
Let’s first go over how this new system works:
In a typical MLB season, as many as 16 games can happen in one day, resulting in ~2,340 games played each season. So naturally, the first problem is weeding through these games and finding the best ones to add to the betting portfolio.
Betting every single day gets tedious and frustrating, so this approach will ingest all games of the week, get the 3 most confident, then, it will bet them all at the start of the week.
To do this, we take a sorting approach as follows:
Starting on Sunday, we get a schedule of all games to be played in the following week (Mon.-Sat.).
From this dataset, we drop the games that feature a game with a team that already played that week. For example, if Team A played against Team B on Monday, then on Wednesday, Team A played against Team C, we would remove that later game from our universe. We do this because we use Team A’s historical game data to make a prediction, thus we would need the stats of their Monday game to help determine whether or not they would win the Wednesday game.
For each remaining game, we assign a score rating based on the confidence of the pick. Once this is sorted from largest to smallest, we select the top 3 games. These are the bets that will be made for the next week.
Here’s an example of what that looks like for a day:
Our run differential approach gave us statistically significant accuracy in almost every season from 1998, but I wondered if adding a bit more complexity could help take the accuracy from the 60s, to the 80-90s.
To find this, I read through the most recent sports betting research papers, and stumbled upon one that caught my attention: “Use of Machine Learning and Deep Learning to Predict the Outcomes of Major League Baseball Matches”
By using 40 variables in a traditional machine learning model, the author claimed to reach a 94% accuracy!
Those numbers were higher than anything I’d ever thought possible, so naturally, I tried re-creating it to see the results for myself:
Unlike in the paper, I didn’t have much experience with Neural Networks, so I tried just using the variables in a comparative way instead. The idea is that the variables are what lead to the high accuracy, not the type of math applied on them. So, I first gathered data for the following variables, excluding a few that were too hard to get:
Once I had those stats for each team, I repeated the same sorting approach, but with the sum of the variables being the new “score”. For each matchup, it would bet that the team with the higher score would be the winner. Let’s see what happened: