The following is a very brief description of my NBA draft model where I use machine learning to predict which players will succeed in the NBA. I started dabbling in this area a few years ago and I have worked on steadilty improving my model ever since (as much as time allows). The biggest new feature in the 2016 version is the ability to factor in individual game performances. This was in response to some draft model hate after D'Angelo Russell was consistently loved by most draft models last year.
The criticism was that "Russell was pretty much awful against good defenses. The stats are aggregations." I took that feedback to heart and decided to improve in this area by incorporating individual game logs. Individual game logs better capture information about how well a player performs against different levels of competition and playing style, which can be lost in the noise when simply looking at season averages (even if adjusting stats by strength of schedule and pace).
By incorporating this information into the model, I hope to tease out insight from big games (such as Russell's 3-19 shooting performance in the NCAA tourney against Arizona's strong defense). This idea was reinforced this year by Jaylen Brown's disastrous close of the season where he finished with shooting performances of 1-6, 3-17, and 1-6 along with a combined 14 turnovers in his only post-season collegiate games. At first glance the addition of this new information didn't seem to hurt Russell's rank much but it may be playing a role in pushing Jaylen down my board even further. At some point I'll do some analysis on how much this new information contributes to the final ranking of different players.
My "Longevity" draft model optimizes for "long term value" as defined by a player's max five year "Value over Replacement Player" (VORP). VORP is based on the stat Box Plus/Minus (BPM) and accounts for playing time, allowing injuries/durability/coaching preferences to be factored in, which is important when measuring for playing longevity. For active players, max VORP values are projected based on age, VORP trajectory, playing time trajectory, etc.
The "Longevity" model incorporates individual and team performance (traditional and advanced stats), measurables (age, height, weight, etc), athletic abilities (NBA combine data), situation (teammate quality, competition, pace, position, playing time, era), and scouting (high school rank, actual/expected draft rank). Additionally, as mentioned above, the newest iteration of my model now includes metrics built from individual game logs.
The model is trained on a data set of every college player over the last 25 years, reduced down to players with any NBA potential (as determined by NBA probability estimates, which are based on basic performance statistics). Players who never made the NBA are assumed to have replacement player value. Since playing styles have shifted greatly over the last 25 years, the performance of a player in a certain area is also measured relative to his peers from that season which helps make effectiveness in certain areas (i.e. 3's) more comparable across time. Lastly, the final model is actually a blend of many different individual models. The individual models consist of various machine learning algorithms (both linear and non-linear), all tuned in different ways.
While this post is brief, there is a lot that goes into one of these models (data cleaning, feature engineering, modeling techniques, etc). Please contact me if you want to discuss further. Otherwise, check out results from the model in our draft board tool!