PREDICTING BASKETBALL GAME RESULTS USING MACHINE LEARNING
Top 5 finish out of 155 participants in a Kaggle Competition to predict the results of a basketball season
using historical results as datas sources
Dataset: Anonymous Basketball Data (Kaggle)
Technologies and Methods Used: Python, R, Machine Learning, Ensemble Modeling, Feature Engineering
INTRODUCTION
The Kaggle competition was to predict the results of a basketball game from an anonymous league and year. The competition tested participants skills in data analysis and processing, and also tested their intuition for deploying the correct machine learning methods to predict results. Over 150 participants enrolled in the Kaggle competition, and final rankings were based on the percentage of accurate results predicted by each team. My team finished 4th in the competition, with an accuracy of 69%.
DATA PREPROCESSING
The dataset was provided by Kaggle and was highly complex and multi-dimensional. There were over 200 variables, and our first job was to sift through the variables and filter out any data that might be unnecessary such as the color of the jerseys. We also filtered out any N/A columns and summarized the data to check for any outliers. My team was resourceful, and we looked up advance basketball statistics as we knew that there were probably more to the stats than was included in the dataset. Thus, we feature engineered about 20 new variables that we determined may affect the outcome, and added them to our dataset. Some of the key variables we created are:
-
Offensive Rating: An estimate of the number of points a team scored per 100 possessions (normalization)
-
Defensive Rating: An estimate of the number of points a team allowed per 100 possessions (normalization)
-
Possessions: An estimate of the number of possessions a team had during a game
-
Difference between Home Team and Away Team average points scored the whole season
VARIABLE SELECTION AND MODELING
We then performed feature selection using Lasso regression (L1 regularization). Lasso helps identify the most important features that have a significant impact on the target variable (Home Team Wins). It eliminates irrelevant or less important features, making the model more interpretable and potentially improving performance.
There is an additional step we took where highly correlated features are identified using the "findCorrelation" function from the "caret" package. Highly correlated features can cause multicollinearity issues, affecting model interpretability and stability.
The code builds three types of models:
a. Lasso Logistic Regression: It fits a logistic regression model with L1 regularization. The lambda value (penalty strength) that gives the best model performance is selected using cross-validation.
b. Multiple Logistic Regression: It fits another logistic regression model, including a subset of features selected by Lasso. This model allows for direct comparison with the Lasso model.
c. Ensemble Model: We then created an ensemble model by combining the predictions of the two logistic regression models (model1 and model2). If both models agree on the outcome (both predict "Yes" or both predict "No"), the ensemble model predicts accordingly. If they disagree, the ensemble model chooses the majority vote.
Model Evaluation: We employed repeated random sampling to split the training dataset into training and testing sets, iteratively building and testing the models. The final accuracy score is the average prediction success rate obtained across all iterations.
INTERPRETATION AND LIMITATION
Interpretation
The final model predicted basketball results with a 69% accuracy.
Limitations
The code's main limitation is the potential risk of overfitting, as it uses the same dataset for feature selection, training, and testing