Training an XGBoost model for Pricing Analysis using AWS SageMaker
This project is part of the Full Stack ML in AWS course by AICamp, and is based on the data preparation steps outlined in the previous article on churn prediction. Let’s have a look at some definitions first.
How does the XGBoost algorithm work?
XGBoost is a variation of gradient boosted decision trees. These algorithms are based on a much simpler algorithm, the decision tree. In decision trees, new data points are assigned to predicted categories based on subsequent splits at the tree nodes. To decide where to split, the decision tree follows the path of maximum information gain, or least entropy, which means choosing the split that minimizes the variance between the data in each new subset. Decision tree-based algorithms also include Random Forests and Gradient Boosting.
In AWS SageMaker, a built-in XGBoost algorithm is available, where we can tune hyperparameters and run the algorithm using the platform’s UI.
What is Pricing Analysis?
Pricing Analysis is the process of setting a product price, which involves studying historical data and assessing which feature of the product drives the price. In the case of housing — the dataset used in this problem — many different factors can influence prices. Number of bedrooms and bathrooms, home size, and lot size are some obvious ones, but location, school availability, commute times also play a role. The pricing analysis problem is a regression problem: the aim is to predict a continuous value, which in this dataset is the price in USD.
What metrics can we use to assess performance in this problem?
In regression problems, we want to measure quantitatively how close our prediction is to the ground truth. One way to measure this is to simply calculate this difference for each of our predictions and average over all the results. This is the Mean Absolute Error (MAE). Another way is to square the error, the Mean Squared Error (MSE), or to take the square root of it (RMSE). In MAE, errors are weighted equally, whereas in MSE and RMSE larger errors are weighted more, which is why this is preferable most of the time.
Training XGBoost
To preprocess the data, and upload them to S3 buckets, follow the steps in the k-NN post.
Go to your SageMaker dashboard, and click on “create training job”. Choose a name for your training job. The IAM role can be the same as the one created for the k-NN problem. This time choose XGBoost as the algorithm.
In the hyperparameters section, set the following parameters:
- num_round = 10
- objective = reg:linear
- eval_metric = RMSE
Specify your train and validation channels as done for the k-NN training job. Note that while for the k-NN algorithm the channels’ nomenclature was “test” and “train”, for XGBoost it is “test” and “validation”. This is just a SageMaker convention. Add the output location and click on “create training job”.
Once training is finished, you can visualize the performance in the CloudWatch platform:
The validation error is much higher than the training error, so we might want to do some further training and tune the hyperparameters.
After training is completed, you can see the model in your output folder in S3:
As the result still has a high RMSE, we want to minimize this by tuning the hyperparameters. Some important hyperparameters are:
- Number of rounds: number of boosting rounds, corresponds to the n_estimators parameter in the scikit-learn XGBRegressor function.
- Alpha: L1 regularization term on weights, a larger value should result in better generalization (avoiding overfitting).
- Booster: gradient boosting method, options are gbtree, gblinear, or dart.
- Early stopping: the training job stops when the validation score stops improving.
- Depth: depth of the tree, the deeper the tree the more chances of overfitting.
Hyperparameter tuning
Tuning the hyperparameters allows to find the set of hyperparameters for which the model performs best.
Start from the hyperparameters in your first model, which you can find by selecting the training job you just ran:
Go to “hyperparameter tuning jobs” and select “create hyperparameter tuning job”. Choose a name for your tuning job, and select early stopping and select a tuning strategy, random or bayesian.
After setting the hyperparameter tuning, you’ll be asked to define the input data, for which you can follow the same steps as for the configuration of the training job. Next you can configure your resource limits to avoid having the job running for too long.
Finally,
You can see a summary of your settings:
When the hyperparameter tuning job is completed, you can visualize the results and select your best training job.
As you can see, there is already an improvement of ~7000 in the RMSE value, and the hyperparameter tuning job only ran for ~7 minutes.
To improve your even results further, you can create a new hyperparameter tuning job, but this time start with “warm start”, meaning you’ll initialize the values with the best results from the previous hyperparameter tuning job.
Clean-up
Finally, to avoid extra charges, remember to delete your files and folder that you don’t need:
- Open the Amazon SageMaker console. Under Inference, choose “models”. Choose the model that you just created in the example, then from “actions” click “delete”. There is no way to delete training or hyperparameter tuning jobs, however once the jobs are completed they do not result in extra costs.
- Open the Amazon S3 console and delete the bucket that you created for storing model artifacts and the training dataset.
- Open the Amazon CloudWatch console and delete all of the log groups that have names starting with /aws/sagemaker/.