Training a k-NN model to Predict Bank Customer Churn using AWS
This article is a summary of one of the case studies in the Full Stack ML course by AICamp. The problem is that of predicting customer churn, which is the fraction of customers lost by a business.
Feature engineering
The first step towards data preparation is to gather all your data in one single table, and apply feature engineering (the set of techniques used to transform the raw data) to obtain features in a format that can be used by the model.
For AWS SageMaker specifically, you need to prepare the data in a specific format: the first column should contain the labels, and there should be no headers.
You can apply other feature engineering techniques to prepare your dataset :
- One hot encoding
- Numerical encoding (when categories are hierarchical)
- Removing unnecessary columns that do not contain useful data
- Aggregating columns
- Normalization
- Missing values replacement (interpolation, frequency, removal…)
The Garbage In, Garbage Out principle
When deploying models, we need to make sure that we are integrating the feature engineering process in the production pipeline too. Data coming as input in the production pipeline will be in raw form, and it is the responsibility of the ML pipeline to transform the data. It is very important to make sure that the data is constantly monitored, as even minor changes in the pipeline could cause predictions to fail. Even though the pipeline might not break as a result of those changes, predictions will be wrong: as long as the model sees the data, it will make a prediction, but the model itself won’t check whether the input data is accurate. An example of how this might happen is if two columns are at some point accidentally swapped in the input dataset: the number of features will still be the same, so the model will make a prediction, but the results will be unreliable. This is the concept of GIGO: Garbage In, Garbage Out.
Data preparation
For the bank churn exercise, we have a dataset with with a total of 10000 entries (rows) and 8 features plus the label (columns):
The label is our “Exited” column, as we want to predict whether a customer will exit or not.
A useful step for transforming the dataset to the SageMaker format is to check whether the label is in the first column, and if not, swap it:
label = "Exited"# Rearrange the dataset columnscols = data.columns.tolist()colIdx = data.columns.get_loc(label)# Do nothing if the label is in the 0th position# Otherwise, change the order of columns to move label to 0th positionif colIdx != 0:cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]# Change the order of data so that label is in the 0th columnmodified_data = data[cols]
In this case there are no categorical columns, so we will not do any categorical or one hot encoding.
The first feature engineering we perform is to fill out missing values. We can use the scikit-learn Python library that has a few built-in functions to replace missing values. One is SimpleImputer, which by default replaces missing values with the mean calculated along each column:
from sklearn.pipeline import Pipelinefrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformer# Initialize Simple Imputer and fill missing values with the median valuenumeric_transformer = Pipeline(steps=[(‘imputer’, SimpleImputer(strategy=’median’))])numeric_features = data_without_label.select_dtypes(include=[‘int64’,‘float64’]).columns#numeric_features# Create the column transformerpreprocessor_cols = ColumnTransformer(transformers=[(‘num’, numeric_transformer, numeric_features)])# Create a pipeline with the column transformer, note that# more things can be added to this pipeline in the futurepreprocessor = Pipeline(steps=[(‘preprocessor’, preprocessor_cols)])preprocessor.fit(data_without_label)modified_data_without_label = preprocessor.transform(data_without_label)
We also need to split the dataset between the train and test dataset. There are different techniques for making this split, in this case we are using the default train test split function in scikit-learn, which randomly samples data points to create two new datasets in a given proportion for train and test (in this case we select a 80:20 split):
from sklearn.model_selection import train_test_splitmodified_data_array = np.concatenate((np.array(modified_data[label]).reshape(-1, 1),modified_data_without_label), axis=1)# Split the file into train and test (80% train and 20% test)train, test= train_test_split(modified_data_array, test_size=0.2)
Storing the dataset in AWS S3
Now that the dataset preparation phase is done, we need to upload our dataset to our S3 bucket. To create the S3 bucket, go to the AWS Management Console and search for S3.
If it’s your first time creating buckets, you should see something like this:
Go ahead and select “create bucket”. You’ll be asked to input a name for your bucket and select a region. It is preferable to use a region close to where you are located, but most importantly you should use the same region throughout your project (a training job created in one region cannot access data from another region, even if you’ve uploaded it using the same account).
Other useful options are public access and bucket versioning. The first enables different levels of accessibility, while the second enables saving all versions of your datasets uploaded.
Click on “create bucket” at the bottom of the page. Now you can click on your bucket (I called mine ‘francesca-aicamp’) and you should see these options:
Click “upload” and add your modified dataset as produced in the feature engineering step.
k-NN model training
We now want to train our model using k-nearest-neighbors (k-NN). K is the parameter representing how many neighbors the model is considering to classify the data. The algorithm looks at its k nearest neighbors in the training dataset to classify each new data point. In this classification problem, the new data point is assigned to the class most common amongst the number of neighbors defined by k, that are closer to the new data point in the feature space. K-NN can also be used in regression problems, where the result of the prediction is given by the average of the neighbors.
To launch our training job, we use the k-NN algorithm provided by SageMaker. Go back to the AWS console and search for SageMaker.
In the sidebar, click on training jobs:
Click on “create training job” and choose a name for your training job, I chose AICamp-KNN-churn-Dec03.
You’ll also be asked to select an IAM role, the set of permissions to control access between AWS services. Here we give permission to SageMaker to access S3.Create a new IAM role or use an existing one that gives access to any S3 bucket:
In “algorithm options”, choose Amazon SageMaker built-in algorithm and select k-NN. Leave the remaining options to default and jump to the Hyperparameters section. For this exercise, we are using the following parameters:
- Feature dimension (the number of features we have in the dataset): 8
- Min batch size (the number of data points at each iteration): 100
- K value (how many neighbors to consider): 5
- Predictor type: classifier
Next, we need to specify the location of our input data. To specify the S3 location, go to S3 in a new tab (to avoid losing the parameters you configured so far). Go to the location of your file and copy the URI:
Add another channel and repeat for the test dataset, specifying the name “test” for the channel (SageMaker expects two channels for the k-NN algorithm, specifically “train” and “test”).
Specify a location where to save your model output, which can be the S3 bucket that you just generated, for instance:
You’ve now finished configuring the training job! Click on “create training job”.
It might take 3–5 minutes for the job to be completed.
Once the job is completed, you can click to get more details. Scroll to the “Monitor” section and click on “view algorithm metrics”: you can see different metrics, click on “test accuracy”.
The CloudWatch console will open, where you can plot different values monitored during training. For the parameters used in this problem, the accuracy is 0.77:
To test a different value of k, we can clone the training job and modify the hyperparameter k. Go back to the training jobs dashboard, select the job, then go to “actions” and “clone”. You’ll see a copy of your training job:
You can see the accuracy slightly increased to 0.78:
Finally, you can go back to your output directory in S3: you’ll find the trained model saved in a tar.gz format. Once you’re happy with the level of accuracy reached with a certain set of hyperparameters, you have finished your training, and can use the model to make predictions on new data!
Full code for data preprocessing:
import pandas as pdimport numpy as np# Please change the file location as neededfile_location = “bank_churn_project_1.csv”data = pd.read_csv(file_location)label = “Exited”# Rearrange the dataset columnscols = data.columns.tolist()colIdx = data.columns.get_loc(label)# Do nothing if the label is in the 0th position# Otherwise, change the order of columns to move label to 0th positionif colIdx != 0:cols = cols[colIdx:colIdx+1] + cols[0:colIdx] + cols[colIdx+1:]# Change the order of data so that label is in the 0th columnmodified_data = data[cols]# Remove label so that it is not encodeddata_without_label = modified_data.drop([label], axis=1)from sklearn.pipeline import Pipelinefrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformer# Initialize Simple Imputer and fill missing values with the median valuenumeric_transformer = Pipeline(steps=[(‘imputer’, SimpleImputer(strategy=’median’))])numeric_features = data_without_label.select_dtypes(include=[‘int64’,‘float64’]).columns#numeric_features# Create the column transformerpreprocessor_cols = ColumnTransformer(transformers=[(‘num’, numeric_transformer, numeric_features)])# Create a pipeline with the column transformer, note that# more things can be added to this pipeline in the futurepreprocessor = Pipeline(steps=[(‘preprocessor’, preprocessor_cols)])preprocessor.fit(data_without_label)modified_data_without_label = preprocessor.transform(data_without_label)from sklearn.model_selection import train_test_splitmodified_data_array = np.concatenate((np.array(modified_data[label]).reshape(-1, 1),modified_data_without_label), axis=1)# Split the file into train and test (80% train and 20% test)train, test= train_test_split(modified_data_array, test_size=0.2)