Trending December 2023 # Building Customer Churn Prediction Model With Imbalance Dataset # Suggested January 2024 # Top 14 Popular

You are reading the article Building Customer Churn Prediction Model With Imbalance Dataset updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Building Customer Churn Prediction Model With Imbalance Dataset


The digital revolution is taking place at a very fast pace. Data generation over the cloud is increasing in massive amounts regularly, and data being king in the digital era can win anything. The amount of data is insufficient until it does not reflect or we cannot find meaningful information that can drive business decisions. Data analysts and scientists spend almost 70 percent of their time forming quality data or refining it ta, and the other class has almost zero presence in the data, known as a class Imbalance problem. There are d

Learning Objectives

Why is the quality of data important in Machine Learning?

What is Imbalance Dataset, and How to deal with class Imbal

Use different Models to control Internally the Imbalance Dataset

Measure the accuracy of the Imbalance dataset with perfect performance Metric.

Build Customer Churn Prediction model with decent accuracy.

 This article was published as a part of the Data Science Blogathon.

Table Of Contents

Brief Introduction to Imbalance Datasets

Describing the Dataset

Small Talk on Churn Analysis

Choosing the Best Performance Metrics

5.2. Explore Numerical and Categorical Features

6.3. Applying XGBoost Algorithm

Comparison of Different Models


Brief Introduction to Imbalance Datasets

Machine learning operates completely on the data. If data is improper, then model performance will  below, where No states the transaction, which is not fraudulent, and YES states the fraudulent transaction.

Source: Canva

In the above example, the no becomes 80 percent, and the yes becomes 20 percent, so this kind of scenario is called an imbalanced class problem because your model creates a bias toward the more weightage class. The thing of our interest is to understand more fraud transactions, but since we don’t have enough sample data, machine learning models will be unable to learn the pattern of fraud transactions. This is a very common problem in the IT industry while dealing with data. Here we have seen an example of 80-20 percent. You can get different ratios 90:10, 95:5, or even 99:1.

Describing the Data

The dataset we will use is the Customer churn prediction dataset of 2023. It is all about measuring why customers are leaving the business or stating whether customers will change telecommunication providers or not is what churning is. The dataset contains 4250 samples. Each sample has 19 input features and 1 boolean target variable, which indicates the class of the sample.

The dataset is imbalanced, where 86 percent dataset is not churned, and only 14 percent of the data represents churn so our target is to handle the imbalance dataset and develop a generalized model with good performance.

Small Talk on Churn Analysis

Churn Analysis describes the company’s customer loss rate. Churn means Attrition in simple words, which occurs in two forms customer attrition and employee attrition. When the attrition is high, the company’s growth graph starts coming down, and the company suffers a high loss time during the attrition. Churn can be minimized by analyzing the company’s work environment, product growth, market conditions, dealer connections, etc. If churn increases by only one point, then it directly affects the business in a negative perspective. High Churn rates compound very fast that can have a massive loss to the company, so it is important to calculate churn regularly.

Source: Cleartap

Choosing the Best Performance Metric

Since our target variable has imbalanced dataset, we will not use the accuracy score. The recall is used in an imbalance class problem with two classes, which quantifies the number of correct predictions of positive made out of all positive predictions. The formula to find recall is the total number of true positives divided by the total number of true positives and false negatives.

Precision is another choice of metric which calculates accuracy for the minority class. Maximizing the precision will reduce the false positives, while maximizing the Recall will reduce the false negatives. So, according to our problem statement, we must focus on reducing the false negative. Hence we will go with Recall as a performance metric.

Source: Medium

Performing Exploratory Data Analysis

The first step is to import all the libraries and load the data to explore the data and understand some insights and relationships to take further action on the dataset. We have imported data analytics libraries and data visualization libraries.

Python Code:

import matplotlib.pyplot as plt import seaborn as sns #importing plotly and cufflinks in offline mode import cufflinks as cf import plotly.offline cf.go_offline() cf.set_config_file(offline=False, world_readable=True) import plotly import as px import plotly.graph_objs as go import plotly.offline as py from plotly.offline import iplot from plotly.subplots import make_subplots import plotly.figure_factory as ff pd.set_option('max_columns',100) pd.set_option('max_rows',900) pd.set_option('max_colwidth',200)

If you check the null or duplicate values, the result is 0 for each column. So, from the Preliminary analysis, we can conclude some results.

We will create models with the famous trio XGBoost, Light GBM, and Catboost that predict behavior to retain customer data and develop a focused customer churn prediction.

For Catboost, types of columns with integers will be converted to float type.

We have to look at the cardinality of categorical variables.

And finally, we will convert the churn analysis variable to numeric using Label Encoding

Explore Target Variable

We will analyze the target variable to find the percentage presence of each class data and plot a histogram of the churn variable to visualize the availability of each class data.

y = df1['churn']

We will convert the target variable from categorical to numerical values using a Label encoder, and the value is 0 or 1.

Explore Numerical and Categorical Features

We have different numerical and categorical features, and to get good data insight, we can analyze them separately and find relationships. In simple words, we will perform a univariate and bivariate analysis. So for ease of usage, we will separate the numerical and categorical columns in a separate list.

First, we will go with numerical columns where we first convert all the integer type columns to float type.

#Pick int columns and convert to float type col_int = [] for col in numerical: if df1[col].dtype == "int64": col_int.append(col) col_int.remove("churn") print(col_int) #Convert to float for i in col_int: df1[i] = df1[i].astype(float)

To get insights about the relationship between numerical columns, you can display the summary statistics using describe the function, and we will plot the correlation heatmap graph. The parameters we describe to find a correlation are data of relationship, annotations mean to print the values, and we have set values with 2 place decimal, and the style of the graph we have used is cool warm.

Observation: From the above heatmap, we have some of the columns that are highly correlated with each other, indicating multicollinearity in the data. So, we need to drop one of each highly correlated column pair.

#remove the multicollinearity features drop_col = ['total_day_charge', 'total_eve_charge', 'total_night_charge', 'total_intl_charge'] df1 = df1.drop(drop_col, axis=1) df1.shape

We got rid of multicollinear columns, and you can plot the correlation plot again and watch the relationship. Now it’s time to explore categorical features along with target and bivariate analysis. The good news is we do not have high cardinality or zero variance issues in the categorical columns. So we directly jump on bivariate analysis where first we plot the state versus churn to check the state-wise churn report.

Same you can analyze different columns in the dataset, for example, area code, international plan, and voice mail plan against churn analysis, and find the relationship to conclude some better insights.

Learning to Use Famous Trio Models with Class Imbalance Dataset

Now, let’s look at Catboost, XGBoost, and Light GBM to see how they handle the imbalanced dataset internally. By giving a chance to focus more on the minority class and fine-tune the training, they do a good job even on the imbalanced datasets.

CatBoost, XGBoost, and LightGBM use scale_pos_weight hyperparameter to fine-tune the training algorithm for the imbalanced data. By default, scale_pos_weight is 1.

The formula for calculating the value of scale_pos_weight:

Number of Non-churned (majority) customers: 5174

Number of Churned customer(minority): 1869

scale_pos_weight = 5174 / 1869 or almost 3

By modifying the weight, the minority class gets 3 times more impact and 3 times more correction than errors made by the majority class.

from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler,LabelEncoder from sklearn.metrics import accuracy_score, roc_curve, recall_score, confusion_matrix, roc_auc_score, precision_score, plot_confusion_matrix,plot_roc_curve import optuna from xgboost import XGBClassifier from lightgbm import LGBMClassifier from catboost import CatBoostClassifier import optuna import lightgbm as lgb from xgboost import XGBClassifier from catboost import CatBoostClassifier 1. Apply Cat Boost Algorithm accuracy= [] recall =[] roc_auc= [] precision = [] X= df1.drop('churn', axis=1) y= df1['churn'] categorical_features_indices = np.where(X.dtypes != np.float)[0] #Separate Training and Testing set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # With scale_pos_weight=5, minority class gets 5 times more impact and 5 times more correction than errors made on majority class. catboost_5 = CatBoostClassifier(verbose=False,random_state=0,scale_pos_weight=5) #Train the Model, y_train,cat_features=categorical_features_indices,eval_set=(X_test, y_test)) #Take Predictions y_pred = catboost_5.predict(X_test) #Calculate Metrics accuracy.append(round(accuracy_score(y_test, y_pred),4)) recall.append(round(recall_score(y_test, y_pred),4)) roc_auc.append(round(roc_auc_score(y_test, y_pred),4)) precision.append(round(precision_score(y_test, y_pred),4)) model_names = ['Catboost_adjusted_weight_5'] result_df1 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names) result_df1

We can see the results where we got a recall of 0.85. we can plot the confusion matrix to clearly understand the false positive, false negative, True positive, and true negative.

Optuna for Hyperparameter Tuning

Optuna is used for automating the search process of best hyperparameters. It automatically finds the optimal hyperparameter using different methods like Grid search, Bayesian, Random search, and evolutionary algorithm. If we use these different methods to fine-tune, we get optimized hyperparameters, but accuracy differs. So let’s do the hyperparameter optimization of the cat boost with Optuna. The code below will run for 90 to 100 trials, so it will take a little time.

def objective(trial): param = { "objective": "Logloss", "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1), "depth": trial.suggest_int("depth", 1, 12), "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]), "bootstrap_type": trial.suggest_categorical( "bootstrap_type", ["Bayesian", "Bernoulli", "MVS"] ), "used_ram_limit": "3gb", } if param["bootstrap_type"] == "Bayesian": param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10) elif param["bootstrap_type"] == "Bernoulli": param["subsample"] = trial.suggest_float("subsample", 0.1, 1) cat_cls = CatBoostClassifier(verbose=False,random_state=0,scale_pos_weight=1.2, **param), y_train, eval_set=[(X_test, y_test)], cat_features=categorical_features_indices,verbose=0, early_stopping_rounds=100) preds = cat_cls.predict(X_test) pred_labels = np.rint(preds) accuracy = accuracy_score(y_test, pred_labels) return accuracy if __name__ == "__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100, timeout=600) print("Number of finished trials: {}".format(len(study.trials))) print("Best trial:") trial = study.best_trial print(" Value: {}".format(trial.value)) print(" Params: ") for key, value in trial.params.items(): print(" {}: {}".format(key, value))

Built Cat Boost Classifier Model with New Parameters

We have done with Hyperparameter search, and Optuna has given some of the best parameters to use, so let’s train the model with new hyperparameters.

accuracy= [] recall =[] roc_auc= [] precision = [] #since our dataset is not imbalanced, we do not have to use scale_pos_weight parameter to counter balance our results catboost_5 = CatBoostClassifier(verbose=False,random_state=0, colsample_bylevel=0.09928058251743176, depth=9, boosting_type="Ordered", bootstrap_type="MVS"), y_train,cat_features=categorical_features_indices,eval_set=(X_test, y_test), early_stopping_rounds=100) y_pred = catboost_5.predict(X_test) accuracy.append(round(accuracy_score(y_test, y_pred),4)) recall.append(round(recall_score(y_test, y_pred),4)) roc_auc.append(round(roc_auc_score(y_test, y_pred),4)) precision.append(round(precision_score(y_test, y_pred),4)) model_names = ['Catboost_adjusted_weight_5_optuna'] result_df2 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names) result_df2

With Optuna Hyperparameters, we can increase the accuracy score by 2 percent.

2. Apply Light GBM Classifier

LightGBM is a fast gradient-boosting algorithm based on a decision tree that produces high performance for different tasks like classification, ranking, etc. It is developed by Microsoft and helped many competitors to win popular data science hackathons and become one of the best algorithms to get the best performance on different types of data.

accuracy= [] recall =[] roc_auc= [] precision = [] #Creating data for LightGBM independent_features= df1.drop('churn', axis=1) dependent_feature= df1['churn'] for col in independent_features.columns: col_type = independent_features[col].dtype if col_type == 'object' or == 'category': independent_features[col] = independent_features[col].astype('category') X_train, X_test, y_train, y_test = train_test_split(independent_features, dependent_feature, test_size=0.3, random_state=42) #Creat LightGBM Classifier lgbmc_5=LGBMClassifier(random_state=0,scale_pos_weight=5) #Train the Model, y_train,categorical_feature = 'auto',eval_set=(X_test, y_test),feature_name='auto', verbose=0) #Make Predictions y_pred = lgbmc_5.predict(X_test) #Calculate Metrics accuracy.append(round(accuracy_score(y_test, y_pred),4)) recall.append(round(recall_score(y_test, y_pred),4)) roc_auc.append(round(roc_auc_score(y_test, y_pred),4)) precision.append(round(precision_score(y_test, y_pred),4)) #Create DF of metrics model_names = ['LightGBM_adjusted_weight_5'] result_df3 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names) result_df3

With the help of default parameters, we can get 96 percent accuracy, and recall is slightly down compared to the cat boost.

Hyperparameter Tuning of Light GBM using Optuna

Now you know how to use Optuna to find the best Hyperparameters, so let us use Optuna to find the best parameters for Light GBM. You can add more parameters from the official Light GBM Documentation.

def objective(trial): param = { "objective": "binary", "metric": "binary_logloss", "verbosity": -1, "boosting_type": "dart", "num_leaves": trial.suggest_int("num_leaves", 2,2000), "max_depth": trial.suggest_int("max_depth", 3, 12), "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True), "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True), "num_leaves": trial.suggest_int("num_leaves", 2, 256), "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0), "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0), "bagging_freq": trial.suggest_int("bagging_freq", 1, 7), "min_child_samples": trial.suggest_int("min_child_samples", 5, 100), } lgbmc_adj=lgb.LGBMClassifier(random_state=0,scale_pos_weight=5,**param), y_train,categorical_feature = 'auto',eval_set=(X_test, y_test),feature_name='auto', verbose=0, early_stopping_rounds=100) preds = lgbmc_adj.predict(X_test) pred_labels = np.rint(preds) accuracy = accuracy_score(y_test, pred_labels) return accuracy if __name__ == "__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=100) print("Number of finished trials: {}".format(len(study.trials))) print("Best trial:") trial = study.best_trial print(" Value: {}".format(trial.value)) print(" Params: ") for key, value in trial.params.items(): print(" {}: {}".format(key, value))

The accuracy is not much increased with the provided parameters by Optuna. You can now rebuild the model with the received parameters from Optuna.

3. Aplying XGBoost Classifier

XGBoost stands for Extreme Gradient Boosting, an ensemble machine learning algorithm combining the results of multiple decision trees. It is a parallel tree-boosting algorithm in which numerous decision trees are trained simultaneously, and further trees and so on optimize their accuracy. It is a leading algorithm for Regression or classification tasks.

accuracy= [] recall =[] roc_auc= [] precision = [] #Since XGBoost does not handle categorical values itself, we use get_dummies to convert categorical variables into numeric variables. df1= pd.get_dummies(df1) X= df1.drop('churn', axis=1) y= df1['churn'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) xgbc_5 = XGBClassifier(random_state=0), y_train) y_pred = xgbc_5.predict(X_test) accuracy.append(round(accuracy_score(y_test, y_pred),4)) recall.append(round(recall_score(y_test, y_pred),4)) roc_auc.append(round(roc_auc_score(y_test, y_pred),4)) precision.append(round(precision_score(y_test, y_pred),4)) model_names = ['XGBoost_adjusted_weight_5'] result_df5 = pd.DataFrame({'Accuracy':accuracy,'Recall':recall, 'Roc_Auc':roc_auc, 'Precision':precision}, index=model_names) result_df5

With the default parameters, we can grab 95 percent accuracy. Now you can optimize the XGBoost model with Optuna to find the optimal parameters and rebuild the model.

Comparison of Different Models

We have created multiple models using default parameters and used Optuna to find the best Hyperparameters. We have 6 resultant data frames, so we can concatenate all dataframe to compare all the results and plot a bar graph.

result_final= pd.concat([result_df1,result_df2,result_df3,result_df4,result_df5,result_df6],axis=0) result_final Conclusion

An imbalanced dataset is common in data science where whenever you work on data acquisition, one side of the data is overflooded while the other is present in the minority. In this article, we have learned how to deal with the Imbalance dataset while building a model other than applying individual methods like Resampling, SMOTE, ensemble, etc. After reading the article, let us discuss the key learning points we should remember.

We have learned the Importance of data quality to get a good performance of machine learning models.

If we use extreme values for scale pos weight, we can overfit the minority class, and the model could make worse predictions, so the value of scale pos weight should be optimal.

While Cat boost and Light GBM can handle the categorical features, XGBoost cannot. You have to convert categorical features before creating the model.

Scale pos weight value is 1 by default. Both the majority and minority class gets the same weight.

Optuna is the framework we can use to find the optimal parameters to fine-tune the machine-learning model automatically.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


You're reading Building Customer Churn Prediction Model With Imbalance Dataset

Heart Disease Prediction Using Logistic Regression On Uci Dataset

This article was published as a part of the Data Science Blogathon.



Hi everyone!

In this article, we study, in detail, the hyperparameters, code and libraries used for heart disease prediction using logistic regression on the UCI heart disease dataset.

Importing Libraries #Importing libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Numpy: Numpy is an open-source python library for handling n-dimensional arrays, written in the C programming language. Python is also written in the C programming language. Loading Numpy in the memory enables the Python interpreter to work with array computing in a fast and efficient manner. Numpy offers the implementation of various mathematical functions, algebraic routines and Fourier transforms. Numpy supports different hardware and computing technologies and is well suited for GPU and distributed computing. The high-level language used provides ease of use with respect to the various Numpy functionalities.

Pandas: Pandas is a fast open-source data analysis tool built on top of Python. Pandas allow various data manipulation activities using Pandas DataFrame objects. The different Pandas methods used in this study will be explained in detail later.

Matplotlib: Matplotlib is a Python library that enables plotting publication-quality graphs, static and interactive graphs using Python. Matplotlib plots can be exported to various file formats, can work with third-party packages and can be embedded in Jupyter notebooks. Matplotlib methods used are explained in detail as we encounter them.

Seaborn: Seaborn is a statistical data visualization tool for Python built over Matplotlib. The library enables us to create high-quality visualizations in Python.

Data Exploration and Visualization dataframe = pd.read_csv('heart_disease_dataset_UCI.csv')

The read_csv method from the Pandas library enables us to read the *.csv (comma-separated value) file format heart disease dataset published by UCI into the dataframe. The DataFrame object is the primary Pandas data structure which is a two-dimensional table with labelled axes – along rows and along with columns. Various data manipulation operations can be applied to the Pandas dataframe along rows and columns.

The Pandas dataframe head(10) method enables us to get a peek at the top 10 rows of the dataframe. This helps us in gaining an insight into the various columns and an insight into the type and values of data being stored in the dataframe.

The Pandas dataframe info() method provides information on the number of row-entries in the dataframe and the number of columns in the dataframe. Count of non-null entries per column, the data type of each column and the memory usage of the dataframe is also provided.

The Pandas dataframe isna().sum() methods provide the count of null values in each column.

The Matplotlib.figure API implements the Figure class which is the top-level class for all plot elements. Figsize = (15,10) defines the plot size as 15 inches wide and 10 inches high.

The Seaborn heatmap API provides the colour encoded plot for 2-D matrix data. The Pandas dataframe corr() method provides pairwise correlation (movement of the two variables in relation to each other) of columns in the dataframe. NA or null values are excluded by this method. The method allows us to find positive and negative correlations and strong and weak correlations between the various columns and the target variable. This can help us in feature selection. Weakly correlated features can be neglected. Positive and negative correlations can be used to describe model predictions. Positive correlation implies that as the value of one variable goes up, the value of the other variable also goes up. A negative correlation implies that as the value of one variable goes down, the value of the other variable also goes down. Zero correlation implies that there is no linear relationship between the variables. linewidth gives the width of the line that divides each cell in the heatmap. Setting can not to True, labels each cell with the corresponding correlation value. cmap value defines the mapping of the data value to the colorspace.


The Pandas dataframe hist method plots the histogram of the different columns, with figsize equal to 12 inches wide and 12 inches high.

Standard Scaling X = dataframe.iloc[:,0:13] y = dataframe.iloc[:,13]

Next,  we split our dataframe into features (X) and target variable (y) by using the integer-location based indexing ‘iloc’ dataframe property. We select all the rows and the first 13 columns as the X variable and all the rows and the 14th column as the target variable.

X = X.values y = y.values

We extract and return a Numpy representation of X and y values using the dataframe values property for our machine learning study.

from sklearn.preprocessing import StandardScaler X_std=StandardScaler().fit_transform(X)

We use the scikit-learn (sklearn) library for our machine learning studies. The scikit-learn library is an open-source Python library for predictive data analysis and machine learning and is built on top of Numpy, SciPy and Matplotlib. The SciPy ecosystem is used for scientific computing and provides optimized modules for Linear Algebra, Calculus, ODE solvers and Fast Fourier transforms among others. The sklearn preprocessing module implements function like scaling, normalizing and binarizing data. The StandardScaler standardizes the features by making the mean equal to zero and variance equal to one. The fit_transform() method achieves the dual purpose of (i) the fit() method by fitting a scaling algorithm and finding out the parameters for scaling (ii) the transform method, where the actual scaling transformation is applied by using the parameters found in the fit() method. Many machine learning algorithms are designed based on the assumption of expecting normalized/scaled data and standard scaling is thus one of the methods that help in improving the accuracy of machine learning models.

Train-Test Split from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X_std,y,test_size=0.25,random_state=40)

The sklearn model_selection class implements different data splitter classes (split into train and test sets, KFold train and test sets etc.), Hyper-parameter optimizers (search over a grid to find optimal hyperparameters) and model validation functionalities (evaluate the metrics of the cross-validated model etc).

N.B. – KFold (K=10) cross-validation means splitting the train set into 10 parts. 9 parts are used for training while the last part is used for testing. Next, another set of 9 parts (different from the previous set) is used for training while the remaining one part is used for testing. This process is repeated until each part forms one test set. The average of the 10 accuracy scores on 10 test sets is the KFold cross_val_score.

The train_test_split method from the sklearn model_selection class is used to split our features (X) and targets (y) into training and test sets. The test size = 0.25 specifies that 25 % of data is to be kept in the test set while setting a random_state = 40 ensures that the algorithm generates the same set of training and test data every time the algorithm is run. Machine learning algorithms are random by nature and setting a random_state ensures that the results are reproducible.

Model Fitting and Prediction from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix lr=LogisticRegression(C=1.0,class_weight='balanced',dual=False, fit_intercept=True, intercept_scaling=1,max_iter=100,multi_class='auto', n_jobs=None,penalty='l2',random_state=1234,solver='lbfgs',tol=0.0001, verbose=0,warm_start=False),y_train) prediction1=model1.predict(X_test) cm=confusion_matrix(y_test,prediction1) sns.heatmap(cm,annot=True,cmap='winter',linewidths=0.3, linecolor='black',annot_kws={"size":20}) TP=cm[0][0] TN=cm[1][1] FN=cm[1][0] FP=cm[0][1] print('Testing Accuracy for Logistic Regression:',(TP+TN)/(TP+TN+FN+FP))

The sklearn.metrics module includes score functions, performance metrics and distance metrics among others. The confusion_matrix method provides the accuracy of classification in a matrix format.

The sklearn linear_model class implements a variety of linear models like Linear regression, Logistic regression, Ridge regression, Lasso regression etc. We import the LogisticRegression class for our classification studies. A LogisticRegression object is instantiated.


The parameter C specifies regularization strength. Regularization implies penalizing the model for overfitting. C=1.0 is the default value for LogisticRegressor in the sklearn library.

The class_weight=’balanced’ method provides weights to the classes. If unspecified, the default class_weight is = 1. Class weight = ‘balanced’ assigns class weights by using the formula (n_samples/(n_classes*np.bin_count(y))). e.g. if n_samples =100, n_classes=2 and there are 50 samples belonging to each of the 0 and 1 classes, class_weight = 100/(2*50) = 1

N.B. Liblinear solver utilizes the coordinate-descent algorithm instead of the gradient descent algorithms to find the optimal parameters for the logistic regression model. E.g. in the gradient descent algorithms, we optimize all the parameters at once. While coordinate descent optimizes only one parameter at a time. In coordinate descent, we first initialize the parameter vector (theta = [theta0, theta1 …….. thetan]). In the kth iteration, only thetaik is updated while (theta0k… thetai-1k and thetai+1k-1…. thetank-1) are fixed.

fit_intercept = True The default value is True. Specifies if a constant should be added to the decision function.

intercept_scaling = 1 The default value is 1. Is applicable only when the solver is liblinear and fit_intercept = True. [X] becomes [X, intercept_scaling]. A synthetic feature with constant value = intercept_scaling is appended to [X]. The intercept becomes, intercept scaling * synthetic feature weight. Synthetic feature weight is modified by L1/L2 regularizations. To lessen the effect of regularization on synthetic feature weights, high intercept_scaling value must be chosen.

max_iter = 100 (default). A maximum number of iterations is taken for the solvers to converge.

multi_class = ‘ovr’, ‘multinomial’ or auto(default). auto selects ‘ovr’ i.e. binary problem if the data is binary or if the solver is liblinear. Otherwise auto selects multinomial which minimises the multinomial loss function even when the data is binary.

n_jobs (default = None). A number of CPU cores are utilized when parallelizing computations for multi_class=’ovr’. None means 1 core is used. -1 means all cores are used. Ignored when the solver is set to liblinear.

penalty: specify the penalty norm (default = L2).

random_state = set random state so that the same results are returned every time the model is run.

solver = the choice of the optimization algorithm (default = ‘lbfgs’)

tol = Tolerance for stopping criteria (default = 1e-4)

verbose = 0 (for suppressing information during the running of the algorithm)

warm_start = (default = False). when set to True, use the solution from the previous step as the initialization for the present step. This is not applicable for the liblinear solver.

Next, we call the fit method on the logistic regressor object using (X_train, y_train) to find the parameters of our logistic regression model. We call the predict method on the logistic regressor object utilizing X_test and the parameters predicted using the fit() method earlier.

We can calculate the confusion matrix to measure the accuracy of the model using the predicted values and y_test.

The parameters for the sns (seaborn) heatmap have been explained earlier. The linecolor parameter specifies the colour of the lines that will divide each cell. The annot_kws parameter passes keyword arguments to the matplotlib method – fontsize in this case.

of test samples = 89.47%.



This brings us to the end of the article. In this article, we developed a logistic regression model for heart disease prediction using a dataset from the UCI repository. We focused on gaining an in-depth understanding of the hyperparameters, libraries and code used when defining a logistic regression model through the scikit-learn library.



The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


Building A Machine Learning Model In Bigquery


Google’s BigQuery is a powerful cloud-based data warehouse that provides fast, flexible, and cost-effective data storage and analysis capabilities. One of its unique features is the ability to build and run machine learning models directly inside the database without extracting the data and moving it to another platform.

BigQuery was created to analyse data with billions of rows using SQL-like syntax. It is hosted on the Google Cloud Storage infrastructure and is accessible via a REST-oriented application programming interface (API).

Learning Objectives

In this article, we will:

The process of building a machine learning model in BigQuery.

Learn the key steps of ETL, feature selection and preprocessing, model creation, performance evaluation, and prediction.

This article was published as a part of the Data Science Blogathon.

Table of Contents Advantages of BigQuery

Scalability: BigQuery is a fully managed, cloud-native data warehouse that can easily handle petabyte-scale datasets. This makes it an ideal platform for machine learning, as it can handle large amounts of data and provide fast, interactive results.

Cost-effectiveness: BigQuery is designed to be cost-effective, with a flexible pricing model that allows you to only pay for what you use. This makes it an affordable option for machine learning, even for large and complex projects.

Integration with other Google Cloud services: BigQuery integrates seamlessly with other Google Cloud services, such as Google Cloud Storage, Google Cloud AI Platform, and Google Cloud Data Studio, providing a complete machine learning solution that is easy to use and scalable.

SQL Support: BigQuery supports standard SQL, which makes it easy for data analysts and developers to work with data and build machine learning models, even if they do not have a background in machine learning.

Security and Privacy: BigQuery implements the highest levels of security and privacy, with support for encryption at rest and in transit and strict access controls to ensure your data is secure and protected.

Real-time Analytics: BigQuery provides real-time analytics capabilities, allowing you to run interactive queries on large datasets and get results in seconds. This makes it an ideal platform for machine learning, enabling you to test and iterate your models quickly and easily.

Open-source IGntegrations: BigQuery supports several open-source integrations, such as TensorFlow, Pandas, and scikit-learn, which makes it easy to use existing machine learning libraries and models with BigQuery.

Step-by-step Tutorial to Build a Machine Learning Model in BigQuery

This guide will provide a step-by-step tutorial on how to build a machine-learning model in BigQuery, covering the following five main stages:

Extract, Transform, and Load (ETL) Data into BigQuery

Select and Preprocess Features

Create the Model Inside BigQuery

Evaluate the Performance of the Trained Model

Use the Model to Make Prediction

Step 1. Extract, Transform, and Load (ETL) Data into BigQuery

The first step in building a machine learning model in BigQuery is to get the data into the database. BigQuery supports loading data from various sources, including cloud storage (such as Google Cloud Storage or AWS S3), local files, or even other databases (such as Google Cloud SQL).

For the purposes of this tutorial, we will assume that we have a dataset in Google Cloud Storage that we want to load into BigQuery. The data in this example will be a simple CSV file with two columns: ‘age’ and ‘income’. Our goal is to build a model that predicts income based on age.

To load the data into BigQuery, we first need to create a new table in the database. This can be done using the BigQuery web interface or the command line tool.

Once the table is created, we can use the `bq` command line tool to load the data from our Cloud Storage bucket:

bq load --source_format=CSV mydataset.mytable gs://analyticsvidya/myfile.csv age:INTEGER,income:FLOAT

This command specifies that the data is in CSV format and that the columns ‘age’ and ‘income’ should be treated as an integer and float values, respectively.

Step 2. Select and Preprocess Features

The next step is to select and preprocess the features we want to use in our model. In this case, we only have two columns in our dataset, ‘age’ and ‘income’, so there’s not much to do here. However, in real-world datasets, there may be many columns with various data types, missing values, or other issues that need to be addressed.

One common preprocessing step is to normalize the data. Normalization is the process of scaling the values of a column to a specific range, such as [0,1]. This is useful for avoiding biases in the model due to differences in the scales of different columns.

In BigQuery, we can normalize the data using the `NORMALIZE` function:

WITH data AS ( SELECT age, income FROM mydataset.mytable ) SELECT age, NORMALIZE(CAST(income AS STRING)) as income_norm FROM data

This query creates a new table with the same data as the original table but with normalized income values.

Step 3. Create the Model Inside BigQuery

Once the data is preprocessed, we can create the machine learning model. BigQuery supports a variety of machine learning models, including linear regression, logistic regression, decision trees, and more.

For this example, we will use a simple linear regression model, which predicts a continuous target variable based on one or more independent variables. In our case, the target variable is income, and the independent variable is age.

To create the linear regression model in BigQuery, we use the `CREATE MODEL` statement:

CREATE MODEL mydataset.mymodel OPTIONS (model_type='linear_reg', input_label_cols=['income']) AS SELECT age, income FROM `mydataset.mytable`

This statement creates a new model called `mymodel` in the `mydataset` dataset. The `OPTIONS` clause specifies that the model type is a linear regression model, and the input label column is `income_norm.` The `AS` clause specifies the query that returns the data that will be used to train the model.

Following are the different models supported by BigQuery

Linear Regression: This is a simple and widely used model for predicting a continuous target variable based on one or more independent variables.

Logistic Regression: This type of regression model predicts a binary target variable based on one or more independent variables, such as yes/no or 0/1.

K-Means Clustering: This is an unsupervised learning algorithm that is used to divide a set of data points into K clusters, where each cluster is represented by its centroid.

Time-series Models: These models are used to forecast future values based on past values of time-series data, such as sales data or stock prices.

Random Forest: This ensemble learning method combines the predictions of multiple decision trees to create a more accurate and robust prediction.

Gradient Boosted Trees: This is another ensemble learning method that combines the predictions of multiple decision trees but uses a gradient-based optimization approach to create a more accurate and robust prediction.

Neural Networks: This is a machine learning model inspired by the structure and function of the human brain. Neural networks are used for various tasks, such as image classification, natural language processing, and speech recognition.

In addition to these models, BigQuery also provides several pre-trained models and APIs that can be used for common machine learning tasks, such as sentiment analysis, entity recognition, and image labeling.

Step 4. Evaluate the Performance of the Trained Model

Once the model is created, we need to evaluate its performance to see how well it can predict the target variable based on the independent variable. This can be done by splitting the data into a training set and a test set and using the training set to train the model and the test set to evaluate its performance.

In BigQuery, we can use the `ML.EVALUATE` function to evaluate the performance of the model:

SELECT * FROM ML.EVALUATE(MODEL mydataset.mymodel, ( SELECT age, income FROM `mydataset.mytable`))

This query returns various metrics, including mean squared error, root mean squared error, mean absolute error, and R-squared, all of which measure how well the model can predict the target variable.

Step 5. Use the Model to Make Predictions

Finally, once we have evaluated the model’s performance, we can use it to predict new data. In BigQuery, we can use the `ML.PREDICT` function to make predictions:

SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, ( SELECT age, FROM `mydataset.mytable`))

This query returns the original data along with the predicted income values.


Key Takwaways from this article:

BigQuery supports several open-source integrations, such as TensorFlow, Pandas, and scikit-learn, which makes it easy to use existing machine learning libraries and models with BigQuery.

BigQuery also provides several pre-trained models and APIs that can be used for common machine learning tasks, such as sentiment analysis, entity recognition, and image labeling.

In the dataset we studied, query returns the original data along with the predicted income values.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


How To Train A Custom Dataset With Yolov5?


We have seen some fancy terms for AI and deep learning, such as pre-trained models, transfer learning, etc. Let me educate you with a widely used technology and one of the most important and effective: Transfer learning with YOLOv5.

You Only Look Once, or YOLO is one of the most extensively used deep learning-based object identification methods. Using a custom dataset, this article will show you how to train one of its most recent variations, YOLOv5.

Learning Objectives 

This article will focus mainly on training the YOLOv5 model on a custom dataset implementation.

We will see what pre-trained models are and see what transfer learningis.

We will understand what YOLOv5 is and why we are using version 5 of YOLO.

So, without wasting time, let’s get started with the process

Table of Content

Pre-Trained Models

Transfer Learning

What and Why YOLOv5?

Steps Involved In Transfer Learning


Some Challenges That You Can Face


Pre-trained Models

You might have heard data scientists use the term “pre-trained model”  widely. After explaining what a deep learning model/network does, I will explain the term. A deep learning model is a model containing various layers stacked together so as to serve a solitary purpose, such as classification, detection, etc. Deep learning networks learn by discovering complicated structures in the data fed to them and saving the weights in a file which are later used to perform similar tasks. Pretrained models are already trained Deep Learning models. What it means is that they are already trained on a huge dataset containing millions of images.

Here is how the TensorFlow website defines pre-trained models: A pre-trained model is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task.

Some highly optimized and extraordinarily efficient pre-trained models are available on the internet. Different models are used to perform different tasks. Some of the pre-trained models are VGG-16, VGG-19, YOLOv5, YOLOv3, and ResNet 50.

Which model to use depends on the task you want to perform. For example, if I want to perform an object detection task, I will use the YOLOv5 model.

Transfer Learning

Transfer Learning is the most important technique that eases the task of a data scientist. Training a model is a hefty and time-consuming task; if a model is trained from scratch, it usually does not give very good results. Even if we train a model similar to a pre-trained model, it will not perform as effectively, and it can take weeks for a model to train. Instead, we can use the pre-trained models and use the already learned weights by training them on a custom dataset to perform a similar task. These models are highly efficient and refined in terms of architecture and performance, and they have made their way to the top by performing better in different contests. These models are trained on very large amounts of data, making them more diverse in knowledge.

So transfer learning basically means transferring knowledge gained by training the model on previous data to help the model learn better and faster to perform a different but similar task.

For example, using a YOLOv5 for object detection, but the object is something other than the object’s previous data used.

What and Why YOLOv5?

YOLOv5 is a pre-trained model which stands for you only look once version 5 is used for real-time object detection and has proven to be highly efficient in terms of accuracy and inference time. There are other versions of YOLO, but as one would predict, YOLOv5 performs better than other versions. YOLOv5 is fast and easy to use. It is based on the PyTorch framework, which has a larger community than Yolo v4 Darknet.

We will now look at the architecture of YOLOv5.

The structure may look confusing, but it does not matter as we do not have to look at the architecture instead directly use the model and weights.

In transfer learning, we use the custom dataset i.e., the data the model has never seen before OR the data on which the model is not trained. Since the model is already trained on a large dataset, we already have the weights. We can now train the model for a number of epochs on the data we want to work on. Training is required as the model has seen the data for the first time and will require some knowledge in order to perform the task.

Steps Involved in Transfer Learning

Transfer learning is a simple process, and we can do it in a few simple steps:

Data preparation

The right format for the annotations

Change a few layers if you want to

Retrain the model for a few iterations


Data Preparation

Data preparation can be time-consuming if your chosen data is a bit large. Data preparation means annotating the images, which is a process where you label the images by making a box around the object in the image. By doing this, the coordinates of the object marked will be saved in a file which will then be fed to the model for training. There are a few websites, such as chúng tôi and chúng tôi which can help you label the data. 

Here is how you can annotate the data for the YOLOv5 model on

4. After loading the images, you will be asked to create labels for your dataset’s different classes.

5. Start creating a bounding box around the object in the image. This exercise may be a bit fun initially, but with very large data, it can be tiring.

6. After annotating all the images you need to save the file which will contain the coordinates of bounding boxes along with the class.

7. This is a significant step, so follow it carefully.

To give you an idea of the folder, I have created a folder named ‘CarsData’ and in that folder made two folders – ‘images’ and ‘labels.’

Inside the two folders, you have to make two more folders named ‘train’ and ‘val’. In the images folder, you can split the images according to your will, but you have to be careful while splitting the label, as the labels should match the images you have split

8. Now make a zip file of the folder and upload it to the drive so that we can use it in colab.


We will now come to the implementation part, which is very simple but tricky. If you don’t know what files to change exactly, you won’t be able to train the model on the custom dataset. 

So here are the codes that you should follow to train the YOLOv5 model on a custom dataset

I recommend you use google colab for this tutorial as it also provides GPU which provides faster computations.

This will make a copy of the YOLOv5 repository which is a GitHub repository created by ultralytics.

This is a command-line shell command used to change the current working directory to the YOLOv5 directory.

This command will install all the packages and libraries used in training the model.

Unzipping the folder that contains images and labels in google colab

Here comes the most important step…

You have now performed almost all the steps and need to write one more line of code that will train the model, but, before that, you need to perform a few more steps and change some directories in order to give the path of your custom dataset and train your model on that data.

Here is what you need to do.

Go ahead and download this folder.

After the folder is downloaded, you need to make a few changes to it and upload it back to the same folder you downloaded it from.

Let’s now look at the content of the file we have downloaded, and it will look something like this.

We are going to customize this file according to our dataset and annotations.

We have already unzipped the dataset on colab, so we are going to copy the path of our train and validation images. After copying the path of the train images, which will be in the dataset folder and looks something like this ‘/content/yolov5/CarsData/images/train’, paste it in the chúng tôi file, which we just downloaded.

Do the same with the test and validation images.

Save this file with any name you want. I have saved the file with the name chúng tôi and now upload this file back to the colab at the same place where chúng tôi was. 

Now we are done with the editing part and ready to train the model.

Run the following command to train your model for a few interactions on your custom dataset.

Do not forget to change the name of the file you have uploaded(‘customPath.yaml). You can also change the number of epochs you want to train the model. In this case, I am going to train the model only for 3 epochs.

5. !python chúng tôi –img 640 –batch 16 –epochs 10 –data chúng tôi –weights

Keep in mind the path where you upload the folder. If the path is changed, then the commands will not work at all.

After you run this command, your model should start training and you will see something like this on your screen.

After all the epochs are completed, your model can be tested on any image.

You can do some more customization in the chúng tôi file on what you want to save and what you don’t like, the detections where the license plates are detected, etc.

6. !python chúng tôi –weight chúng tôi –source path_of_the_image

You can use this command to test the prediction of the model on some of the images.

Some Challenges That You can Face

Although the steps explained above are correct, there are some problems you can face if you don’t follow them exactly. 

Wrong path: This can be a headache or a problem. If you have entered the wrong path somewhere in training the image, it can be not easy to identify, and you will not be able to train the model.

Wrong format of labels: This is a widespread problem faced by people while training a YOLOv5. The model only accepts a format in which every image has its own text file with the desired format inside. Often, an XLS format file or a single CSV file is fed to the network, resulting in an error. If you are downloading the data from somewhere, instead of annotating each and every image, there can be a different file format in which the labels are saved. Here is an article to convert the XLS format to YOLO format. (link after the completion of the article).

Not naming the files correctly: Not naming the file correctly will again lead to an error. Pay attention to the steps while naming the folders and avoid this error.


In this article, we learned what transfer learning is and the pre-trained model. We learned when and why to use the YOLOv5 model and how to train the model on a custom dataset. We went through each and every step, from preparing the dataset to changing the paths and finally feeding them to the network in the implementation of the technique, and thoroughly understood the steps. We also looked at common problems faced while training a YOLOv5 and their solution. I hope this article helped you train your first YOLOv5 on a custom dataset and that you like the article.


Transforming Business Models With Integrated Customer 360 Views

In an operating environment marked by transparent markets and rising global competition, the customer has undoubtedly become king. A 2023 Salesforce survey revealed that the two biggest challenges faced by organizations today are meeting customer expectations and reacting appropriately to shifts in market demand. For the most part, businesses have responded to these issues by implementing varying levels of digital transformation across their front and backend processes. These strategies encompass everything from setting up ERP and WMS systems within supply chain functions to leveraging social media platforms and mobile applications to their make marketing and sales more accessible for users.  

Technology Alone Isn’t a Competitive Differentiator

While technological investments may help establish your brand amongst a growing stable of forward-thinking organizations, they can only produce lasting ROI when accompanied by a comprehensive reporting and analytics framework. Take a look at industry leaders like Netflix, Apple, Amazon, and Tesla. You’ll see that they all share an obsessive dedication to understanding their audience and creating personalized offerings that seamlessly meet the demands of each individual. This outward-in strategy is built around a robust reporting and analytics framework that draws data from varied sources both within and outside the organization. From real-time user behaviors tracked through website analytics to lead lists and sales histories gathered from CRM systems to sentiment analyses drawn from customer service channels and review aggregators, each touchpoint creates a wealth of data that can be used to create end-to-end customer experiences. Unfortunately, insights are all-too-often hidden behind siloed systems that are only accessible to specific functions. If they want to create cohesive business strategies that translate across multiple channels, businesses need to consolidate their disparate datasets and move them into a centralized repository that can provide a complete view of prospective and current customers across all targeted market segments.  

Creating a 360 Degree Customer View

Traditionally, these objectives have been served through a system of records (SOR).  This architecture served as a hub for all data relating to a specific business function. Once consolidated, inputs were combined to provide a more insightful look into the underlying process. Unfortunately, the SOR of old was often limited to a single platform, i.e., Salesforce for sales data, HubSpot for marketing information, 3PL for logistics, and supply chain. While accurate, the picture provided by these systems often lacked a greater business context. As a result, a far more holistic approach was required to achieve the ever-elusive 360 view. That’s where the concept of a golden record becomes so essential. A golden record consists of cleansed, validated, and merged data collected from disparate platforms both within and outside the enterprise. Data extraction from each of these sources is automated to ensure minimum lag time between the receipt of information into each operational store and its delivery into the golden record.

Matching data from different sources removes duplicated and contradictory data so that a definitive customer profile is available for decision-makers across the enterprise.

A 360-degree architecture is time-sensitive and touchpoint-sensitive; it provides an account of when and how customers interact with your brand, which channels they prefer to use, and where their last-known engagement occurred. With this information at hand, organizations can build a complete map of their buyer’s journey.

As new data filters in from each source platform, it is updated dynamically in the golden record, so the customer 360 is always as current as possible from every possible perspective.

The 360-view encompasses external data such as market trends, 3rd party reports, and public statistics to supplement internal sources. These inputs provide context for customer decisions and help organizations make macro-level decisions with far greater certainty.

This system is designed for enterprise-wide accessibility. The goal is to allow workforces must to derive insights from data based on their functional concerns.


Automating Data Integration in Your Customer 360

Ultimately, the biggest obstacle to creating the 360 view isn’t data extraction from disparate sources, the subsequent validation and cleansing of data, or even consolidation in an accessible repository – it’s the time, resources, and expertise required to design and deploy these processes from start to finish. Here are just some of the challenges that can arise during a manual implementation:

Each siloed system will house data in a unique format and layout. Connecting to these sources and linking extracted datasets with those from other functions requires a considerable amount of technical skill.

A data steward will need to be appointed to oversee the transfer of data in line with applicable governance policies.

Processes will need to be put into place to ensure updates and changes to source systems are transferred to the 360 view as quickly as possible. Otherwise, records will become outdated. Again this will require significant intervention from your IT team.

Data may be duplicated in different organizational silos, and there could be inconsistencies between these records, or they may be formatted differently. In these cases, the master record will need to be identified and missing or erroneous entries removed.

The timeliness of data may differ from system to system, which will hamper the accuracy of the 360 view. So, the duration of updates for each silo will need to be verified, and these details will need to be accounted for when designing manual integration flows.

Exploring Recommendation System (With An Implementation Model In R)


We are sub-consciously exposed to recommendation systems when we visit websites such as Amazon, Netflix, imdb and many more. Apparently, they have become an integral part of online marketing (pushing products online). Let’s learn more about them here.

In this article, I’ve explained the working of recommendation system using a real life example, just to show you this is not limited to online marketing. It is being used by all industries. Also, we’ll learn about its various types followed by a practical exercise in R. The term ‘recommendation engine’ & ‘recommendation system’ has been used interchangeably. Don’t get confused!

Recommendation System in Banks – Example

Today, every industry is making full use of recommendation systems with their own tailored versions. Let’s take banking industry for an example.

Bank X wants to make use of the transactions information and accordingly customize the offers they provide to their existing credit and debit card users. Here is what the end state of such analysis looks like:

Customer Z walks in to a Pizza Hut. He pays the food bill through bank Xs card. Using all the past transaction information, bank X knows that Customer Z likes to have an ice cream after his pizza. Using this transaction information at pizza hut, bank has located the exact location of the customer.  Next, it finds 5 ice cream stores which are close enough to the customer and 3 of which have ties with bank X.

This is the interesting part. Now, here are the deals with these ice-cream store:

Store 5 : Bank profit – $4, Customer expense – $11, Propensity of customer to respond – 20%

Let’s assume the marked prize is proportional to the desire of customer to have that ice-cream. Hence, customer struggles with the trade-off that whether to fulfil his desire at the extra cost or buy the cheaper ice cream. Bank X wants the customer to go to store 3,4 or 5 (higher profits). It can increase the propensity of the customer to respond if it gives him a reasonable deal. Let’s assume that discounts are always whole numbers. For now, the expected value was :

Expected value = 20%*{2 + 2 + 5 + 6 + 4 } = $ 19/5 = $3.8

Can we increase the expected value by giving out discounts. Here is how the propensity varies at store (3,4,5) varies :

Store 3 : Discount of $1 increases propensity by 5%, a discount of $2 by 7.5% and a discount of $3 by 10%

Store 4 : Discount of $1 increases propensity by 25%, a discount of $2 by 30%, a discount of $3 by 35% and a discount of $4 by 80%

Store 5 : No change with any discount

Banks cannot give multiple offers at the same time with competing merchants. You need to assume that an increase in ones propensity gives equal percentage point decrease in all other propensity. Here is the calculation for the most intuitive case – Give a discount of $2 at store 4.

Expected value = 50%/4 * (2 +  2 + 5 + 4) + 50% * 5 = $ 13/8 + $2.5 = $1.6 + $2.5 = $4.1 

Think Box : Is there any better option available which can give bank a higher profit? I’d be interested to know!

You see, making recommendations isn’t about extracting data, writing codes and be done with it. Instead, it requires mathematics (apparently), logical thinking and a flair to use a programming language. Trust me, third one is the easiest of all. Feeling confident? Let’s proceed.

What exactly is the work of a recommendation engine?

Previous example would have given you a fair idea. It’s time to make it crystal clear. Let’s understand what all a recommendation engine can do in context of previous example (Bank X):

It finds out the merchants/Items which a customer might be interested into after buying something else.

It estimates the profit & loss if many competing items can be recommended to the customer. Now based on the profile of the customer, recommends a customer centric or product centric offering. For a high value customer, which other banks are also interested to gain wallet share, you might want to bring out best of your offers.

It can enhance customer engagement by providing offers which can be highly appealing to the customer. Such that, (s)he might have purchased the item anyway but with an additional offer, the bank might win his/her interest of such attributing customer.

What are the types of Recommender Engines ?

There are broadly two types of recommender engines and based on the industry we make this choice. We have explained each of these algorithms in our previous articles, but here I try to put a practical explanation to help you understand them easily.

I’ve explained these algorithms in context of the industry they are used in and what makes them apt for these industries.

Context based algorithms:

As the name suggest, these algorithms are strongly based on driving the context of the item. Once you have gathered this context level information on items, you try to find look alike items and recommend them. For instance on Youtube, you can find genre, language, starring of a video. Now based on these information we can find look alike (related) of these videos. Once we have look alike, we simply recommend these videos to a customer who originally saw the first video only. Such algorithms are very common in video online channels, song online stores etc. Plausible reason being, such context level information is far easier to get when the product/item can be explained with few dimensions.

Collaborative filtering algorithms:

This is one of the most commonly used algorithm because it is not dependent on any additional information. All you need is the transaction level information of the industry. For instance: e-commerce player like Amazon and banks like American Express often use these algorithm to make merchant/product recommendations. Further, there are several types of collaborative filtering algorithms :

User-User Collaborative filtering: Here we find look alike customer to every customer and offer products which first customer’s look alike has chosen in past. This algorithm is very effective but takes a lot of time and resources since it requires to compute every customer pair information. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

Item-Item Collaborative filtering: It is quite similar to previous algorithm, but instead of finding customer look alike, we try finding item look alike. Once we have item look alike matrix, we can easily recommend alike items to customer who have purchased any item from the store. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new customer the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between customers. And with fixed number of products, product-product look alike matrix is fixed over time.

Other simpler algorithms: There are other approaches like market basket analysis, which generally do not have high predictive power than last algorithms.

How do we decide the performance metric of such engines?

Good question! We must know that performance metrics are strongly driven by business objectives. Generally, there are three possible metrics which you might want to optimise:

Based on dollar value:

If your overall aim is to increase profit/revenue metric using recommendation engine, then your evaluation metric should be incremental revenue/profit/sale with each recommended rank. Each rank should have an unbroken order and the average revenue/profit/sale should be over the expected benefit over the offer cost.

Based on propensity to respond:

If the aim is just to activate customers, or make customers explore new items/merchant, this metric might be very helpful. Here you need to track the response rate of the customer with each rank.

Based on number of transactions:

Some times you are interested in activity of the customer. For higher activity, customer needs to do higher number of transactions. So we track number of transaction for the recommended ranks.

Other metrics:

 There are other metrics which you might be interested in like satisfaction rate or number of calls to call centre etc. These metrics are rarely used as they generally won’t give you results for the entire portfolio but sample.

Building an Item-Item collaborative filtering Recommendation Engine using R

Let’s get some hands-on experience building a recommendation engine. Here, I’ve demonstrated building an item-item collaborative filter recommendation engine. The data contains just 2 columns namely individual_merchant and individual_customer. The data is available to download – Download Now.


End Notes

Recommended engines have become extremely common because they solve one of the commonly found business case for all industries. Substitute to these recommendation engine are very difficult because they predict for multiple items/merchant at the same time. Classification algorithms struggle to take in so many classes as the output variable.

In this article, we learnt about the use of recommendation systems in Banks. We also looked at implementing a recommendation engine in R. No doubt, they are being used across all sectors of industry, with a common aim to enhance customer experience.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.


Update the detailed information about Building Customer Churn Prediction Model With Imbalance Dataset on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!