Trending March 2024 # Exploring Recommendation System (With An Implementation Model In R) # Suggested April 2024 # Top 8 Popular

You are reading the article Exploring Recommendation System (With An Implementation Model In R) updated in March 2024 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Exploring Recommendation System (With An Implementation Model In R)


We are sub-consciously exposed to recommendation systems when we visit websites such as Amazon, Netflix, imdb and many more. Apparently, they have become an integral part of online marketing (pushing products online). Let’s learn more about them here.

In this article, I’ve explained the working of recommendation system using a real life example, just to show you this is not limited to online marketing. It is being used by all industries. Also, we’ll learn about its various types followed by a practical exercise in R. The term ‘recommendation engine’ & ‘recommendation system’ has been used interchangeably. Don’t get confused!

Recommendation System in Banks – Example

Today, every industry is making full use of recommendation systems with their own tailored versions. Let’s take banking industry for an example.

Bank X wants to make use of the transactions information and accordingly customize the offers they provide to their existing credit and debit card users. Here is what the end state of such analysis looks like:

Customer Z walks in to a Pizza Hut. He pays the food bill through bank Xs card. Using all the past transaction information, bank X knows that Customer Z likes to have an ice cream after his pizza. Using this transaction information at pizza hut, bank has located the exact location of the customer.  Next, it finds 5 ice cream stores which are close enough to the customer and 3 of which have ties with bank X.

This is the interesting part. Now, here are the deals with these ice-cream store:

Store 5 : Bank profit – $4, Customer expense – $11, Propensity of customer to respond – 20%

Let’s assume the marked prize is proportional to the desire of customer to have that ice-cream. Hence, customer struggles with the trade-off that whether to fulfil his desire at the extra cost or buy the cheaper ice cream. Bank X wants the customer to go to store 3,4 or 5 (higher profits). It can increase the propensity of the customer to respond if it gives him a reasonable deal. Let’s assume that discounts are always whole numbers. For now, the expected value was :

Expected value = 20%*{2 + 2 + 5 + 6 + 4 } = $ 19/5 = $3.8

Can we increase the expected value by giving out discounts. Here is how the propensity varies at store (3,4,5) varies :

Store 3 : Discount of $1 increases propensity by 5%, a discount of $2 by 7.5% and a discount of $3 by 10%

Store 4 : Discount of $1 increases propensity by 25%, a discount of $2 by 30%, a discount of $3 by 35% and a discount of $4 by 80%

Store 5 : No change with any discount

Banks cannot give multiple offers at the same time with competing merchants. You need to assume that an increase in ones propensity gives equal percentage point decrease in all other propensity. Here is the calculation for the most intuitive case – Give a discount of $2 at store 4.

Expected value = 50%/4 * (2 +  2 + 5 + 4) + 50% * 5 = $ 13/8 + $2.5 = $1.6 + $2.5 = $4.1 

Think Box : Is there any better option available which can give bank a higher profit? I’d be interested to know!

You see, making recommendations isn’t about extracting data, writing codes and be done with it. Instead, it requires mathematics (apparently), logical thinking and a flair to use a programming language. Trust me, third one is the easiest of all. Feeling confident? Let’s proceed.

What exactly is the work of a recommendation engine?

Previous example would have given you a fair idea. It’s time to make it crystal clear. Let’s understand what all a recommendation engine can do in context of previous example (Bank X):

It finds out the merchants/Items which a customer might be interested into after buying something else.

It estimates the profit & loss if many competing items can be recommended to the customer. Now based on the profile of the customer, recommends a customer centric or product centric offering. For a high value customer, which other banks are also interested to gain wallet share, you might want to bring out best of your offers.

It can enhance customer engagement by providing offers which can be highly appealing to the customer. Such that, (s)he might have purchased the item anyway but with an additional offer, the bank might win his/her interest of such attributing customer.

What are the types of Recommender Engines ?

There are broadly two types of recommender engines and based on the industry we make this choice. We have explained each of these algorithms in our previous articles, but here I try to put a practical explanation to help you understand them easily.

I’ve explained these algorithms in context of the industry they are used in and what makes them apt for these industries.

Context based algorithms:

As the name suggest, these algorithms are strongly based on driving the context of the item. Once you have gathered this context level information on items, you try to find look alike items and recommend them. For instance on Youtube, you can find genre, language, starring of a video. Now based on these information we can find look alike (related) of these videos. Once we have look alike, we simply recommend these videos to a customer who originally saw the first video only. Such algorithms are very common in video online channels, song online stores etc. Plausible reason being, such context level information is far easier to get when the product/item can be explained with few dimensions.

Collaborative filtering algorithms:

This is one of the most commonly used algorithm because it is not dependent on any additional information. All you need is the transaction level information of the industry. For instance: e-commerce player like Amazon and banks like American Express often use these algorithm to make merchant/product recommendations. Further, there are several types of collaborative filtering algorithms :

User-User Collaborative filtering: Here we find look alike customer to every customer and offer products which first customer’s look alike has chosen in past. This algorithm is very effective but takes a lot of time and resources since it requires to compute every customer pair information. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

Item-Item Collaborative filtering: It is quite similar to previous algorithm, but instead of finding customer look alike, we try finding item look alike. Once we have item look alike matrix, we can easily recommend alike items to customer who have purchased any item from the store. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new customer the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between customers. And with fixed number of products, product-product look alike matrix is fixed over time.

Other simpler algorithms: There are other approaches like market basket analysis, which generally do not have high predictive power than last algorithms.

How do we decide the performance metric of such engines?

Good question! We must know that performance metrics are strongly driven by business objectives. Generally, there are three possible metrics which you might want to optimise:

Based on dollar value:

If your overall aim is to increase profit/revenue metric using recommendation engine, then your evaluation metric should be incremental revenue/profit/sale with each recommended rank. Each rank should have an unbroken order and the average revenue/profit/sale should be over the expected benefit over the offer cost.

Based on propensity to respond:

If the aim is just to activate customers, or make customers explore new items/merchant, this metric might be very helpful. Here you need to track the response rate of the customer with each rank.

Based on number of transactions:

Some times you are interested in activity of the customer. For higher activity, customer needs to do higher number of transactions. So we track number of transaction for the recommended ranks.

Other metrics:

 There are other metrics which you might be interested in like satisfaction rate or number of calls to call centre etc. These metrics are rarely used as they generally won’t give you results for the entire portfolio but sample.

Building an Item-Item collaborative filtering Recommendation Engine using R

Let’s get some hands-on experience building a recommendation engine. Here, I’ve demonstrated building an item-item collaborative filter recommendation engine. The data contains just 2 columns namely individual_merchant and individual_customer. The data is available to download – Download Now.


End Notes

Recommended engines have become extremely common because they solve one of the commonly found business case for all industries. Substitute to these recommendation engine are very difficult because they predict for multiple items/merchant at the same time. Classification algorithms struggle to take in so many classes as the output variable.

In this article, we learnt about the use of recommendation systems in Banks. We also looked at implementing a recommendation engine in R. No doubt, they are being used across all sectors of industry, with a common aim to enhance customer experience.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.


You're reading Exploring Recommendation System (With An Implementation Model In R)

How To Create Polynomial Regression Model In R?

A Polynomial regression model is the type of model in which the dependent variable does not have linear relationship with the independent variables rather they have nth degree relationship. For example, a dependent variable x can depend on an independent variable y-square. There are two ways to create a polynomial regression in R, first one is using polym function and second one is using I() function.

Example1 set.seed(322) x1<−rnorm(20,1,0.5) x2<−rnorm(20,5,0.98) y1<−rnorm(20,8,2.15) Method1 Model1<−lm(y1~polym(x1,x2,degree=2,raw=TRUE)) summary(Model1) Output Call: lm(formula = y1 ~ polym(x1, x2, degree = 2, raw = TRUE)) Residuals: Min 1Q Median 3Q Max −4.2038 −0.7669 −0.2619 1.2505 6.8684 Coefficients: (Intercept) 11.2809 17.0298 0.662 0.518 polym(x1, x2, degree = 2, raw = TRUE)1.0 −2.9603 6.5583 −0.451 0.659 polym(x1, x2, degree = 2, raw = TRUE)2.0 1.9913 1.9570 1.017 0.326 polym(x1, x2, degree = 2, raw = TRUE)0.1 −1.3573 6.1738 −0.220 0.829 polym(x1, x2, degree = 2, raw = TRUE)1.1 −0.5574 1.2127 −0.460 0.653 polym(x1, x2, degree = 2, raw = TRUE)0.2 0.2383 0.5876 0.406 0.691 Residual standard error: 2.721 on 14 degrees of freedom Multiple R−squared: 0.205, Adjusted R−squared: −0.0789 F−statistic: 0.7221 on 5 and 14 DF, p−value: 0.6178 Method2 Model_1_M2<−lm(y1 ~ x1 + x2 + I(x1^2) + I(x2^2) + x1:x2) summary(Model_1_M2) Output Call: lm(formula = y1 ~ x1 + x2 + I(x1^2) + I(x2^2) + x1:x2) Residuals: Min 1Q Median 3Q Max −4.2038 −0.7669 −0.2619 1.2505 6.8684 Coefficients: (Intercept) 11.2809 17.0298 0.662 0.518 x1 −2.9603 6.5583 −0.451 0.659 x2 −1.3573 6.1738 −0.220 0.829 I(x1^2) 1.9913 1.9570 1.017 0.326 I(x2^2) 0.2383 0.5876 0.406 0.691 x1:x2 −0.5574 1.2127 −0.460 0.653 Residual standard error: 2.721 on 14 degrees of freedom Multiple R−squared: 0.205, Adjusted R−squared: −0.0789 F−statistic: 0.7221 on 5 and 14 DF, p−value: 0.6178 Example2

Third degree polynomial regression model −

Model1_3degree<−lm(y1~polym(x1,x2,degree=3,raw=TRUE)) summary(Model1_3degree) Output Call: lm(formula = y1 ~ polym(x1, x2, degree = 3, raw = TRUE)) Residuals: Min 1Q Median 3Q Max −4.4845 −0.8435 −0.2514 0.8108 6.7156 Coefficients: (Intercept) 63.0178 115.9156 0.544 0.599 polym(x1, x2, degree = 3, raw = TRUE)1.0 33.3374 83.3353 0.400 0.698 polym(x1, x2, degree = 3, raw = TRUE)2.0 −10.2012 42.4193 −0.240 0.815 polym(x1, x2, degree = 3, raw = TRUE)3.0 −1.4147 6.4873 −0.218 0.832 polym(x1, x2, degree = 3, raw = TRUE)0.1 −42.6725 72.9322 −0.585 0.571 polym(x1, x2, degree = 3, raw = TRUE)1.1 −8.9795 22.7650 −0.394 0.702 polym(x1, x2, degree = 3, raw = TRUE)2.1 2.8923 7.6277 0.379 0.712 polym(x1, x2, degree = 3, raw = TRUE)0.2 9.6863 14.2095 0.682 0.511 polym(x1, x2, degree = 3, raw = TRUE)1.2 0.2289 2.6744 0.086 0.933 polym(x1, x2, degree = 3, raw = TRUE)0.3 −0.6544 0.8341 −0.785 0.451 Residual standard error: 3.055 on 10 degrees of freedom Multiple R−squared: 0.2841, Adjusted R−squared: −0.3602 F−statistic: 0.441 on 9 and 10 DF, p−value: 0.8833 Example3 Model1_4degree<−lm(y1~polym(x1,x2,degree=4,raw=TRUE)) summary(Model1_4degree) Output Call: lm(formula = y1 ~ polym(x1, x2, degree = 4, raw = TRUE)) Residuals: 1 2 3 4 5 6 7 8 4.59666 −0.41485 −0.62921 −0.62414 −0.49045 2.15614 −0.42311 −0.12903 9 10 11 12 13 14 15 16 2.27613 0.60005 −1.94649 1.79153 0.01765 0.03866 −1.40706 0.85596 17 18 19 20 0.51553 −3.71274 0.05606 −3.12731 Coefficients: (Intercept) −1114.793 2124.374 −0.525 0.622 polym(x1, x2, degree = 4, raw = TRUE)1.0 −263.858 2131.701 −0.124 0.906 polym(x1, x2, degree = 4, raw = TRUE)2.0 −267.502 1250.139 −0.214 0.839 polym(x1, x2, degree = 4, raw = TRUE)3.0 317.739 433.932 0.732 0.497 polym(x1, x2, degree = 4, raw = TRUE)4.0 −6.803 40.546 −0.168 0.873 polym(x1, x2, degree = 4, raw = TRUE)0.1 967.989 2009.940 0.482 0.650 polym(x1, x2, degree = 4, raw = TRUE)1.1 256.227 869.447 0.295 0.780 polym(x1, x2, degree = 4, raw = TRUE)2.1 −125.888 473.845 −0.266 0.801 polym(x1, x2, degree = 4, raw = TRUE)3.1 −59.450 70.623 −0.842 0.438 polym(x1, x2, degree = 4, raw = TRUE)0.2 −314.183 674.159 −0.466 0.661 polym(x1, x2, degree = 4, raw = TRUE)1.2 −18.033 112.576 −0.160 0.879 polym(x1, x2, degree = 4, raw = TRUE)2.2 34.781 57.232 0.608 0.570 polym(x1, x2, degree = 4, raw = TRUE)0.3 41.854 91.862 0.456 0.668 polym(x1, x2, degree = 4, raw = TRUE)1.3 −4.360 9.895 −0.441 0.678 polym(x1, x2, degree = 4, raw = TRUE)0.4 −1.763 4.178 −0.422 0.690 Residual standard error: 3.64 on 5 degrees of freedom Multiple R−squared: 0.4917, Adjusted R−squared: −0.9315 F−statistic: 0.3455 on 14 and 5 DF, p−value: 0.9466

How To Split Comma Separated Values In An R Vector?

The splitting of comma separated values in an R vector can be done by unlisting the elements of the vector then using strsplit function for splitting. For example, if we have a vector say x that contains comma separated values then the splitting of those values will be done by using the command unlist(strsplit(x,”,”)).


 Live Demo

x1<-sample(c("a,b,c,d,e,f,g,h"),30,replace=TRUE) x1 Output [1] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" [5] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" [9] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" [13] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" [17] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" [21] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" [25] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h" [29] "a,b,c,d,e,f,g,h" "a,b,c,d,e,f,g,h"


[1] "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" [19] "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" [37] "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" [55] "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" [73] "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" [91] "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" [109] "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" [127] "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" [145] "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" [163] "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" [181] "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" [199] "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" [217] "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" "c" "d" "e" "f" "g" "h" "a" "b" [235] "c" "d" "e" "f" "g" "h" Example

 Live Demo

x2<-sample(c("india,russia,canada,china,united kingdom,egypt"),20,replace=TRUE) x2 Output [1] "india,russia,canada,china,united kingdom,egypt" [2] "india,russia,canada,china,united kingdom,egypt" [3] "india,russia,canada,china,united kingdom,egypt" [4] "india,russia,canada,china,united kingdom,egypt" [5] "india,russia,canada,china,united kingdom,egypt" [6] "india,russia,canada,china,united kingdom,egypt" [7] "india,russia,canada,china,united kingdom,egypt" [8] "india,russia,canada,china,united kingdom,egypt" [9] "india,russia,canada,china,united kingdom,egypt" [10] "india,russia,canada,china,united kingdom,egypt" [11] "india,russia,canada,china,united kingdom,egypt" [12] "india,russia,canada,china,united kingdom,egypt" [13] "india,russia,canada,china,united kingdom,egypt" [14] "india,russia,canada,china,united kingdom,egypt" [15] "india,russia,canada,china,united kingdom,egypt" [16] "india,russia,canada,china,united kingdom,egypt" [17] "india,russia,canada,china,united kingdom,egypt" [18] "india,russia,canada,china,united kingdom,egypt" [19] "india,russia,canada,china,united kingdom,egypt" [20] "india,russia,canada,china,united kingdom,egypt"


[1] "india" "russia" "canada" "china" [5] "united kingdom" "egypt" "india" "russia" [9] "canada" "china" "united kingdom" "egypt" [13] "india" "russia" "canada" "china" [17] "united kingdom" "egypt" "india" "russia" [21] "canada" "china" "united kingdom" "egypt" [25] "india" "russia" "canada" "china" [29] "united kingdom" "egypt" "india" "russia" [33] "canada" "china" "united kingdom" "egypt" [37] "india" "russia" "canada" "china" [41] "united kingdom" "egypt" "india" "russia" [45] "canada" "china" "united kingdom" "egypt" [49] "india" "russia" "canada" "china" [53] "united kingdom" "egypt" "india" "russia" [57] "canada" "china" "united kingdom" "egypt" [61] "india" "russia" "canada" "china" [65] "united kingdom" "egypt" "india" "russia" [69] "canada" "china" "united kingdom" "egypt" [73] "india" "russia" "canada" "china" [77] "united kingdom" "egypt" "india" "russia" [81] "canada" "china" "united kingdom" "egypt" [85] "india" "russia" "canada" "china" [89] "united kingdom" "egypt" "india" "russia" [93] "canada" "china" "united kingdom" "egypt" [97] "india" "russia" "canada" "china" [101] "united kingdom" "egypt" "india" "russia" [105] "canada" "china" "united kingdom" "egypt" [109] "india" "russia" "canada" "china" [113] "united kingdom" "egypt" "india" "russia" [117] "canada" "china" "united kingdom" "egypt" Example

 Live Demo

x3<-sample(c("male,female"),20,replace=TRUE) x3 Output [1] "male,female" "male,female" "male,female" "male,female" "male,female" [6] "male,female" "male,female" "male,female" "male,female" "male,female" [11] "male,female" "male,female" "male,female" "male,female" "male,female" [16] "male,female" "male,female" "male,female" "male,female" "male,female"


[1] "male" "female" "male" "female" "male" "female" "male" "female" [9] "male" "female" "male" "female" "male" "female" "male" "female" [17] "male" "female" "male" "female" "male" "female" "male" "female" [25] "male" "female" "male" "female" "male" "female" "male" "female" [33] "male" "female" "male" "female" "male" "female" "male" "female"

Practicing Machine Learning Techniques In R With Mlr Package


In R, we often use multiple packages for doing various machine learning tasks. For example: we impute missing value using one package, then build a model with another and finally evaluate their performance using a third package.

The problem is, every package has a set of specific parameters. While working with many packages, we end up spending a lot of time to figure out which parameters are important. Don’t you think?

To solve this problem, I researched and came across a R package named MLR, which is absolutely incredible at performing machine learning tasks. This package includes all of the ML algorithms which we use frequently. 

In this tutorial, I’ve taken up a classification problem and tried improving its accuracy using machine learning. I haven’t explained the ML algorithms (theoretically) but focus is kept on their implementation. By the end of this article, you are expected to become proficient at implementing several ML algorithms in R. But, only if you practice alongside.

Note: This article is meant only for beginners and early starters with Machine Learning in R. Basic statistic knowledge is required. 

Table of Content

Getting Data

Exploring Data

Missing Value Imputation

Feature Engineering

Outlier Removal by Capping

New Features

Machine Learning

Feature Importance


Logistic Regression

Cross Validation

Decision Tree

Cross Validation

Parameter Tuning using Grid Search

Random Forest


GBM (Gradient Boosting)

Cross Validation

Parameter Tuning using Random Search (Faster)

XGBoost (Extreme Gradient Boosting)

Feature Selection

Machine Learning with MLR Package

Until now, R didn’t have any package / library similar to Scikit-Learn from Python, wherein you could get all the functions required to do machine learning. But, since February 2024, R users have got mlr package using which they can perform most of their ML tasks.

Let’s now understand the basic concept of how this package works. If you get it right here, understanding the whole package would be a mere cakewalk.

The entire structure of this package relies on this premise:

Create a Task. Make a Learner. Train Them.

Creating a task means loading data in the package. Making a learner means choosing an algorithm ( learner) which learns from task (or data). Finally, train them.

MLR package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification. Let’s look at some of the available algorithms for classification problems:

22 classif.xgboost                 xgboost

And, there are many more. Let’s start working now!

1. Getting Data

For this tutorial, I’ve taken up one of the popular ML problem from DataHack  (one time login will be required to get data): Download Data.

After you’ve downloaded the data, let’s quickly get done with initial commands such as setting the working directory and loading data.

2. Exploring Data

Once the data is loaded, you can access it using:

Loan_Status      factor   0      NA         0.3127036     NA        NA    192    422    2

This functions gives a much comprehensive view of the data set as compared to base str() function. Shown above are the last 5 rows of the result. Similarly you can do for test data also:

From these outputs, we can make the following inferences:

In the data, we have 12 variables, out of which Loan_Status is the dependent variable and rest are independent variables.

Train data has 614 observations. Test data has 367 observations.

In train and test data, 6 variables have missing values (can be seen in na column).

ApplicantIncome and Coapplicant Income are highly skewed variables. How do we know that ? Look at their min, max and median value. We’ll have to normalize these variables.

LoanAmount, ApplicantIncome and CoapplicantIncome has outlier values, which should be treated.

Credit_History is an integer type variable. But, being binary in nature, we should convert it to factor.

Also, you can check the presence of skewness in variables mentioned above using a simple histogram.

As you can see in charts above, skewness is nothing but concentration of majority of data on one side of the chart. What we see is a right skewed graph. To visualize outliers, we can use a boxplot:

Similarly, you can create a boxplot for CoapplicantIncome and LoanAmount as well.

Let’s change the class of Credit_History to factor. Remember, the class factor is always used for categorical variables.

To check the changes, you can do:

[1] "factor"

You can further scrutinize the data using:

We find that the variable Dependents has a level 3+ which shall be treated too. It’s quite simple to modify the name levels in a factor variable. It can be done as:

3. Missing Value Imputation

Not just beginners, even good R analyst struggle with missing value imputation. MLR package offers a nice and convenient way to impute missing value using multiple methods. After we are done with much needed modifications in data, let’s impute missing values.

In our case, we’ll use basic mean and mode imputation to impute data. You can also use any ML algorithm to impute these values, but that comes at the cost of computation.

This function is convenient because you don’t have to specify each variable name to impute. It selects variables on the basis of their classes. It also creates new dummy variables for missing values. Sometimes, these (dummy) features contain a trend which can be captured using this function. dummy.classes says for which classes should I create a dummy variable. dummy.type says what should be the class of new dummy variables.

$data attribute of imp function contains the imputed data.

Now, we have the complete data. You can check the new variables using:

Did you notice a disparity among both data sets? No ? See again. The answer is Married.dummy variable exists only in imp_train and not in imp_test. Therefore, we’ll have to remove it before modeling stage.

Optional: You might be excited or curious to try out imputing missing values using a ML algorithm. In fact, there are some algorithms which don’t require you to impute missing values. You can simply supply them missing data. They take care of missing values on their own. Let’s see which algorithms are they:

8 classif.rpart                  rpart

chúng tôi = "numeric")

4. Feature Engineering

Feature Engineering is the most interesting part of predictive modeling. So, feature engineering has two aspects: Feature Transformation and Feature Creation. We’ll try to work on both the aspects here.

At first, let’s remove outliers from variables like ApplicantIncome, CoapplicantIncome, LoanAmount. There are many techniques to remove outliers. Here, we’ll cap all the large values in these variables and set them to a threshold value as shown below:

I’ve chosen the threshold value with my discretion, after analyzing the variable distribution. To check the effects, you can do summary(cd_train$ApplicantIncome) and see that the maximum value is capped at 33000.

In both data sets, we see that all dummy variables are numeric in nature. Being binary in form, they should be categorical. Let’s convert their classes to factor. This time, we’ll use simple for and if loops.



These loops say – ‘for every column name which falls column number 14 to 20 of cd_train / cd_test data frame, if the class of those variables in numeric, take out the unique value from those columns as levels and convert them into a factor (categorical) variables.

Let’s create some new features now.

While creating new features(if they are numeric), we must check their correlation with existing variables as there are high chances often. Let’s see if our new variables too happens to be correlated:

As we see, there exists a very high correlation of Total_Income with ApplicantIncome. It means that the new variable isn’t providing any new information. Thus, this variable is not helpful for modeling data.

Now we can remove the variable.

There is still enough potential left to create new variables. Before proceeding, I want you to think deeper on this problem and try creating newer variables. After doing so much modifications in data, let’s check the data again:

5. Machine Learning

Until here, we’ve performed all the important transformation steps except normalizing the skewed variables. That will be done after we create the task.

As explained in the beginning, for mlr, a task is nothing but the data set on which a learner learns. Since, it’s a classification problem, we’ll create a classification task. So, the task type solely depends on type of problem at hand.

Let’s check trainTask

Positive class: N

As you can see, it provides a description of cd_train data. However, an evident problem is that it is considering positive class as N, whereas it should be Y. Let’s modify it:

For a deeper view, you can check your task data using str(getTaskData(trainTask)).

Now, we will normalize the data. For this step, we’ll use normalizeFeatures function from mlr package. By default, this packages normalizes all the numeric features in the data. Thankfully, only 3 variables which we have to normalize are numeric, rest of the variables have classes other than numeric.

Before we start applying algorithms, we should remove the variables which are not required.

MLR package has an in built function which returns the important variables from data. Let’s see which variables are important. Later, we can use this knowledge to subset out input predictors for model improvement. While running this code, R might prompt you to install ‘FSelector’ package, which you should do.

If you are still wondering about information.gain, let me provide a simple explanation. Information gain is generally used in context with decision trees. Every node split in a decision tree is based on information gain. In general, it tries to find out variables which carries the maximum information using which the target class is easier to predict.

Let’s start modeling now. I won’t explain these algorithms in detail but I’ve provided links to helpful resources. We’ll take up simpler algorithms at first and end this tutorial with the complexed ones.

With MLR, we can choose & set algorithms using makeLearner. This learner will train on trainTask and try to make predictions on testTask.

1. Quadratic Discriminant Analysis (QDA).

In general, qda is a parametric algorithm. Parametric means that it makes certain assumptions about data. If the data is actually found to follow the assumptions, such algorithms sometime outperform several non-parametric algorithms. Read More.

Upload this submission file and check your leaderboard rank (wouldn’t be good). Our accuracy is ~ 71.5%. I understand, this submission might not put you among the top on leaderboard, but there’s along way to go. So, let’s proceed.

2. Logistic Regression

This time, let’s also check cross validation accuracy. Higher CV accuracy determines that our model does not suffer from high variance and generalizes well on unseen data.

Similarly, you can perform CV for any learner. Isn’t it incredibly easy? So, I’ve used stratified sampling with 3 fold CV. I’d always recommend you to use stratified sampling in classification problems since it maintains the proportion of target class in n folds. We can check CV accuracy by:

This is the average accuracy calculated on 5 folds. To see, respective accuracy each fold, we can do this:

3  3    0.7598039

Now, we’ll train the model and check the prediction accuracy on test data.

Woah! This algorithm gave us a significant boost in accuracy. Moreover, this is a stable model since our CV score and leaderboard score matches closely. This submission returns accuracy of 79.16%. Good, we are improving now. Let’s get ahead to the next algorithm.

A decision tree is said to capture non-linear relations better than a logistic regression model. Let’s see if we can improve our model further. This time we’ll hyper tune the tree parameters to achieve optimal results. To get the list of parameters for any algorithm, simply write (in this case rpart):

This will return a long list of tunable and non-tunable parameters. Let’s build a decision tree now. Make sure you have installed the rpart package before creating the tree learner:

I’m doing a 3 fold CV because we have less data. Now, let’s set tunable parameters:


As you can see, I’ve set 3 parameters. minsplit represents the minimum number of observation in a node for a split to take place. minbucket says the minimum number of observation I should keep in terminal nodes. cp is the complexity parameter. The lesser it is, the tree will learn more specific relations in the data which might result in overfitting.

You may go and take a walk until the parameter tuning completes. May be, go catch some pokemons! It took 15 minutes to run at my machine. I’ve 8GB intel i5 processor windows machine.

# [1] 0.001

It returns a list of best parameters. You can check the CV accuracy with:


Using setHyperPars function, we can directly set the best parameters as modeling parameters in the algorithm.


Decision Tree is doing no better than logistic regression. This algorithm has returned the same accuracy of 79.14% as of logistic regression. So, one tree isn’t enough. Let’s build a forest now.

Random Forest is a powerful algorithm known to produce astonishing results. Actually, it’s prediction derive from an ensemble of trees. It averages the prediction given by each tree and produces a generalized result. From here, most of the steps would be similar to followed above, but this time I’ve done random search instead of grid search for parameter tuning, because it’s faster.



Though, random search is faster than grid search, but sometimes it turns out to be less efficient. In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy, who knows.

Now, we have the final parameters. Let’s check the list of parameters and CV accuracy.


[1] 168

[1] 6

[1] 29

Let’s build the random forest model now and check its accuracy.

Support Vector Machines (SVM) is also a supervised learning algorithm used for regression and classification problems. In general, it creates a hyperplane in n dimensional space to classify the data based on target class. Let’s step away from tree algorithms for a while and see if this algorithm can bring us some improvement.

Since, most of the steps would be similar as performed above, I don’t think understanding these codes for you would be a challenge anymore.



This model returns an accuracy of 77.08%. Not bad, but lesser than our highest score. Don’t feel hopeless here. This is core machine learning. ML doesn’t work unless it gets some good variables. May be, you should think longer on feature engineering aspect, and create more useful variables. Let’s do boosting now.

6. GBM

Now you are entering the territory of boosting algorithms. GBM performs sequential modeling i.e after one round of prediction, it checks for incorrect predictions, assigns them relatively more weight and predict them again until they are predicted correctly.


n.minobsinnode refers to the minimum number of observations in a tree node. shrinkage is the regulation parameter which dictates how fast / slow the algorithm should move.

The accuracy of this model is 78.47%. GBM performed better than SVM, but couldn’t exceed random forest’s accuracy. Finally, let’s test XGboost also.

Xgboost is considered to be better than GBM because of its inbuilt properties including first and second order gradient, parallel processing and ability to prune trees. General implementation of xgboost requires you to convert the data into a matrix. With mlr, that is not required.

As I said in the beginning, a benefit of using this (MLR) package is that you can follow same set of commands for implementing different algorithms.



Terrible XGBoost. This model returns an accuracy of 68.5%, even lower than qda. What could happen ? Overfitting. So, this model returned CV accuracy of ~ 80% but leaderboard score declined drastically, because the model couldn’t predict correctly on unseen data.

What can you do next? Feature Selection ?

For improvement, let’s do this. Until here, we’ve used trainTask for model building. Let’s use the knowledge of important variables. Take first 6 important variables and train the models on them. You can expect some improvement. To create a task selecting important variables, do this:

Also, try to create more features. The current leaderboard winner is at ~81% accuracy. If you have followed me till here, don’t give up now.

End Notes

The motive of this article was to get you started with machine learning techniques. These techniques are commonly used in industry today. Hence, make sure you understand them well. Don’t use these algorithms as black box approaches, understand them well. I’ve provided link to resources.

What happened above, happens a lot in real life. You’d try many algorithms but wouldn’t get improvement in accuracy. But, you shouldn’t give up. Being a beginner, you should try exploring other ways to achieve accuracy. Remember, no matter how many wrong attempts you make, you just have to be right once.

You might have to install packages while loading these models, but that’s one time only. If you followed this article completely, you are ready to build models. All you have to do is, learn the theory behind them.

Got expertise in Business Intelligence  / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.


Dplyr Tutorial: Merge And Join Data In R With Examples

Introduction to Data Analysis

Data analysis can be divided into three parts:

Extraction: First, we need to collect the data from many sources and combine them.

Transform: This step involves the data manipulation. Once we have consolidated all the sources of data, we can begin to clean the data.

Visualize: The last move is to visualize our data to check irregularity.

Data Analysis Process

One of the most significant challenges faced by data scientists is the data manipulation. Data is never available in the desired format. Data scientists need to spend at least half of their time, cleaning and manipulating the data. That is one of the most critical assignments in the job. If the data manipulation process is not complete, precise and rigorous, the model will not perform correctly.

In this tutorial, you will learn:

R Dplyr

R has a library called dplyr to help in data transformation. The dplyr library is fundamentally created around four functions to manipulate the data and five verbs to clean the data. After that, we can use the ggplot library to analyze and visualize the data.

We will learn how to use the dplyr library to manipulate a Data Frame.

Merge Data with R Dplyr

dplyr provides a nice and convenient way to combine datasets. We may have many sources of input data, and at some point, we need to combine them. A join with dplyr adds variables to the right of the original dataset.

Dplyr Joins

Following are four important types of joins used in dplyr to merge two datasets:

Function Objective Arguments Multiple keys

left_join() Merge two datasets. Keep all observations from the origin table data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)

right_join() Merge two datasets. Keep all observations from the destination table data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)

inner_join() Merge two datasets. Excludes all unmatched rows data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)

full_join() Merge two datasets. Keeps all observations data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)

We will study all the joins types via an easy example.

First of all, we build two datasets. Table 1 contains two variables, ID, and y, whereas Table 2 gathers ID and z. In each situation, we need to have a key-pair variable. In our case, ID is our key variable. The function will look for identical values in both tables and bind the returning values to the right of table 1.

library(dplyr) df_primary <- tribble( ~ID, ~y, "A", 5, "B", 5, "C", 8, "D", 0, "F", 9) df_secondary <- tribble( ~ID, ~z, "A", 30, "B", 21, "C", 22, "D", 25, "E", 29) Dplyr left_join()

The most common way to merge two datasets is to use the left_join() function. We can see from the picture below that the key-pair matches perfectly the rows A, B, C and D from both datasets. However, E and F are left over. How do we treat these two observations? With the left_join(), we will keep all the variables in the original table and don’t consider the variables that do not have a key-paired in the destination table. In our example, the variable E does not exist in table 1. Therefore, the row will be dropped. The variable F comes from the origin table; it will be kept after the left_join() and return NA in the column z. The figure below reproduces what will happen with a left_join().

Example of dplyr left_join()

left_join(df_primary, df_secondary, by ='ID')


## # A tibble: 5 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 ## 5 F 9 NA Dplyr right_join()

The right_join() function works exactly like left_join(). The only difference is the row dropped. The value E, available in the destination data frame, exists in the new table and takes the value NA for the column y.

Example of dplyr right_join()

right_join(df_primary, df_secondary, by = 'ID')


## # A tibble: 5 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 ## 5 E NA 29 Dplyr inner_join()

When we are 100% sure that the two datasets won’t match, we can consider to return only rows existing in both dataset. This is possible when we need a clean dataset or when we don’t want to impute missing values with the mean or median.

The inner_join()comes to help. This function excludes the unmatched rows.

Example of dplyr inner_join()

inner_join(df_primary, df_secondary, by ='ID')


## # A tibble: 4 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 Dplyr full_join()

Finally, the full_join() function keeps all observations and replace missing values with NA.

Example of dplyr full_join()

full_join(df_primary, df_secondary, by = 'ID')


## # A tibble: 6 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 ## 5 F 9 NA ## 6 E NA 29 Multiple Key pairs

Last but not least, we can have multiple keys in our dataset. Consider the following dataset where we have years or a list of products bought by the customer.

If we try to merge both tables, R throws an error. To remedy the situation, we can pass two key-pairs variables. That is, ID and year which appear in both datasets. We can use the following code to merge table1 and table 2

df_primary <- tribble( ~ID, ~year, ~items, "A", 2024,3, "A", 2024,7, "A", 2023,6, "B", 2024,4, "B", 2024,8, "B", 2023,7, "C", 2024,4, "C", 2024,6, "C", 2023,6) df_secondary <- tribble( ~ID, ~year, ~prices, "A", 2024,9, "A", 2024,8, "A", 2023,12, "B", 2024,13, "B", 2024,14, "B", 2023,6, "C", 2024,15, "C", 2024,15, "C", 2023,13) left_join(df_primary, df_secondary, by = c('ID', 'year'))


## # A tibble: 9 x 4 ## ID year items prices ## 1 A 2024 3 9 ## 2 A 2024 7 8 ## 3 A 2023 6 12 ## 4 B 2024 4 13 ## 5 B 2024 8 14 ## 6 B 2023 7 6 ## 7 C 2024 4 15 ## 8 C 2024 6 15 ## 9 C 2023 6 13 Data Cleaning Functions in R

Following are the four important functions to tidy (clean) the data:

Function Objective Arguments

gather() Transform the data from wide to long (data, key, value, chúng tôi = FALSE)

spread() Transform the data from long to wide (data, key, value)

separate() Split one variables into two (data, col, into, sep= “”, remove = TRUE)

unit() Unit two variables into one (data, col, conc ,sep= “”, remove = TRUE)

If not installed already, enter the following command to install tidyr:

install tidyr : install.packages("tidyr")

The objectives of the gather() function is to transform the data from wide to long.


gather(data, key, value, chúng tôi = FALSE) Arguments: -data: The data frame used to reshape the dataset -key: Name of the new column created -value: Select the columns used to fill the key column -na.rm: Remove missing values. FALSE by default


Below, we can visualize the concept of reshaping wide to long. We want to create a single column named growth, filled by the rows of the quarter variables.

library(tidyr) # Create a messy dataset messy <- data.frame( country = c("A", "B", "C"), q1_2024 = c(0.03, 0.05, 0.01), q2_2024 = c(0.05, 0.07, 0.02), q3_2024 = c(0.04, 0.05, 0.01), q4_2024 = c(0.03, 0.02, 0.04)) messy


## country q1_2024 q2_2024 q3_2024 q4_2024 ## 1 A 0.03 0.05 0.04 0.03 ## 2 B 0.05 0.07 0.05 0.02 ## 3 C 0.01 0.02 0.01 0.04 # Reshape the data gather(quarter, growth, q1_2024:q4_2024) tidier


## country quarter growth ## 1 A q1_2024 0.03 ## 2 B q1_2024 0.05 ## 3 C q1_2024 0.01 ## 4 A q2_2024 0.05 ## 5 B q2_2024 0.07 ## 6 C q2_2024 0.02 ## 7 A q3_2024 0.04 ## 8 B q3_2024 0.05 ## 9 C q3_2024 0.01 ## 10 A q4_2024 0.03 ## 11 B q4_2024 0.02 ## 12 C q4_2024 0.04

In the gather() function, we create two new variable quarter and growth because our original dataset has one group variable: i.e. country and the key-value pairs.

The spread() function does the opposite of gather.


spread(data, key, value) arguments: data: The data frame used to reshape the dataset key: Column to reshape long to wide value: Rows used to fill the new column


We can reshape the tidier dataset back to messy with spread()

# Reshape the data spread(quarter, growth) messy_1


## country q1_2024 q2_2024 q3_2024 q4_2024 ## 1 A 0.03 0.05 0.04 0.03 ## 2 B 0.05 0.07 0.05 0.02 ## 3 C 0.01 0.02 0.01 0.04

The separate() function splits a column into two according to a separator. This function is helpful in some situations where the variable is a date. Our analysis can require focussing on month and year and we want to separate the column into two new variables.


separate(data, col, into, sep= "", remove = TRUE) arguments: -data: The data frame used to reshape the dataset -col: The column to split -into: The name of the new variables -sep: Indicates the symbol used that separates the variable, i.e.: "-", "_", "&" -remove: Remove the old column. By default sets to TRUE.


We can split the quarter from the year in the tidier dataset by applying the separate() function.

separate(quarter, c(“Qrt”, “year”), sep =”_”) head(separate_tidier)


## country Qrt year growth ## 1 A q1 2023 0.03 ## 2 B q1 2023 0.05 ## 3 C q1 2023 0.01 ## 4 A q2 2023 0.05 ## 5 B q2 2023 0.07 ## 6 C q2 2023 0.02

The unite() function concanates two columns into one.


unit(data, col, conc ,sep= "", remove = TRUE) arguments: -data: The data frame used to reshape the dataset -col: Name of the new column -conc: Name of the columns to concatenate -sep: Indicates the symbol used that unites the variable, i.e: "-", "_", "&" -remove: Remove the old columns. By default, sets to TRUE


In the above example, we separated quarter from year. What if we want to merge them. We use the following code:

unite(Quarter, Qrt, year, sep =”_”) head(unit_tidier)


## country Quarter growth ## 1 A q1_2024 0.03 ## 2 B q1_2024 0.05 ## 3 C q1_2024 0.01 ## 4 A q2_2024 0.05 ## 5 B q2_2024 0.07 ## 6 C q2_2024 0.02 Summary

Data analysis can be divided into three parts: Extraction, Transform, and Visualize.

R has a library called dplyr to help in data transformation. The dplyr library is fundamentally created around four functions to manipulate the data and five verbs to clean the data.

dplyr provides a nice and convenient way to combine datasets. A join with dplyr adds variables to the right of the original dataset.

The beauty of dplyr is that it handles four types of joins similar to SQL:

left_join() – To merge two datasets and keep all observations from the origin table.

right_join() – To merge two datasets and keep all observations from the destination table.

inner_join() – To merge two datasets and exclude all unmatched rows.

full_join() – To merge two datasets and keep all observations.

Using the tidyr Library, you can transform a dataset using following functions:

gather(): Transform the data from wide to long.

spread(): Transform the data from long to wide.

separate(): Split one variable into two.

unit(): Unit two variables into one.

A Comprehensive Guide On Recommendation Engines In 2023

This article was published as a part of the Data Science Blogathon.


The global market for the usage of Recommendation Engine was valued at USD 2.69 billion in 2023. It is anticipated to surpass USD 15.10 billion by 2026, reporting a CAGR of 37.79% during 2023-2026.

The recommendations that companies give you sometimes use data analysis techniques to identify items that match your taste and preferences. With the rapidly growing data over the internet, it is no surprise to say that Netflix knows which movie you’re going to want to watch next or the top news article you would like to read on your Twitter.

With that being said, in today’s guide, we will discuss Recommendation engines, their importance, challenges faced, working principles, different techniques, applications, and top companies using them, and lastly, how to build your own recommendation engine in Python.


What is Recommendation Engines?

Why Are Recommendation Engines Important In Machine Learning?

Different Techniques Of Recommender Engines

Working Of Recommendation Engines

Challenges Of Recommendation Engines

How To Build A Recommendation Engine

Applications & Top Companies Using Recommendation Engines


What is a Recommendation Engine?

A recommendation engine is a data filtering system that operates on different machine learning algorithms to recommend products, services, and information to users based on data analysis. It works on the principle of finding patterns in customer behavior data employing a variety of factors such as customer preferences, past transaction history, attributes, or situational context.

The data used for finding insights can be collected implicitly or explicitly. Companies usually use petabytes of data for their recommendation engines to present their views with their experiences, behaviors, preferences, and interests.

In this ever-evolving market of information density and product overload, each company uses recommendations engines for slightly different purposes. Still, all have the same goal of driving more sales, boosting customer engagement and retention, and providing consumers with a piece of personalized knowledge and solutions.

Why are Recommendation Engines important in ML?

Let us take Netflix as an example.

There are thousands of movies and multiple categories of shows to watch from. Still, Netflix offers you a much more opinionated selection of movies ad shows you are most likely to enjoy. With this strategy, Netflix achieves lower cancellation rates, saves a billion dollars a year, saves your time, and delivers a better user experience.

This is why recommendations engines are essential and exactly how many businesses are boosting engagement opportunities with their products by offering a more significant influx of cross-selling opportunities.

Different Techniques of Recommendation Engines

There are three different types of recommender engines known in machine learning, and they are:

1. Collaborative Filtering

The collaborative filtering method collects and analyzes data on user behavior, online activities, and preferences to predict what they will like based on the similarity with other users. It uses a matrix-style formula to plot and calculates these similarities.



If user X likes Book A, Book B, and Book C while user Y likes Book A, Book B, and Book D, they have similar interests. So, it is favorably possible that user X would select Book D and user Y would enjoy reading Bood C. This is how collaborative filtering happens.

2. Content-Based Filtering

Content-based filtering works on the principle of describing a product and a profile of the user’s desired choices. It assumes that you will also like this other item if you like a particular item. Products are defined using keywords (genre, product type, color, word length) to make recommendations. A user profile is created to describe the kind of item this user enjoys. Then the algorithm evaluates the similarity of items using cosine and Euclidean distances.



Suppose a user X likes to watch action movies like Spider-man. In that case, this recommender engine technique only recommends movies of the action genre or films describing Tom Holland.

3. Hybrid Model

In hybrid recommendation systems, both the meta (collaborative) data and the transactional (content-based) data are used simultaneously to suggest a broader range of items to the users. In this technique, natural language processing tags can be allocated for each object (movie, song), and vector equations calculate the similarity. A collaborative filtering matrix can then suggest things to users, depending on their behaviors, actions, and intentions.


This recommendation system is up-and-coming and is said to outperform both of the above methods in terms of accuracy.


Netflix uses a hybrid recommendation engine. It makes recommendations by analyzing the user’s interests (collaborative) and recommending such shows/movies that share similar attributes with those rated highly by the user(content-based).

Working Of Recommendation Engines

Data is the most vital element in constructing a recommendation engine. It is the building block from which patterns are derived by algorithms. The more details it has, the more accurately and practically it will deliver appropriate revenue-generating recommendations. Basically, a recommendation engine works using a combination of data and machine learning algorithms in four phases. Let’s understand them in detail now:

1. Data Collection

Each user’s data profile will become more distinctive over time; hence it is also crucial to collect customer attribute data such as:

demographics (age, gender)

Psychographics (interests, values) to identify similar customers

feature data (genre, object type) to determine similar products likeness.

2. Data Storage

Once you have collected the data, the next step is to store the data efficiently. As you collect more data, ample, scalable storage must be available. Several storage options are available depending on the type of data you collect, like NoSQL, a standard SQL database, MongoDB, and AWS.

When choosing the best storage options, one should consider some factors: ease of implementation, data storage size, integration, and portability.

3. Analyze The Data

After collecting the data, you need to analyze the data. The data must then be drilled and analyzed to offer immediate recommendations. The most prevalent methods in which you can analyze data are:

Real-time Analysis, in which the system uses tools that evaluates and analyzes events as it is created. This technique is mainly implemented when we want to provide instant recommendations.

Batch Analysis, in which processing and analyzing of data are done periodically. This technique is mainly implemented when we want to send emails with recommendations.

Near-real-time Analysis, in which you analyze and process data in minutes instead of seconds as you don’t need it immediately. This technique is mainly implemented when we provide recommendations while the user is still on the site.

4. Filtering The Data

Once you analyze the data, the final step is to accurately filter the data to provide valuable recommendations. Different matrixes, mathematical rules, and formulas are applied to the data to provide the right suggestion. You must choose the appropriate algorithm, and the outcome of this filtering is the recommendations.

Challenges of Recommendation Engines

Perfection simply doesn’t exist. An English Theoretical Physicist “Stephen Hawking,” once said:

“One of the basic rules of the universe is that nothing is perfect.”

Similarly, there are some challenges companies have to overcome to build an effective recommender system. Here are some of them:

1. The COLD START Problem

This problem arises when a new user joins the system or adds new items to the record. The recommender system cannot initially suggest this new item or user because it does not have any rating or reviews. Hence, it gets challenging for the engine to predict the preference or priorities of the new user, or the rating of the new items, leading to less precise recommendations.

For instance, a new movie on Netflix cannot be recommended until it gets some views and ratings.

However, a deep learning-based model can solve the cold start problem because these models do not heavily depend on user behavior to make predictions. It can optimize the correlations between the user and the item by examining product context and user details like product descriptions, images, and user behaviors.

2. Data Sparsity Problem

As we all know, recommendation engines hugely depend upon the data. Under a few situations, some users do not give ratings or reviews of the items they purchased. If we do not have high-quality data, the rating model becomes very sparse, leading to data sparsity problems.

This problem makes it challenging for the algorithm to find users with similar ratings or interests.

To ensure the best quality data and to be able to make the most of the recommendation engine, ask yourself four questions:

How recent is the data?

How noisy is the information?

How diverse is the information?

How quickly can you feed new data to your recommender system model?

The above questions will ensure that your business meets the complex data analytics requirements.

3. Changing User Preferences Problem

User-item interactions in rating and reviews can generate massive changing data.

For instance, I might be on Netflix today to watch a Romantic movie with my girlfriend. But tomorrow, I might have a different mood, and a classic psychological thriller is what I would like to watch.

How to Build a Recommendation Engine in Python?

This guide section will help you build basic recommendation systems in Python. We will focus on building a basic recommendation system by recommending items that are most comparable to a specific item, in our case, movies. Keep in mind, this is not an exact, robust recommendation engine. It just suggests what movies/items are most similar to your movie preference.

You can find the code and data files at the end of this section. So let’s get started:

Note: It is highly suggested to operate on google collab or jupyter notebook for running this code.

#1. Import the required libraries.

Import numpy and pandas machine learning libraries, as we will use them for data frames and evaluating correlations.


import numpy as np import pandas as pd

#2. Get the Data

Define the column names, read the csv file for the movies and reviews dataset and print the first 5 rows.


column_names = ['user_id', 'item_id', 'rating', 'timestamp'] df = pd.read_csv('', sep='t', names=column_names) df.head()


As you can see above, we have four columns: user id, which is unique to each user. Item id is unique to each movie, ratings of the film, and their timestamp.

Now let’s get the movie titles:


movie_titles = pd.read_csv("Movie_Id_Titles") movie_titles.head()


Read the data using the pandas’ library and print the top 5 rows from the dataset. We have the id and title for each film.

We can now join the two columns:


df = pd.merge(df,movie_titles,on='item_id') df.head()


We now have the combined dataframe, which we will use next for Exploratory Data Analysis (EDA).

#3. Exploratory Data Analysis

Let’s examine the data a bit and get a peek at some of the best-rated films.

Visualization imports will be our first step in EDA.


import matplotlib.pyplot as plt import seaborn as sns sns.set_style('white') %matplotlib inline

Next, we will create a rating dataframe with average rating and number of ratings as our two columns:








ratings = pd.DataFrame(df.groupby('title')['rating'].mean()) ratings.head()


Next, set the number of rating columns right next to mean ratings:


ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count()) ratings.head()


Plot a few histograms to check several ratings visually:


plt.figure(figsize=(10,4)) ratings['num of ratings'].hist(bins=70)



plt.figure(figsize=(10,4)) ratings['rating'].hist(bins=70)



sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)


Okay! Now that we have a comprehensive view of what the data looks like, let’s move on to constructing a simple recommendation system in Python:

#4. Recommending Similar Movies

Now let’s construct a matrix with the user IDs and the movie title. Each cell will then consist of the user’s rating of that movie.

Note: There will be many NaN values because most people have not seen most of the film.


moviemat = df.pivot_table(index='user_id',columns='title',values='rating') moviemat.head()


Print the most rated films:


ratings.sort_values('num of ratings',ascending=False).head(10)


Let’s pick two movies: Star Wars, a science fiction movie. And the other is Liar Liar, which is a comedy. The next step is to grab the user ratings for those two movies:


starwars_user_ratings = moviemat['Star Wars (1977)'] liarliar_user_ratings = moviemat['Liar Liar (1997)'] starwars_user_ratings.head()


We can then use the corrwith() method to get correlations between two pandas series:


similar_to_starwars = moviemat.corrwith(starwars_user_ratings) similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings)


There are still many null values that can be cleaned by removing NaN values. So we use a DataFrame instead of a series:


corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation']) corr_starwars.dropna(inplace=True) corr_starwars.head()


Now, suppose we sort the dataframe by correlation. In that case, we should get the most comparable films, however, note that we get a few movies that don’t really make sense.

This is because there are a lot of films only watched once by users who also watched star wars.




We can fix this by filtering films out with less than 100 reviews. We can determine this value based on the histogram we plotted in the EDA section earlier.


corr_starwars = corr_starwars.join(ratings['num of ratings']) corr_starwars.head()


Now sort the values and witness how the titles make a lot more understanding:



Now the same proceeds for the comedy Liar Liar movie:


corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation']) corr_liarliar.dropna(inplace=True) corr_liarliar = corr_liarliar.join(ratings['num of ratings'])


Great Job, you have made a movies recommendation engine of your own.

Note: Access the google notebook here.

Applications & Top Companies using Recommendation Engines

Many industries employ recommendation engines to boost user interaction and enhance shopping prospects. As we all saw, recommendation engines can change the way businesses communicate with users and maximize their return-on-investment (ROI) based on the information they can gather.

We will see how almost every business uses a recommendation engine to stand a chance to profit.

1. E-Commerce

E-commerce is an industry where recommendation engines were first widely employed. E-commerce businesses are best suited to provide accurate recommendations with millions of customers and data on their online database.

2. Retail

Shopping data is the most valuable information for a machine learning algorithm. It is the most precise data point on a user’s intent. Retailers with troves of shopping data are at the forefront of enterprises generating concrete recommendations for their customers.

3. Media

Like e-commerce, media companies are the first to hop onto the recommendations engines techniques. It is hard to notice a news site without a recommendation engine at play.

4. Banking

Banking is a mass-market industry utilized digitally by millions of people and is prime for recommendations. Understanding a customer’s exact financial situation and past choices, correlated with data of thousands of comparable users, is quite decisive.

5. Telecom

This industry shares similar dynamics with the banking industry. Telcos have the credentials of millions of customers whose every action is documented. Their product range is also moderately narrow compared to other sectors, making recommendations in telecom a more manageable solution.

6. Utilities

Similar dynamics with telecom, but utilities have an even more limited scope of products, making recommendations relatively easy to use.

Top Companies Using Recommendation Engines Include













Final Thoughts

Recommendation engines are a powerful marketing tool that will help you better up-sell, cross-sell, and boost your business. Many things are going on in the field of recommendation engines. Every company has to keep up to date with the technology to provide the best satisfaction set of recommendations to all their users.

Here we reach the end of this guide. I hope all the topics and explanations are helpful enough to assist you in starting your journey in the recommendation engines in machine learning.

Read more articles on our blog about Recommendation Engines. 

If you still have any doubt, reach out to me on my social media profiles, and I will be happy to help you out. You can read more about me below:

I am a Data Scientist with a Bachelors’s degree in computer science specializing in Machine Learning, Artificial Intelligence, and Computer Vision. Mrinal is also a freelance blogger, author, and geek with five years of experience in his work. With a background working through most areas of computer science, I am currently pursuing Masters in Applied Computing with a specialization in AI from the University of Windsor, and I am a Freelance content writer and content analyst.

Read More On Recommender Engines By Mrinal Walia:

1. Top 5 Open-Source Machine Learning Recommender System Projects With Resources

2. Must-Try Open-Source Deep Learning Projects for Computer Science Students

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 


Update the detailed information about Exploring Recommendation System (With An Implementation Model In R) on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!