# Trending February 2024 # Linear Regression Vs Logistic Regression # Suggested March 2024 # Top 5 Popular

You are reading the article Linear Regression Vs Logistic Regression updated in February 2024 on the website Katfastfood.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Linear Regression Vs Logistic Regression

Difference Between Linear Regression and Logistic Regression

Linear regression is an algorithm that is based on the supervised learning domain of machine learning. It inherits a linear relationship between its input variables and the single output variable where the output variable is continuous in nature. It is used to predict the value of output let’s say Y from the inputs let’s say X. When only single input is considered it is called simple linear regression. Logistic Regression  is a form of regression which allows the prediction of discrete variables by a mixture of continuous and discrete predictors. It results in a unique transformation of dependent variables which impacts not only the estimation process but also the coefficients of independent variables. It addresses the same question which multiple regression does but with no distributional assumptions on the predictors. In logistic regression the outcome variable is binary. The purpose of the analysis is to assess the effects of multiple explanatory variables, which can be numeric or categorical or both.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

Categories of Linear Regression

It can be classified into two main categories:

1. Simple Regression

Y = β0 + β1 X

Where,

β represents the features

β0 represents the intercept

β1 represents the coefficient of feature X

2. Multivariable Regression

It is used to predict a correlation between more than one independent variable and one dependent variable. Regression with more than two independent variable is based on fitting shape to the constellation of data on a multi-dimensional graph. The shape of regression should be such that it minimizes the distance of the shape from every data point.

A linear relationship model can be represented mathematically as below:

Y = β0 + β1 X1 + β2X2 + β3X3 + ....... + βnXn

Where,

β represents the features

β0 represents the intercept

β1 represents the coefficient of feature X1

βn represents the coefficient of feature Xn

Due to its simplicity, it is widely used modeling for predictions and inferences.

It focuses on data analysis and data preprocessing. So, it deals with different data without bothering about the details of the model.

It works efficiently when the data are normally distributed. Thus for efficient modeling, the collinearity must be avoided.

Types of Logistic Regression

Below are the 2 types of Logistic Regression:

1. Binary Logistic Regression

It is used when the dependent variable is dichotomous i.e. like a tree with two branches. It is used when the dependent variable is non-parametric.

Used when

If there is no linearity

There are only two levels of the dependent variable.

If multivariate normality is doubtful.

2. Multinomial Logistic Regression

Multinomial logistic regression analysis requires that the independent variables be metric or dichotomous. It does not make any assumptions of linearity, normality, and homogeneity of variance for the independent variables.

It is used when the dependent variable has more than two categories. It is used to analyze relationships between a non-metric dependent variable and metric or dichotomous independent variables, then compares multiple groups through a combination of binary logistic regressions. In the end, it provides a set of coefficients for each of the two comparisons. The coefficients for the reference group are taken to be all zeros. Finally, prediction is done based on the highest resultant probability.

Advantage of logistic regression: It is a very efficient and widely used technique as it doesn’t require many computational resources and doesn’t require any tuning.

Key Difference Between The Linear Regression and Logistic Regression

Let us discuss some of the major key differences between Linear Regression vs Logistic Regression

Linear Regression

It is a linear approach

It uses a straight line

It can’t take categorical variables

It has to ignore observations with missing values of the numeric independent variable

Output Y is given as

1 unit increase in x increases Y by α

Applications

Predicting the price of a product

Predicting score in a match

Logistic Regression

It is a statistical approach

It uses a sigmoid function

It can take categorical variables

It can take decisions even if observations with missing values are present

Output Y is given as, where z is given as

1 unit increase in x increases Y by log odds of α

If P is the probability of an event, then (1-P) is the probability of it not occurring. Odds of success = P / 1-P

Applications

Predicting whether today it will rain or not.

Predicting whether an email is a spam or not.

Linear Regression vs Logistic Regression Comparison Table

Let’s discuss the top comparison between Linear Regression vs Logistic Regression

Linear Regression

Logistic Regression

It is used to solve regression problems It is used to solve classification problems

It models the relationship between a dependent variable and one or more independent variable It predicts the probability of an outcome that can only have two values at the output either 0 or 1

The predicted output is a continuous variable The predicted output is a discrete variable

Predicted output Y can exceed 0 and 1 range Predicted output Y lies within 0 and 1 range

Predicted output Y can exceed 0 and 1 range Predicted output

Conclusion

If features doesn’t contribute to prediction or if they are very much correlated with each other then it adds noise to the model. So, features which doesn’t contribute enough to the model must be removed. If independent variables are highly correlated it may cause a problem of multi-collinearity, which can be solved by running separate models with each independent variable.

Recommended Articles

This has been a guide to Linear Regression vs Logistic Regression . Here we discuss the Linear Regression vs Logistic Regression key differences with infographics, and comparison table. You may also have a look at the following articles to learn more–

You're reading Linear Regression Vs Logistic Regression

## Heart Disease Prediction Using Logistic Regression On Uci Dataset

This article was published as a part of the Data Science Blogathon.

Overview

Hi everyone!

In this article, we study, in detail, the hyperparameters, code and libraries used for heart disease prediction using logistic regression on the UCI heart disease dataset.

Importing Libraries #Importing libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Numpy: Numpy is an open-source python library for handling n-dimensional arrays, written in the C programming language. Python is also written in the C programming language. Loading Numpy in the memory enables the Python interpreter to work with array computing in a fast and efficient manner. Numpy offers the implementation of various mathematical functions, algebraic routines and Fourier transforms. Numpy supports different hardware and computing technologies and is well suited for GPU and distributed computing. The high-level language used provides ease of use with respect to the various Numpy functionalities.

Pandas: Pandas is a fast open-source data analysis tool built on top of Python. Pandas allow various data manipulation activities using Pandas DataFrame objects. The different Pandas methods used in this study will be explained in detail later.

Matplotlib: Matplotlib is a Python library that enables plotting publication-quality graphs, static and interactive graphs using Python. Matplotlib plots can be exported to various file formats, can work with third-party packages and can be embedded in Jupyter notebooks. Matplotlib methods used are explained in detail as we encounter them.

Seaborn: Seaborn is a statistical data visualization tool for Python built over Matplotlib. The library enables us to create high-quality visualizations in Python.

Data Exploration and Visualization dataframe = pd.read_csv('heart_disease_dataset_UCI.csv')

The read_csv method from the Pandas library enables us to read the *.csv (comma-separated value) file format heart disease dataset published by UCI into the dataframe. The DataFrame object is the primary Pandas data structure which is a two-dimensional table with labelled axes – along rows and along with columns. Various data manipulation operations can be applied to the Pandas dataframe along rows and columns.

The Pandas dataframe head(10) method enables us to get a peek at the top 10 rows of the dataframe. This helps us in gaining an insight into the various columns and an insight into the type and values of data being stored in the dataframe.

The Pandas dataframe info() method provides information on the number of row-entries in the dataframe and the number of columns in the dataframe. Count of non-null entries per column, the data type of each column and the memory usage of the dataframe is also provided.

The Pandas dataframe isna().sum() methods provide the count of null values in each column.

The Matplotlib.figure API implements the Figure class which is the top-level class for all plot elements. Figsize = (15,10) defines the plot size as 15 inches wide and 10 inches high.

The Seaborn heatmap API provides the colour encoded plot for 2-D matrix data. The Pandas dataframe corr() method provides pairwise correlation (movement of the two variables in relation to each other) of columns in the dataframe. NA or null values are excluded by this method. The method allows us to find positive and negative correlations and strong and weak correlations between the various columns and the target variable. This can help us in feature selection. Weakly correlated features can be neglected. Positive and negative correlations can be used to describe model predictions. Positive correlation implies that as the value of one variable goes up, the value of the other variable also goes up. A negative correlation implies that as the value of one variable goes down, the value of the other variable also goes down. Zero correlation implies that there is no linear relationship between the variables. linewidth gives the width of the line that divides each cell in the heatmap. Setting can not to True, labels each cell with the corresponding correlation value. cmap value defines the mapping of the data value to the colorspace.

dataframe.hist(figsize=(12,12))

The Pandas dataframe hist method plots the histogram of the different columns, with figsize equal to 12 inches wide and 12 inches high.

Standard Scaling X = dataframe.iloc[:,0:13] y = dataframe.iloc[:,13]

Next,  we split our dataframe into features (X) and target variable (y) by using the integer-location based indexing ‘iloc’ dataframe property. We select all the rows and the first 13 columns as the X variable and all the rows and the 14th column as the target variable.

X = X.values y = y.values

We extract and return a Numpy representation of X and y values using the dataframe values property for our machine learning study.

from sklearn.preprocessing import StandardScaler X_std=StandardScaler().fit_transform(X)

We use the scikit-learn (sklearn) library for our machine learning studies. The scikit-learn library is an open-source Python library for predictive data analysis and machine learning and is built on top of Numpy, SciPy and Matplotlib. The SciPy ecosystem is used for scientific computing and provides optimized modules for Linear Algebra, Calculus, ODE solvers and Fast Fourier transforms among others. The sklearn preprocessing module implements function like scaling, normalizing and binarizing data. The StandardScaler standardizes the features by making the mean equal to zero and variance equal to one. The fit_transform() method achieves the dual purpose of (i) the fit() method by fitting a scaling algorithm and finding out the parameters for scaling (ii) the transform method, where the actual scaling transformation is applied by using the parameters found in the fit() method. Many machine learning algorithms are designed based on the assumption of expecting normalized/scaled data and standard scaling is thus one of the methods that help in improving the accuracy of machine learning models.

Train-Test Split from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X_std,y,test_size=0.25,random_state=40)

The sklearn model_selection class implements different data splitter classes (split into train and test sets, KFold train and test sets etc.), Hyper-parameter optimizers (search over a grid to find optimal hyperparameters) and model validation functionalities (evaluate the metrics of the cross-validated model etc).

N.B. – KFold (K=10) cross-validation means splitting the train set into 10 parts. 9 parts are used for training while the last part is used for testing. Next, another set of 9 parts (different from the previous set) is used for training while the remaining one part is used for testing. This process is repeated until each part forms one test set. The average of the 10 accuracy scores on 10 test sets is the KFold cross_val_score.

The train_test_split method from the sklearn model_selection class is used to split our features (X) and targets (y) into training and test sets. The test size = 0.25 specifies that 25 % of data is to be kept in the test set while setting a random_state = 40 ensures that the algorithm generates the same set of training and test data every time the algorithm is run. Machine learning algorithms are random by nature and setting a random_state ensures that the results are reproducible.

Model Fitting and Prediction from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix lr=LogisticRegression(C=1.0,class_weight='balanced',dual=False, fit_intercept=True, intercept_scaling=1,max_iter=100,multi_class='auto', n_jobs=None,penalty='l2',random_state=1234,solver='lbfgs',tol=0.0001, verbose=0,warm_start=False) model1=lr.fit(X_train,y_train) prediction1=model1.predict(X_test) cm=confusion_matrix(y_test,prediction1) sns.heatmap(cm,annot=True,cmap='winter',linewidths=0.3, linecolor='black',annot_kws={"size":20}) TP=cm[0][0] TN=cm[1][1] FN=cm[1][0] FP=cm[0][1] print('Testing Accuracy for Logistic Regression:',(TP+TN)/(TP+TN+FN+FP))

The sklearn.metrics module includes score functions, performance metrics and distance metrics among others. The confusion_matrix method provides the accuracy of classification in a matrix format.

The sklearn linear_model class implements a variety of linear models like Linear regression, Logistic regression, Ridge regression, Lasso regression etc. We import the LogisticRegression class for our classification studies. A LogisticRegression object is instantiated.

The parameter C specifies regularization strength. Regularization implies penalizing the model for overfitting. C=1.0 is the default value for LogisticRegressor in the sklearn library.

The class_weight=’balanced’ method provides weights to the classes. If unspecified, the default class_weight is = 1. Class weight = ‘balanced’ assigns class weights by using the formula (n_samples/(n_classes*np.bin_count(y))). e.g. if n_samples =100, n_classes=2 and there are 50 samples belonging to each of the 0 and 1 classes, class_weight = 100/(2*50) = 1

N.B. Liblinear solver utilizes the coordinate-descent algorithm instead of the gradient descent algorithms to find the optimal parameters for the logistic regression model. E.g. in the gradient descent algorithms, we optimize all the parameters at once. While coordinate descent optimizes only one parameter at a time. In coordinate descent, we first initialize the parameter vector (theta = [theta0, theta1 …….. thetan]). In the kth iteration, only thetaik is updated while (theta0k… thetai-1k and thetai+1k-1…. thetank-1) are fixed.

fit_intercept = True The default value is True. Specifies if a constant should be added to the decision function.

intercept_scaling = 1 The default value is 1. Is applicable only when the solver is liblinear and fit_intercept = True. [X] becomes [X, intercept_scaling]. A synthetic feature with constant value = intercept_scaling is appended to [X]. The intercept becomes, intercept scaling * synthetic feature weight. Synthetic feature weight is modified by L1/L2 regularizations. To lessen the effect of regularization on synthetic feature weights, high intercept_scaling value must be chosen.

max_iter = 100 (default). A maximum number of iterations is taken for the solvers to converge.

multi_class = ‘ovr’, ‘multinomial’ or auto(default). auto selects ‘ovr’ i.e. binary problem if the data is binary or if the solver is liblinear. Otherwise auto selects multinomial which minimises the multinomial loss function even when the data is binary.

n_jobs (default = None). A number of CPU cores are utilized when parallelizing computations for multi_class=’ovr’. None means 1 core is used. -1 means all cores are used. Ignored when the solver is set to liblinear.

penalty: specify the penalty norm (default = L2).

random_state = set random state so that the same results are returned every time the model is run.

solver = the choice of the optimization algorithm (default = ‘lbfgs’)

tol = Tolerance for stopping criteria (default = 1e-4)

verbose = 0 (for suppressing information during the running of the algorithm)

warm_start = (default = False). when set to True, use the solution from the previous step as the initialization for the present step. This is not applicable for the liblinear solver.

Next, we call the fit method on the logistic regressor object using (X_train, y_train) to find the parameters of our logistic regression model. We call the predict method on the logistic regressor object utilizing X_test and the parameters predicted using the fit() method earlier.

We can calculate the confusion matrix to measure the accuracy of the model using the predicted values and y_test.

The parameters for the sns (seaborn) heatmap have been explained earlier. The linecolor parameter specifies the colour of the lines that will divide each cell. The annot_kws parameter passes keyword arguments to the matplotlib method – fontsize in this case.

of test samples = 89.47%.

Conclusion

This brings us to the end of the article. In this article, we developed a logistic regression model for heart disease prediction using a dataset from the UCI repository. We focused on gaining an in-depth understanding of the hyperparameters, libraries and code used when defining a logistic regression model through the scikit-learn library.

Thanks

Sources

Related

## Conceptual Understanding Of Logistic Regression For Data Science Beginners

This article was published as a part of the Data Science Blogathon

What is Logistic Regression? How is it different from Linear Regression? Why regression word is used here if this is a classification problem? What is the use of MLE in Logistic regression? From where did the Loss function come? How does Gradient Descent work in Logistic Regression? What is an odd’s ratio?

Well, these were a few of my doubts when I was learning Logistic Regression. To find the math behind this, I plunged deeper into this topic only to find myself a better understanding of the Logistic Regression model. And in this article, I will try to answer all the doubts you are having right now on this topic. I will tell you the math behind this regression model.

Contents

1) What is Logistic Regression

2) Why do we use Logistic regression rather than Linear Regression?

3) Logistic Function

How Linear regression is similar to logistic regression?

Derivation of the sigmoid function

What are odds?

4) Cost function in Logistic regression

5) What is the use of MLE in logistic regression?

Derivation of the Cost function

Why do we take the Negative log-likelihood function?

Derivative of the Cost function

Derivative of the sigmoid function

7) Endnotes

What is Logistic Regression?

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

I found this definition on google and now we’ll try to understand it. Logistic Regression is another statistical analysis method borrowed by Machine Learning. It is used when our dependent variable is dichotomous or binary. It just means a variable that has only 2 outputs, for example, A person will survive this accident or not, The student will pass this exam or not. The outcome can either be yes or no (2 outputs). This regression technique is similar to linear regression and can be used to predict the Probabilities for classification problems.

Why do we use Logistic Regression rather than Linear Regression?

If you have this doubt, then you’re in the right place, my friend. After reading the definition of logistic regression we now know that it is only used when our dependent variable is binary and in linear regression this dependent variable is continuous.

The second problem is that if we add an outlier in our dataset, the best fit line in linear regression shifts to fit that point.

Now, if we use linear regression to find the best fit line which aims at minimizing the distance between the predicted value and actual value, the line will be like this:

Here the threshold value is 0.5, which means if the value of h(x) is greater than 0.5 then we predict malignant tumor (1) and if it is less than 0.5 then we predict benign tumor (0). Everything seems okay here but now let’s change it a bit, we add some outliers in our dataset, now this best fit line will shift to that point. Hence the line will be somewhat like this:

Do you see any problem here? The blue line represents the old threshold and the yellow line represents the new threshold which is maybe 0.2 here. To keep our predictions right we had to lower our threshold value. Hence we can say that linear regression is prone to outliers. Now here if h(x) is greater than 0.2 then only this regression will give correct outputs.

Another problem with linear regression is that the predicted values may be out of range. We know that probability can be between 0 and 1, but if we use linear regression this probability may exceed 1 or go below 0.

To overcome these problems we use Logistic Regression, which converts this straight best fit line in linear regression to an S-curve using the sigmoid function, which will always give values between 0 and 1. How does this work and what’s the math behind this will be covered in a later section?

If you want to know the difference between logistic regression and linear regression then you refer to this article.

Logistic Function

You must be wondering how logistic regression squeezes the output of linear regression between 0 and 1. If you haven’t read my article on Linear Regression then please have a look at it for a better understanding.

Well, there’s a little bit of math included behind this and it is pretty interesting trust me.

Let’s start by mentioning the formula of logistic function:

How similar it is too linear regression? If you haven’t read my article on Linear Regression, then please have a look at it for a better understanding.

We all know the equation of the best fit line in linear regression is:

Let’s say instead of y we are taking probabilities (P). But there is an issue here, the value of (P) will exceed 1 or go below 0 and we know that range of Probability is (0-1). To overcome this issue we take “odds” of P:

Do you think we are done here? No, we are not. We know that odds can always be positive which means the range will always be (0,+∞ ). Odds are nothing but the ratio of the probability of success and probability of failure. Now the question comes out of so many other options to transform this why did we only take ‘odds’? Because odds are probably the easiest way to do this, that’s it.

The problem here is that the range is restricted and we don’t want a restricted range because if we do so then our correlation will decrease. By restricting the range we are actually decreasing the number of data points and of course, if we decrease our data points, our correlation will decrease. It is difficult to model a variable that has a restricted range. To control this we take the log of odds which has a range from (-∞,+∞).

If you understood what I did here then you have done 80% of the maths. Now we just want a function of P because we want to predict probability right? not log of odds. To do so we will multiply by exponent on both sides and then solve for P.

Now we have our logistic function, also called a sigmoid function. The graph of a sigmoid function is as shown below. It squeezes a straight line into an S-curve.

Cost Function in Logistic Regression

In linear regression, we use the Mean squared error which was the difference between y_predicted and y_actual and this is derived from the maximum likelihood estimator. The graph of the cost function in linear regression is like this:

In logistic regression Yi is a non-linear function (Ŷ=1​/1+ e-z). If we use this in the above MSE equation then it will give a non-convex graph with many local minima as shown

The problem here is that this cost function will give results with local minima, which is a big problem because then we’ll miss out on our global minima and our error will increase.

In order to solve this problem, we derive a different cost function for logistic regression called log loss which is also derived from the maximum likelihood estimation method.

In the next section, we’ll talk a little bit about the maximum likelihood estimator and what it is used for. We’ll also try to see the math behind this log loss function.

What is the use of Maximum Likelihood Estimator?

The main aim of MLE is to find the value of our parameters for which the likelihood function is maximized. The likelihood function is nothing but a joint pdf of our sample observations and joint distribution is the multiplication of the conditional probability for observing each example given the distribution parameters. In other words, we try to find such that plugging these estimates into the model for P(x), yields a number close to one for people who had a malignant tumor and close to 0 for people who had a benign tumor.

Let’s start by defining our likelihood function. We now know that the labels are binary which means they can be either yes/no or pass/fail etc. We can also say we have two outcomes success and failure. This means we can interpret each label as Bernoulli random variable.

A random experiment whose outcomes are of two types, success S and failure F, occurring with probabilities p and q respectively is called a Bernoulli trial. If for this experiment a random variable X is defined such that it takes value 1 when S occurs and 0 if F occurs, then X follows a Bernoulli Distribution.

Where P is our sigmoid function

where σ(θ^T*x^i) is the sigmoid function. Now for n observations,

We need a value for theta which will maximize this likelihood function. To make our calculations easier we multiply the log on both sides. The function we get is also called the log-likelihood function or sum of the log conditional probability

In machine learning, it is conventional to minimize a loss(error) function via gradient descent, rather than maximize an objective function via gradient ascent. If we maximize this above function then we’ll have to deal with gradient ascent to avoid this we take negative of this log so that we use gradient descent. We’ll talk more about gradient descent in a later section and then you’ll have more clarity. Also, remember,

max[log(x)] = min[-log(x)]

The negative of this function is our cost function and what do we want with our cost function? That it should have a minimum value. It is common practice to minimize a cost function for optimization problems; therefore, we can invert the function so that we minimize the negative log-likelihood (NLL). So in logistic regression, our cost function is:

Here y represents the actual class and log(σ(θ^T*x^i) ) is the probability of that class.

p(y) is the probability of 1.

1-p(y) is the probability of 0.

Let’s see what will be the graph of cost function when y=1 and y=0

If we combine both the graphs, we will get a convex graph with only 1 local minimum and now it’ll be easy to use gradient descent here.

The red line here represents the 1 class (y=1), the right term of cost function will vanish. Now if the predicted probability is close to 1 then our loss will be less and when probability approaches 0, our loss function reaches infinity.

The black line represents 0 class (y=0), the left term will vanish in our cost function and if the predicted probability is close to 0 then our loss function will be less but if our probability approaches 1 then our loss function reaches infinity.

This cost function is also called log loss. It also ensures that as the probability of the correct answer is maximized, the probability of the incorrect answer is minimized. Lower the value of this cost function higher will be the accuracy.

In this section, we will try to understand how we can utilize Gradient Descent to compute the minimum cost.

Gradient descent changes the value of our weights in such a way that it always converges to minimum point or we can also say that, it aims at finding the optimal weights which minimize the loss function of our model. It is an iterative method that finds the minimum of a function by figuring out the slope at a random point and then moving in the opposite direction.

The intuition is that if you are hiking in a canyon and trying to descend most quickly down to the river at the bottom, you might look around yourself 360 degrees, find the direction where the ground is sloping the steepest, and walk downhill in that direction.

At first gradient descent takes a random value of our parameters from our function. Now we need an algorithm that will tell us whether at the next iteration we should move left or right to reach the minimum point. The gradient descent algorithm finds the slope of the loss function at that particular point and then in the next iteration, it moves in the opposite direction to reach the minima. Since we have a convex graph now we don’t need to worry about local minima. A convex curve will always have only 1 minima.

We can summarize the gradient descent algorithm as:

Here alpha is known as the learning rate. It determines the step size at each iteration while moving towards the minimum point. Usually, a lower value of “alpha” is preferred, because if the learning rate is a big number then we may miss the minimum point and keep on oscillating in the convex curve

Now the question is what is this derivative of cost function? How do we do this? Don’t worry, In the next section we’ll see how we can derive this cost function w.r.t our parameters.

Derivation of Cost Function:

Before we derive our cost function we’ll first find a derivative for our sigmoid function because it will be used in derivating the cost function.

Now, we will derive the cost function with the help of the chain rule as it allows us to calculate complex partial derivatives by breaking them down.

Step-1: Use chain rule and break the partial derivative of log-likelihood.

Step-2: Find derivative of log-likelihood w.r.t p Step-3: Find derivative of ‘p’ w.r.t ‘z’ Step-4: Find derivate of z w.r.t

θ

Step-5: Put all the derivatives in equation 1

Hence the derivative of our cost function is:

Now since we have our derivative of the cost function, we can write our gradient descent algorithm as:

If the slope is negative (downward slope) then our gradient descent will add some value to our new value of the parameter directing it towards the minimum point of the convex curve. Whereas if the slope is positive (upward slope) the gradient descent will minus some value to direct it towards the minimum point.

Endnote

To summarise, in this article we learned why linear regression doesn’t work in the case of classification problems.  Also, how MLE is used in logistic regression and how our cost function is derived.

In the next article, I will explain all the interpretations of logistic regression. And how we can check the accuracy of our logistic model.

I am an undergraduate student currently in my last year majoring in Statistics (Bachelors of Statistics) and have a strong interest in the field of data science, machine learning, and artificial intelligence. I enjoy diving into data to discover trends and other valuable insights about the data. I am constantly learning and motivated to try new things.

I am open to collaboration and work.

For any doubt and queries, feel free to contact me on Email

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

## Lasso & Ridge Regression

37 minutes

⭐⭐⭐⭐⭐

Rating: 5 out of 5.

Introduction

Regression models work like magic in predicting the future using machine learning. Using these, businesses can predict future purchases and make better-informed decisions and future plans. In this article, I will tell you everything you need to know about regression models and how they can be used to solve prediction problems. We will extensively cover the basics of linear, lasso, and ridge regression models and learn how they are implemented in Python and R.

Learning Objectives

Understand and implement linear regression techniques for predictive modeling.

Understand regularization in regression models.

Build Lasso, Ridge, and Elastic Net regression models.

Are you a beginner looking for a place to start your data science journey? Presenting a comprehensive course full of knowledge and data science learning curated just for you!

Learning Example

Take a moment to list down all those factors you can think, on which the sales of a store will be dependent on. For each factor create an hypothesis about why and how that factor would influence the sales of various products. For example – I expect the sales of products to depend on the location of the store, because the local residents in each area would have different lifestyle. The amount of bread a store will sell in Ahmedabad would be a fraction of similar store in Mumbai.

Similarly list down all possible factors you can think of.

How many factors were you able to think of? If it is less than 15, give it more time and think again! A seasoned data scientist working on this problem would possibly think of tens and hundreds of such factors.

With that thought in mind, I am providing you with one such data set – The Big Mart Sales. In the data set, we have product wise Sales for Multiple outlets of a chain.

Let’s us take a snapshot of the dataset:

In the dataset, we can see characteristics of the sold item (fat content, visibility, type, price) and some characteristics of the outlet (year of establishment, size, location, type) and the number of the items sold for that particular item. Let’s see if we can predict sales using these features.

Simple Models for Prediction

Let us start with making predictions using a few simple ways to start with. If I were to ask you, what could be the simplest way to predict the sales of an item, what would you say?

Model 1 – Mean sales:

Even without any knowledge of machine learning, you can say that if you have to predict sales for an item – it would be the average over last few days . / months / weeks.

It is a good thought to start, but it also raises a question – how good is that model?

Turns out that there are various ways in which we can evaluate how good is our model. The most common way is Mean Squared Error. Let us understand how to measure it.

Prediction error

To evaluate how good is a model, let us understand the impact of wrong predictions. If we predict sales to be higher than what they might be, the store will spend a lot of money making unnecessary arrangement which would lead to excess inventory. On the other side if I predict it too low, I will lose out on sales opportunity.

So, the simplest way of calculating error will be, to calculate the difference in the predicted and actual values. However, if we simply add them, they might cancel out, so we square these errors before adding. We also divide them by the number of data points to calculate a mean error since it should not be dependent on number of data points.

This is known as the mean squared error.

Here e1, e2 …. , en are the difference between the actual and the predicted values.

So, in our first model what would be the mean squared error? On predicting the mean for all the data points, we get a mean squared error = 29,11,799. Looks like huge error. May be its not so cool to simply predict the average value.

Let’s see if we can think of something to reduce the error. Here is a live coding window to predict target using mean.

﻿

Model 2 – Average Sales by Location:

We know that location plays a vital role in the sales of an item. For example, let us say, sales of car would be much higher in Delhi than its sales in Varanasi. Therefore let us use the data of the column ‘Outlet_Location_Type’.

So basically, let us calculate the average sales for each location type and predict accordingly.

On predicting the same, we get mse = 28,75,386, which is less than our previous case. So we can notice that by using a characteristic[location], we have reduced the error.

Now, what if there are multiple features on which the sales would depend on. How would we predict sales using this information? Linear regression comes to our rescue.

What Is Linear Regression?

Linear regression is the simplest and most widely used statistical technique for predictive modeling. It basically gives us an equation, where we have our features as independent variables, on which our target variable [sales in our case] is dependent upon.

So what does the equation look like? Linear regression equation looks like this:

Here, we have Y as our dependent variable (Sales), X’s are the independent variables and all thetas are the coefficients. Coefficients are basically the weights assigned to the features, based on their importance. For example, if we believe that sales of an item would have higher dependency upon the type of location as compared to size of store, it means that sales in a tier 1 city would be more even if it is a smaller outlet than a tier 3 city in a bigger outlet. Therefore, coefficient of location type would be more than that of store size.

So, firstly let us try to understand linear regression with only one feature, i.e., only one independent variable. Therefore our equation becomes,

This equation is called a simple linear regression equation, which represents a straight line, where ‘Θ0’ is the intercept, ‘Θ1’ is the slope of the line. Take a look at the plot below between sales and MRP.

Surprisingly, we can see that sales of a product increases with increase in its MRP. Therefore the dotted red line represents our regression line or the line of best fit. But one question that arises is how you would find out this line?

How to Find the Line of Best Fit

As you can see below there can be so many lines which can be used to estimate Sales according to their MRP. So how would you choose the best fit line or the regression line?

The main purpose of the best fit line is that our predicted values should be closer to our actual or the observed values, because there is no point in predicting values which are far away from the real values. In other words, we tend to minimize the difference between the values predicted by us and the observed values, and which is actually termed as error. Graphical representation of error is as shown below. These errors are also called as residuals. The residuals are indicated by the vertical lines showing the difference between the predicted and actual value.

Okay, now we know that our main objective is to find out the error and minimize it. But before that, let’s think of how to deal with the first part, that is, to calculate the error. We already know that error is the difference between the value predicted by us and the observed value. Let’s just consider three ways through which we can calculate error:

Sum of residuals (∑(Y – h(X))) – it might result in cancelling out of positive and negative errors.

Sum of square of residuals ( ∑ (Y-h(X))2) – it’s the method mostly used in practice since here we penalize higher error value much more as compared to a smaller one, so that there is a significant difference between making big errors and small errors, which makes it easy to differentiate and select the best fit line.

Therefore, sum of squares of these residuals is denoted by:

where, h(x) is the value predicted by us,  h(x) =Θ1*x +Θ0 , y is the actual values and m is the number of rows in the training set.

The cost Function

So let’s say, you increased the size of a particular shop, where you predicted that the sales would be higher. But despite increasing the size, the sales in that shop did not increase that much. So the cost applied in increasing the size of the shop, gave you negative results.

So, we need to minimize these costs. Therefore we introduce a cost function, which is basically used to define and measure the error of the model.

If you look at this equation carefully, it is just similar to sum of squared errors, with just a factor of 1/2m is multiplied in order to ease mathematics.

So in order to improve our prediction, we need to minimize the cost function. For this purpose we use the gradient descent algorithm. So let us understand how it works.

Let us consider an example, we need to find the minimum value of this equation,

Y= 5x + 4x^2. In mathematics, we simple take the derivative of this equation with respect to x, simply equate it to zero. This gives us the point where this equation is minimum. Therefore substituting that value can give us the minimum value of that equation.

Gradient descent works in a similar manner. It iteratively updates Θ, to find a point where the cost function would be minimum. If you wish to study gradient descent in depth, I would highly recommend going through this article.

Using Linear Regression for Prediction

Now let us consider using Linear Regression to predict Sales for our big mart sales problem.

Model 3 – Enter Linear Regression:

From the previous case, we know that by using the right features would improve our accuracy. So now let us use two features, MRP and the store establishment year to estimate sales.

Now, let us built a linear regression model in python considering only these two features.

In this case, we got mse = 19,10,586.53, which is much smaller than our model 2. Therefore predicting with the help of two features is much more accurate.

Let us take a look at the coefficients of this linear regression model.

Therefore, we can see that MRP has a high coefficient, meaning items having higher prices have better sales.

How accurate do you think the model is? Do we have any evaluation metric, so that we can check this? Actually we have a quantity, known as R-Square.

R-Square: It determines how much of the total variation in Y (dependent variable) is explained by the variation in X (independent variable). Mathematically, it can be written as:

The value of R-square is always between 0 and 1, where 0 means that the model does not model explain any variability in the target variable (Y) and 1 meaning it explains full variability in the target variable.

Now let us check the r-square for the above model.

In this case, R² is 32%, meaning, only 32% of variance in sales is explained by year of establishment and MRP. In other words, if you know year of establishment and the MRP, you’ll have 32% information to make an accurate prediction about its sales.

Now what would happen if I introduce one more feature in my model, will my model predict values more closely to its actual value? Will the value of R-Square increase?

Let us consider another case.

Model 4 – Linear regression with more variables

We learnt, by using two variables rather than one, we improved the ability to make accurate predictions about the item sales.

So, let us introduce another feature ‘weight’ in case 3. Now let’s build a regression model with these three features.

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

It produces an error, because item weights column have some missing values. So let us impute it with the mean of other non-null entries.

train['Item_Weight'].fillna((train['Item_Weight'].mean()), inplace=True)

Let us try to run the model again.

Therefore we can see that the mse is further reduced. There is an increase in the value R-square, does it mean that the addition of item weight is useful for our model?

The only drawback of R2 is that if new predictors (X) are added to our model, R2 only increases or remains constant but it never decreases. We can not judge that by increasing complexity of our model, are we making it more accurate?

That is why, we use “Adjusted R-Square”.

The Adjusted R-Square is the modified form of R-Square that has been adjusted for the number of predictors in the model. It incorporates model’s degree of freedom. The adjusted R-Square only increases if the new term improves the model accuracy.

where

R2 = Sample R square

p = Number of predictors

N = total sample size

Using All the Features for Prediction

Now let us built a model containing all the features. While building the regression models, I have only used continuous features. This is because we need to treat categorical variables differently before they can used in linear regression model. There are different techniques to treat them, here I have used one hot encoding(convert each class of a categorical variable as a feature). Other than that I have also imputed the missing values for outlet size.

Data pre-processing steps for regression model Building the model

Clearly, we can see that there is a great improvement in both mse and R-square, which means that our model now is able to predict much closer values to the actual values.

Selecting the right features for your model

When we have a high dimensional data set, it would be highly inefficient to use all the variables since some of them might be imparting redundant information. We would need to select the right set of variables which give us an accurate model as well as are able to explain the dependent variable well. There are multiple ways to select the right set of variables for the model. First among them would be the business understanding and domain knowledge. For instance while predicting sales we know that marketing efforts should impact positively towards sales and is an important feature in your model. We should also take care that the variables we’re selecting should not be correlated among themselves.

Instead of manually selecting the variables, we can automate this process by using forward or backward selection. Forward selection starts with most significant predictor in the model and adds variable for each step. Backward elimination starts with all predictors in the model and removes the least significant variable for each step. Selecting criteria can be set to any statistical measure like R-square, t-stat etc.

Interpretation of Regression Plots

Take a look at the residual vs fitted values plot.

We can see a funnel like shape in the plot. This shape indicates Heteroskedasticity. The presence of non-constant variance in the error terms results in heteroskedasticity. We can clearly see that the variance of error terms(residuals) is not constant. Generally, non-constant variance arises in presence of outliers or extreme leverage values. These values get too much weight, thereby disproportionately influencing the model’s performance. When this phenomenon occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow.

We can easily check this by looking at residual vs fitted values plot. If heteroskedasticity exists, the plot would exhibit a funnel shape pattern as shown above. This indicates signs of non linearity in the data which has not been captured by the model. I would highly recommend going through this article for a detailed understanding of assumptions and interpretation of regression plots.

In order to capture this non-linear effects, we have another type of regression known as polynomial regression. So let us now understand it.

What Is Polynomial Regression?

Polynomial regression is another form of regression in which the maximum power of the independent variable is more than 1. In this regression technique, the best fit line is not a straight line instead it is in the form of a curve.

Quadratic regression, or regression with second order polynomial, is given by the following equation:

Y =Θ1 +Θ2*x +Θ3*x2

Now take a look at the plot given below.

Clearly the quadratic equation fits the data better than simple linear equation. In this case, what do you think will the R-square value of quadratic regression greater than simple linear regression? Definitely yes, because quadratic regression fits the data better than linear regression. While quadratic and cubic polynomials are common, but you can also add higher degree polynomials.

Below figure shows the behavior of a polynomial equation of degree 6.

So do you think it’s always better to use higher order polynomials to fit the data set. Sadly, no. Basically, we have created a model that fits our training data well but fails to estimate the real relationship among variables beyond the training set. Therefore our model performs poorly on the test data. This problem is called as over-fitting. We also say that the model has high variance and low bias.

Similarly, we have another problem called underfitting, it occurs when our model neither fits the training data nor generalizes on the new data.

Our model is underfit when we have high bias and low variance.

Bias and Variance in Regression Models

What does that bias and variance actually mean? Let us understand this by an example of archery targets.

Let’s say we have model which is very accurate, therefore the error of our model will be low, meaning a low bias and low variance as shown in first figure. All the data points fit within the bulls-eye. Similarly we can say that if the variance increases, the spread of our data point increases which results in less accurate prediction. And as the bias increases the error between our predicted value and the observed values increases.

Now how this bias and variance is balanced to have a perfect model? Take a look at the image below and try to understand.

As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one optimum point in our model where the decrease in bias is equal to increase in variance. In practice, there is no analytical way to find this point. So how to deal with high variance or high bias?

To overcome underfitting or high bias, we can basically add new parameters to our model so that the model complexity increases, and thus reducing high bias.

Now, how can we overcome Overfitting for a regression model?

Basically there are two methods to overcome overfitting,

Reduce the model complexity

Regularization

Here we would be discussing about Regularization in detail and how to use it to make your model more generalized.

Regularization of Models

You have your model ready, you have predicted your output. So why do you need to study regularization? Is it necessary?

Suppose you have taken part in a competition, and in that problem you need to predict a continuous variable. So you applied linear regression and predicted your output. Voila! You are on the leaderboard. But wait what you see is still there are many people above you on the leaderboard. But you did everything right then how is it possible?

“Everything should be made simple as possible, but not simpler – Albert Einstein”

What we did was simpler, everybody else did that, now let us look at making it simple. That is why, we will try to optimize our code with the help of regularization.

In regularization, what we do is normally we keep the same number of features, but reduce the magnitude of the coefficients j. How does reducing the coefficients will help us?

Let us take a look at the coefficients of feature in our above regression model.

We can see that coefficients of Outlet_Identifier_OUT027 and Outlet_Type_Supermarket_Type3(last 2) is much higher as compared to rest of the coefficients. Therefore the total sales of an item would be more driven by these two features.

How can we reduce the magnitude of coefficients in our model? For this purpose, we have different types of regression techniques which uses regularization to overcome this problem. So let us discuss them.

What Is Ridge Regression?

Let us first implement it on our above problem and check our results that whether it performs better than our linear regression model.

So, we can see that there is a slight improvement in our model because the value of the R-Square has been increased. Note that value of alpha, which is hyperparameter of Ridge, which means that they are not automatically learned by the model instead they have to be set manually.

Here we have consider alpha = 0.05. But let us consider different values of alpha and plot the coefficients for each case.

You can see that, as we increase the value of alpha, the magnitude of the coefficients decreases, where the values reaches to zero but not absolute zero.

But if you calculate R-square for each alpha, we will see that the value of R-square will be maximum at alpha=0.05. So we have to choose it wisely by iterating it through a range of values and using the one which gives us lowest error.

So, now you have an idea how to implement it but let us take a look at the mathematics side also. Till now our idea was to basically minimize the cost function, such that values predicted are much closer to the desired result.

Now take a look back again at the cost function for ridge regression.

Here if you notice, we come across an extra term, which is known as the penalty term. λ given here, is actually denoted by alpha parameter in the ridge function. So by changing the values of alpha, we are basically controlling the penalty term. Higher the values of alpha, bigger is the penalty and therefore the magnitude of coefficients are reduced.

Important Points:

It shrinks the parameters, therefore it is mostly used to prevent multicollinearity.

It reduces the model complexity by coefficient shrinkage.

It uses L2 regularization technique. (which I will discussed later in this article)

Now let us consider another type of regression technique which also makes use of regularization.

What Is Lasso Regression?

LASSO (Least Absolute Shrinkage Selector Operator), is quite similar to ridge, but lets understand the difference them by implementing it in our big mart problem.

As we can see that, both the mse and the value of R-square for our model has been increased. Therefore, lasso model is predicting better than both linear and ridge.

Again lets change the value of alpha and see how does it affect the coefficients.

So, we can see that even at small values of alpha, the magnitude of coefficients have reduced a lot. By looking at the plots, can you figure a difference between ridge and lasso?

We can see that as we increased the value of alpha, coefficients were approaching towards zero, but if you see in case of lasso, even at smaller alpha’s, our coefficients are reducing to absolute zeroes. Therefore, lasso selects the only some feature while reduces the coefficients of others to zero. This property is known as feature selection and which is absent in case of ridge.

Mathematics behind lasso regression is quiet similar to that of ridge only difference being instead of adding squares of theta, we will add absolute value of Θ.

Here too, λ is the hypermeter, whose value is equal to the alpha in the Lasso function.

Important Points:

It uses L1 regularization technique (will be discussed later in this article)

It is generally used when we have more number of features, because it automatically does feature selection.

Now that you have a basic understanding of ridge and lasso regression, let’s think of an example where we have a large dataset, lets say it has 10,000 features. And we know that some of the independent features are correlated with other independent features. Then think, which regression would you use, Rigde or Lasso?

Let’s discuss it one by one. If we apply ridge regression to it, it will retain all of the features but will shrink the coefficients. But the problem is that model will still remain complex as there are 10,000 features, thus may lead to poor model performance.

Instead of ridge what if we apply lasso regression to this problem. The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.

Then what is the solution for this problem? Actually we have another type of regression, known as elastic net regression, which is basically a hybrid of ridge and lasso regression. So let’s try to understand it.

﻿

What Is Elastic Net Regression?

Before going into the theory part, let us implement this too in big mart sales problem. Will it perform better than ridge and lasso? Let’s check!

So we get the value of R-Square, which is very less than both ridge and lasso. Can you think why? The reason behind this downfall is basically we didn’t have a large set of features. Elastic regression generally works well when we have a big dataset.

Note, here we had two parameters alpha and l1_ratio. First let’s discuss, what happens in elastic net, and how it is different from ridge and lasso.

Elastic net is basically a combination of both L1 and L2 regularization. So if you know elastic net, you can implement both Ridge and Lasso by tuning the parameters. So it uses both L1 and L2 penality term, therefore its equation look like as follows:

So how do we adjust the lambdas in order to control the L1 and L2 penalty term? Let us understand by an example. You are trying to catch a fish from a pond. And you only have a net, then what would you do? Will you randomly throw your net? No, you will actually wait until you see one fish swimming around, then you would throw the net in that direction to basically collect the entire group of fishes. Therefore even if they are correlated, we still want to look at their entire group.

Elastic regression works in a similar way. Let’ say, we have a bunch of correlated independent variables in a dataset, then elastic net will simply form a group consisting of these correlated variables. Now if any one of the variable of this group is a strong predictor (meaning having a strong relationship with dependent variable), then we will include the entire group in the model building, because omitting other variables (like what we did in lasso) might result in losing some information in terms of interpretation ability, leading to a poor model performance.

So, if you look at the code above, we need to define alpha and l1_ratio while defining the model. Alpha and l1_ratio are the parameters which you can set accordingly if you wish to control the L1 and L2 penalty separately. Actually, we have

Alpha = a + b           and     l1_ratio =  a / (a+b)

where, a and b weights assigned to L1 and L2 term respectively. So when we change the values of alpha and l1_ratio, a and b are set aaccordingly such that they control trade off between L1 and L2 as:

a * (L1 term) + b* (L2 term)

Let alpha (or a+b) = 1, and now consider the following cases:

If l1_ratio =1, therefore if we look at the formula of l1_ratio, we can see that l1_ratio can only be equal to 1 if a=1, which implies b=0. Therefore, it will be a lasso penalty.

Similarly if l1_ratio = 0, implies a=0. Then the penalty will be a ridge penalty.

For l1_ratio between 0 and 1, the penalty is the combination of ridge and lasso.

So let us adjust alpha and l1_ratio, and try to understand from the plots of coefficient given below.

Now, you have basic understanding about ridge, lasso and elasticnet regression. But during this, we came across two terms L1 and L2, which are basically two types of regularization. To sum up basically lasso and ridge are the direct application of L1 and L2 regularization respectively.

But if you still want to know, below I have explained the concept behind them, which is OPTIONAL but before that let us see the same implementation of above codes in R.

Implementation in R

Step 1: Linear regression with two variables “Item MRP” and “Item Establishment Year”.

View the code on Gist.

Output

Call: lm(formula = Y_train ~ Item_MRP + Outlet_Establishment_Year, data = train_2) Residuals: Min 1Q Median 3Q Max -4000.1 -769.4 -32.7 679.4 9286.7 Coefficients: (Intercept) 17491.6441 4328.9747 4.041 5.40e-05 *** Item_MRP 15.9085 0.2909 54.680 < 2e-16 *** Outlet_Establishment_Year -8.7808 2.1667 -4.053 5.13e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1393 on 5953 degrees of freedom Multiple R-squared: 0.3354, Adjusted R-squared: 0.3352 F-statistic: 1502 on 2 and 5953 DF, p-value: < 2.2e-16

Also, the value of r square is 0.3354391 and the MSE is 20,28,538.

Step 2: Linear regression with three variables “Item MRP”, “Item Establishment Year”, “Item Weight”.

View the code on Gist.

Output

Call: lm(formula = Y_train ~ Item_Weight + Item_MRP + Outlet_Establishment_Year, data = train_2) Residuals: Min 1Q Median 3Q Max -4000.7 -767.1 -33.2 680.8 9286.3 Coefficients: (Intercept) 17530.3653 4329.9774 4.049 5.22e-05 *** Item_Weight -2.0914 4.2819 -0.488 0.625 Item_MRP 15.9128 0.2911 54.666 < 2e-16 *** Outlet_Establishment_Year -8.7870 2.1669 -4.055 5.08e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1393 on 5952 degrees of freedom Multiple R-squared: 0.3355, Adjusted R-squared: 0.3351 F-statistic: 1002 on 3 and 5952 DF, p-value: < 2.2e-16

Also, the value of r square is 0.3354657 and the MSE is 20,28,692.

Step 3: Linear regression with all variables.

View the code on Gist.

Output

Also, the value of r square is 0.3354657 and the MSE is 14,38,692.

Step 4: Implementation of Ridge regression

View the code on Gist.

Output

(Intercept) (Intercept) Item_Weight -220.666749 0.000000 4.223814 Item_Fat_Contentlow fat Item_Fat_ContentLow Fat Item_Fat_Contentreg 450.322572 170.154291 -760.541684 -150.724872 589.452742 -80.478887 Item_TypeBreakfast Item_TypeCanned Item_TypeDairy 62.144579 638.515758 85.600397 Item_TypeFrozen Foods Item_TypeFruits and Vegetables Item_TypeHard Drinks 359.471616 -259.261220 -724.253049 Item_TypeHealth and Hygiene Item_TypeHousehold Item_TypeMeat -127.436543 -197.964693 -62.876064 Item_TypeOthers Item_TypeSeafood Item_TypeSnack Foods 120.296577 -114.541586 140.058051 Item_TypeSoft Drinks Item_TypeStarchy Foods Item_MRP -77.997959 1294.760824 9.841181 Outlet_IdentifierOUT013 Outlet_IdentifierOUT017 Outlet_IdentifierOUT018 20.653040 442.754542 252.958840 Outlet_IdentifierOUT019 Outlet_IdentifierOUT027 Outlet_IdentifierOUT035 -1044.171225 1031.745209 -213.353983 Outlet_IdentifierOUT045 Outlet_IdentifierOUT046 Outlet_IdentifierOUT049 -232.709629 249.947674 154.372830 Outlet_Establishment_Year Outlet_SizeHigh Outlet_SizeMedium 1.718906 18.560755 356.251406 Outlet_SizeSmall Outlet_Location_TypeTier 2 Outlet_Location_TypeTier 3 54.009414 50.276025 -369.417420 Outlet_TypeSupermarket Type1 649.021251

Step 5: Implementation of lasso regression

View the code on Gist.

Output

(Intercept) (Intercept) Item_Weight 550.39251 0.00000 0.00000 Item_Fat_Contentlow fat Item_Fat_ContentLow Fat Item_Fat_Contentreg 0.00000 22.67186 0.00000 0.00000 0.00000 0.00000 Item_TypeBreakfast Item_TypeCanned Item_TypeDairy 0.00000 0.00000 0.00000 Item_TypeFrozen Foods Item_TypeFruits and Vegetables Item_TypeHard Drinks 0.00000 -83.94379 0.00000 Item_TypeHealth and Hygiene Item_TypeHousehold Item_TypeMeat 0.00000 0.00000 0.00000 Item_TypeOthers Item_TypeSeafood Item_TypeSnack Foods 0.00000 0.00000 0.00000 Item_TypeSoft Drinks Item_TypeStarchy Foods Item_MRP 0.00000 0.00000 11.18735 Outlet_IdentifierOUT013 Outlet_IdentifierOUT017 Outlet_IdentifierOUT018 0.00000 0.00000 0.00000 Outlet_IdentifierOUT019 Outlet_IdentifierOUT027 Outlet_IdentifierOUT035 -580.02106 645.76539 0.00000 Outlet_IdentifierOUT045 Outlet_IdentifierOUT046 Outlet_IdentifierOUT049 0.00000 0.00000 0.00000 Outlet_Establishment_Year Outlet_SizeHigh Outlet_SizeMedium 0.00000 0.00000 260.63703 Outlet_SizeSmall Outlet_Location_TypeTier 2 Outlet_Location_TypeTier 3 0.00000 0.00000 -313.21402 Outlet_TypeSupermarket Type1 48.77124

For better understanding and more clarity on all the three types of regression, you can refer to this Free Course: Big Mart Sales In R.

Types of Regularization Techniques

Let’s recall, both in ridge and lasso we added a penalty term, but the term was different in both cases. In ridge, we used the squares of theta while in lasso we used absolute value of theta. So why these two only, can’t there be other possibilities?

Actually, there are different possible choices of regularization with different choices of order of the parameter in the regularization term, which is denoted by . This is more generally known as Lp regularizer.

Let us try to visualize some by plotting them. For making visualization easy, let us plot them in 2D space. For that we suppose that we just have two parameters. Now, let’s say if p=1, we have term as  . Can’t we plot this equation of line? Similarly plot for different values of p are given below.

In the above plots, axis denote the parameters(Θ1 and Θ2). Let us examine them one by one.

For p=0.5, we can only get large values of one parameter only if other parameter is too small. For p=1, we get sum of absolute values where the increase in one parameter Θ is exactly offset by the decrease in other. For p =2, we get a circle and for larger p values, it approaches a round square shape.

The two most commonly used regularization are in which we have p=1 and p=2, more commonly known as L1 and L2 regularization.

Look at the figure given below carefully. The blue shape refers the regularization term and other shape present refers to our least square error (or data term).

The first figure is for L1 and the second one is for L2 regularization. The black point denotes that the least square error is minimized at that point and as we can see that it increases quadratically as we move from it and the regularization term is minimized at the origin where all the parameters are zero .

Now the question is that at what point will our cost function be minimum? The answer will be, since they are quadratically increasing, the sum of both the terms will be minimized at the point where they first intersect.

Take a look at the L2 regularization curve. Since the shape formed by L2 regularizer is a circle, it increases quadratically as we move away from it. The L2 optimum(which is basically the intersection point) can fall on the axis lines only when the minimum MSE (mean square error or the black point in the figure) is also exactly on the axis. But in case of L1, the L1 optimum can be on the axis line because its contour is sharp and therefore there are high chances of interaction point to fall on axis. Therefore it is possible to intersect on the axis line, even when minimum MSE is not on the axis. If the intersection point falls on the axes it is known as sparse.

Therefore L1 offers some level of sparsity which makes our model more efficient to store and compute and it can also help in checking importance of feature, since the features that are not important can be exactly set to zero.

Conclusion

I hope now you understand the science behind the linear regression and how to implement it and optimize it further to improve your model.

“Knowledge is the treasure and practice is the key to it”

Therefore, get your hands dirty by solving some problems. You can also start with the Big mart sales problem and try to improve your model with some feature engineering.  If you face any difficulties while implementing it, feel free to write on our discussion portal.

Key Takeaways

We now understand how to evaluate a linear regression model using R-squared and Adjusted R-squared values.

We looked into the difference between Bias and Variance in regression models.

We also learned to implement various regression models in R.

Q1. Where are ridge and lasso regression used?

A. Ridge regression is used in cases where there are many parameters or predictors that affect the outcome. Whereas lasso regression works better when there are fewer significant parameters or predictors involved.

Q2. What is the full form of lasso?

A. The LASSO in lasso regression stands for Least Absolute Shrinkage and Selection Operator.

Q3. What is the difference between lasso and ridge regression in R?

A. The main difference between the two is that lasso regression can lower the coefficients to zero resulting in feature selection, while ridge regression cannot result in feature selection, as it only gets the coefficients close to zero.

Related

## How To Create Polynomial Regression Model In R?

A Polynomial regression model is the type of model in which the dependent variable does not have linear relationship with the independent variables rather they have nth degree relationship. For example, a dependent variable x can depend on an independent variable y-square. There are two ways to create a polynomial regression in R, first one is using polym function and second one is using I() function.

Example1 set.seed(322) x1<−rnorm(20,1,0.5) x2<−rnorm(20,5,0.98) y1<−rnorm(20,8,2.15) Method1 Model1<−lm(y1~polym(x1,x2,degree=2,raw=TRUE)) summary(Model1) Output Call: lm(formula = y1 ~ polym(x1, x2, degree = 2, raw = TRUE)) Residuals: Min 1Q Median 3Q Max −4.2038 −0.7669 −0.2619 1.2505 6.8684 Coefficients: (Intercept) 11.2809 17.0298 0.662 0.518 polym(x1, x2, degree = 2, raw = TRUE)1.0 −2.9603 6.5583 −0.451 0.659 polym(x1, x2, degree = 2, raw = TRUE)2.0 1.9913 1.9570 1.017 0.326 polym(x1, x2, degree = 2, raw = TRUE)0.1 −1.3573 6.1738 −0.220 0.829 polym(x1, x2, degree = 2, raw = TRUE)1.1 −0.5574 1.2127 −0.460 0.653 polym(x1, x2, degree = 2, raw = TRUE)0.2 0.2383 0.5876 0.406 0.691 Residual standard error: 2.721 on 14 degrees of freedom Multiple R−squared: 0.205, Adjusted R−squared: −0.0789 F−statistic: 0.7221 on 5 and 14 DF, p−value: 0.6178 Method2 Model_1_M2<−lm(y1 ~ x1 + x2 + I(x1^2) + I(x2^2) + x1:x2) summary(Model_1_M2) Output Call: lm(formula = y1 ~ x1 + x2 + I(x1^2) + I(x2^2) + x1:x2) Residuals: Min 1Q Median 3Q Max −4.2038 −0.7669 −0.2619 1.2505 6.8684 Coefficients: (Intercept) 11.2809 17.0298 0.662 0.518 x1 −2.9603 6.5583 −0.451 0.659 x2 −1.3573 6.1738 −0.220 0.829 I(x1^2) 1.9913 1.9570 1.017 0.326 I(x2^2) 0.2383 0.5876 0.406 0.691 x1:x2 −0.5574 1.2127 −0.460 0.653 Residual standard error: 2.721 on 14 degrees of freedom Multiple R−squared: 0.205, Adjusted R−squared: −0.0789 F−statistic: 0.7221 on 5 and 14 DF, p−value: 0.6178 Example2

Third degree polynomial regression model −

Model1_3degree<−lm(y1~polym(x1,x2,degree=3,raw=TRUE)) summary(Model1_3degree) Output Call: lm(formula = y1 ~ polym(x1, x2, degree = 3, raw = TRUE)) Residuals: Min 1Q Median 3Q Max −4.4845 −0.8435 −0.2514 0.8108 6.7156 Coefficients: (Intercept) 63.0178 115.9156 0.544 0.599 polym(x1, x2, degree = 3, raw = TRUE)1.0 33.3374 83.3353 0.400 0.698 polym(x1, x2, degree = 3, raw = TRUE)2.0 −10.2012 42.4193 −0.240 0.815 polym(x1, x2, degree = 3, raw = TRUE)3.0 −1.4147 6.4873 −0.218 0.832 polym(x1, x2, degree = 3, raw = TRUE)0.1 −42.6725 72.9322 −0.585 0.571 polym(x1, x2, degree = 3, raw = TRUE)1.1 −8.9795 22.7650 −0.394 0.702 polym(x1, x2, degree = 3, raw = TRUE)2.1 2.8923 7.6277 0.379 0.712 polym(x1, x2, degree = 3, raw = TRUE)0.2 9.6863 14.2095 0.682 0.511 polym(x1, x2, degree = 3, raw = TRUE)1.2 0.2289 2.6744 0.086 0.933 polym(x1, x2, degree = 3, raw = TRUE)0.3 −0.6544 0.8341 −0.785 0.451 Residual standard error: 3.055 on 10 degrees of freedom Multiple R−squared: 0.2841, Adjusted R−squared: −0.3602 F−statistic: 0.441 on 9 and 10 DF, p−value: 0.8833 Example3 Model1_4degree<−lm(y1~polym(x1,x2,degree=4,raw=TRUE)) summary(Model1_4degree) Output Call: lm(formula = y1 ~ polym(x1, x2, degree = 4, raw = TRUE)) Residuals: 1 2 3 4 5 6 7 8 4.59666 −0.41485 −0.62921 −0.62414 −0.49045 2.15614 −0.42311 −0.12903 9 10 11 12 13 14 15 16 2.27613 0.60005 −1.94649 1.79153 0.01765 0.03866 −1.40706 0.85596 17 18 19 20 0.51553 −3.71274 0.05606 −3.12731 Coefficients: (Intercept) −1114.793 2124.374 −0.525 0.622 polym(x1, x2, degree = 4, raw = TRUE)1.0 −263.858 2131.701 −0.124 0.906 polym(x1, x2, degree = 4, raw = TRUE)2.0 −267.502 1250.139 −0.214 0.839 polym(x1, x2, degree = 4, raw = TRUE)3.0 317.739 433.932 0.732 0.497 polym(x1, x2, degree = 4, raw = TRUE)4.0 −6.803 40.546 −0.168 0.873 polym(x1, x2, degree = 4, raw = TRUE)0.1 967.989 2009.940 0.482 0.650 polym(x1, x2, degree = 4, raw = TRUE)1.1 256.227 869.447 0.295 0.780 polym(x1, x2, degree = 4, raw = TRUE)2.1 −125.888 473.845 −0.266 0.801 polym(x1, x2, degree = 4, raw = TRUE)3.1 −59.450 70.623 −0.842 0.438 polym(x1, x2, degree = 4, raw = TRUE)0.2 −314.183 674.159 −0.466 0.661 polym(x1, x2, degree = 4, raw = TRUE)1.2 −18.033 112.576 −0.160 0.879 polym(x1, x2, degree = 4, raw = TRUE)2.2 34.781 57.232 0.608 0.570 polym(x1, x2, degree = 4, raw = TRUE)0.3 41.854 91.862 0.456 0.668 polym(x1, x2, degree = 4, raw = TRUE)1.3 −4.360 9.895 −0.441 0.678 polym(x1, x2, degree = 4, raw = TRUE)0.4 −1.763 4.178 −0.422 0.690 Residual standard error: 3.64 on 5 degrees of freedom Multiple R−squared: 0.4917, Adjusted R−squared: −0.9315 F−statistic: 0.3455 on 14 and 5 DF, p−value: 0.9466

## Creating Linear Model, It’s Equation And Visualization For Analysis

This article was published as a part of the Data Science Blogathon.

Introduction

Linear Regression:

Fig. 1.0: The Basic Linear Regression model Visualization

The Linear model (Linear Regression) was probably the first model you learned and created, using the model to predict the Target’s continuous values. You sure must have been happy that you’ve completed a model. You were probably also taught the theories behind its functionality– The Empirical Risks Minimization, The Mean Squared Loss, The Gradient descent, The Learning Rate among others.

Well, this is great and all of a sudden I was called to explain a model I created to the manager, all those terms were like jargons to him, and when he asked for the model visualization (as in fig 1.0) that is the model fit hyperplane(the red line) and the data points(the blue dots). I froze to my toes not knowing how to create that in python code.

Let’s begin using this random data:

X y

1 2

2 3

3 11

4 13

5 28

6 32

7 50

8 59

9 85

Method 1: Manual Formulation Importing our library and creating the Dataframe:

﻿

now at this stage, there are two ways to perform this visualization:

1.) Using  Mathematical knowledge

2.) Using the Linear_regression Attribute for scikit learns Linear_model.

Let’s get started with the Math😥😥.

just follow through it’s not that difficult, First we define the equation for a linear relationship between y(dependent variables/target) and X(independent variable/features) as :

y = mX + c

where y = Target

X = features

a = slope

b = y-intercept constant

To create the model’s equation we have to get the value of m and c , we can get this from the Y and X with the equations below:

The slope, a is interpreted as the product between the summation of the difference between each individual x value and its mean and the summation of the difference between each individual y point and its mean then divided by the summation of the square of each individual x and its mean.

The intercept is simply the mean of y  minus the product of the slope and mean of x

That is a lot to take in. probably read it over and over  till you get it, try reading with the picture

👆👆 that was the only challenge; if you’ve understood it congratulations let’s move on.

Now writing this in python code is ‘eazy-pizzy’  using the numpy library, check it out👇👇.

To blow your mind now, did you know that this is the model’s equation. and we just created a model without using scikit learn. we will confirm it now using the second method which is the scikit learn Linear Regression package

Method 2: Using scikit-learn’s Linear regression

We’ll be importing Linear regression from scikit learn, fit the data on the model then confirming the slope and the intercept. The steps are in the image below.

so you can see that there is almost no difference, now let us visualize this as in fig 1.

The red line is our line of best fit that will be used for the prediction and the blue point are our initial data. With this, I had something to report back to the manager. I particularly did this for each feature with the target to add more insight.

Now we have achieved our goal of creating a model and showing its plotted graph

This technique may be time-consuming when it comes to data with larger sizes and should only be used when visualizing your line of best fit with a particular feature for analysis purposes. It is not really necessary during modeling unless requested, Errors and calculation may suck up your time and computation resources especially if you’re working with 3D and higher data. But the insight gotten is worth it.

I hope you enjoyed the article if yes that great, you can also tell me how to improve in any way. I still have a lot to share especially on regressions(Linear, Logistics, and Polynomial ).