Trending March 2024 # Practicing Machine Learning Techniques In R With Mlr Package # Suggested April 2024 # Top 7 Popular

You are reading the article Practicing Machine Learning Techniques In R With Mlr Package updated in March 2024 on the website Katfastfood.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Practicing Machine Learning Techniques In R With Mlr Package

Introduction

In R, we often use multiple packages for doing various machine learning tasks. For example: we impute missing value using one package, then build a model with another and finally evaluate their performance using a third package.

The problem is, every package has a set of specific parameters. While working with many packages, we end up spending a lot of time to figure out which parameters are important. Don’t you think?

To solve this problem, I researched and came across a R package named MLR, which is absolutely incredible at performing machine learning tasks. This package includes all of the ML algorithms which we use frequently. 

In this tutorial, I’ve taken up a classification problem and tried improving its accuracy using machine learning. I haven’t explained the ML algorithms (theoretically) but focus is kept on their implementation. By the end of this article, you are expected to become proficient at implementing several ML algorithms in R. But, only if you practice alongside.

Note: This article is meant only for beginners and early starters with Machine Learning in R. Basic statistic knowledge is required. 

Table of Content

Getting Data

Exploring Data

Missing Value Imputation

Feature Engineering

Outlier Removal by Capping

New Features

Machine Learning

Feature Importance

QDA

Logistic Regression

Cross Validation

Decision Tree

Cross Validation

Parameter Tuning using Grid Search

Random Forest

SVM

GBM (Gradient Boosting)

Cross Validation

Parameter Tuning using Random Search (Faster)

XGBoost (Extreme Gradient Boosting)

Feature Selection

Machine Learning with MLR Package

Until now, R didn’t have any package / library similar to Scikit-Learn from Python, wherein you could get all the functions required to do machine learning. But, since February 2024, R users have got mlr package using which they can perform most of their ML tasks.

Let’s now understand the basic concept of how this package works. If you get it right here, understanding the whole package would be a mere cakewalk.

The entire structure of this package relies on this premise:

Create a Task. Make a Learner. Train Them.

Creating a task means loading data in the package. Making a learner means choosing an algorithm ( learner) which learns from task (or data). Finally, train them.

MLR package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification. Let’s look at some of the available algorithms for classification problems:

22 classif.xgboost                 xgboost

And, there are many more. Let’s start working now!

1. Getting Data

For this tutorial, I’ve taken up one of the popular ML problem from DataHack  (one time login will be required to get data): Download Data.

After you’ve downloaded the data, let’s quickly get done with initial commands such as setting the working directory and loading data.

2. Exploring Data

Once the data is loaded, you can access it using:

Loan_Status      factor   0      NA         0.3127036     NA        NA    192    422    2

This functions gives a much comprehensive view of the data set as compared to base str() function. Shown above are the last 5 rows of the result. Similarly you can do for test data also:

From these outputs, we can make the following inferences:

In the data, we have 12 variables, out of which Loan_Status is the dependent variable and rest are independent variables.

Train data has 614 observations. Test data has 367 observations.

In train and test data, 6 variables have missing values (can be seen in na column).

ApplicantIncome and Coapplicant Income are highly skewed variables. How do we know that ? Look at their min, max and median value. We’ll have to normalize these variables.

LoanAmount, ApplicantIncome and CoapplicantIncome has outlier values, which should be treated.

Credit_History is an integer type variable. But, being binary in nature, we should convert it to factor.

Also, you can check the presence of skewness in variables mentioned above using a simple histogram.

As you can see in charts above, skewness is nothing but concentration of majority of data on one side of the chart. What we see is a right skewed graph. To visualize outliers, we can use a boxplot:

Similarly, you can create a boxplot for CoapplicantIncome and LoanAmount as well.

Let’s change the class of Credit_History to factor. Remember, the class factor is always used for categorical variables.

To check the changes, you can do:

[1] "factor"

You can further scrutinize the data using:

We find that the variable Dependents has a level 3+ which shall be treated too. It’s quite simple to modify the name levels in a factor variable. It can be done as:

3. Missing Value Imputation

Not just beginners, even good R analyst struggle with missing value imputation. MLR package offers a nice and convenient way to impute missing value using multiple methods. After we are done with much needed modifications in data, let’s impute missing values.

In our case, we’ll use basic mean and mode imputation to impute data. You can also use any ML algorithm to impute these values, but that comes at the cost of computation.

This function is convenient because you don’t have to specify each variable name to impute. It selects variables on the basis of their classes. It also creates new dummy variables for missing values. Sometimes, these (dummy) features contain a trend which can be captured using this function. dummy.classes says for which classes should I create a dummy variable. dummy.type says what should be the class of new dummy variables.

$data attribute of imp function contains the imputed data.

Now, we have the complete data. You can check the new variables using:

Did you notice a disparity among both data sets? No ? See again. The answer is Married.dummy variable exists only in imp_train and not in imp_test. Therefore, we’ll have to remove it before modeling stage.

Optional: You might be excited or curious to try out imputing missing values using a ML algorithm. In fact, there are some algorithms which don’t require you to impute missing values. You can simply supply them missing data. They take care of missing values on their own. Let’s see which algorithms are they:

8 classif.rpart                  rpart

chúng tôi = "numeric")

4. Feature Engineering

Feature Engineering is the most interesting part of predictive modeling. So, feature engineering has two aspects: Feature Transformation and Feature Creation. We’ll try to work on both the aspects here.

At first, let’s remove outliers from variables like ApplicantIncome, CoapplicantIncome, LoanAmount. There are many techniques to remove outliers. Here, we’ll cap all the large values in these variables and set them to a threshold value as shown below:

I’ve chosen the threshold value with my discretion, after analyzing the variable distribution. To check the effects, you can do summary(cd_train$ApplicantIncome) and see that the maximum value is capped at 33000.

In both data sets, we see that all dummy variables are numeric in nature. Being binary in form, they should be categorical. Let’s convert their classes to factor. This time, we’ll use simple for and if loops.

}

}

These loops say – ‘for every column name which falls column number 14 to 20 of cd_train / cd_test data frame, if the class of those variables in numeric, take out the unique value from those columns as levels and convert them into a factor (categorical) variables.

Let’s create some new features now.

While creating new features(if they are numeric), we must check their correlation with existing variables as there are high chances often. Let’s see if our new variables too happens to be correlated:

As we see, there exists a very high correlation of Total_Income with ApplicantIncome. It means that the new variable isn’t providing any new information. Thus, this variable is not helpful for modeling data.

Now we can remove the variable.

There is still enough potential left to create new variables. Before proceeding, I want you to think deeper on this problem and try creating newer variables. After doing so much modifications in data, let’s check the data again:

5. Machine Learning

Until here, we’ve performed all the important transformation steps except normalizing the skewed variables. That will be done after we create the task.

As explained in the beginning, for mlr, a task is nothing but the data set on which a learner learns. Since, it’s a classification problem, we’ll create a classification task. So, the task type solely depends on type of problem at hand.

Let’s check trainTask

Positive class: N

As you can see, it provides a description of cd_train data. However, an evident problem is that it is considering positive class as N, whereas it should be Y. Let’s modify it:

For a deeper view, you can check your task data using str(getTaskData(trainTask)).

Now, we will normalize the data. For this step, we’ll use normalizeFeatures function from mlr package. By default, this packages normalizes all the numeric features in the data. Thankfully, only 3 variables which we have to normalize are numeric, rest of the variables have classes other than numeric.

Before we start applying algorithms, we should remove the variables which are not required.

MLR package has an in built function which returns the important variables from data. Let’s see which variables are important. Later, we can use this knowledge to subset out input predictors for model improvement. While running this code, R might prompt you to install ‘FSelector’ package, which you should do.

If you are still wondering about information.gain, let me provide a simple explanation. Information gain is generally used in context with decision trees. Every node split in a decision tree is based on information gain. In general, it tries to find out variables which carries the maximum information using which the target class is easier to predict.

Let’s start modeling now. I won’t explain these algorithms in detail but I’ve provided links to helpful resources. We’ll take up simpler algorithms at first and end this tutorial with the complexed ones.

With MLR, we can choose & set algorithms using makeLearner. This learner will train on trainTask and try to make predictions on testTask.

1. Quadratic Discriminant Analysis (QDA).

In general, qda is a parametric algorithm. Parametric means that it makes certain assumptions about data. If the data is actually found to follow the assumptions, such algorithms sometime outperform several non-parametric algorithms. Read More.

Upload this submission file and check your leaderboard rank (wouldn’t be good). Our accuracy is ~ 71.5%. I understand, this submission might not put you among the top on leaderboard, but there’s along way to go. So, let’s proceed.

2. Logistic Regression

This time, let’s also check cross validation accuracy. Higher CV accuracy determines that our model does not suffer from high variance and generalizes well on unseen data.

Similarly, you can perform CV for any learner. Isn’t it incredibly easy? So, I’ve used stratified sampling with 3 fold CV. I’d always recommend you to use stratified sampling in classification problems since it maintains the proportion of target class in n folds. We can check CV accuracy by:

This is the average accuracy calculated on 5 folds. To see, respective accuracy each fold, we can do this:

3  3    0.7598039

Now, we’ll train the model and check the prediction accuracy on test data.

Woah! This algorithm gave us a significant boost in accuracy. Moreover, this is a stable model since our CV score and leaderboard score matches closely. This submission returns accuracy of 79.16%. Good, we are improving now. Let’s get ahead to the next algorithm.

A decision tree is said to capture non-linear relations better than a logistic regression model. Let’s see if we can improve our model further. This time we’ll hyper tune the tree parameters to achieve optimal results. To get the list of parameters for any algorithm, simply write (in this case rpart):

This will return a long list of tunable and non-tunable parameters. Let’s build a decision tree now. Make sure you have installed the rpart package before creating the tree learner:

I’m doing a 3 fold CV because we have less data. Now, let’s set tunable parameters:

)

As you can see, I’ve set 3 parameters. minsplit represents the minimum number of observation in a node for a split to take place. minbucket says the minimum number of observation I should keep in terminal nodes. cp is the complexity parameter. The lesser it is, the tree will learn more specific relations in the data which might result in overfitting.

You may go and take a walk until the parameter tuning completes. May be, go catch some pokemons! It took 15 minutes to run at my machine. I’ve 8GB intel i5 processor windows machine.

# [1] 0.001

It returns a list of best parameters. You can check the CV accuracy with:

0.8127132

Using setHyperPars function, we can directly set the best parameters as modeling parameters in the algorithm.

getLearnerModel(t.rpart)

Decision Tree is doing no better than logistic regression. This algorithm has returned the same accuracy of 79.14% as of logistic regression. So, one tree isn’t enough. Let’s build a forest now.

Random Forest is a powerful algorithm known to produce astonishing results. Actually, it’s prediction derive from an ensemble of trees. It averages the prediction given by each tree and produces a generalized result. From here, most of the steps would be similar to followed above, but this time I’ve done random search instead of grid search for parameter tuning, because it’s faster.

)

)

Though, random search is faster than grid search, but sometimes it turns out to be less efficient. In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy, who knows.

Now, we have the final parameters. Let’s check the list of parameters and CV accuracy.

0.8192571

[1] 168

[1] 6

[1] 29

Let’s build the random forest model now and check its accuracy.

Support Vector Machines (SVM) is also a supervised learning algorithm used for regression and classification problems. In general, it creates a hyperplane in n dimensional space to classify the data based on target class. Let’s step away from tree algorithms for a while and see if this algorithm can bring us some improvement.

Since, most of the steps would be similar as performed above, I don’t think understanding these codes for you would be a challenge anymore.

)

0.8062092

This model returns an accuracy of 77.08%. Not bad, but lesser than our highest score. Don’t feel hopeless here. This is core machine learning. ML doesn’t work unless it gets some good variables. May be, you should think longer on feature engineering aspect, and create more useful variables. Let’s do boosting now.

6. GBM

Now you are entering the territory of boosting algorithms. GBM performs sequential modeling i.e after one round of prediction, it checks for incorrect predictions, assigns them relatively more weight and predict them again until they are predicted correctly.

)

n.minobsinnode refers to the minimum number of observations in a tree node. shrinkage is the regulation parameter which dictates how fast / slow the algorithm should move.

The accuracy of this model is 78.47%. GBM performed better than SVM, but couldn’t exceed random forest’s accuracy. Finally, let’s test XGboost also.

Xgboost is considered to be better than GBM because of its inbuilt properties including first and second order gradient, parallel processing and ability to prune trees. General implementation of xgboost requires you to convert the data into a matrix. With mlr, that is not required.

As I said in the beginning, a benefit of using this (MLR) package is that you can follow same set of commands for implementing different algorithms.

)

)

Terrible XGBoost. This model returns an accuracy of 68.5%, even lower than qda. What could happen ? Overfitting. So, this model returned CV accuracy of ~ 80% but leaderboard score declined drastically, because the model couldn’t predict correctly on unseen data.

What can you do next? Feature Selection ?

For improvement, let’s do this. Until here, we’ve used trainTask for model building. Let’s use the knowledge of important variables. Take first 6 important variables and train the models on them. You can expect some improvement. To create a task selecting important variables, do this:

Also, try to create more features. The current leaderboard winner is at ~81% accuracy. If you have followed me till here, don’t give up now.

End Notes

The motive of this article was to get you started with machine learning techniques. These techniques are commonly used in industry today. Hence, make sure you understand them well. Don’t use these algorithms as black box approaches, understand them well. I’ve provided link to resources.

What happened above, happens a lot in real life. You’d try many algorithms but wouldn’t get improvement in accuracy. But, you shouldn’t give up. Being a beginner, you should try exploring other ways to achieve accuracy. Remember, no matter how many wrong attempts you make, you just have to be right once.

You might have to install packages while loading these models, but that’s one time only. If you followed this article completely, you are ready to build models. All you have to do is, learn the theory behind them.

Got expertise in Business Intelligence  / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.

Related

You're reading Practicing Machine Learning Techniques In R With Mlr Package

Model Validation In Machine Learning

Introduction

Model validation is a technique where we try to validate the model that has been built by gathering, preprocessing, and feeding appropriate data to the machine learning algorithms. We can not directly feed the data to the model, train it and deploy it. It is essential to validate the performance or results of a model to check whether a model is performing as per our expectations or not. There are multiple model validation techniques that are used to evaluate and validate the model according to the different types of model and their behaviors.

What is Model Validation?

Machine learning is all about the data, its quality, quantity, and playing with the same. Here most of the time, we collect the data; we have to clean it, preprocess it, and then we have to apply the appropriate algorithm and get the best-fit model out of it. But after getting a model, the task is not done; the model validation is as important as the training.

Directly training and then deploying a model would not work, and in sensitive areas like the healthcare model, there is a huge amount of risk associated, and real-life predictions have to be made; in this case, there should not be an error in the model as it can cost a lot then.

Advantages of Model Validation Quality of the Model The flexibility of the Model

Secondly, validating the model makes it easy to get an idea about the flexibility. Model validation helps make the model more flexible also.

Overfitting and Underfitting

Model validation help identify if the model is underfitted or overfitted. In the case of underfitting, the model gives high accuracy in training data, and the model performs poorly during the validation phase. In the case of underfitting, the model does not perform well during either the training or validation phase.

There are many techniques available for validating the model; let us try to discuss them one by one.

Train Test Split

Train test split is one of the most basic and easy model validation techniques used to validate the data. Here we can easily split the data into two parts, the training set, and the testing set. Also, we can choose in which ratio we want to split the data. 

There is one problem associated with the train_test_split method if there is any subset of the data that is not present in the training set and is present in the testing set, then the model will give an error.

Hold Out Approach

Hold out approach is also very similar to the train test split method; just here, we have an additional split of the data. While using the train test split method, it may happen that there are two splits of the data, and the data can be leaked, due to which the overfitting of the model can take place. To overcome this issue, we can still split the data into one more part called hold out or validation split.

So basically, here, we train our data on the big training set and then test the model on the testing set. Once the model performs well on both the training and testing set, we try the model on the final validation split to get an idea about the behavior of the model in unknown datasets.

K Fold Cross Validation

K fold cross-validation is one of the widely used and most accurate methods for splitting the data into its training and testing points. In this approach, the logic or the working mechanism of the KNN algorithm is used. Same as the KNN algorithm, here we also have a term called K which is the number of splits of the data.

In this method, instead of splitting the data a single time, we split the data multiplied based on the value of K. Let us suppose that the value of K is defined as 5. Then the model will split the dataset five times and will choose different training and testing sets every single time.

Lean One Out Method

Leave one out is also a variant of the K fold cross-validation technique where we have defines the K as n. Where n is the number of samples or data observations we have in our dataset. Here the model trains and tests on every data sample, and the model considers each sample as a testing set and others as a training set.

Although this method is not used widely, the holdout and K fold approach solves most of the issues related to model validation.

Key Takeaways

Model validation is one of the most important tasks in machine learning which should be implemented for all models to be deployed.

Model validation gives us an idea about the behavior of the model, its performance on the data, problems like overfitting and underfitting, and the errors associated with the model.

Train test split and hold-out approaches are easy and the most common method for model validation where the data is splintered into two or three parts and the model is validated on the testing set.

Leave one out is a variant of the K fold approach where the model leaves one observation of the data out of the training set and use it as a testing set.

Conclusion

Building A Machine Learning Model In Bigquery

Introduction

Google’s BigQuery is a powerful cloud-based data warehouse that provides fast, flexible, and cost-effective data storage and analysis capabilities. One of its unique features is the ability to build and run machine learning models directly inside the database without extracting the data and moving it to another platform.

BigQuery was created to analyse data with billions of rows using SQL-like syntax. It is hosted on the Google Cloud Storage infrastructure and is accessible via a REST-oriented application programming interface (API).

Learning Objectives

In this article, we will:

The process of building a machine learning model in BigQuery.

Learn the key steps of ETL, feature selection and preprocessing, model creation, performance evaluation, and prediction.

This article was published as a part of the Data Science Blogathon.

Table of Contents Advantages of BigQuery

Scalability: BigQuery is a fully managed, cloud-native data warehouse that can easily handle petabyte-scale datasets. This makes it an ideal platform for machine learning, as it can handle large amounts of data and provide fast, interactive results.

Cost-effectiveness: BigQuery is designed to be cost-effective, with a flexible pricing model that allows you to only pay for what you use. This makes it an affordable option for machine learning, even for large and complex projects.

Integration with other Google Cloud services: BigQuery integrates seamlessly with other Google Cloud services, such as Google Cloud Storage, Google Cloud AI Platform, and Google Cloud Data Studio, providing a complete machine learning solution that is easy to use and scalable.

SQL Support: BigQuery supports standard SQL, which makes it easy for data analysts and developers to work with data and build machine learning models, even if they do not have a background in machine learning.

Security and Privacy: BigQuery implements the highest levels of security and privacy, with support for encryption at rest and in transit and strict access controls to ensure your data is secure and protected.

Real-time Analytics: BigQuery provides real-time analytics capabilities, allowing you to run interactive queries on large datasets and get results in seconds. This makes it an ideal platform for machine learning, enabling you to test and iterate your models quickly and easily.

Open-source IGntegrations: BigQuery supports several open-source integrations, such as TensorFlow, Pandas, and scikit-learn, which makes it easy to use existing machine learning libraries and models with BigQuery.

Step-by-step Tutorial to Build a Machine Learning Model in BigQuery

This guide will provide a step-by-step tutorial on how to build a machine-learning model in BigQuery, covering the following five main stages:

Extract, Transform, and Load (ETL) Data into BigQuery

Select and Preprocess Features

Create the Model Inside BigQuery

Evaluate the Performance of the Trained Model

Use the Model to Make Prediction

Step 1. Extract, Transform, and Load (ETL) Data into BigQuery

The first step in building a machine learning model in BigQuery is to get the data into the database. BigQuery supports loading data from various sources, including cloud storage (such as Google Cloud Storage or AWS S3), local files, or even other databases (such as Google Cloud SQL).

For the purposes of this tutorial, we will assume that we have a dataset in Google Cloud Storage that we want to load into BigQuery. The data in this example will be a simple CSV file with two columns: ‘age’ and ‘income’. Our goal is to build a model that predicts income based on age.

To load the data into BigQuery, we first need to create a new table in the database. This can be done using the BigQuery web interface or the command line tool.

Once the table is created, we can use the `bq` command line tool to load the data from our Cloud Storage bucket:

bq load --source_format=CSV mydataset.mytable gs://analyticsvidya/myfile.csv age:INTEGER,income:FLOAT

This command specifies that the data is in CSV format and that the columns ‘age’ and ‘income’ should be treated as an integer and float values, respectively.

Step 2. Select and Preprocess Features

The next step is to select and preprocess the features we want to use in our model. In this case, we only have two columns in our dataset, ‘age’ and ‘income’, so there’s not much to do here. However, in real-world datasets, there may be many columns with various data types, missing values, or other issues that need to be addressed.

One common preprocessing step is to normalize the data. Normalization is the process of scaling the values of a column to a specific range, such as [0,1]. This is useful for avoiding biases in the model due to differences in the scales of different columns.

In BigQuery, we can normalize the data using the `NORMALIZE` function:

WITH data AS ( SELECT age, income FROM mydataset.mytable ) SELECT age, NORMALIZE(CAST(income AS STRING)) as income_norm FROM data

This query creates a new table with the same data as the original table but with normalized income values.

Step 3. Create the Model Inside BigQuery

Once the data is preprocessed, we can create the machine learning model. BigQuery supports a variety of machine learning models, including linear regression, logistic regression, decision trees, and more.

For this example, we will use a simple linear regression model, which predicts a continuous target variable based on one or more independent variables. In our case, the target variable is income, and the independent variable is age.

To create the linear regression model in BigQuery, we use the `CREATE MODEL` statement:

CREATE MODEL mydataset.mymodel OPTIONS (model_type='linear_reg', input_label_cols=['income']) AS SELECT age, income FROM `mydataset.mytable`

This statement creates a new model called `mymodel` in the `mydataset` dataset. The `OPTIONS` clause specifies that the model type is a linear regression model, and the input label column is `income_norm.` The `AS` clause specifies the query that returns the data that will be used to train the model.

Following are the different models supported by BigQuery

Linear Regression: This is a simple and widely used model for predicting a continuous target variable based on one or more independent variables.

Logistic Regression: This type of regression model predicts a binary target variable based on one or more independent variables, such as yes/no or 0/1.

K-Means Clustering: This is an unsupervised learning algorithm that is used to divide a set of data points into K clusters, where each cluster is represented by its centroid.

Time-series Models: These models are used to forecast future values based on past values of time-series data, such as sales data or stock prices.

Random Forest: This ensemble learning method combines the predictions of multiple decision trees to create a more accurate and robust prediction.

Gradient Boosted Trees: This is another ensemble learning method that combines the predictions of multiple decision trees but uses a gradient-based optimization approach to create a more accurate and robust prediction.

Neural Networks: This is a machine learning model inspired by the structure and function of the human brain. Neural networks are used for various tasks, such as image classification, natural language processing, and speech recognition.

In addition to these models, BigQuery also provides several pre-trained models and APIs that can be used for common machine learning tasks, such as sentiment analysis, entity recognition, and image labeling.

Step 4. Evaluate the Performance of the Trained Model

Once the model is created, we need to evaluate its performance to see how well it can predict the target variable based on the independent variable. This can be done by splitting the data into a training set and a test set and using the training set to train the model and the test set to evaluate its performance.

In BigQuery, we can use the `ML.EVALUATE` function to evaluate the performance of the model:

SELECT * FROM ML.EVALUATE(MODEL mydataset.mymodel, ( SELECT age, income FROM `mydataset.mytable`))

This query returns various metrics, including mean squared error, root mean squared error, mean absolute error, and R-squared, all of which measure how well the model can predict the target variable.

Step 5. Use the Model to Make Predictions

Finally, once we have evaluated the model’s performance, we can use it to predict new data. In BigQuery, we can use the `ML.PREDICT` function to make predictions:

SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, ( SELECT age, FROM `mydataset.mytable`))

This query returns the original data along with the predicted income values.

Conclusion

Key Takwaways from this article:

BigQuery supports several open-source integrations, such as TensorFlow, Pandas, and scikit-learn, which makes it easy to use existing machine learning libraries and models with BigQuery.

BigQuery also provides several pre-trained models and APIs that can be used for common machine learning tasks, such as sentiment analysis, entity recognition, and image labeling.

In the dataset we studied, query returns the original data along with the predicted income values.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

How Does Machine Learning Work In Paid Search Marketing?

All modern ad platforms now factor machine learning into their algorithms. Managing successful campaigns requires an understanding of the machine learning in each ad network.

This Ask the PPC question, from Chhote Lal in New Delhi, is an important one for account managers and those they report to:

“How does Google’s machine learning work in paid marketing?”

In this column, you’ll learn:

What is machine learning?

How does machine learning factor into paid search campaigns?

How to optimize for paid search machine learning.

Since the question was specifically about search, we’ll focus on search-first uses.

What is Machine Learning?

Algorithms are taught to process information through machine learning. The more data it has, the faster it will learn what to do with that information.

Different data points can carry different weights in the algorithm. It’s important to understand how data points are valued.

Data points can be completely objective, subjective, or a hybrid of human interaction and pure algorithmic learning.

Knowing what you can control is crucial to your success as you partner with ad network machine learning.

The other critical factor is the learning period (and that the algorithm is given enough time to process the data points).

How Does Machine Learning Factor Into Paid Search Campaigns?

Machine learning impacts almost all of paid search. Any major change can influence how the algorithm processes your campaign.

These changes include:

Bidding and Budgets: Drastic changes to budgets or changing bidding strategies.

Audiences: Changing targets or excluding targets.

Creative: Changing or adding creative creates a new version of the ad that won’t have access to the old ad’s stats.

Campaign status: Pausing campaigns resets the learning period.

It’s important to note that manual campaigns aren’t as impacted by these changes, however, it is increasingly hard to run purely manual campaigns.

Running a manual campaign means opting out of the 60+ signals ad networks leverage in their smart bidding.

These signals are used to adjust bids according to the bidding strategy chosen and the given budget.

Machine learning isn’t always an active choice. Keyword matching and audience tagging happens in the background, and are based on historic data.

Native audiences (in-market, affinity, etc.) are based on the algorithm learning that people completing one action are likely to complete another action/have other linked traits.

When you ask the ad platform to find “similar” audiences to an uploaded list/website visitors, you’re using the seed audience to help the ad platform understand which prospects you find valuable and which ones are not.

Keyword matching and close variants are influenced by the likelihood of profitable outcomes, as well as real-time user behavior.

How to Optimize for Paid Search Machine Learning

It’s a lot easier to optimize when one has empathy for paid search’s machine learning.

The most important mechanic is honoring learning periods and avoiding accidental resets.

If you need to scale a campaign, for example, be sure you budget in two weeks between each major budget increase.

If you need your campaign to slow down (or stop), lower the budget instead of pausing so you don’t reset the learning period.

Negative keywords and audiences can help ad platform algorithms understand which ideas and behaviors to match budget to (and which to avoid).

This is the most powerful way to influence machine learning and should be a part of all paid search accounts.

Conversions and conversion values are under-utilized machine learning tools. They are the easiest way to communicate with the paid search algorithm and allow you to see user behavior without asking the ad channel to value the action.

Takeaways

Machine learning impacts almost all elements of paid search and understanding how to teach the algorithm is crucial to PPC success.

More Resources:

Have a question about PPC? Submit via this form or tweet me @navahf with the #AskPPC hashtag. See you next month!

Featured image: Paulo Bobita/SearchEngineJournal

How Machine Learning Models Fail To Deliver In Real

This article was published as a part of the Data Science Blogathon.

Introduction

Yesterday, my brother broke an antique at home. I began to search for FeviQuick (a classic glue) to put it back together. Given that it’s one of the most misplaced items, I began to search for it in every possible drawer and every untouched corner of the house I hadn’t been to in the past 3 months. I gave up the search after an hour – the FeviQuick was nowhere to be found. I narrowed down the places to look and began to search again, only to find it pressed under three books!

But what does this story have to do with machine learning models? Let me explain how I thought through this versus how a machine learning model will think it through.

The Human Mind vs. The Machine – Problem Solving

I would make a list of all the possible places it could be (data). My brain would automatically prioritize which places to search thoroughly. This could be considered as assigning a probability to each of the places (prior). Without yielding any results after my first search, I started thinking about all the probable places. My mother thought she had last seen it in my bedroom (new data). I now thought about the likelihood of where the object could be found, given it was in my bedroom (Mind).

In this whole thinking process, my mind is assigning new probabilities to each of these places (posteriori). Basically, my value of probability is not constrained to one value, but a range of values. This is exactly how Bayesian probabilities are ranged – not a single value, but a distribution of values.

Now for the machine. We input the data containing all the places it was to possibly find it. Each place of finding the object would be equally likely for the model, as the machine knows no such thing called bias, nor does it have any prior training data to refer to.

Now that the machine is unable to get any result, hence it assigns zero value to each of the probabilities present. Even if we taught the machine to only consider a few places, it would give each place an equally likely probability that would not help our case. Here, we can see that the values of the probability are confined to a single value. The machine assigns the value zero if the object isn’t found, as it gives no room to the probability that it could’ve made an error (all machines are prone to error ).

There’s no space for the ML model to incorporate uncertainty and probability! Machine Learning in the Real World

Consider this being applied to a real-life problem, where there has been a kidnapping and we try to narrow down the possibilities where the kidnapper could’ve taken the person.

This is the Prior knowledge we may have, and hence depicting the likelihood of finding the kidnapper given he is in that square.

As humans, we know that no place can be ruled out because first of all, the kidnapper need not be stationary and there is always a possibility that he could be hidden in a place we have no whereabouts of. Hence, we make a probability of where he could be, based on the type of buildings present at each block. A higher probability has been given to those areas which are shady and remotely located. This is called the likelihood.

Now, let’s say we get a tip from a person that he surely saw the person go straight from E13 block ahead? Then our probabilities would change as follows:

The Posterior Probability of where the kidnapper could be.

Here, we can see that there has been a shift in the alignment of probabilities. The cells towards the right of block E12 have gotten a rise in the green shade. This is also sometimes referred to as the posterior probability. The likelihood has been multiplied with a value E such that the posterior value of the probability has been updated in those regions. This value E is called the evidence.

Despite the information, it has not assigned zero probability to the blocks before E12, because it does take uncertainty into account. 

This hierarchy can be applied as a chain of events, with the evidence being updated as soon as new information is fed to it. The posterior for the previous situation will become the likelihood for the updated situation. Let us give a formula to what we’ve been visualizing:

For a machine learning model, such as logistic regression, or even for a multi-layer perceptron, it would be impossible to train these models without considering the Bayesian analogy. For a machine learning model, it would assign values that would look something like this:

An equally likely Probability assigned to each block

Now, let us see how a machine learning model would react to new information being added as before, i.e the kidnapper had been sighted going towards the right of E12:

Here the model has very less representation and shows too much bias and dependency on the data

Can you see how biased the model has become? It has straightaway assigned a value of zero to places before E12. This seems rather inconvenient as it does not take into account the factors that the information or data provided might be wrong, hence the model might already be looking in the wrong direction and wasting time.

This is because the models follow the frequentist approach, where, if the item is not found, its probability will be zero regardless of any other conditions specified (the kidnapper may have traveled to the left of block E12 or the informant may be wrong).

End Notes

So, the Bayesian approach is one that has a personal perspective. This application can prove to be very important in the real-world and the personal assumptions we make are all deemed right if they’re done in a proper mathematical environment.

The frequentist approach, no matter how statistically strong it might be, just cannot take on real-life applications because it has no space for uncertainty as well as the unpredictability of the probability of an outcome.

Since neural networks and machine learning models do not incorporate Bayesian probability, A/B tests and real-world situations often make use of Monte Carlo simulation and Markov models.

Related

Top Machine Learning Jobs To Apply In November 2023

Apply to these top machine learning  jobs Machine Learning Specialist at Standard Chartered Bank

Chennai, Tamil Nadu, India

Requirements

Experience 0-6 years

A strong foothold on machine learning and deep learning concepts

Preference for work experience in unstructured data-based models using NLP and computer vision

Knowledge of full-stack machine learning on all phases of model design to deployment

Python/Django, API development, Git, TensorFlow, Numpy, Pandas, Jenkins

Ability to do fast prototypes and interest to work in cutting edge research areas such as explainable AI, federated learning, reinforcement learning etc.

Knowledge of cloud (AWS Preferred)

Github links/blogs that can showcase your work.

Microsoft Azure Machine Learning Application Lead at Accenture 

Bengaluru, Karnataka, India Project Role: Application Lead Project Role Description: Lead the effort to design, build and configure applications, acting as the primary point of contact. Management Level:9 Work Experience:6-8 years’ Work location: Bengaluru Must Have Skills: Microsoft Azure Machine Learning Good To Have Skills: No Function Specialization Key Responsibilities: Solely responsible for the machine learning-based software solution and work independently based on inputs from the other department’s design, develop, troubleshoot and debug products/solutions in the AI/ML domain Work with partners within/outside BU to develop and commercialize products/solutions Help to create a cloud-based machine learning environment 2 support the overall development support firmware development / embedded system. Technical Experience: Strong Knowledge of machine learning, deep learning, natural language processing, and neural networks 2 experience with any of the languages Nodejs, python or Java familiarity with ML tools, and packages like OpenNLP, Caffe, Torch, TensorFlow, etc also knowledge on SQL, Azure DevOps CI/CD, Docker, etc. Professional Attributes : He / She must be a good team player with good analytical skills, good communication and Interpersonal skills 2 Should have good work ethics, always can-do attitude, good maturity and professional attitude should be able to understand the organizational and business goal and work with the team.  

Machine Learning Engineer at Pratiti Technologies

Pune, Maharashtra, India

Job Profile:

Design and build machine learning models and pipeline Role Description: The role requires you to think critically and design with first principles. You should be comfortable with multiple moving parts, microservices architecture, and de-coupled services. Given you are constructing the foundation on which data and our global system will be built, you need to pay close attention to detail and maintain a forward-thinking outlook as well as scrappiness for the present needs. You are very comfortable learning new technologies, and systems. You thrive in an iterative but heavily test-driven development environment. You obsess over model accuracy + performance and thrive on applied machine learning techniques to business problems.  

You are a good fit if you:

Have strong experience building natural language processing (NLP) based systems, specifically in areas such as event and topic detection, relation extraction, summarization, entity recognition, document classification, and knowledge-based generation

Have experience with NLP and machine learning tools and libraries such as NumPy, Genism, SpaCy, NLTK, Scikit-learn, Tensorflow, Kerasetc

Learn new ways of thinking about age-old problems

Are passionate about driving the performance of machine learning algorithms towards the state of the art and in challenging us to continually improve what is possible

Have experience in distributed training infrastructure and ML pipelines.

Senior Machine Learning Engineer – India at Bungee Tech

 India Remote

Job Responsibilities:

Work with Business stakeholders to understand the customer requirements of our SaaS products and expand the product vision using the power of AI/ML

Provide clear, compelling analysis that shapes the direction of our business

Build neural net models that contribute to the enhancement of our image and text processing algorithms

Harness neural net-based natural language processing models to create new path-breaking capabilities in the retail business

Use machine learning, data mining, statistical techniques, etc. to create actionable, meaningful, and scalable solutions for business problems

Analyze and extract relevant information from large amounts of data and derive useful insights at a big data scale.

Work with software engineering teams, data engineers, and ML operations team (Data Labelers, Auditors) to deliver production systems with your deep learning models

Architecturally optimize the deep learning models for efficient inference, reduce latency, improve throughput, reduce memory footprint without sacrificing model accuracy

Establish scalable, efficient, automated processes for large-scale data analyses, model development, model validation, and model implementation

Create and enhance a model monitoring system that could alert real-time anomalies with the model.

Streamline ML operations by envisioning human-in-the-loop kind of workflows, collect necessary labels/audit information from these workflows/processes, that can feed into improved training and algorithm development process

Maintain multiple versions of the model and ensure the controlled release of models.

Machine Learning Ops Engineer – Analytics at Optimal Strategix Group, Inc.

Bengaluru, Karnataka, India Hybrid  

Key Responsibilities:

Takes ownership for MLOPs for product development (on the cloud using microservices architecture)

Connect AI/ML modules to frontend and backend to build scalable and reproducible solutions

Setup pipelines, design and develop RESTful APIs for ML solution deployment

Ensuring code/solution delivery within schedule

Coordinating with external stakeholders and internal cross-functional teams in ensuring timelines are met and clear communication is established

Has an automation mindset and continuously focus on improving current processes

Proactively find issues and consult on possible solutions. Should be good at problem-solving.

Skills / Competencies:

Experience in NLP, AWS, and/or Azure Infrastructure and comfortable with frameworks like MLFLow, AirFlow, Git, Flash, Docker, Spark.

Must have experience including hands-on skills in Python, SQL, Dockers

Hands-on experience in AWS like S3, VPC, EC2, Route-53, RDS, cloud formation, cloud watch, Lambda

Should have experience in architecting and developing solutions for end-to-end pipelines

Proficiency in Python programming – data wrangling, data visualization, machine learning, statistical analysis

Handling of unstructured data – Video/Image/Sound/Text

The ability to keep current with the constantly changing technology

Ability to design, develop, test, deploy, maintain, and improve ML Modules and infra

Ability to manage project priorities, deadlines, and deliverables

Design and build new data pipelines from scratch till deployment for projects

Mentor other MLOPs engineers

Excellent verbal and written communication skills

Should be able to articulate ideas clearly

Ability to think independently and take responsibility

Should show inquisitiveness in understanding business needs

Should be able to understand the business context

Good time management and organizational skills

Sense of urgency in completing project deliverables in time.

Chennai, Tamil Nadu, IndiaBengaluru, Karnataka, India Project Role: Application Lead Project Role Description: Lead the effort to design, build and configure applications, acting as the primary point of contact. Management Level:9 Work Experience:6-8 years’ Work location: Bengaluru Must Have Skills: Microsoft Azure Machine Learning Good To Have Skills: No Function Specialization Key Responsibilities: Solely responsible for the machine learning-based software solution and work independently based on inputs from the other department’s design, develop, troubleshoot and debug products/solutions in the AI/ML domain Work with partners within/outside BU to develop and commercialize products/solutions Help to create a cloud-based machine learning environment 2 support the overall development support firmware development / embedded system. Technical Experience: Strong Knowledge of machine learning, deep learning, natural language processing, and neural networks 2 experience with any of the languages Nodejs, python or Java familiarity with ML tools, and packages like OpenNLP, Caffe, Torch, TensorFlow, etc also knowledge on SQL, Azure DevOps CI/CD, Docker, etc. Professional Attributes : He / She must be a good team player with good analytical skills, good communication and Interpersonal skills 2 Should have good work ethics, always can-do attitude, good maturity and professional attitude should be able to understand the organizational and business goal and work with the team.Pune, Maharashtra, IndiaDesign and build machine learning models and pipeline Role Description: The role requires you to think critically and design with first principles. You should be comfortable with multiple moving parts, microservices architecture, and de-coupled services. Given you are constructing the foundation on which data and our global system will be built, you need to pay close attention to detail and maintain a forward-thinking outlook as well as scrappiness for the present needs. You are very comfortable learning new technologies, and systems. You thrive in an iterative but heavily test-driven development environment. You obsess over model accuracy + performance and thrive on applied machine learning techniques to business problems.India RemoteBengaluru, Karnataka, India Hybrid

Update the detailed information about Practicing Machine Learning Techniques In R With Mlr Package on the Katfastfood.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!