You are reading the article Dplyr Tutorial: Merge And Join Data In R With Examples updated in December 2023 on the website Katfastfood.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Dplyr Tutorial: Merge And Join Data In R With Examples
Introduction to Data AnalysisData analysis can be divided into three parts:
Extraction: First, we need to collect the data from many sources and combine them.
Transform: This step involves the data manipulation. Once we have consolidated all the sources of data, we can begin to clean the data.
Visualize: The last move is to visualize our data to check irregularity.
Data Analysis Process
One of the most significant challenges faced by data scientists is the data manipulation. Data is never available in the desired format. Data scientists need to spend at least half of their time, cleaning and manipulating the data. That is one of the most critical assignments in the job. If the data manipulation process is not complete, precise and rigorous, the model will not perform correctly.
In this tutorial, you will learn:
R DplyrR has a library called dplyr to help in data transformation. The dplyr library is fundamentally created around four functions to manipulate the data and five verbs to clean the data. After that, we can use the ggplot library to analyze and visualize the data.
We will learn how to use the dplyr library to manipulate a Data Frame.
Merge Data with R Dplyr
dplyr provides a nice and convenient way to combine datasets. We may have many sources of input data, and at some point, we need to combine them. A join with dplyr adds variables to the right of the original dataset.
Dplyr JoinsFollowing are four important types of joins used in dplyr to merge two datasets:
Function Objective Arguments Multiple keys
left_join() Merge two datasets. Keep all observations from the origin table data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)
right_join() Merge two datasets. Keep all observations from the destination table data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)
inner_join() Merge two datasets. Excludes all unmatched rows data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)
full_join() Merge two datasets. Keeps all observations data, origin, destination, by = “ID” origin, destination, by = c(“ID”, “ID2”)
We will study all the joins types via an easy example.
First of all, we build two datasets. Table 1 contains two variables, ID, and y, whereas Table 2 gathers ID and z. In each situation, we need to have a key-pair variable. In our case, ID is our key variable. The function will look for identical values in both tables and bind the returning values to the right of table 1.
library(dplyr) df_primary <- tribble( ~ID, ~y, "A", 5, "B", 5, "C", 8, "D", 0, "F", 9) df_secondary <- tribble( ~ID, ~z, "A", 30, "B", 21, "C", 22, "D", 25, "E", 29) Dplyr left_join()The most common way to merge two datasets is to use the left_join() function. We can see from the picture below that the key-pair matches perfectly the rows A, B, C and D from both datasets. However, E and F are left over. How do we treat these two observations? With the left_join(), we will keep all the variables in the original table and don’t consider the variables that do not have a key-paired in the destination table. In our example, the variable E does not exist in table 1. Therefore, the row will be dropped. The variable F comes from the origin table; it will be kept after the left_join() and return NA in the column z. The figure below reproduces what will happen with a left_join().
Example of dplyr left_join() left_join(df_primary, df_secondary, by ='ID')Output:
## # A tibble: 5 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 ## 5 F 9 NA Dplyr right_join()The right_join() function works exactly like left_join(). The only difference is the row dropped. The value E, available in the destination data frame, exists in the new table and takes the value NA for the column y.
Example of dplyr right_join() right_join(df_primary, df_secondary, by = 'ID')Output:
## # A tibble: 5 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 ## 5 E NA 29 Dplyr inner_join()When we are 100% sure that the two datasets won’t match, we can consider to return only rows existing in both dataset. This is possible when we need a clean dataset or when we don’t want to impute missing values with the mean or median.
The inner_join()comes to help. This function excludes the unmatched rows.
Example of dplyr inner_join() inner_join(df_primary, df_secondary, by ='ID')Output:
## # A tibble: 4 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 Dplyr full_join()Finally, the full_join() function keeps all observations and replace missing values with NA.
Example of dplyr full_join() full_join(df_primary, df_secondary, by = 'ID')Output:
## # A tibble: 6 x 3 ## ID y.x y.y ## 1 A 5 30 ## 2 B 5 21 ## 3 C 8 22 ## 4 D 0 25 ## 5 F 9 NA ## 6 E NA 29 Multiple Key pairsLast but not least, we can have multiple keys in our dataset. Consider the following dataset where we have years or a list of products bought by the customer.
If we try to merge both tables, R throws an error. To remedy the situation, we can pass two key-pairs variables. That is, ID and year which appear in both datasets. We can use the following code to merge table1 and table 2
df_primary <- tribble( ~ID, ~year, ~items, "A", 2023,3, "A", 2023,7, "A", 2023,6, "B", 2023,4, "B", 2023,8, "B", 2023,7, "C", 2023,4, "C", 2023,6, "C", 2023,6) df_secondary <- tribble( ~ID, ~year, ~prices, "A", 2023,9, "A", 2023,8, "A", 2023,12, "B", 2023,13, "B", 2023,14, "B", 2023,6, "C", 2023,15, "C", 2023,15, "C", 2023,13) left_join(df_primary, df_secondary, by = c('ID', 'year'))Output:
## # A tibble: 9 x 4 ## ID year items prices ## 1 A 2023 3 9 ## 2 A 2023 7 8 ## 3 A 2023 6 12 ## 4 B 2023 4 13 ## 5 B 2023 8 14 ## 6 B 2023 7 6 ## 7 C 2023 4 15 ## 8 C 2023 6 15 ## 9 C 2023 6 13 Data Cleaning Functions in RFollowing are the four important functions to tidy (clean) the data:
Function Objective Arguments
gather() Transform the data from wide to long (data, key, value, chúng tôi = FALSE)
spread() Transform the data from long to wide (data, key, value)
separate() Split one variables into two (data, col, into, sep= “”, remove = TRUE)
unit() Unit two variables into one (data, col, conc ,sep= “”, remove = TRUE)
If not installed already, enter the following command to install tidyr:
install tidyr : install.packages("tidyr")The objectives of the gather() function is to transform the data from wide to long.
Syntax
gather(data, key, value, chúng tôi = FALSE) Arguments: -data: The data frame used to reshape the dataset -key: Name of the new column created -value: Select the columns used to fill the key column -na.rm: Remove missing values. FALSE by defaultExample
Below, we can visualize the concept of reshaping wide to long. We want to create a single column named growth, filled by the rows of the quarter variables.
library(tidyr) # Create a messy dataset messy <- data.frame( country = c("A", "B", "C"), q1_2023 = c(0.03, 0.05, 0.01), q2_2023 = c(0.05, 0.07, 0.02), q3_2023 = c(0.04, 0.05, 0.01), q4_2023 = c(0.03, 0.02, 0.04)) messyOutput:
## country q1_2023 q2_2023 q3_2023 q4_2023 ## 1 A 0.03 0.05 0.04 0.03 ## 2 B 0.05 0.07 0.05 0.02 ## 3 C 0.01 0.02 0.01 0.04 # Reshape the data gather(quarter, growth, q1_2023:q4_2023) tidierOutput:
## country quarter growth ## 1 A q1_2023 0.03 ## 2 B q1_2023 0.05 ## 3 C q1_2023 0.01 ## 4 A q2_2023 0.05 ## 5 B q2_2023 0.07 ## 6 C q2_2023 0.02 ## 7 A q3_2023 0.04 ## 8 B q3_2023 0.05 ## 9 C q3_2023 0.01 ## 10 A q4_2023 0.03 ## 11 B q4_2023 0.02 ## 12 C q4_2023 0.04In the gather() function, we create two new variable quarter and growth because our original dataset has one group variable: i.e. country and the key-value pairs.
The spread() function does the opposite of gather.
Syntax
spread(data, key, value) arguments: data: The data frame used to reshape the dataset key: Column to reshape long to wide value: Rows used to fill the new columnExample
We can reshape the tidier dataset back to messy with spread()
# Reshape the data spread(quarter, growth) messy_1Output:
## country q1_2023 q2_2023 q3_2023 q4_2023 ## 1 A 0.03 0.05 0.04 0.03 ## 2 B 0.05 0.07 0.05 0.02 ## 3 C 0.01 0.02 0.01 0.04The separate() function splits a column into two according to a separator. This function is helpful in some situations where the variable is a date. Our analysis can require focussing on month and year and we want to separate the column into two new variables.
Syntax
separate(data, col, into, sep= "", remove = TRUE) arguments: -data: The data frame used to reshape the dataset -col: The column to split -into: The name of the new variables -sep: Indicates the symbol used that separates the variable, i.e.: "-", "_", "&" -remove: Remove the old column. By default sets to TRUE.Example
We can split the quarter from the year in the tidier dataset by applying the separate() function.
separate(quarter, c(“Qrt”, “year”), sep =”_”) head(separate_tidier)
Output:
## country Qrt year growth ## 1 A q1 2023 0.03 ## 2 B q1 2023 0.05 ## 3 C q1 2023 0.01 ## 4 A q2 2023 0.05 ## 5 B q2 2023 0.07 ## 6 C q2 2023 0.02The unite() function concanates two columns into one.
Syntax
unit(data, col, conc ,sep= "", remove = TRUE) arguments: -data: The data frame used to reshape the dataset -col: Name of the new column -conc: Name of the columns to concatenate -sep: Indicates the symbol used that unites the variable, i.e: "-", "_", "&" -remove: Remove the old columns. By default, sets to TRUEExample
In the above example, we separated quarter from year. What if we want to merge them. We use the following code:
unite(Quarter, Qrt, year, sep =”_”) head(unit_tidier)
Output:
## country Quarter growth ## 1 A q1_2023 0.03 ## 2 B q1_2023 0.05 ## 3 C q1_2023 0.01 ## 4 A q2_2023 0.05 ## 5 B q2_2023 0.07 ## 6 C q2_2023 0.02 Summary
Data analysis can be divided into three parts: Extraction, Transform, and Visualize.
R has a library called dplyr to help in data transformation. The dplyr library is fundamentally created around four functions to manipulate the data and five verbs to clean the data.
dplyr provides a nice and convenient way to combine datasets. A join with dplyr adds variables to the right of the original dataset.
The beauty of dplyr is that it handles four types of joins similar to SQL:
left_join() – To merge two datasets and keep all observations from the origin table.
right_join() – To merge two datasets and keep all observations from the destination table.
inner_join() – To merge two datasets and exclude all unmatched rows.
full_join() – To merge two datasets and keep all observations.
Using the tidyr Library, you can transform a dataset using following functions:
gather(): Transform the data from wide to long.
spread(): Transform the data from long to wide.
separate(): Split one variable into two.
unit(): Unit two variables into one.
You're reading Dplyr Tutorial: Merge And Join Data In R With Examples
What Is Data Lake? It’s Architecture: Data Lake Tutorial
What is Data Lake?
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
Data Lake
The Data Lake democratizes data and is a cost-effective way to store all data of an organization for later processing. Research Analyst can focus on finding meaning patterns in data and not data itself.
Unlike a hierarchal Data Warehouse where data is stored in Files and Folder, Data lake has a flat architecture. Every data elements in a Data Lake is given a unique identifier and tagged with a set of metadata information.
In this tutorial, you will learn-
Why Data Lake?
The main objective of building a data lake is to offer an unrefined view of data to data scientists.
Reasons for using Data Lake are:
With the onset of storage engines like Hadoop storing disparate information has become easy. There is no need to model data into an enterprise-wide schema with a Data Lake.
With the increase in data volume, data quality, and metadata, the quality of analyses also increases.
Data Lake offers business Agility
Machine Learning and Artificial Intelligence can be used to make profitable predictions.
There is no data silo structure. Data Lake gives 360 degrees view of customers and makes analysis more robust.
Data Lake ArchitectureData Lake Architecture
The figure shows the architecture of a Business Data Lake. The lower levels represent data that is mostly at rest while the upper levels show real-time transactional data. This data flow through the system with no or little latency. Following are important tiers in Data Lake Architecture:
Ingestion Tier: The tiers on the left side depict the data sources. The data could be loaded into the data lake in batches or in real-time
Insights Tier: The tiers on the right represent the research side where insights from the system are used. SQL, NoSQL queries, or even excel could be used for data analysis.
HDFS is a cost-effective solution for both structured and unstructured data. It is a landing zone for all data that is at rest in the system.
Distillation tier takes data from the storage tire and converts it to structured data for easier analysis.
Processing tier run analytical algorithms and users queries with varying real time, interactive, batch to generate structured data for easier analysis.
Unified operations tier governs system management and monitoring. It includes auditing and proficiency management, data management, workflow management.
Key Data Lake ConceptsFollowing are Key Data Lake concepts that one needs to understand to completely understand the Data Lake Architecture
Key Concepts of Data Lake
Data IngestionData Ingestion allows connectors to get data from a different data sources and load into the Data lake.
Data Ingestion supports:
All types of Structured, Semi-Structured, and Unstructured data.
Multiple ingestions like Batch, Real-Time, One-time load.
Many types of data sources like Databases, Webservers, Emails, IoT, and FTP.
Data StorageData storage should be scalable, offers cost-effective storage and allow fast access to data exploration. It should support various data formats.
Data GovernanceData governance is a process of managing availability, usability, security, and integrity of data used in an organization.
SecuritySecurity needs to be implemented in every layer of the Data lake. It starts with Storage, Unearthing, and Consumption. The basic need is to stop access for unauthorized users. It should support different tools to access data with easy to navigate GUI and Dashboards.
Authentication, Accounting, Authorization and Data Protection are some important features of data lake security.
Data Quality:Data quality is an essential component of Data Lake architecture. Data is used to exact business value. Extracting insights from poor quality data will lead to poor quality insights.
Data DiscoveryData Discovery is another important stage before you can begin preparing data or analysis. In this stage, tagging technique is used to express the data understanding, by organizing and interpreting the data ingested in the Data lake.
Data AuditingTwo major Data auditing tasks are tracking changes to the key dataset.
Tracking changes to important dataset elements
Captures how/ when/ and who changes to these elements.
Data auditing helps to evaluate risk and compliance.
Data LineageThis component deals with data’s origins. It mainly deals with where it movers over time and what happens to it. It eases errors corrections in a data analytics process from origin to destination.
Data ExplorationIt is the beginning stage of data analysis. It helps to identify right dataset is vital before starting Data Exploration.
All given components need to work together to play an important part in Data lake building easily evolve and explore the environment.
Maturity stages of Data LakeThe Definition of Data Lake Maturity stages differs from textbook to other. Though the crux remains the same. Following maturity, stage definition is from a layman point of view.
Maturity stages of Data Lake
Stage 1: Handle and ingest data at scaleThis first stage of Data Maturity Involves improving the ability to transform and analyze data. Here, business owners need to find the tools according to their skillset for obtaining more data and build analytical applications.
Stage 2: Building the analytical muscleThis is a second stage which involves improving the ability to transform and analyze data. In this stage, companies use the tool which is most appropriate to their skillset. They start acquiring more data and building applications. Here, capabilities of the enterprise data warehouse and data lake are used together.
Stage 3: EDW and Data Lake work in unisonThis step involves getting data and analytics into the hands of as many people as possible. In this stage, the data lake and the enterprise data warehouse start to work in a union. Both playing their part in analytics
Stage 4: Enterprise capability in the lakeIn this maturity stage of the data lake, enterprise capabilities are added to the Data Lake. Adoption of information governance, information lifecycle management capabilities, and Metadata management. However, very few organizations can reach this level of maturity, but this tally will increase in the future.
Best practices for Data Lake Implementation:
Architectural components, their interaction and identified products should support native data types
Design of Data Lake should be driven by what is available instead of what is required. The schema and data requirement is not defined until it is queried
Design should be guided by disposable components integrated with service API.
Data discovery, ingestion, storage, administration, quality, transformation, and visualization should be managed independently.
The Data Lake architecture should be tailored to a specific industry. It should ensure that capabilities necessary for that domain are an inherent part of the design
Faster on-boarding of newly discovered data sources is important
Data Lake helps customized management to extract maximum value
The Data Lake should support existing enterprise data management techniques and methods
Challenges of building a data lake:
In Data Lake, Data volume is higher, so the process must be more reliant on programmatic administration
It is difficult to deal with sparse, incomplete, volatile data
Wider scope of dataset and source needs larger data governance & support
Difference between Data lakes and Data warehouseParameters Data Lakes Data Warehouse
Data Data lakes store everything. Data Warehouse focuses only on Business Processes.
Processing Data are mainly unprocessed Highly processed data.
Type of Data It can be Unstructured, semi-structured and structured. It is mostly in tabular form & structure.
Task Share data stewardship Optimized for data retrieval
Agility Highly agile, configure and reconfigure as needed. Compare to Data lake it is less agile and has fixed configuration.
Users Data Lake is mostly used by Data Scientist Business professionals widely use data Warehouse
Storage Data lakes design for low-cost storage. Expensive storage that give fast response times are used
Security Offers lesser control. Allows better control of the data.
Replacement of EDW Data lake can be source for EDW Complementary to EDW (not replacement)
Schema Schema on reading (no predefined schemas) Schema on write (predefined schemas)
Data Processing Helps for fast ingestion of new data. Time-consuming to introduce new content.
Data Granularity Data at a low level of detail or granularity. Data at the summary or aggregated level of detail.
Tools Can use open source/tools like Hadoop/ Map Reduce Mostly commercial tools.
Benefits and Risks of using Data Lake:Here are some major benefits in using a Data Lake:
Offers cost-effective scalability and flexibility
Offers value from unlimited data types
Reduces long-term cost of ownership
Allows economic storage of files
Quickly adaptable to changes
Users, from various departments, may be scattered around the globe can have flexible access to the data
Risk of Using Data Lake:
After some time, Data Lake may lose relevance and momentum
There is larger amount risk involved while designing Data Lake
Unstructured Data may lead to Ungoverned Chao, Unusable Data, Disparate & Complex Tools, Enterprise-Wide Collaboration, Unified, Consistent, and Common
It also increases storage & computes costs
There is no way to get insights from others who have worked with the data because there is no account of the lineage of findings by previous analysts
The biggest risk of data lakes is security and access control. Sometimes data can be placed into a lake without any oversight, as some of the data may have privacy and regulatory need
Summary:
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data.
The main objective of building a data lake is to offer an unrefined view of data to data scientists.
Unified operations tier, Processing tier, Distillation tier and HDFS are important layers of Data Lake Architecture
Data Ingestion, Data storage, Data quality, Data Auditing, Data exploration, Data discover are some important components of Data Lake Architecture
Design of Data Lake should be driven by what is available instead of what is required.
Data Lake reduces long-term cost of ownership and allows economic storage of files
The biggest risk of data lakes is security and access control. Sometimes data can be placed into a lake without any oversight, as some of the data may have privacy and regulatory need.
Postgresql In, Not In With Examples
What is PostgreSQL In ?
The IN operator is used in a WHERE clause that allows checking whether a value is present in a list of other values. In Operation helps to reduce the need for multiple OR conditions in SELECT, UPDATE, INSERT, or DELETE statements.
In this PostgreSQL Tutorial, you will learn the following:
SyntaxThe IN operator takes the following syntax:
value IN (value_1, value_2, ...)The value is the value that you are checking for in the list.
The value_1, value_2… are the list values.
If the value is found in the list, the operator will return a true.
The list can be a set of numbers of strings or even the output result of a SELECT statement as shown below:
value IN (SELECT value FROM table-name);
The statement placed inside the parenthesis is known as a subquery.
With CharacterLet us demonstrate how you can use the IN operator with character values.
Consider the following table:
Employees:
Let us run the following query against the above table:
SELECT * FROM Employees WHERE name IN ('James John', 'Mercy Bush', 'Kate Joel');It return the following:
We have a list of three names. We are searching for whether we can find any of these names in the name column of the Employees table. The Kate Joel was matched to one of the table’s records, and its details were returned.
With NumericNow, let us see how we can use the IN operator with numeric values.
Consider the Price table given below:
Price:
We can run the following query against the table:
SELECT * FROM Price WHERE price IN (200, 308, 250, 550);This returns the following:
We have created a list with 4 numeric values. We are checking whether we can match any of these values with the values contained in the price column of the Price table. Two values were matched, and their details were returned.
Using NOT operatorThe IN operator can be used together with the NOT operator. It returns the values that are not found in the specified column. We will use the Price table to demonstrate this.
SELECT * FROM Price WHERE price NOT IN (200, 400, 190, 230);This will return the following:
We have created a list with 4 numerical values. We are checking the price column of the Price table for values that are not part of the list. Two values, 250 and 300, were not found. Hence their details have been returned.
Using pgAdminNow let’s see how the actions can be performed using pgAdmin.
With CharacterTo accomplish the same through pgAdmin, do this:
Step 1) Login to your pgAdmin account.
Step 2)
Step 3) Type the query in the query editor:
SELECT * FROM Employees WHERE name IN ('James John', 'Mercy Bush', 'Kate Joel');It should return the following:
With Numeric
To accomplish the same through pgAdmin, do this:
Step 1) Login to your pgAdmin account.
Step 2)
Step 3) Type the query in the query editor:
SELECT * FROM Price WHERE price IN (200, 308, 250, 550);It should return the following:
Using NOT operatorTo accomplish the same through pgAdmin, do this:
Step 1) Login to your pgAdmin account.
Step 2)
Step 3) Type the query in the query editor:
SELECT * FROM Price WHERE price NOT IN (200, 400, 190, 230);It should return the following:
Summary:
The IN operator is used with the WHERE operator. It allows checking whether a particular value is present in a specific table.
The IN operator helps in reducing the need for multiple OR operators in SELECT, UPDATE, INSERT, or DELETE statements.
When creating a character list to check for the presence of a value, each value in the list should be enclosed within single quotes.
The IN operator can also be used with numeric values.
When the IN operator is used together with the NOT operator, it returns all values that are not found in the specified column.
Download the Database used in this Tutorial
Github Integration With Selenium: Complete Tutorial
Git Hub is a Collaboration platform. It is built on top of git. It allows you to keep both local and remote copies of your project. A project which you can publish it among your team members as they can use it and update it from there itself.
Advantages of Using Git Hub For Selenium.
When multiple people work on the same project they can update project details and inform other team members simultaneously.
Jenkins can help us to regularly build the project from the remote repository this helps us to keep track of failed builds.
In this tutorial, you will learn
Before we start selenium and git hub integration, we need to install the following components.
Jenkins Installation.
Maven Installation.
Tomcat Installation.
You can find this installation steps in the following links:
Git Binaries InstallationNow let us start by installing “Git Binaries”.
Step 2) Download the latest stable release.
Step 4) Go to the download location or icon and run the installer.
Another window will pop up,
Step 8) In this step,
Select the Directory where you want to install “Git Binaries” and
Step 11) In this step,
Select Use Git from the Windows Command Prompt to run Git from the command line and
Step 12) In this step,
Select Use Open SSH It will help us to execute the command from the command line, and it will set the environmental path.
Step 13) In this step,
Select “Checkout windows-style, commit Unix-style line ending”.(how the git hub should treat line endings in text files).
Step 14) In this step,
Select Use MinTTY is the default terminal of MSys2 for Git Bash
Once git is installed successfully, you can access the git.
Open Command prompt and type “git” and hit “Enter” If you see below screen means it is installed successfully
Jenkins Git Plugin InstallNow let’s start with Jenkins Git Plugin Installation.
Step 1) Launch the Browser and navigate to your Jenkins.
Step 5) In this step,
Select GitHub plugin then
Now it will install the following plugins.
Once the Installation is finished. Restart your Tomcat server by calling the “shutdown.bat” file
After Restarting the tomcat and Jenkins we can see plugins are installed in the “Installed” TAB.
Setting Up our Eclipse with GitHub PluginNow let’s install GitHub Plugin for Eclipse.
Step 1) Launch Eclipse and then
Step 3) In this step,
Type the name “EGIT” and
Then restart the eclipse.
Building a repository on GitStep 3) In this step,
Enter the name of the repository and
Testing Example Of Using Selenium with Git Hub.Step 1) Once we are done with the new repository, Launch Eclipse
Step 2) In this step,
Select Maven Project and browse the location.
Step 3) In this step,
Select project name and location then
Step 5) In this step,
Enter Group Id and
Artifact Id and
Step 6)
Now let’s create a sample script
Let’s push the code/local repository to Git Hub.
Step 7) In this step,
Open eclipse and then navigate to the project
Select share project
In this step,
Select the local repository and
Now it’s time to push our code to Git Hub Repository
Step 9) In this step,
Step 10) In this step,
Enter a commit message and
Select the files which we want to send to Git Hub repository
Once you are done with it, you could see the icons in the project is being changed it says that we have successfully pushed and committed our code to Git Hub
We can verify in the Git hub in the repository that our project is successfully pushed into repository
Now it’s time for executing our project from Git Hub in Jenkins
Step 11) Launch browser and open your Jenkins.
Step 13) In this step,
Enter Item name
Select Maven Project
Step 14) In this step, we will configure Git Hub in Jenkins
Enter the Repository URI
If you have multiple repositories in Git Hub, you need to add name Refspec field of the repository.
We can get the URI in Git Hub
Step 15) In this step,
Add the chúng tôi file location in the textbox and
Specify the goals and options for Maven then
Select option on how to run the test
Finally, we can verify that our build is successfully completed/executed.
How To Merge Objects In Javascript
In JavaScript, object merging combines the properties and values of two or more objects into a single object. This is often useful when working with data in the form of key-value pairs, as it allows you to combine multiple objects into a single data structure.
Object merging is used in a variety of situations, including:
Combining the default values of an object with user-specified values
Merging the properties of multiple objects into a single object
Combining the results of multiple API calls into a single object
Joining the properties of an object with the properties of an array
In this post, we will explore the various use cases for object merging in JavaScript and provide examples of how to perform object merging using the (...) spread operator, the built-in Object.assign() method, as well as the merge() function from the popular Lodash library.
The (…) spread operatorThe spread operator (...) is considered the most efficient and clean way to combine objects, compared to other alternatives, such as using Object.assign() or manually copying properties from one object to another.
To use the spread operator for object merging, you can use the following syntax:
const defaultOptions = { option1: 'value1', option2: 'value2' }; const userOptions = { option1: 'overriddenValue1', option3: 'newValue3' }; const mergedOptions = { ...defaultOptions, ...userOptions }; console.log(mergedOptions); /* { "option1": "overriddenValue1", "option2": "value2", "option3": "newValue3" } */In this example, mergedOptions would be a new object that contains all the properties from both defaultOptions and userOptions, with the properties from userOptions taking precedence in case of a conflict.
For comparison, here is a version of the code that uses Object.assign() to achieve the same result:
const defaultOptions = { option1: 'value1', option2: 'value2' }; const userOptions = { option1: 'overriddenValue1', option3: 'newValue3' }; const mergedOptions = Object.assign({}, defaultOptions, userOptions); console.log(mergedOptions);In this example, the console.log statement would output the same object as before:
{ option1: 'overriddenValue1', option2: 'value2', option3: 'newValue3' } What about deep merges?If you need to perform a deep merge, where you recursively merge object properties and arrays, the spread operator and Object.assign() are not suitable. These methods only perform a shallow merge, meaning that they only merge the top-level properties of the objects.
One possible way to perform a deep merge is to write a custom function that recursively merges the properties of the objects. This function can use a combination of the spread operator, Object.assign(), and other methods to merge the objects at different levels of depth.
Here is an example of a custom function that performs a deep merge of two objects:
function deepMerge(obj1, obj2) { const merged = { ...obj1, ...obj2 }; for (const key of Object.keys(merged)) { if (typeof merged[key] === 'object' && merged[key] !== null) { merged[key] = deepMerge(obj1[key], obj2[key]); } } return merged; }To use this function, you can call it with the two objects you want to merge as arguments:
const defaultOptions = { option1: 'value1', option2: 'value2', nested: { subOption1: 'subValue1', subOption2: 'subValue2' } }; const userOptions = { option1: 'overriddenValue1', option3: 'newValue3', nested: { subOption1: 'overriddenSubValue1', subOption3: 'newSubValue3' } }; const mergedOptions = deepMerge(defaultOptions, userOptions); console.log(mergedOptions);In this example, the console.log statement would output the following object:
{ nested: { subOption1: "overriddenSubValue1", subOption2: "subValue2", subOption3: "newSubValue3" }, option1: "overriddenValue1", option2: "value2", option3: "newValue3" } Using LodashLodash is a JavaScript library that provides utility functions for common programming tasks, including the ability to perform a deep merge of objects. To perform a deep merge with Lodash, you can use the merge() function.
The library is available as a browser script or through npm. It goes without saying that using this library is the best approach if you plan on doing production-ready merges!
Here is an example of the merge() function in Lodash:
const _ = require('lodash'); const obj1 = { a: 1, b: { c: 2, d: [3, 4, 5] } }; const obj2 = { a: 6, b: { c: 7, d: [8, 9] }, e: 10 }; const merged = _.merge(obj1, obj2); console.log(merged);The merge() function will recursively merge the properties of the two objects, including arrays, so that the resulting object contains the properties and values from both objects.
In the case of properties with the same key, the value from the second object (obj2 in this case) will take precedence.
Exploring Recommendation System (With An Implementation Model In R)
Introduction
We are sub-consciously exposed to recommendation systems when we visit websites such as Amazon, Netflix, imdb and many more. Apparently, they have become an integral part of online marketing (pushing products online). Let’s learn more about them here.
In this article, I’ve explained the working of recommendation system using a real life example, just to show you this is not limited to online marketing. It is being used by all industries. Also, we’ll learn about its various types followed by a practical exercise in R. The term ‘recommendation engine’ & ‘recommendation system’ has been used interchangeably. Don’t get confused!
Recommendation System in Banks – ExampleToday, every industry is making full use of recommendation systems with their own tailored versions. Let’s take banking industry for an example.
Bank X wants to make use of the transactions information and accordingly customize the offers they provide to their existing credit and debit card users. Here is what the end state of such analysis looks like:
Customer Z walks in to a Pizza Hut. He pays the food bill through bank Xs card. Using all the past transaction information, bank X knows that Customer Z likes to have an ice cream after his pizza. Using this transaction information at pizza hut, bank has located the exact location of the customer. Next, it finds 5 ice cream stores which are close enough to the customer and 3 of which have ties with bank X.
This is the interesting part. Now, here are the deals with these ice-cream store:
Store 5 : Bank profit – $4, Customer expense – $11, Propensity of customer to respond – 20%
Let’s assume the marked prize is proportional to the desire of customer to have that ice-cream. Hence, customer struggles with the trade-off that whether to fulfil his desire at the extra cost or buy the cheaper ice cream. Bank X wants the customer to go to store 3,4 or 5 (higher profits). It can increase the propensity of the customer to respond if it gives him a reasonable deal. Let’s assume that discounts are always whole numbers. For now, the expected value was :
Expected value = 20%*{2 + 2 + 5 + 6 + 4 } = $ 19/5 = $3.8
Can we increase the expected value by giving out discounts. Here is how the propensity varies at store (3,4,5) varies :
Store 3 : Discount of $1 increases propensity by 5%, a discount of $2 by 7.5% and a discount of $3 by 10%
Store 4 : Discount of $1 increases propensity by 25%, a discount of $2 by 30%, a discount of $3 by 35% and a discount of $4 by 80%
Store 5 : No change with any discount
Banks cannot give multiple offers at the same time with competing merchants. You need to assume that an increase in ones propensity gives equal percentage point decrease in all other propensity. Here is the calculation for the most intuitive case – Give a discount of $2 at store 4.
Expected value = 50%/4 * (2 + 2 + 5 + 4) + 50% * 5 = $ 13/8 + $2.5 = $1.6 + $2.5 = $4.1
Think Box : Is there any better option available which can give bank a higher profit? I’d be interested to know!
You see, making recommendations isn’t about extracting data, writing codes and be done with it. Instead, it requires mathematics (apparently), logical thinking and a flair to use a programming language. Trust me, third one is the easiest of all. Feeling confident? Let’s proceed.
What exactly is the work of a recommendation engine?Previous example would have given you a fair idea. It’s time to make it crystal clear. Let’s understand what all a recommendation engine can do in context of previous example (Bank X):
It finds out the merchants/Items which a customer might be interested into after buying something else.
It estimates the profit & loss if many competing items can be recommended to the customer. Now based on the profile of the customer, recommends a customer centric or product centric offering. For a high value customer, which other banks are also interested to gain wallet share, you might want to bring out best of your offers.
It can enhance customer engagement by providing offers which can be highly appealing to the customer. Such that, (s)he might have purchased the item anyway but with an additional offer, the bank might win his/her interest of such attributing customer.
What are the types of Recommender Engines ?There are broadly two types of recommender engines and based on the industry we make this choice. We have explained each of these algorithms in our previous articles, but here I try to put a practical explanation to help you understand them easily.
I’ve explained these algorithms in context of the industry they are used in and what makes them apt for these industries.
Context based algorithms:
As the name suggest, these algorithms are strongly based on driving the context of the item. Once you have gathered this context level information on items, you try to find look alike items and recommend them. For instance on Youtube, you can find genre, language, starring of a video. Now based on these information we can find look alike (related) of these videos. Once we have look alike, we simply recommend these videos to a customer who originally saw the first video only. Such algorithms are very common in video online channels, song online stores etc. Plausible reason being, such context level information is far easier to get when the product/item can be explained with few dimensions.
Collaborative filtering algorithms:
This is one of the most commonly used algorithm because it is not dependent on any additional information. All you need is the transaction level information of the industry. For instance: e-commerce player like Amazon and banks like American Express often use these algorithm to make merchant/product recommendations. Further, there are several types of collaborative filtering algorithms :
User-User Collaborative filtering: Here we find look alike customer to every customer and offer products which first customer’s look alike has chosen in past. This algorithm is very effective but takes a lot of time and resources since it requires to compute every customer pair information. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.
Item-Item Collaborative filtering: It is quite similar to previous algorithm, but instead of finding customer look alike, we try finding item look alike. Once we have item look alike matrix, we can easily recommend alike items to customer who have purchased any item from the store. This algorithm is far less resource consuming than user-user collaborative filtering. Hence, for a new customer the algorithm takes far lesser time than user-user collaborate as we don’t need all similarity scores between customers. And with fixed number of products, product-product look alike matrix is fixed over time.
Other simpler algorithms: There are other approaches like market basket analysis, which generally do not have high predictive power than last algorithms.
How do we decide the performance metric of such engines?Good question! We must know that performance metrics are strongly driven by business objectives. Generally, there are three possible metrics which you might want to optimise:
Based on dollar value:
If your overall aim is to increase profit/revenue metric using recommendation engine, then your evaluation metric should be incremental revenue/profit/sale with each recommended rank. Each rank should have an unbroken order and the average revenue/profit/sale should be over the expected benefit over the offer cost.
Based on propensity to respond:
If the aim is just to activate customers, or make customers explore new items/merchant, this metric might be very helpful. Here you need to track the response rate of the customer with each rank.
Based on number of transactions:
Some times you are interested in activity of the customer. For higher activity, customer needs to do higher number of transactions. So we track number of transaction for the recommended ranks.
Other metrics:
There are other metrics which you might be interested in like satisfaction rate or number of calls to call centre etc. These metrics are rarely used as they generally won’t give you results for the entire portfolio but sample.
Building an Item-Item collaborative filtering Recommendation Engine using RLet’s get some hands-on experience building a recommendation engine. Here, I’ve demonstrated building an item-item collaborative filter recommendation engine. The data contains just 2 columns namely individual_merchant and individual_customer. The data is available to download – Download Now.
}
End NotesRecommended engines have become extremely common because they solve one of the commonly found business case for all industries. Substitute to these recommendation engine are very difficult because they predict for multiple items/merchant at the same time. Classification algorithms struggle to take in so many classes as the output variable.
In this article, we learnt about the use of recommendation systems in Banks. We also looked at implementing a recommendation engine in R. No doubt, they are being used across all sectors of industry, with a common aim to enhance customer experience.
You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.Related
Update the detailed information about Dplyr Tutorial: Merge And Join Data In R With Examples on the Katfastfood.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!