Trending December 2023 # Learn Hive Query To Unlock The Power Of Big Data Analytics # Suggested January 2024 # Top 12 Popular

You are reading the article Learn Hive Query To Unlock The Power Of Big Data Analytics updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Learn Hive Query To Unlock The Power Of Big Data Analytics


Given the number of large datasets that data engineers handle on a daily basis, it is no doubt that a dedicated tool is required to process and analyze such data. Some tools like Pig, One of the most widely used tools to solve such a problem is Apache Hive which is built on top of Hadoop.

Apache Hive is a data warehousing built on top of Apache Hadoop. Using Apache Hive, you can query distributed data storage, including the data residing in Hadoop Distributed File System (HDFS), which is the file storage system provided in Apache Hadoop. Hive also supports the ACID properties of relational databases with ORC file format, which is optimized for faster querying. But the real reason behind the prolific use of Hive for working with Big Data is that it is an easy-to-use querying language.

Apache Hive supports the Hive Query Language, or HQL for short. HQL is very similar to SQL, which is the main reason behind its extensive use in the data engineering domain. Not only that, but HQL makes it fairly easy for data engineers to support transactions in Hive. So you can use the familiar insert, update, delete, and merge SQL statements to query table data in Hive. In fact, the simplicity of HQL is one of the reasons why data engineers now use Hive instead of Pig to query Big data.

So, in this article, we will be covering the most commonly used queries which you will find useful when querying data in Hive.

Learning Objectives

Get an overview of Apache Hive.

Get familiar with Hive Query Language.

Implement various functions in Hive, like aggregation functions, date functions, etc.

Table of Contents Hive Refresher

Hive is a data warehouse built on top of Apache Hadoop, which is an open-source distributed framework.

Hive architecture contains Hive Client, Hive Services, and Distributed Storage.

Hive Client various types of connectors like JDBC and ODBC connectors which allows Hive to support various applications in a different programming languages like Java, Python, etc.

Hive Services includes Hive Server, Hive CLI, Hive Driver, and Hive Metastore.

Hive CLI has been replaced by Beeline in HiveServer2.

Hive supports three different types of execution engines – MapReduce, Tez, and Spark.

Hive supports its own command line interface known as Hive CLI, where programmers can directly write the Hive queries.

Hive Metastore maintains the metadata about Hive tables.

Hive metastore can be used with Spark as well for storing the metadata.

Hive supports two types of tables – Managed tables and External tables.

The schema and data for Managed tables are stored in Hive.

In the case of External tables, only the schema is stored by Hive in the Hive metastore.

Hive uses the Hive Query Language (HQL) for querying data.

Using HQL or Hiveql, we can easily implement MapReduce jobs on Hadoop.

Let’s look at some popular Hive queries.

Simple Selects

In Hive, querying data is performed by a SELECT statement. A select statement has 6 key components;

SELECT column names

FROM table-name

GROUP BY column names

WHERE conditions

HAVING conditions

ORDER by column names

In practice, very few queries will have all of these clauses in them, simplifying many queries. On the other hand, conditions in the WHERE clause can be very complex, and if you need to JOIN two or more tables together, then more clauses (JOIN and ON) are needed.

All of the clause names above have been written in uppercase for clarity. HQL is not case-sensitive. Neither do you need to write each clause on a new line, but it is often clearer to do so for all but the simplest of queries.

Over here, we will start with the very simple ones and work our way up to the more complex ones.

Simple Selects ‐ Selecting Columns

Amongst all the hive queries, the simplest query is effectively one which returns the contents of the whole table. Following is the syntax to do that –

SELECT * FROM geog_all;

It is better to practice and generally more efficient to explicitly list the column names that you want to be returned. This is one of the optimization techniques that you can use while querying in Hive.

SELECT anonid, fueltypes, acorn_type FROM geog_all; Simple Selects – Selecting Rows

In addition to limiting the columns returned by a query, you can also limit the rows returned. The simplest case is to say how many rows are wanted using the Limit clause.

SELECT anonid, fueltypes, acorn_type FROM geog_all LIMIT 10;

This is useful if you just want to get a feel for what the data looks like. Usually, you will want to restrict the rows returned based on some criteria. i.e., certain values or ranges within one or more columns.

SELECT anonid, fueltypes, acorn_type FROM geog_all WHERE fueltypes = "ElecOnly";

The Expression in the where clause can be more complex and involve more than one column.

SELECT anonid, fueltypes, acorn_type FROM geog_all SELECT anonid, fueltypes, acorn_type FROM geog_all

Notice that the columns used in the conditions of the WHERE clause don’t have to appear in the Select clause. Other operators can also be used in the where clause. For complex expressions, brackets can be used to enforce precedence.

SELECT anonid, fueltypes, acorn_type, nuts1, ldz FROM geog_all WHERE fueltypes = "ElecOnly" AND acorn_type BETWEEN 42 AND 47 AND (nuts1 NOT IN ("UKM", "UKI") OR ldz = "--"); Creating New Columns

It is possible to create new columns in the output of the query. These columns can be from combinations from the other columns using operators and/or built-in Hive functions.

SELECT anonid, eprofileclass, acorn_type, (eprofileclass * acorn_type) AS multiply, (eprofileclass + acorn_type) AS added FROM edrp_geography_data b;

A full list of the operators and functions available within the Hive can be found in the documentation.

When you create a new column, it is usual to provide an ‘alias’ for the column. This is essentially the name you wish to give to the new column. The alias is given immediately after the expression to which it refers. Optionally you can add the AS keyword for clarity. If you do not provide an alias for your new columns, Hive will generate a name for you.

Although the term alias may seem a bit odd for a new column that has no natural name, alias’ can also be used with any existing column to provide a more meaningful name in the output.

Tables can also be given an alias, this is particularly common in join queries involving multiple tables where there is a need to distinguish between columns with the same name in different tables. In addition to using operators to create new columns, there are also many Hive built‐in functions that can be used.

Hive Functions

You can use various Hive functions for data analysis purposes. Following are the functions to do that.

Simple Functions

Let’s talk about the functions which are popularly used to query columns that contain string data type values.

Concat can be used to add strings together.

SELECT anonid, acorn_category, acorn_group, acorn_type, concat (acorn_category, ",", acorn_group, ",", acorn_type)  AS acorn_code FROM geog_all;

substr can be used to extract a part of a string

SELECT anon_id, advancedatetime, FROM elec_c;

Examples of length, instr, and reverse

SELECT anonid,      acorn_code,      length (acorn_code),      instr (acorn_code, ',') AS a_catpos,      instr (reverse (acorn_code), "," ) AS reverse_a_typepo

Where needed, functions can be nested within each other, cast and type conversions.

SELECT anonid, substr (acorn_code, 7, 2) AS ac_type_string, cast (substr (acorn_code, 7, 2) AS INT) AS ac_type_int, substr (acorn_code, 7, 2) +1 AS ac_type_not_sure FROM geog_all; Aggregation Functions

Aggregate functions are used to perform some kind of mathematical or statistical calculation across a group of rows. The rows in each group are determined by the different values in a specified column or columns. A list of all of the available functions is available in the apache documentation.

SELECT anon_id,               count (eleckwh) AS total_row_count,               sum (eleckwh) AS total_period_usage,               min (eleckwh) AS min_period_usage,               avg (eleckwh) AS avg_period_usage,              max (eleckwh) AS max_period_usage        FROM elec_c GROUP BY anon_id;

In the above example, five aggregations were performed over the single column anon_id. It is possible to aggregate over multiple columns by specifying them in both the select and the group by clause. The grouping will take place based on the order of the columns listed in the group by clause. What is not allowed is specifying a non‐aggregated column in the select clause that is not mentioned in the group by clause.

SELECT anon_id,               count (eleckwh) AS total_row_count,               sum (eleckwh) AS total_period_usage,               min (eleckwh) AS min_period_usage,               avg (eleckwh) AS avg_period_usage,               max (eleckwh) AS max_period_usage        FROM elec_c

Unfortunately, the group by clause will not accept alias’.

SELECT anon_id,               count (eleckwh) AS total_row_count,               sum (eleckwh) AS total_period_usage,               min (eleckwh) AS min_period_usage,               avg (eleckwh) AS avg_period_usage,               max (eleckwh) AS max_period_usage       FROM elec_c ORDER BY anon_id, reading_year;

But the Order by clause does.

The Distinct keyword provides a set of a unique combination of column values within a table without any kind of aggregation.

SELECT DISTINCT eprofileclass, fueltypes FROM geog_all; Date Functions

Hive provides a variety of date-related functions to allow you to convert strings into timestamps and to additionally extract parts of the Timestamp.

unix_timestamp returns the current date and time – as an integer!

from_unixtime takes an integer and converts it into a recognizable Timestamp string

SELECT unix_timestamp () AS currenttime FROM sample_07 LIMIT 1; SELECT from_unixtime (unix_timestamp ()) AS currenttime FROM sample_07 LIMIT 1;

There are various date part functions that will extract the relevant parts from a Timestamp string.

SELECT anon_id,              from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))                   AS proper_date,             year (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))                  AS full_year,             month (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))                 AS full_month,             day (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))                AS full_day,            last_day (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))               AS last_day_of_month,            date_add ( (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))),10)               AS added_days FROM elec_days_c ORDER BY proper_date; Conclusion

In the article, we covered some basic Hive functions and queries. We saw that running queries on distributed data is not much different from running queries in MySQL. We covered some same basic queries like inserting records, working with simple functions, and working with aggregation functions in Hive.

Key Takeaways

Hive Query Language is the language supported by Hive.

HQL makes it easy for developers to query on Big data.

HQL is similar to SQL, making it easy for developers to learn this language.

I recommend you go through these articles to get acquainted with tools for big data:

Frequently Asked Questions

Q1. What queries are used in Hive?

A. Hive supports the Hive Querying Language(HQL). HQL is very similar to SQL. It supports the usual insert, update, delete, and merge SQL statements to query data in Hive.

Q2. What are the benefits of Hive?

A. Hive is built on top of Apache Hadoop. This makes it an apt tool for analyzing Big data. It also supports various types of connectors, making it easier for developers to query Hive data using different programming languages.

Q3. What is the difference between Hive and MapReduce?

A. Hive is a data warehousing system that provides SQL-like querying language called HiveQL, while MapReduce is a programming model and software framework used for processing large datasets in a distributed computing environment. Hive also provides a schema for data stored in Hadoop Distributed File System (HDFS), making it easier to manage and analyze large datasets.


You're reading Learn Hive Query To Unlock The Power Of Big Data Analytics

How To Create Table In Hive With Query Examples?

Introduction to Hive Table

In the hive, the tables consist of columns and rows and store the related data in the table format within the same database. The table is storing the records or data in tabular format. The tables are broadly classified into two parts, i.e., external table and internal table.

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

The default storage location of the Table varies from the hive version. From HDP 3.0, we are using hive version 3.0 and more. The default Table location was changed from HDP 3.0 version / Hive version 3.0. The location for the external hive Table is “/warehouse/tablespace/external/hive/” and the location for the manage Table is “/warehouse/tablespace/managed/hive”.

In the older version of the hive, the default storage location of the hive Table is “/apps/hive/warehouse/”.


CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [ database name ] table name [ ROW FORMAT row format] [ STORED AS file format] How to create a Table in Hive?

Hive internal table

Hive external table

Note: We have the hive “hql” file concept with the help of “hql” files, we can directly write the entire internal or external table DDL and directly load the data in the respective table.

1. Internal Table

The internal table is also called a managed table and is owned by a “hive” only. Whenever we create the table without specifying the keyword “external” then the tables will create in the default location.

If we drop the internal or manage the table, the table DDL, metadata information, and table data will be lost. The table data is available on HDFS it will also lose. We should be very careful while dropping any internal or managing the table.

DDL Code for Internal Table

create table emp.customer ( idint, first_name string, last_name string, gender string, company_name string, job_title string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by 'n' location "/emp/table1" tblproperties ("skip.header.line.count"="1");

Note: To load the data in hive internal or manage the table. We are using the “location” keyword in DDL Code. From the same location, we have kept the CSV file and load the CSV file data in the table.


2. External Table

The best practice is to create an external table. Many organizations are following the same practice to create tables. It does not manage the data of the external table, and the table is not created in the warehouse directory. We can store the external table data anywhere on the HDFS level.

The external tables have the facility to recover the data, i.e., if we delete/drop the external table. Still no impact on the external table data present on the HDFS. It will only drop the metadata associated with the table.

If we drop the internal or manage the table, the table DDL, metadata information, and table data will be lost. The table data is available on HDFS it will also lose. We should be very careful while dropping any internal or manage the table.

DDL Code for External Table

create external table emp.sales ( idint, first_name string, last_name string, gender string, email_id string, city string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by 'n' location "/emp/sales" tblproperties ("skip.header.line.count"="1");

Note: we can directly store the external table data on the cloud or any other remote machine in the network. It will depend on the requirement.


How to modify/alter the Table?

Here we have the facility to alter or modify the existing attributes of the Table. With the help of the “alter” functionality, we can change the column name, add the column, drop the column, change the column name, and replace the column.

We can alter the below Table attributes.

1. Alter/ rename the tablename


ALTER TABLE [current table name] RENAME TO [new table name]

Query to Alter Table Name :

ALTER TABLE customer RENAME TO cust;


Before alter

After alter

2. Alter/ add column in the table


ALTER TABLE [current table name] ADD COLUMNS (column spec[, col_spec ...])

Query to add Column :



Sample view of the table

We are adding a new column in the table “department = dept”

3. Alter/change the column name


ALTER TABLE [current table name] CHANGE [column name][new name][new type]

Query to change column name :

ALTER TABLE cust CHANGE first_name name string;


Sample view of the customer table.

How to drop the Table?

Drop Internal or External Table


DROP TABLE [IF EXISTS] table name;

Drop Query:

drop table cust;


Before drop query run

After dropping the query, run on the “cust” table.


We have seen the uncut concept of “Hive Table” with the proper example, explanation, syntax, and SQL Query with different outputs. The table is useful for storing the structure data. The table data is helpful for various analysis purposes like BI, reporting, helpful/easy in data slicing and dicing, etc. The internal table is managed, and the hive does not manage the external table. We can choose the table type we need to create per the requirement.

Recommended Articles

We hope that this EDUCBA information on “Hive Table” was beneficial to you. You can view EDUCBA’s recommended articles for more information.

5 Data Analytics Trends Shaping The Future Of Analytics

As data analytics shapes business, the trends that shape data analytics become ever more important.

Clearly, data analytics software is now a core tool set for managing a business. Today, the constantly updated apps from Big Data companies are the every engine the runs the enterprise.

Given the importance of data analytics, it’s essential that business managers of every stripe understand the trends that are shaping it going forward.

To discuss this, I spoke with Jim Hare, research vice president at Gartner. Based on a Gartner report he authored, we discussed the five key trends shaping the evolution of data analytics.

See below: transcribed highlights of my discussion with Jim Hare.

Jim Hare: Augmented analytics “Is making it easier for even business analysts to build and deploy these models, without even having to be programmers. So it’s really, really changing the landscape, both on the traditional analytic side but also these data science machine learning platforms.

“It’s helping in a couple ways, one of which is making it easier to prepare data, to find insights in the data, and then even how to communicate those insights and results.”

“Data literacy, digital ethics, privacy, enterprise and vendor data-for-good initiatives encompass digital culture,” says the Gartner report.

Gartner predicts that, by 2023, 60% of organizations with more than 20 data scientists will require a professional code of conduct incorporating ethical use of data analytics.

Jim Hare: “One of the challenges that we see a lot of organizations are facing is: not everybody understands how data analytics really works.

“So one of the things that’s critical for organizations that want to be data-driven organizations is to ‘up-level’ the knowledge and skills of the people even on the front line, who are starting to have these analytical insights.

“But they need to understand how to communicate in a way to best use that information and also what the limitations are so they don’t get themselves in trouble.”

Relationship analytics highlights the growing use of graph, location and social analytical techniques.

Jim Hare: “In some cases, [relationship analytics] is people-to-people, sometimes it’s people-to-things. But there’s a lot of information you can refer to, to deduce when you start combining multiple data sets.

“Today most of the analytic solutions you see, [they] look at these types of data in isolation. So your analyzing location, you’re looking at particular data points on a map, or social analytics, you’re maybe looking at the connectivity between individuals.

“What’s important is when you start piecing together these different types of data sources and use multiple analytic techniques together, you’re able to have a much more complete picture of whatever problem you’re trying to solve. And we really think that this notion of relationship analytics, which is the connected tissue between the data on those people, places or things. This is really going to be the next wave providing deeper insights and really helping organizations.”

Jim Hare: “[Decision intelligence] means that these decisions often are spanning multiple applications and even different functional groups.

“Case in point, if you talk to most organizations and ask them about customer experience, it’s very siloed. You talk to the sales organization, the marketing support, all of them have their own silos. All of them have their own view what the customer is, but no one’s really looking holistically across all of these different silos or pillars.

“So decision intelligence is really bringing that level of insights and using a combination of AI and automation to break down these barriers and really look at it holistically.”

More people want to engage with data, and more interactions and processes need analytics in order to automate and scale.

Jim Hare: “[Operationalization] is really couple of things. Organizations are awash in too much data and they’re trying to figure out how to manage all that data to begin with.

“But then they are also trying to figure out, ‘Well, where else can we be using the data? How do we get this information, analyze it and get it in the hands of our users?’ And this requires not just looking at individual technologies or tools but taking a fundamental approach, where you’re really creating this data foundation.

“So that you’re able to handle, absorb and bring an increasing amount of data, organize it and make it useful to those who need to analyze it. And then the second part of it is: now that I’m analyzed it, who can benefit from having those insights, and how do I contextualize that information for the different roles?”

How Big Data Analytics Is Redefining Bfsi Sector

Big data analytics is proliferating fast in almost all the industries including banking and securities, communications, media and entertainment, healthcare and education, to name a few. There are numerous organizations who have included big data analytics as part of their growth strategies. Those businesses are no less than role models for others. In this article, we will be specifically emphasizing on the Banking Financial Services and Insurance (BFSI) sector. A few decades ago, banking processes were transformed by IT systems. These days, it is the big data analytics that is facilitating banks and financial businesses to make them compliant, which is undoubtedly putting them one step forward to their competitors. Big data analytics helps to monitor enormous datasets to uncover market developments, consumer likings, data interactions, and other insights which assist in strategic planning of BFSI organizations. Big data analytics is helping BFSI sector in some of the following noted ways: 1) Predictive Analytics The past transaction records of any bank or financial institution can be used as an effective input for forecast and future strategic planning. Big data can benefit companies to track market developments and plan future targets. The analysis can also be interpreted to highlight the risks associated with day to day work of an organization. 2) Faster Data Processing For businesses with a large dynamic customer database, traditional data management systems aren’t fully-flavored. The traditional system is also deficient to handle the multi-dimensionality of big data. By switching to data analytics platforms, banks would be able to handle gigantic quantities of data seamlessly. 3) Performance Analytics Banks can customize big data analytics to monitor business and employee performance and then work accordingly on budgets and employee KPI’s grounded on previous accomplishments. Moreover, they can mark training and education of employees and monitor performance in the direction of targets in real time. As a result, banks can make their product more trustworthy to their customers with maximum utilization of resources. 4) Fraud and Malicious Attack Protection The increased technological usage has given birth to plentiful threats for the BFSI sector. Despite having stringent security laws globally, organizations face attacks and threat on a regular basis. With big data analytics tools and techniques, banks are now able to recognize unusual patterns and take business actions accordingly. Big data analytics also supports biometrics which is responsible to create unique ID for every new user. At the same time, online transaction encryption is also a gift of data analytics which is helping the industry in an effective way. 5) Risk Analysis and Management The banking industry is full of risk with every single transaction needs to be witnessed carefully. Business intelligence (BI) and analytics tools are able to give banks new understandings of their structures, dealings, clients and architecture to help them sidestep risks. Banks can evaluate the influences that cause risks in dealing with defaulted borrowers. BI can also make systems crystal clear so that the management can identify internal or external dishonest activities and categorize history to prevent future risk. 6) Customer Analytics Big data analytics tools and techniques provide the BFSI sector with dynamic and updated statistics of their most lucrative customers. It helps them to chart out effective business strategies to entice their customers. Banks can also use evidenced-based data to preserve top notch clients and market them with relevant products. 7) Better Compliance Monitoring and Reporting The government frequently updates its policies and compliance procedures in different industries. These new standards and rules are being implemented periodically. If organizations use the traditional ways to keep a track of these compliances then it might turn up a bit risky. Big data platform can be used to track these developments so that all the governmental policies and rules are followed by the organization. Future at a Glance

Top 10 Online Big Data Analytics Courses To Enroll In 2023

Build a successful career in data analytics with these top 10 online big data analytics courses

Data analysts shouldn’t be confused with data scientists. Although both data analysts and data scientists work with data, what they do with that data, differs. A data analyst helps business leaders with decision-making by finding answers to a set of given questions using data. If you are interested in pursuing this career, these top 10 online big data analytics courses will be perfect for you.

Data Analyst Nanodegree (Udacity)

Udacity’s Data Analyst Nanodegree will teach you all of the knowledge, skills, and tools needed to build a career in big data analytics. In addition to covering both theory and practice, the program also includes regular 1-on-1 mentor calls, an active student community, and one-of-a-kind career support services. This program is best suited for students who have working experience with Python (and in particular NumPy and Pandas) and SQL programming. However, don’t be discouraged if you lack these prerequisites. There’s also a similarly structured beginner-level Nanodegree, “Programming for Data Science “, that is the perfect pick for you if you don’t meet the prerequisites for this program. The beginner-level program covers just what you need: the basics of Python programming from a data science perspective.

Big Big Data Analytics with Tableau (Pluralsight)

Pluralsight’s Big big data analytics with Tableau will not only give you a better understanding of big data but will also teach you how to access big data systems using Tableau Software. The course covers topics like big big data analytics and how to access and visualize big data with Tableau. The course is taught by Ben Sullins, who has 15 years of industry experience and has offered consulting services to companies like Facebook, LinkedIn, and Cisco. Sullins passes on his knowledge to students through bite-sized chunks of content. As such, students can personalize their learning to suit their individual requirements. You can finish the course in just a day or take your time and complete it over the course of a week or two. This course isn’t necessarily for beginners. You’re expected to have some experience with big data analytics. If you’re a complete beginner, consider taking Data Analysis Fundamentals with Tableau, another course authored by Sullins.

The Data Science Course 2023: Complete Data Science Bootcamp (Udemy)

Available on Udemy, Data Science Course 2023: Complete Data Science Bootcamp is a comprehensive data science course that consists of 471 lectures. The lectures include almost 30 hours of on-demand video, 90 articles, and 154 downloadable resources. While the course is not new and we first covered it back in 2023, it has been updated for 2023 with new learning materials. As part of this course, students can expect to learn in-demand data science skills, such as Python and its libraries (like Pandas, NumPy, Seaborn, and Matplotlib), machine learning, statistics, and Tableau. Although the course might seem a bit overwhelming at first glance, it’s actually well-structured and requires no prior experience. All you need to get started is access to Microsoft Excel. The course will set you back a few hundred dollars. However, since Udemy runs generous discounts fairly regularly, you can get the course for under $20. Either way, this course is a steal, especially considering that you get full lifetime access to it and any future updates.

Become a Data Analyst (LinkedIn Learning)

This particular path consists of seven courses: Learning big data analytics, Data Fluency: Exploring and Describing Data, Excel Statistics Essential Training: 1, Learning Excel: Data Analysis, Learning Data Visualization, Power BI Essential Training, and Tableau Essential Training (2023.1). Each course varies in length. However, most courses are between two and four hours long so that you can complete the entire path from start to finish in about 24 hours. There are no prerequisites to starting this learning path. Infact, you don’t even need to know what data analysis is. The course begins by defining data analysis before teaching you how to identify, interpret, clean, and visualize data. The curriculum is taught via video by six different instructors, all of which are experts in the industry. Some courses include quizzes and every course has a Q&A section where you can ask the lecturer questions about the course. The only downside? There are no hands-on projects.

Big data analytics Bootcamp (Springboard) Data Analyst with R (DataCamp)

This program features bite-sized learning materials curated by data industry experts. It will help you get your dream job in data analysis regardless of how much free time you have to study. DataCamp’s Data Analyst with R Career Track consists of 19 data science analytics courses handpicked by industry experts to help you start a new career in data science. Since each course is about 4 hours long, the entire track should take about 77 hours to complete. At the end of this track, students should be able to manipulate and analyze data using R.

Big Data Analytics Immersion (Thinkful)

With a customized schedule, 1-on-1 mentorship, and 24/7 support from instructors, this course is as close to personalized learning as you can get. Thinkful’s big data analytics Immersion is an intensive full-time training program. Although one of the more expensive big data analytics courses out there (it costs $12,250), it promises to take you from beginner to expert in just four months. However, students are expected to spend between 50 to 60 hours a week studying. Once you sign up for the course, you receive a customized schedule to help you stay on track. The curriculum consists of seven areas: Excel Foundations, Storytelling with Data, SQL Foundation, Tableau, Business Research, Python Foundations, and Capstone Phase. During the Capstone Phase, students not only get to build a final project but also complete two culture fit interviews.

Data Science Specialization (Coursera)

Data Science Specialization offered by Coursera, together with the prestigious John Hopkins University, is a ten-course program that helps you understand the whole data science pipeline at a basic level. Although anyone can sign up for this course, students should have beginner-level experience in Python and some familiarity with regression. The curriculum is taught through videos and complementary readings. Student knowledge is tested via auto-graded practice quizzes and peer-graded assignments. The program culminates with a hands-on project that gives students a chance to create a usable data product.

Business Analytics Specialization (Coursera)

This five-course series aims to teach students how to use big data to make data-driven business decisions in the areas of finance, human resources, marketing, and operations. Created by the Wharton School of the University of Pennsylvania and hosted on Coursera, the Business Analytics Specialization is divided into four discipline-specific courses (customer, operations, people, and accounting analytics). The final, fifth, course, is dedicated to a capstone project. The Specialization is taught through videos and readings. Your knowledge is tested via compulsory quizzes. You can also participate in discussion forums. At the end of the course, students complete a Capstone Project designed in conjunction with Yahoo. The entire Specialization takes about 40 hours to complete, which means that students can finish the program in just six months if they spend three hours a week learning.

Excel to MySQL: Analytic Techniques for Business Specialization (Coursera)

Easily Compare Multiple Tables In Power Query

Comparing table columns in Excel is a common task. You may need to identify items that are the same, different, or missing from these columns.

In Power Query, table columns are lists and you can compare these lists using table merges. But merging can only be done on two tables at a time. If you need to compare three or more table columns then using List Functions is the way to do it.

It does require some manual M coding and if the thought of that puts you off, it’s not hard. And if you never get your hands dirty with some coding, you’re going to miss out on the real power of Power Query.

The three examples I’ll show you contain only two lines of code each and do what can take multiple lines using table merges or appends.

Watch the Video

Download Sample Excel Workbook

Enter your email address below to download the sample workbook.

By submitting your email address you agree that we can email you our Excel newsletter.

Please enter a valid email address.

Excel Workbook. Note: This is a .xlsx file please ensure your browser doesn’t change the file extension on download.

Source Data Tables

I’ve got four tables. The first contains the names of some imaginary staff members and an ID code assigned to them.

The other three tables show the names of staff that attended three different days of training that were arranged for them.

My tasks are to work out what staff

Attended every day of training

Attended at least one day

Attended no days

I’ve already loaded three of these tables into PQ so let’s load the last one.

Who Attended Every Day of Training

I have to find out who appears in every table for Training Days 1, 2 and 3.

If you were using table merges then you’d do an Inner Join on the Training_1 table and the Training_2 table to return only matching rows.

You then have to do another Inner Join on the result of the first join and the Training_3 table.

If you are eagle eyed you may have noticed that in the Training_2 table the name agueda jonson is lower case. So both of these joins have to be Fuzzy Joins and set to ignore case so that you can do case insensitive comparisons.

You can do all of these steps in 1 line using the List.Intersect function.

Name the query Attended All Days.

If you remember your set theory, Intersect returns whatever is common between sets, or in this case, our Name columns.

Remember that table columns are lists which is why you can use list functions to do this.

Now in the formula bar type this, and then press Enter.

= List.Intersect( { Training_1[Name], Training_2[Name], Training_3[Name] } , Comparer.OrdinalIgnoreCase )

The result is this list of names

What List.Intersect is Doing

I’m passing in two arguments to the function. The first are the columns I want to compare and these are enclosed in { }

The columns are of course the Name columns from the three tables of attendees at each days’ training.

The second argument Comparer.OrdinalIgnoreCase tells the function to ignore case in text comparisons. So agueda jonson will look the same as Agueda Jonson.

Filtering the Staff Table

With the list of names of those who attended all days of training, I can use that to filter the Staff table and return all rows associated with those names.

With the new step still selected, type this into the formula bar

= Table.SelectRows( Staff , each List.ContainsAny( {[Name]} , Source ) )

So what’s happening here?

This line of code is doing the following

Select rows from the table

called Staff

where each row

in the Name column

appears in the list Source

If you filter rows by using the menus, Power Query uses the same function. I’m just manually calling it here to do what I want.

The result is the Staff table filtered to just the rows showing those people who attended all days of training.

Who Attended At Least 1 Day of Training

This involves checking every day and seeing who was there at least once.

You could do this by combining the three tables (appending the queries) and then removing any duplicates.

Or you could just use the List.Union function

= List.Union( { Training_1[Name], Training_2[Name], Training_3[Name] } , Comparer.OrdinalIgnoreCase )

Using Set Theory again, a Union of the Name columns from the Training_1, Training_2 and Training_3 tables, using case insensitive text comparison because of Comparer.OrdinalIgnoreCase, gives a list of those who attended at least one day.

If you want to, you can then filter the Staff table exactly the same way as before using the Table.SelectRows function.

Who Attended No Training Days

To work this out you need to start with the full list of staff and then remove the names of those who attended on each of the 3 days.

Using table merges you can do this with Left Anti Joins, but you need to do 3 separate joins to get the final list.

Or you can do it with in one line with List.Difference

= List.Difference( Staff[Name], List.Union( { Training_1[Name], Training_2[Name], Training_3[Name] } , Comparer.OrdinalIgnoreCase ) )

Filtering the Staff table using these names gives

Tip – Quick Sanity Check

The number of people who attended at least 1 day (16) added to the number of people who attended no days (3) should equal the total number of staff (19).


I’ve shown you how to use List functions to compare lists and filter tables. In the sample file I’ve created I’ve also included queries that do the same job but use table merges and appends. You can have a look for yourself at the different methods.

If you’re not sure about learning M functions like List.Intersect I’d encourage you to read through the Microsoft function definitions to become familiar with them and try them out.

Although I do have a programming background, I didn’t wake up one day and know these M functions. I had to put time in to learn what functions existed and how they worked.

When you first started using Excel did you know how to use VLOOKUP or SUMIF? No, you had to learn them. So if you’re at all hesitant about learning M functions, just dive in and get started.

Update the detailed information about Learn Hive Query To Unlock The Power Of Big Data Analytics on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!