You are reading the article Easily Compare Multiple Tables In Power Query updated in December 2023 on the website Katfastfood.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Easily Compare Multiple Tables In Power Query
Comparing table columns in Excel is a common task. You may need to identify items that are the same, different, or missing from these columns.
In Power Query, table columns are lists and you can compare these lists using table merges. But merging can only be done on two tables at a time. If you need to compare three or more table columns then using List Functions is the way to do it.
It does require some manual M coding and if the thought of that puts you off, it’s not hard. And if you never get your hands dirty with some coding, you’re going to miss out on the real power of Power Query.
The three examples I’ll show you contain only two lines of code each and do what can take multiple lines using table merges or appends.
Watch the Video
Download Sample Excel Workbook
Enter your email address below to download the sample workbook.
By submitting your email address you agree that we can email you our Excel newsletter.
Please enter a valid email address.
Excel Workbook. Note: This is a .xlsx file please ensure your browser doesn’t change the file extension on download.
Source Data Tables
I’ve got four tables. The first contains the names of some imaginary staff members and an ID code assigned to them.
The other three tables show the names of staff that attended three different days of training that were arranged for them.
My tasks are to work out what staff
Attended every day of training
Attended at least one day
Attended no days
I’ve already loaded three of these tables into PQ so let’s load the last one.
Who Attended Every Day of Training
I have to find out who appears in every table for Training Days 1, 2 and 3.
If you were using table merges then you’d do an Inner Join on the Training_1 table and the Training_2 table to return only matching rows.
You then have to do another Inner Join on the result of the first join and the Training_3 table.
If you are eagle eyed you may have noticed that in the Training_2 table the name agueda jonson is lower case. So both of these joins have to be Fuzzy Joins and set to ignore case so that you can do case insensitive comparisons.
You can do all of these steps in 1 line using the List.Intersect function.
Name the query Attended All Days.
If you remember your set theory, Intersect returns whatever is common between sets, or in this case, our Name columns.
Remember that table columns are lists which is why you can use list functions to do this.
Now in the formula bar type this, and then press Enter.
= List.Intersect( { Training_1[Name], Training_2[Name], Training_3[Name] } , Comparer.OrdinalIgnoreCase )
The result is this list of names
What List.Intersect is Doing
I’m passing in two arguments to the function. The first are the columns I want to compare and these are enclosed in { }
The columns are of course the Name columns from the three tables of attendees at each days’ training.
The second argument Comparer.OrdinalIgnoreCase tells the function to ignore case in text comparisons. So agueda jonson will look the same as Agueda Jonson.
Filtering the Staff Table
With the list of names of those who attended all days of training, I can use that to filter the Staff table and return all rows associated with those names.
With the new step still selected, type this into the formula bar
= Table.SelectRows( Staff , each List.ContainsAny( {[Name]} , Source ) )
So what’s happening here?
This line of code is doing the following
Select rows from the table
called Staff
where each row
in the Name column
appears in the list Source
If you filter rows by using the menus, Power Query uses the same function. I’m just manually calling it here to do what I want.
The result is the Staff table filtered to just the rows showing those people who attended all days of training.
Who Attended At Least 1 Day of Training
This involves checking every day and seeing who was there at least once.
You could do this by combining the three tables (appending the queries) and then removing any duplicates.
Or you could just use the List.Union function
= List.Union( { Training_1[Name], Training_2[Name], Training_3[Name] } , Comparer.OrdinalIgnoreCase )
Using Set Theory again, a Union of the Name columns from the Training_1, Training_2 and Training_3 tables, using case insensitive text comparison because of Comparer.OrdinalIgnoreCase, gives a list of those who attended at least one day.
If you want to, you can then filter the Staff table exactly the same way as before using the Table.SelectRows function.
Who Attended No Training Days
To work this out you need to start with the full list of staff and then remove the names of those who attended on each of the 3 days.
Using table merges you can do this with Left Anti Joins, but you need to do 3 separate joins to get the final list.
Or you can do it with in one line with List.Difference
= List.Difference( Staff[Name], List.Union( { Training_1[Name], Training_2[Name], Training_3[Name] } , Comparer.OrdinalIgnoreCase ) )
Filtering the Staff table using these names gives
Tip – Quick Sanity Check
The number of people who attended at least 1 day (16) added to the number of people who attended no days (3) should equal the total number of staff (19).
Conclusion
I’ve shown you how to use List functions to compare lists and filter tables. In the sample file I’ve created I’ve also included queries that do the same job but use table merges and appends. You can have a look for yourself at the different methods.
If you’re not sure about learning M functions like List.Intersect I’d encourage you to read through the Microsoft function definitions to become familiar with them and try them out.
Although I do have a programming background, I didn’t wake up one day and know these M functions. I had to put time in to learn what functions existed and how they worked.
When you first started using Excel did you know how to use VLOOKUP or SUMIF? No, you had to learn them. So if you’re at all hesitant about learning M functions, just dive in and get started.
You're reading Easily Compare Multiple Tables In Power Query
Learn Hive Query To Unlock The Power Of Big Data Analytics
Introduction
Given the number of large datasets that data engineers handle on a daily basis, it is no doubt that a dedicated tool is required to process and analyze such data. Some tools like Pig, One of the most widely used tools to solve such a problem is Apache Hive which is built on top of Hadoop.
Apache Hive is a data warehousing built on top of Apache Hadoop. Using Apache Hive, you can query distributed data storage, including the data residing in Hadoop Distributed File System (HDFS), which is the file storage system provided in Apache Hadoop. Hive also supports the ACID properties of relational databases with ORC file format, which is optimized for faster querying. But the real reason behind the prolific use of Hive for working with Big Data is that it is an easy-to-use querying language.
Apache Hive supports the Hive Query Language, or HQL for short. HQL is very similar to SQL, which is the main reason behind its extensive use in the data engineering domain. Not only that, but HQL makes it fairly easy for data engineers to support transactions in Hive. So you can use the familiar insert, update, delete, and merge SQL statements to query table data in Hive. In fact, the simplicity of HQL is one of the reasons why data engineers now use Hive instead of Pig to query Big data.
So, in this article, we will be covering the most commonly used queries which you will find useful when querying data in Hive.
Learning Objectives
Get an overview of Apache Hive.
Get familiar with Hive Query Language.
Implement various functions in Hive, like aggregation functions, date functions, etc.
Table of Contents Hive Refresher
Hive is a data warehouse built on top of Apache Hadoop, which is an open-source distributed framework.
Hive architecture contains Hive Client, Hive Services, and Distributed Storage.
Hive Client various types of connectors like JDBC and ODBC connectors which allows Hive to support various applications in a different programming languages like Java, Python, etc.
Hive Services includes Hive Server, Hive CLI, Hive Driver, and Hive Metastore.
Hive CLI has been replaced by Beeline in HiveServer2.
Hive supports three different types of execution engines – MapReduce, Tez, and Spark.
Hive supports its own command line interface known as Hive CLI, where programmers can directly write the Hive queries.
Hive Metastore maintains the metadata about Hive tables.
Hive metastore can be used with Spark as well for storing the metadata.
Hive supports two types of tables – Managed tables and External tables.
The schema and data for Managed tables are stored in Hive.
In the case of External tables, only the schema is stored by Hive in the Hive metastore.
Hive uses the Hive Query Language (HQL) for querying data.
Using HQL or Hiveql, we can easily implement MapReduce jobs on Hadoop.
Let’s look at some popular Hive queries.
Simple SelectsIn Hive, querying data is performed by a SELECT statement. A select statement has 6 key components;
SELECT column names
FROM table-name
GROUP BY column names
WHERE conditions
HAVING conditions
ORDER by column names
In practice, very few queries will have all of these clauses in them, simplifying many queries. On the other hand, conditions in the WHERE clause can be very complex, and if you need to JOIN two or more tables together, then more clauses (JOIN and ON) are needed.
All of the clause names above have been written in uppercase for clarity. HQL is not case-sensitive. Neither do you need to write each clause on a new line, but it is often clearer to do so for all but the simplest of queries.
Over here, we will start with the very simple ones and work our way up to the more complex ones.
Simple Selects ‐ Selecting ColumnsAmongst all the hive queries, the simplest query is effectively one which returns the contents of the whole table. Following is the syntax to do that –
SELECT * FROM geog_all;It is better to practice and generally more efficient to explicitly list the column names that you want to be returned. This is one of the optimization techniques that you can use while querying in Hive.
SELECT anonid, fueltypes, acorn_type FROM geog_all; Simple Selects – Selecting RowsIn addition to limiting the columns returned by a query, you can also limit the rows returned. The simplest case is to say how many rows are wanted using the Limit clause.
SELECT anonid, fueltypes, acorn_type FROM geog_all LIMIT 10;This is useful if you just want to get a feel for what the data looks like. Usually, you will want to restrict the rows returned based on some criteria. i.e., certain values or ranges within one or more columns.
SELECT anonid, fueltypes, acorn_type FROM geog_all WHERE fueltypes = "ElecOnly";The Expression in the where clause can be more complex and involve more than one column.
SELECT anonid, fueltypes, acorn_type FROM geog_all SELECT anonid, fueltypes, acorn_type FROM geog_allNotice that the columns used in the conditions of the WHERE clause don’t have to appear in the Select clause. Other operators can also be used in the where clause. For complex expressions, brackets can be used to enforce precedence.
SELECT anonid, fueltypes, acorn_type, nuts1, ldz FROM geog_all WHERE fueltypes = "ElecOnly" AND acorn_type BETWEEN 42 AND 47 AND (nuts1 NOT IN ("UKM", "UKI") OR ldz = "--"); Creating New ColumnsIt is possible to create new columns in the output of the query. These columns can be from combinations from the other columns using operators and/or built-in Hive functions.
SELECT anonid, eprofileclass, acorn_type, (eprofileclass * acorn_type) AS multiply, (eprofileclass + acorn_type) AS added FROM edrp_geography_data b;A full list of the operators and functions available within the Hive can be found in the documentation.
When you create a new column, it is usual to provide an ‘alias’ for the column. This is essentially the name you wish to give to the new column. The alias is given immediately after the expression to which it refers. Optionally you can add the AS keyword for clarity. If you do not provide an alias for your new columns, Hive will generate a name for you.
Although the term alias may seem a bit odd for a new column that has no natural name, alias’ can also be used with any existing column to provide a more meaningful name in the output.
Tables can also be given an alias, this is particularly common in join queries involving multiple tables where there is a need to distinguish between columns with the same name in different tables. In addition to using operators to create new columns, there are also many Hive built‐in functions that can be used.
Hive FunctionsYou can use various Hive functions for data analysis purposes. Following are the functions to do that.
Simple FunctionsLet’s talk about the functions which are popularly used to query columns that contain string data type values.
Concat can be used to add strings together.
SELECT anonid, acorn_category, acorn_group, acorn_type, concat (acorn_category, ",", acorn_group, ",", acorn_type) AS acorn_code FROM geog_all;substr can be used to extract a part of a string
SELECT anon_id, advancedatetime, FROM elec_c;Examples of length, instr, and reverse
SELECT anonid, acorn_code, length (acorn_code), instr (acorn_code, ',') AS a_catpos, instr (reverse (acorn_code), "," ) AS reverse_a_typepoWhere needed, functions can be nested within each other, cast and type conversions.
SELECT anonid, substr (acorn_code, 7, 2) AS ac_type_string, cast (substr (acorn_code, 7, 2) AS INT) AS ac_type_int, substr (acorn_code, 7, 2) +1 AS ac_type_not_sure FROM geog_all; Aggregation FunctionsAggregate functions are used to perform some kind of mathematical or statistical calculation across a group of rows. The rows in each group are determined by the different values in a specified column or columns. A list of all of the available functions is available in the apache documentation.
SELECT anon_id, count (eleckwh) AS total_row_count, sum (eleckwh) AS total_period_usage, min (eleckwh) AS min_period_usage, avg (eleckwh) AS avg_period_usage, max (eleckwh) AS max_period_usage FROM elec_c GROUP BY anon_id;In the above example, five aggregations were performed over the single column anon_id. It is possible to aggregate over multiple columns by specifying them in both the select and the group by clause. The grouping will take place based on the order of the columns listed in the group by clause. What is not allowed is specifying a non‐aggregated column in the select clause that is not mentioned in the group by clause.
SELECT anon_id, count (eleckwh) AS total_row_count, sum (eleckwh) AS total_period_usage, min (eleckwh) AS min_period_usage, avg (eleckwh) AS avg_period_usage, max (eleckwh) AS max_period_usage FROM elec_cUnfortunately, the group by clause will not accept alias’.
SELECT anon_id, count (eleckwh) AS total_row_count, sum (eleckwh) AS total_period_usage, min (eleckwh) AS min_period_usage, avg (eleckwh) AS avg_period_usage, max (eleckwh) AS max_period_usage FROM elec_c ORDER BY anon_id, reading_year;But the Order by clause does.
The Distinct keyword provides a set of a unique combination of column values within a table without any kind of aggregation.
SELECT DISTINCT eprofileclass, fueltypes FROM geog_all; Date FunctionsHive provides a variety of date-related functions to allow you to convert strings into timestamps and to additionally extract parts of the Timestamp.
unix_timestamp returns the current date and time – as an integer!
from_unixtime takes an integer and converts it into a recognizable Timestamp string
SELECT unix_timestamp () AS currenttime FROM sample_07 LIMIT 1; SELECT from_unixtime (unix_timestamp ()) AS currenttime FROM sample_07 LIMIT 1;There are various date part functions that will extract the relevant parts from a Timestamp string.
SELECT anon_id, from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')) AS proper_date, year (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))) AS full_year, month (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))) AS full_month, day (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))) AS full_day, last_day (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))) AS last_day_of_month, date_add ( (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))),10) AS added_days FROM elec_days_c ORDER BY proper_date; ConclusionIn the article, we covered some basic Hive functions and queries. We saw that running queries on distributed data is not much different from running queries in MySQL. We covered some same basic queries like inserting records, working with simple functions, and working with aggregation functions in Hive.
Key Takeaways
Hive Query Language is the language supported by Hive.
HQL makes it easy for developers to query on Big data.
HQL is similar to SQL, making it easy for developers to learn this language.
I recommend you go through these articles to get acquainted with tools for big data:
Frequently Asked QuestionsQ1. What queries are used in Hive?
A. Hive supports the Hive Querying Language(HQL). HQL is very similar to SQL. It supports the usual insert, update, delete, and merge SQL statements to query data in Hive.
Q2. What are the benefits of Hive?
A. Hive is built on top of Apache Hadoop. This makes it an apt tool for analyzing Big data. It also supports various types of connectors, making it easier for developers to query Hive data using different programming languages.
Q3. What is the difference between Hive and MapReduce?
A. Hive is a data warehousing system that provides SQL-like querying language called HiveQL, while MapReduce is a programming model and software framework used for processing large datasets in a distributed computing environment. Hive also provides a schema for data stored in Hadoop Distributed File System (HDFS), making it easier to manage and analyze large datasets.
Related
Compare Two Directories In Linux?
Introduction
It’s quite common for people to compare two directories. Many different factors make us want to find out whether there really is a difference between two things. For example, we normally want to figure out what’s different from a previous situation when something goes wrong.
We’ll learn how we can use the command line to perform directory comparisons. There are different methods we can use to compare directory listings. We’ll also see some of the most commonly used commands and their options.
SetupWe’ll create some sample directories inside the /temp directory for this tutorial.
Dir1
Dir2
├──
client
.
log
├──
client
.
log
├──
file01
├──
file01
├──
file02
├──
file02
├──
file03
├──
file03
│
├──
file04
├──
server
.
log
├──
server
.
log
├──
subdir1
├──
subdir1
│
├──
file11
│
├──
file11
│
└──
file12
│
└──
file12
├──
subdir2
└──
subdir2
│
├──
file21
├──
file21
│
└──
file22
├──
file22
└──
subdir3
└──
file23
├──
file31
└──
file32
Samples must contain at least two kinds of files: identical ones (same file name, size, etc.) and different ones (different file names, sizes, etc.). So we can easily compare the results from different comparison tools.
Command Line UtilityWe almost always can use an old Unix utility called diff to see how two files (or directories) are different. Used for comparing files, the diff utility is able to compare directories as well as files. There are lots of options, but two of them are most useful for our case. Here are some examples of commands that help you view the contents of different types of file formats and directories recursively −
diff
--
brief
--
recursive
Dir1
Dir2
Files
Dir1
/
client
.
log
and
Dir2
/
client
.
log differ
Files
Dir1
/
file02
and
Dir2
/
file02 differ
Files
Dir1
/
file03
and
Dir2
/
file03 differ
Only
in
Dir2
:
file04
Files
Dir1
/
subdir1
/
file12
and
Dir2
/
subdir1
/
file12 differ
Files
Dir1
/
subdir2
/
file22
and
Dir2
/
subdir2
/
file22 differ
Only
in
Dir2
/
subdir2
:
file23
Only
in
Dir1
:
subdir3
Another useful option for comparing two lists is −−exclude, which lets us filter out elements from one list that we’re not interested in. To exclude all .log files from the example shown above, we would run the following command −
diff
--
brief
--
recursive
Dir1
Dir2
--
exclude
'*.log'
Files
Dir1
/
file02
and
Dir2
/
file02 differ
Files
Dir1
/
file03
and
Dir2
/
file03 differ
Only
in
Dir2
:
file04
Files
Dir1
/
subdir1
/
file12
and
Dir2
/
subdir1
/
file12 differ
Files
Dir1
/
subdir2
/
file22
and
Dir2
/
subdir2
/
file22 differ
Only
in
Dir2
/
subdir2
:
file23
Only
in
Dir1
:
subdir3
We should remember that diff utility compares files using their contents, which could cause a significant delay when comparing large amounts of files.
Terminal File ManagersFile manager directory comparison features are also available in some file managers. To compare two directories in Midnight Commander, use either the command/compare directory menu option or the ctrl+x d keyboard combination. Different options will be displayed when they’re selected −
It doesn’t make it recursive, but we can choose from quick, size only, and thorough options based on time stamps, size, and contents respectively.
GUI ApproachFinally, we can choose between comparing by content or just timestamps, which might significantly improve the speed of comparisons.
ConclusionWe’ve seen several different methods for comparing directory contents on Linux.
How To Update Two Tables In One Statement In Sql Server?
Introduction
In SQL Server, you may sometimes need to update data in multiple tables at the same time. This can be done using a single UPDATE statement, which allows you to update multiple tables in a single query.
To update two tables in one statement, you can use the UPDATE statement with a JOIN clause. The JOIN clause allows you to specify a relationship between the two tables that you want to update, based on a common column or set of columns.
DefinitionThe term “update two tables in one statement” refers to the process of using a single UPDATE statement in SQL Server to update data in two tables at the same time.
In SQL Server, the UPDATE statement is used to modify data in a table. By default, the UPDATE statement updates one table at a time. However, you can use a JOIN clause in the UPDATE statement to update two tables in one statement.
The JOIN clause allows you to specify a relationship between the two tables that you want to update, based on a common column or set of columns. This allows you to update data in both tables at the same time, based on the specified conditions.
For example, you can use an UPDATE statement with a JOIN clause to update the salary for all employees in a certain department, or update the address for all customers in a certain region.
Overall, the concept of updating two tables in one statement is useful when you need to update data in multiple tables at the same time, and the tables have a relationship based on a common column or set of columns. This can help you avoid the need to write multiple UPDATE statements or use other techniques such as cursors or loops.
Syntax UPDATE table1 SET column1 = value1, column2 = value2, ... FROM table1 WHERE condition; UPDATE table2 SET column1 = value1, column2 = value2, ... FROM table1 WHERE condition;This will update both table1 and table2 using the common column specified in the ON clause of the JOIN. The WHERE clause is optional and can be used to specify additional conditions for the update.
Important Points to Consider
Make sure that the two tables have a common column or set of columns that you can use to join the tables. This common column will be used to specify the relationship between the two tables in the JOIN clause of the UPDATE statement.
Use the SET clause to specify the columns and values that you want to update in each table. You can update multiple columns at the same time by separating the assignments with commas.
Use the WHERE clause to specify any additional conditions for the update. This can be used to narrow down the rows that will be updated in each table.
Be careful when updating data in multiple tables at the same time. If you have a mistake in your UPDATE statement, you may end up updating more rows than intended, or updating the wrong values. It is always a good idea to test your UPDATE statement on a test database before applying it to your production database.
If you want to update multiple tables in one statement and the tables do not have a common column, you can use a subquery in the UPDATE statement to achieve the same effect. However, this technique can be more complex and may have worse performance compared to using a JOIN clause.
Example – 1 SQL QueryUPDATE
Table1SET
name=
'John'
,
country=
'USA'
FROM
Table1JOIN
Table2ON
Table1.
user_id=
Table2.
user_idWHERE
Table2.
department=
'IT'
;
This UPDATE statement will update the name and country columns in Table1 for all rows that have a department of ‘IT’ in Table2. The JOIN clause specifies the relationship between the two tables based on the user_id column.
Example – 2 SQL QueryUPDATE
Table2SET
salary=
salary*
1.1
FROM
Table1JOIN
Table2ON
Table1.
user_id=
Table2.
user_idWHERE
Table1.
country=
'USA'
;
This UPDATE statement will increase the salary column in Table2 by 10% for all rows that have a country of ‘USA’ in Table1. The JOIN clause specifies the relationship between the two tables based on the user_id column.
Example – 3 SQL QueryUPDATE
Table1SET
name=
'John'
,
country=
'USA'
FROM
Table1JOIN
Table2ON
Table1.
user_id=
Table2.
user_idUPDATE
Table2SET
salary=
salary*
1.1
FROM
Table1JOIN
Table2ON
Table1.
user_id=
Table2.
user_idWHERE
Table1.
country=
'USA'
;
This example combines the two previous examples into a single statement. It will update the name and country columns in Table1 for all rows that have a department of ‘IT’ in Table2, and increase the salary column in Table2 by 10% for all rows that have a country of ‘USA’ in Table1. The JOIN clause specifies the relationship between the two tables based on the user_id column.
ConclusionThis can be useful when you need to update data in multiple tables at the same time, and the tables have a relationship based on a common column or set of columns.
How To Create Table In Hive With Query Examples?
Introduction to Hive Table
In the hive, the tables consist of columns and rows and store the related data in the table format within the same database. The table is storing the records or data in tabular format. The tables are broadly classified into two parts, i.e., external table and internal table.
Start Your Free Data Science Course
Hadoop, Data Science, Statistics & others
The default storage location of the Table varies from the hive version. From HDP 3.0, we are using hive version 3.0 and more. The default Table location was changed from HDP 3.0 version / Hive version 3.0. The location for the external hive Table is “/warehouse/tablespace/external/hive/” and the location for the manage Table is “/warehouse/tablespace/managed/hive”.
In the older version of the hive, the default storage location of the hive Table is “/apps/hive/warehouse/”.
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [ database name ] table name [ ROW FORMAT row format] [ STORED AS file format] How to create a Table in Hive?
Hive internal table
Hive external table
Note: We have the hive “hql” file concept with the help of “hql” files, we can directly write the entire internal or external table DDL and directly load the data in the respective table.
1. Internal TableThe internal table is also called a managed table and is owned by a “hive” only. Whenever we create the table without specifying the keyword “external” then the tables will create in the default location.
If we drop the internal or manage the table, the table DDL, metadata information, and table data will be lost. The table data is available on HDFS it will also lose. We should be very careful while dropping any internal or managing the table.
DDL Code for Internal Table
create table emp.customer ( idint, first_name string, last_name string, gender string, company_name string, job_title string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by 'n' location "/emp/table1" tblproperties ("skip.header.line.count"="1");Note: To load the data in hive internal or manage the table. We are using the “location” keyword in DDL Code. From the same location, we have kept the CSV file and load the CSV file data in the table.
Output:
2. External TableThe best practice is to create an external table. Many organizations are following the same practice to create tables. It does not manage the data of the external table, and the table is not created in the warehouse directory. We can store the external table data anywhere on the HDFS level.
The external tables have the facility to recover the data, i.e., if we delete/drop the external table. Still no impact on the external table data present on the HDFS. It will only drop the metadata associated with the table.
If we drop the internal or manage the table, the table DDL, metadata information, and table data will be lost. The table data is available on HDFS it will also lose. We should be very careful while dropping any internal or manage the table.
DDL Code for External Table
create external table emp.sales ( idint, first_name string, last_name string, gender string, email_id string, city string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by 'n' location "/emp/sales" tblproperties ("skip.header.line.count"="1");Note: we can directly store the external table data on the cloud or any other remote machine in the network. It will depend on the requirement.
Output:
How to modify/alter the Table?Here we have the facility to alter or modify the existing attributes of the Table. With the help of the “alter” functionality, we can change the column name, add the column, drop the column, change the column name, and replace the column.
We can alter the below Table attributes.
1. Alter/ rename the tablenameSyntax:
ALTER TABLE [current table name] RENAME TO [new table name]Query to Alter Table Name :
ALTER TABLE customer RENAME TO cust;Output:
Before alter
After alter
2. Alter/ add column in the tableSyntax:
ALTER TABLE [current table name] ADD COLUMNS (column spec[, col_spec ...])Query to add Column :
ALTER TABLE cust ADD COLUMNS (dept STRING COMMENT 'Department');Output:
Sample view of the table
We are adding a new column in the table “department = dept”
3. Alter/change the column nameSyntax:
ALTER TABLE [current table name] CHANGE [column name][new name][new type]Query to change column name :
ALTER TABLE cust CHANGE first_name name string;Output:
Sample view of the customer table.
How to drop the Table?Drop Internal or External Table
Syntax:
DROP TABLE [IF EXISTS] table name;Drop Query:
drop table cust;Output:
Before drop query run
After dropping the query, run on the “cust” table.
ConclusionWe have seen the uncut concept of “Hive Table” with the proper example, explanation, syntax, and SQL Query with different outputs. The table is useful for storing the structure data. The table data is helpful for various analysis purposes like BI, reporting, helpful/easy in data slicing and dicing, etc. The internal table is managed, and the hive does not manage the external table. We can choose the table type we need to create per the requirement.
Recommended ArticlesWe hope that this EDUCBA information on “Hive Table” was beneficial to you. You can view EDUCBA’s recommended articles for more information.
Building Data Dimensions In Power Bi
In today’s blog post, I want to discuss how you can build additional data dimensions in Power BI.
These dimensions can help you filter your data in different ways. On top of that, these dimensions can help you come up with intuitive visualizations later on.
This is also the reason why you need to set up your data model in the best possible way. As I have mentioned in the previous tutorials, you should separate the lookup tables and the fact tables. This way, you can set up additional data dimensions when needed.
Now, I want to show you the scenarios where you need to add other dimensions.
First, I have this Product Name column in the Products table. The products listed here are filtered by their product names.
However, there will be times when you want to group the products based on revenue or margins. Because of that, you need to add a new column and place it in a table. This is where you need to create additional dimensions that you can use to run intermediary calculations.
If you look at the table more closely, you can see that the only product-related column is the Product Description Index.
After that, you should name it as Product Sales. This column will show the total revenue for each product under the Product Name column.
Now that you have the total revenue, you need to add another dimension for the product groupings.
If the sales are greater than 10 million, you can classify them as Good Clients. If the sales are less than or equal to 10 million, you can classify them as Ok Clients.
Lastly, add BLANK in the last part to close the formula.
Since you have created the Client Groups column, you can now create a relationship to the Sales table.
You can now filter any calculation using the additional dimensions in the Products table. Without them, you will need to run your calculation using the thousands of rows in the Sales table.
Additional columns are important because they help you do more efficient calculations. They can also speed up Power BI’s performance compared to doing a calculation inside a huge table.
Another way to deal with your dimensions is to hide irrelevant columns in the report view.
In this example, you need to hide the Product Sales table because you only need to show the different client groups in your report. You should also hide the Index column because it’s not relevant in the report for client groups.
In the report view, you only have to show the Good Clients and Ok Clients data. This means you’ll have to utilize the Client Groups column as a filter and slicer.
In the Client Groups slicer, you can make the report dynamic by selecting either Good Clients or Ok Clients.
Since the other columns are hidden, you can’t see it in the report view.
The hidden Index and Products column were only useful in creating the relationship for the other data dimensions, but it’s irrelevant to show them in the visualization.
You can apply this technique to other similar scenarios that you will encounter when creating visualizations.
I have discussed a number of data modelling techniques that are important. I hope you get to master these techniques and apply them every time you work inside the data model area.
As I’ve said before, you need to build your data model in the best way possible so you won’t have any problems when doing your calculations.
Cheers!
Sam
Update the detailed information about Easily Compare Multiple Tables In Power Query on the Katfastfood.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!