You are reading the article Python Zip File With Example updated in December 2023 on the website Katfastfood.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Python Zip File With Example
Python allows you to quickly create zip/tar archives.
Following command will zip entire directory
shutil.make_archive(output_filename, 'zip', dir_name)Following command gives you control on the files you want to archive
ZipFile.write(filename)Here are the steps to create Zip File in Python
Step 1) To create an archive file from Python, make sure you have your import statement correct and in order. Here the import statement for the archive is from shutil import make_archive
Code Explanation
Import make_archive class from module shutil
Use the split function to split out the directory and the file name from the path to the location of the text file (guru99)
Then we call the module “shutil.make_archive(“guru99 archive, “zip”, root_dir)” to create archive file, which will be in zip format
After then we pass in the root directory of things we want to be zipped up. So everything in the directory will be zipped
When you run the code, you can see the archive zip file is created on the right side of the panel.
Now your chúng tôi file will appear on your O.S (Windows Explorer)
Step 4) In Python we can have more control over archive since we can define which specific file to include under archive. In our case, we will include two files under archive “guru99.txt” and “guru99.txt.bak”.
Code Explanation
Import Zipfile class from zip file Python module. This module gives full control over creating zip files
We create a new Zipfile with name ( “testguru99.zip, “w”)
Creating a new Zipfile class, requires to pass in permission because it’s a file, so you need to write information into the file as newzip
We used variable “newzip” to refer to the zip file we created
Using the write function on the “newzip” variable, we add the files “guru99.txt” and “guru99.txt.bak” to the archive
When you execute the code you can see the file is created on the right side of the panel with name “guru99.zip”
Note: Here we don’t give any command to “close” the file like “newzip.close” because we use “With” scope lock, so when program falls outside of this scope the file will be cleaned up and is closed automatically.
Here is the complete code
Python 2 Example
import os import shutil from zipfile import ZipFile from os import path from shutil import make_archive def main(): # Check if file exists if path.exists("guru99.txt"): # get the path to the file in the current directory src = path.realpath("guru99.txt"); # rename the original file os.rename("career.guru99.txt","guru99.txt") # now put things into a ZIP archive root_dir,tail = path.split(src) shutil.make_archive("guru99 archive", "zip", root_dir) # more fine-grained control over ZIP files with ZipFile("testguru99.zip","w") as newzip: newzip.write("guru99.txt") newzip.write("guru99.txt.bak") if __name__== "__main__": main()Python 3 Example
import os import shutil from zipfile import ZipFile from os import path from shutil import make_archive # Check if file exists if path.exists("guru99.txt"): # get the path to the file in the current directory src = path.realpath("guru99.txt"); # rename the original file os.rename("career.guru99.txt","guru99.txt") # now put things into a ZIP archive root_dir,tail = path.split(src) shutil.make_archive("guru99 archive","zip",root_dir) # more fine-grained control over ZIP files with ZipFile("testguru99.zip", "w") as newzip: newzip.write("guru99.txt") newzip.write("guru99.txt.bak") Summary
To zip entire directory use command “shutil.make_archive(“name”,”zip”, root_dir)
To select the files to zip use command “ZipFile.write(filename)”
You're reading Python Zip File With Example
How To Create A Zip File Using Python?
ZIP is an archive file format used to for lossless data compression. One or more directories or files are used to create a ZIP file. ZIP supports multiple compression algorithms, DEFLATE being the most common. ZIP files have .zip as extension. In this article we are going to discuss how to create a Zip file using Python.
Creating uncompressed ZIP file in Python Using shutil.make_archive to create Zip filePython has a standard library shutil which can be used to create uncompressed ZIP files. This method of creating ZIP file should be used only to organize multiple files in a single file.
SyntaxFollowing is the syntax of shutil.make_archive −
shutil.make_archive(‘output file name’, ‘zip’, ‘directory name’) ExampleFollowing is an example to create ZIP file using shutil.make_archive −
import
shutilimport
os.
path archived=
shutil.
make_archive(
'E:/Zipped file'
,
'zip'
,
'E:/Folder to be zipped'
)
if
os.
path.
exists(
'E:/Zipped file.zip'
)
:
(
archived)
else
:
(
"ZIP file not created"
)
OutputFollowing is an output of the above code −
E:Zipped file.zip Creating compressed ZIP file in PythonCompressed ZIP files reduce the size of the original directory by applying compression algorithm. Compressed ZIP files result in faster file sharing over a network as the size of the ZIP file is significantly smaller than original file.
The zipfile library in python allows for creation of compressed ZIP files using different methods.
Creating ZIP file from multiple filesIn this method, ZipFile() creates a ZIP file in which the files which are to be compressed are added. This is achieved by creating object of ZipFile using with keyword and then writing the files using .write() method.
ExampleFollowing is an example to create ZIP file using multiple files −
import
osfrom
zipfileimport
ZipFilewith
ZipFile(
'E:/Zipped file.zip'
,
'w'
)
as
zip_object:
zip_object.
write(
'E:/Folder to be zipped/Greetings.txt'
)
zip_object.
write(
'E:/Folder to be zipped/Introduction.txt'
)
if
os.
path.
exists(
'E:/Zipped file.zip'
)
:
(
"ZIP file created"
)
else
:
(
"ZIP file not created"
)
OutputFollowing is an output of the above code −
ZIP file created Creating ZIP file from entire directoryIn this method, a for loop is used to traverse the entire directory and then add all the files present in the directory to a ZIP file which is created using ZipFile.
ExampleFollowing is an example to create ZIP file from entire directory −
import
osfrom
zipfileimport
ZipFilewith
ZipFile(
'E:/Zipped file.zip'
,
'w'
)
as
zip_object:
for
folder_name,
sub_folders,
file_namesin
os.
walk(
'E:/Folder to be zipped'
)
:
for
filenamein
file_names:
file_path=
os.
path.
join(
folder_name,
filename)
zip_object.
write(
file_path,
os.
path.
basename(
file_path)
)
if
os.
path.
exists(
'E:/Zipped file.zip'
)
:
(
"ZIP file created"
)
else
:
(
"ZIP file not created"
)
OutputFollowing is an output of the above code −
ZIP file created Creating ZIP file from specific files in a directoryIn this method, lambda function is used to filter files with specific extensions to be added in the ZIP file. The lambda function is passed as parameter to a function in which the files are filtered based on the extension.
ExampleFollowing is an example to create ZIP file using specific files in a directory −
import
osfrom
zipfileimport
ZipFiledef
zip_csv
(
directory_name,
zip_file_name,
filter
)
:
with
ZipFile(
zip_file_name,
'w'
)
as
zip_object:
for
folder_name,
sub_folders,
file_namesin
os.
walk(
directory_name)
:
for
filenamein
file_names:
if
filter
(
filename)
:
file_path=
os.
path.
join(
folder_name,
filename)
zip_object.
write(
file_path,
os.
path.
basename(
file_path)
)
if
__name__==
'__main__'
:
zip_csv(
'E:/Folder to be zipped'
,
'E:/Zipped file.zip'
,
lambda
name:
'csv'
in
name)
if
os.
path.
exists(
'E:/Zipped file.zip'
)
:
(
"ZIP file created with only CSV files"
)
else
:
(
"ZIP file not created"
)
OutputFollowing is an output of the above code −
ZIP file created with only CSV filesSieve Of Eratosthenes Algorithm: Python, C++ Example
The Sieve of Eratosthenes is the simplest prime number sieve. It is a Prime number algorithm to search all the prime numbers in a given limit. There are several prime number sieves. For example- the Sieve of Eratosthenes, Sieve of Atkin, Sieve of Sundaram, etc.
The word “sieve” means a utensil that filters substances. Thus, the sieve algorithm in Python and other languages refers to an algorithm to filter out prime numbers.
This algorithm filters out the prime number in an iterative approach. The filtering process starts with the smallest prime number. A Prime is a natural number that is greater than 1 and has only two divisors, viz., 1 and the number itself. The numbers that are not primes are called composite numbers.
In the sieve of the Eratosthenes method, a small prime number is selected first, and all the multiples of it get filtered out. The process runs on a loop in a given range.
For example:
Let’s take the number range from 2 to 10.
After applying the Sieve of Eratosthenes, it will produce the list of prime numbers 2, 3, 5, 7
Algorithm Sieve of Eratosthenes
Here is the algorithm for the Sieve of Eratosthenes:
Step 1) Create a list of numbers from 2 to the given range n. We start with 2 as it is the smallest and first prime number.
Step 2) Select the smallest number on the list, x (initially x equals 2), traverse through the list, and filter the corresponding composite numbers by marking all the multiples of the selected numbers.
Step 3) Then choose the next prime or the smallest unmarked number on the list and repeat step 2.
Step 4) Repeat the previous step until the value of x should be lesser than or equal to the square root of n (x<=
).
).
Note: The mathematical reasoning is quite simple. The number range n can be factorized as-
n=a*b
Again, n =
*
= (factor smaller than
) * (factor larger than )
) * (factor larger than
So at least one of the prime factors or both must be <=
. Thus, traversing to will be enough.
. Thus, traversing towill be enough.
Step 5) After those four steps, the remaining unmarked numbers would be all the primes on that given range n.
Example:
Let’s take an example and see how it works.
For this example, we will find the list of prime numbers from 2 to 25. So, n=25.
Step 1) In the first step, we will take a list of numbers from 2 to 25 as we selected n=25.
Step 2) Then we select the smallest number on the list, x. Initially x=2 as it is the smallest prime number. Then we traverse through the list and mark the multiples of 2.
The multiples of 2 for the given value of n is: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24.
Note: Blue color denotes the selected number, and pink color denotes the eliminated multiples
Step 3) Then we choose the next smallest unmarked number, which is 3, and repeat the last step by marking the multiples of 3.
Step 4) We repeat step 3 in the same way until x=
or 5.
Step 5) The remaining non-marked numbers would be the prime numbers from 2 to 25.
Pseudo-Code Begin Declare a boolean array of size n and initialize it to true For all numbers i : from 2 to sqrt(n) IF bool value of i is true THEN i is prime For all multiples of i (i<n) mark multiples of i as composite Print all unmarked numbers End Sieve of Eratosthenes C/C++ code Exampleusing namespace std; void Sieve_Of_Eratosthenes(int n) { bool primeNumber[n + 1]; memset(primeNumber, true, sizeof(primeNumber)); for (int j = 2; j * j <= n; j++) { if (primeNumber[j] == true) { for (int k = j * j; k <= n; k += j) primeNumber[k] = false; } } for (int i = 2; i <= n; i++) if (primeNumber[i]) cout << i << ” “; } int main() { int n = 25; Sieve_Of_Eratosthenes(n); return 0; }
Output:
2 3 5 7 11 13 17 19 23 Sieve of Eratosthenes Python Program Example def SieveOfEratosthenes(n): # Create a boolean array primeNumber = [True for i in range(n+2)] i = 2 while (i * i <= n): if (primeNumber[i] == True): # Update all multiples of i as false for j in range(i * i, n+1, i): primeNumber[j] = False i += 1 for i in range(2, n): if primeNumber[i]: print(i) n = 25 SieveOfEratosthenes(n)Output:
2 3 5 7 11 13 17 19 23 Segmented SieveWe have seen that the Sieve of Eratosthenes is required to run a loop through the whole number range. Thus, it needs O(n) memory space to store the numbers. The situation becomes complicated when we try to find primes in a huge range. It is not feasible to allocate such a large memory space for a bigger n.
The algorithm can be optimized by introducing some new features. The idea is to divide the number range into smaller segments and compute prime numbers in those segments one by one. This is an efficient way to reduce space complexity. This method is called a segmented sieve.
The optimization can be achieved in the following manner:
Use a simple sieve to find prime numbers from 2 to and store them in an array.
Divide the range [0…n-1] into multiple segments of size at most .
For every segment, iterate through the segment and mark the multiple of prime numbers found in step 1. This step requires O() at max.
The regular sieve requires O(n) auxiliary memory space, whereas the segmented sieve requires O(
), which is a big improvement for a large n. The method has a downside also because it does not improve time complexity.
Complexity Analysis), which is a big improvement for a large n. The method has a downside also because it does not improve time complexity.
Space Complexity:
O(
) auxiliary space.
) auxiliary space.
Time Complexity:
The time complexity of a regular sieve of eratosthenes algorithm is O(n*log(log(n))). The reasoning behind this complexity is discussed below.
For a given number n, the time required to mark a composite number (i.e., nonprime numbers) is constant. So, the number of times the loop runs is equal to-
n/2 + n/3 + n/5 + n/7 + ……∞
= n * (1/2 + 1/3 + 1/5 + 1/7 +…….∞)
The harmonic progression of the sum of the primes can be deducted as log(log(n)).
(1/2 + 1/3 + 1/5 + 1/7 +…….∞) = log(log(n))
So, the time complexity will be-
T(n) = n * (1/2 + 1/3 + 1/5 + 1/7 + ……∞)
= n * log(log(n))
Thus the time complexity O(n * log(log(n)))
Next, you’ll learn about Pascal’s Triangle
Summary:
The Sieve of Eratosthenes filter out the prime numbers in a given upper limit.
Filtering a prime number starts from the smallest prime number, “2”. This is done iteratively.
The iteration is done up to the square root of n, where n is the given number range.
After these iterations, the numbers that remain are the prime numbers.
Binary Search Algorithm With Example
Before we learn Binary search, let’s learn:
What is Search?Search is a utility that enables its user to find documents, files, media, or any other type of data held inside a database. Search works on the simple principle of matching the criteria with the records and displaying it to the user. In this way, the most basic search function works.
What is Binary Search?In this algorithm tutorial, you will learn:
How Binary Search Works?The binary search works in the following manner:
The search process initiates by locating the middle element of the sorted array of data
After that, the key value is compared with the element
If the key value is smaller than the middle element, then searches analyses the upper values to the middle element for comparison and matching
In case the key value is greater than the middle element then searches analyses the lower values to the middle element for comparison and matching
Example Binary Search
Let us look at the example of a dictionary. If you need to find a certain word, no one goes through each word in a sequential manner but randomly locates the nearest words to search for the required word.
The above image illustrates the following:
You have an array of 10 digits, and the element 59 needs to be found.
All the elements are marked with the index from 0 – 9. Now, the middle of the array is calculated. To do so, you take the left and rightmost values of the index and divide them by 2. The result is 4.5, but we take the floor value. Hence the middle is 4.
The algorithm drops all the elements from the middle (4) to the lowest bound because 59 is greater than 24, and now the array is left with 5 elements only.
Now, 59 is greater than 45 and less than 63. The middle is 7. Hence the right index value becomes middle – 1, which equals 6, and the left index value remains the same as before, which is 5.
At this point, you know that 59 comes after 45. Hence, the left index, which is 5, becomes mid as well.
These iterations continue until the array is reduced to only one element, or the item to be found becomes the middle of the array.
Example 2Let’s look at the following example to understand the binary search working
You have an array of sorted values ranging from 2 to 20 and need to locate 18.
The average of the lower and upper limits is (l + r) / 2 = 4. The value being searched is greater than the mid which is 4.
The array values less than the mid are dropped from search and values greater than the mid-value 4 are searched.
This is a recurrent dividing process until the actual item to be searched is found.
Why Do We Need Binary Search?The following reasons make the binary search a better choice to be used as a search algorithm:
Binary search works efficiently on sorted data no matter the size of the data
Instead of performing the search by going through the data in a sequence, the binary algorithm randomly accesses the data to find the required element. This makes the search cycles shorter and more accurate.
Binary search performs comparisons of the sorted data based on an ordering principle than using equality comparisons, which are slower and mostly inaccurate.
After every cycle of search, the algorithm divides the size of the array into half hence in the next iteration it will work only in the remaining half of the array
Learn our next tutorial of Linear Search: Python, C++ Example
Summary
Binary search is commonly known as a half-interval search or a logarithmic search
It works by dividing the array into half on every iteration under the required element is found.
The binary algorithm takes the middle of the array by dividing the sum of the left and rightmost index values by 2. Now, the algorithm drops either the lower or upper bound of elements from the middle of the array, depending on the element to be found.
The algorithm randomly accesses the data to find the required element. This makes the search cycles shorter and more accurate.
Binary search performs comparisons of the sorted data based on an ordering principle than using equality comparisons that are slow and inaccurate.
A binary search is not suitable for unsorted data.
Verify File Integrity With Free File Integrity & Checksum Checkers
It is always important to verify whether the large file you downloaded is the file you expected to download or not. This is because files may change in some way from the original while downloading. This may be due to corruption, due to errors in the download process. The downloaded files may even possess security-compromising features or include unwanted malicious software such as a virus or malware.
Therefore, no matter what the reason is, you should check the integrity of the files first. You should also check if there is any change from the original form or not. There are various ways by which you can check the original form. For instance, you can check if a file has a digital signature or not or check the hash value of it.
MD5 checksumThe MD5 checksum for a file is a 128-bit value which is unique to the file – something like a fingerprint. It can be used to compare files and their integrity control.
Read: MD5 Hash Checker Tools for Windows 10.
Check the hash valueSome web pages offering programs to download provide a long code called as MD5 near the program link. When you apply a cryptographic hash function to it, a string value is returned which is only valid for that file in the current state. If you download the same file with changed data (at very few places only) from some other website and apply this cryptographic function again, you will observe the value is changed. So, you can easily determine if the file is untouched or not.
Luckily, there are many programs available for Windows that help you calculate the hash value and check if the hash value you have of files matches or not. I am covering some free File Integrity Checkers & File Checksum Integrity verifier tools to check file integrity of downloaded files in Windows using Using MD5 & SHA1 Hashes, by computing MD5 or SHA1 cryptographic hashes for files.
File Integrity & Checksum CheckersHere are 5 File Integrity & Checksum Checkers that can help you check the hash value:
IgorWare Hasher
MultiHasher
MD5 & SHA-1 Checksum Utility
Microsoft File Checksum Integrity Verifier
Verify MD5 Checksum online.
1] IgorWare HasherIgorWare Hasher is a free SHA-1, MD5 and CRC32 hash generator for Windows. The program can generate a checksum for a single file and verify its integrity by using verification files (.sha, .md5 and .sfv) generated by Total Commander, with support for UTF-8 verification files. It comes in a portable version also and hence, requires no installation.
IgorWare Hasher features:
Calculates SHA-1, MD5 and CRC32 hash for single file or text
Supports hash verification files (*.sha, *.md5, *.sfv) compatible with Total and Free Commander
Drag and Drop file support
Supports UTF8 verification files
Includes option to associate
hasher
with files in windows explorer
2] MultiHasherThe ingenious program provides support for up to five hash algorithms including MD5, SHA-1, SHA-256, SHA-384, and SHA-512. Plus, it has the capability of calculating one or more hashes that exist in a single file, simultaneously.
MultiHasher includes a built-in famous virus scanner – Virus Total that lets a user know from the Virus Total database if a file is infected with some kind of virus. The program when downloaded, integrates into Windows Explorer. The Program contains no spyware, Adware and is 100% freeware.
MultiHasher Features:
Ability to calculate one or more hash values for a single file at once
Ability to calculate hash values for multiple files and text string
Supports the following hash algorithms: CRC32, MD5, RIPEMD-160, SHA-1, SHA-256, SHA-384, SHA-512
Supports hash file verification such as MHX, SFV, MD5Sum, etc.
Unicode support
Localizable UI
Multiple language support
Built-in virus checker
3] MD5 & SHA-1 Checksum UtilityUnlike other programs that you may find as bloated, the tool just gives you the hash value with no frills thus, sufficient. MD5 & SHA-1 Checksum Utility is a complete freeware, portable and compatible with Windows XP, Vista and 7.
MD5 & SHA-1 Checksum Utility features:
New and Simple Interface
Support Drag and Drop for File
Share hashes easily via Copy All button.
4] Microsoft File Checksum Integrity VerifierThis tool is an unsupported command-line utility that computes MD5 or SHA1 cryptographic hashes for files. It is a command-prompt utility that computes and verifies cryptographic hash values of files. FCIV can compute MD5 or SHA-1 cryptographic hash values. These values can be displayed on the screen or saved in an XML file database for later use and verification.
TIP: You can verify MD5 checksum of files using the built-in command-line tool Certutil.
5] Verify MD5
Checksum
online
Also, have a look at MD5 Check and Marixio File Checksum verifier.
Beginner’s Guide To Automl With An Easy Autogluon Example
This article was published as a part of the Data Science Blogathon
You will find two sections in this guide for easier understanding. The first section deals with the background information on AutoML while the second section covers an end-to-end example use case for AutoGluon – one of the AutoML frameworks.
Follow along this guide to familiarize yourself with the concepts, get to know some existing AutoML frameworks, and try out an example based on AutoGluon. Additionally, find the answers to some interesting questions such as-
Why is AutoML required?
What benefits does AutoML offer over the normal method of selecting conventional ML models?
Who can use AutoML?
Where can it be used?
How does AutoML work?
When to use and when to avoid using AutoML?
Understanding the Basics of AutoML Need for AutoMLSo beginning with the first question, do we need AutoML and why? Machine learning algorithms have become increasingly popular in recent years and their success in assisting the decision-making process in a wide range of applications is well-known. A typical machine learning model includes the following steps:
Data Collection
Data Preparation
Algorithm selection
Model Training
Model Evaluation
Parameter Tuning
Model Prediction and Interpretation
The figure below shows a generic representation of the steps carried out in a traditional Machine Learning process from Data collection to Model Predictions.
Steps in Traditional Machine Learning (Image Source: Author)
From the above image, we can guess why many businesses find it difficult to implement traditional ML models. This is due to the complexity and the amount of learning involved in implementing the machine learning model. It also requires extensive domain expertise to generate and compare multiple models before selecting the best one. AutoML promises to simplify these challenges.
Thus, in a broader sense, AutoML has been introduced for automating the process of building an entire Machine Learning pipeline, with minimal human intervention.
Benefits of using AutoMLIn general, an AutoML model aims to automate all time-consuming operations like the selection of algorithms, writing the code, pipeline development, and hyperparameter tuning thereby allowing the data scientists to focus more on speedily resolving the business challenges at hand. Within the pipeline, the AutoML framework
Considers and selects multiple machine learning algorithms from the available ones like the random forest, k-Nearest Neighbor, SVMs, etc.
Performs data preprocessing steps like missing value imputation, feature scaling, feature selection, etc..,
Optimization or the hyperparameter tuning for all of the models
Decides/Tries multiple ways to ensemble or stack the algorithms
Currently available AutoML frameworksThe AutoML technology and the AutoML frameworks are quite new. Currently, there are various AutoML frameworks that can work with a variety of data, available in both open-source and paid versions.
Some popular AutoML packages are:
AutoGluon (2023): This popular AutoML open-source toolkit developed by AWS helps in getting a strong predictive performance in various machine learning and deep learning models on text, image, and tabular data. Installation is supported for the Linux & Mac operating systems whereas Windows is not an officially supported OS for this toolkit.
MLBox (2023): Another well-known open-source Python-based AutoML library is MLBox. However, it is a powerful library that offers three sub-packages related to Pre-processing (to read and pre-process data), Optimization (to test and/or optimize the models) and Prediction (to predict the outcomes on a test dataset). Additionally, it can perform feature selection, hyper-parameter optimization, automatic model selection for classification and regression as well as predicting the target variables for selected models.
AutoWEKA (2013): Auto-WEKA uses a fully automated approach and leverages the Bayesian optimization to select a machine learning algorithm and set its hyperparameters. In other words, it helps in applying the best parameter settings for a given classification or regression task automatically to get a nice model.
Auto-sklearn (2023): This AutoML framework has been developed by Matthias Feurer, et al and according to the official documentation is based on Bayesian optimization, meta-learning, and ensemble construction. Algorithm selection and hyperparameter tuning are automated using this framework. This framework only supports sklearn based models i.e. this is not suitable for graphical models or sequence prediction problems.
Auto-PyTorch (2023): This framework has been developed by the AutoML Groups of the University of Freiburg and Hannover. It is based on the PyTorch deep learning framework and it supports the tabular data of classification and regression. Also, it can be applied to image data for classification.
Autokeras (2023): It is an open-source library used in deep learning for automating tasks. It helps you in getting good neural network models for classification and regression tasks. The AutoML Toolkit has been built on top of the deep learning framework Keras, developed by the Datalab team at Texas A&M University. Since Auto-Keras follows the classic Scikit-Learn API design, the syntax is similar and easy to use.
TPOT (2023): Tree-based Pipeline Optimization Tool (TPOT) is also one of the popular open-source AutoML frameworks which use sci-kit learn library as a part of its ML menu. According to the official documentation, it uses genetic programming to intelligently explore thousands of possible pipelines to find out a top-performing model pipeline for a given dataset. It is important to note here that TPOT does not perform any pre-processing of the dataset and hence, expects the fed dataset to be clean. However, it can perform feature processing, model selection, and hyperparameter optimization to return the best-performing model. It is a good choice for regression and classification problems but it is not suitable for NLP.
H2O AutoML (2023): The H2O 3 AutoML framework is an open-source toolkit best suited to both traditional neural networks and machine learning models. It can be used to automate the machine learning workflow i.e. model training and hyperparameter tuning of models within a specified time duration. It can perform data preprocessing, model selection, and hyperparameter tuning. Additionally, it returns with a leaderboard view of the model along with its performance. However, it is important to note that a java runtime environment is required since H2O AutoML has been developed in java.
MLJAR (2023): This toolkit works with tabular datasets and offers transparency to the users in each step of the AutoML training. All information about the trained models is saved in the hard drive and is accessible to the ML user.
There are some more AutoML libraries that are not mentioned in the article like –
Open-source: AdaNet, TransmogrifAI, Azure Machine Learning, Ludwig
Commercially available:Darwin,DataRobot,Google AutoML
Who can use AutoML and where can it be used?In a broader sense, the idea is to reduce the need for human interaction with the help of AutoML. During training, AutoML focuses on optimizing not only the model weights but also the architecture. The goal is to automate the process of selecting an architecture, which is currently done by experienced Data Scientists. Auto-ML will perform all the above-mentioned tasks without asking many questions and that too in a shorter time. Thus, making the Machine Learning tasks easier. Automated Machine Learning offers different processes and techniques to make Machine Learning easily available and makes it simple for non-Machine Learning experts. That is why you will find that most of the AutoML frameworks mentioned earlier in this guide have been developed by the tech giants for the greater good. Thus, AutoML can be used by Software Engineers to develop applications without the need to know the details on working of ML algorithms, Data Scientists to build ML pipelines in a low-code environment, ML Engineers to speed up their work, and last but not least AI Enthusiasts to explore the capabilities of AutoML.
How does AutoML work?AutoML frameworks begin by connecting to the provided dataset. It is important to note that the selected dataset contains enough data to develop a supervised machine learning model for classification or regression. This dataset should particularly include the target variable as well as any other data that will be used as features for the model to use as input for its predictions. It is possible to drop non-relevant attributes when feeding the dataset to the AutoML framework. Users also need to specify the target column as well when using an AutoML tool. Further, the AutoML framework produces a data profile similar to the outcome of an EDA after the input dataset has been set up. We can find the descriptive statistics for each variable in the dataset, such as mean, median, quartiles, and so on, in this data profile. The AutoML tool determines whether variables are numeric vs. categorical and counts missing values for each variable as part of this data profiling process. Next, AutoML tools experiment with multiple models and perform optimization as well. Most hyperparameter tuning begins with some random sampling. So, most of the AutoML tools tend to use a strategy for intelligently refining samples. Such a trained and optimized model can then be deployed in a production environment using Rest APIs.
When to use and when to skip using AutoML?AutoML performs well for structured data i.e., when the columns are clearly labeled and the data is well-formatted. As these tools perform imputation and normalization, they can easily handle missing values or skewness in the dataset.
AutoML performs extremely well when we need a quick assessment of the model. Next, small to medium size datasets would obviously be trained quicker using AutoML as compared to larger datasets. Choosing AutoML for larger and complex datasets might be a tough choice as it can prove expensive due to the use of more resources or it could be extremely slow due to multiple experiments for hyperparameter tuning and model optimization.
Steps covered in an AutoML framework (Image Source: Author)
Section II: End-to-End AutoML example using AutoGluonIn this section of the guide, we will explore AutoGluon’s different features which automate the machine learning tasks. Additionally, we will experience the implementation of tabular prediction using AutoGluon along with the other prediction categories it supports. Further, we will try to find out how to get the best suitable model for a particular machine learning task when using AutoGluon. Let’s start exploring this AutoML tool.
AutoGluon Logo (Image Source: Official Website)
The following are the benefits of using the ‘AutoGluon’ library:
Simplicity: Training of classification and regression models and deployment can be achieved with a few lines of code.
Robustness: Without doing any feature engineering or data manipulation, users should be able to use raw data.
Predictable-timing: Getting the best model under a specified time constraint.
Fault-tolerance: Training can be resumed even if interrupted and the users can inspect all the intermediate steps.
AutoGluon is designed for both beginners and experts in machine learning. Deep learning, automated stack ensembling, and real-world applications for text, image, and tabular data are covered by this tool.
Tool installation requirementsAutoGluon requires Python version 3.6 or higher. Currently, Linux and Mac are the only operating systems that are fully supported. All the details on the installation of AutoGluon and its different versions are here. AutoGluon can be used for the following categories:
Tabular Prediction
Image Prediction
Object Detection
Text Prediction
Multimodal Prediction
Example#1- TabularPrediction(Classification) with AutoGluonIn this example, we will use the Stroke prediction dataset. You can download the dataset from Kaggle.
We start by importing all the necessary packages
Python Code:
from sklearn.model_selection import train_test_split #splitting the dataset from autogluon.tabular import TabularDataset, TabularPredictor #to handle tabular data and train modelsNext, we split the dataset into train and test sets.
# split into train and test sets df_train,df_test=train_test_split(df,test_size=0.33,random_state=1) df_train.shape,df_test.shape df.head()
We need to drop the outcome column from the newly created test set
test_data=df_test.drop(['stroke'],axis=1) test_data.head()
Now, we build a predictor to train for classifying whether an individual with a given set of conditions will probably be at risk of a stroke. For this, we specify the outcome column as ‘stroke’ and ask the predictor to fit the algorithms on the train dataset. Arguments (optional) ‘verbosity=2’ will display all the steps the predictor is taking to arrive at the best model while ‘presets= best quality’ will ensure that the best model is selected from the trained ones. There are other additional arguments mentioned in the official documentation which can be used to fine-tune the model.
predictor= TabularPredictor(label =’stroke’).fit(train_data = df_train, verbosity = 2,presets='best_quality')Let us scroll through the log from the above command. (As the log is quite long, only the snippets from the log wherever required have been included in this section).
We can notice that even though we did not specify the type of problem, AutoGluon perfectly understands that this is a binary classification problem based on the two unique labels ‘0’ & ‘1’ in the outcome column.
Further, we can also see that AutoGluon aptly selects the ‘accuracy’ metric for this classification task.
Once the classifier training is complete, we can print a summary of the models it has trained using the following command
predictor.fit_summary()
In this case, AutoGluon trained 24 models but we would be more interested to find out which is the best model as selected by AutoGluon. To display this, simply use the leaderboard() command which ranks the trained models in order.
predictor.leaderboard(df_train, silent=True)
Additionally, we can also check for the feature importance using
predictor.feature_importance(data=df_train)
Here we can see that it has identified age and bmi to be the most important factors in the prediction of the outcome.
Next, we feed the test data to the classifier for prediction and we can store it in a DataFrame
y_pred = predictor.predict(test_data) y_pred=pd.DataFrame(y_pred,columns=['stroke']) y_pred #print the DataFrame
To understand the evaluation metric ‘accuracy’, let us print the details for it.
predictor.evaluate(df_test)
Data preprocessing and Feature Engineering were carried out by AutoGluon. The trained model includes cross-validation as well. So, we got the trained classifier at 95% accuracy with just two lines of code (for the classifier to train and predict). Now, that’s impressive! If it were a traditional ML model, we would be spending a long time completing the entire process including EDA, data cleaning as well as coding to set up multiple models. AutoGluon made this quite simple for us.
Example#2- TabularPrediction(Regression) with AutoGluonLet us try another example to explore how AutoGluon’s TabularPrediction handles a regression problem. For this, we will use the ‘Boston prices’ dataset from the sk-learn dataset library. We follow the same set of steps from the previous example
#importing the dataset import numpy as np import pandas as pd from sklearn.datasets import load_boston boston= load_boston() boston.keys() print(boston.DESCR)#use this command to know more about the datasetCreating a DataFrame from the loaded dataset
#create a dataframe from the dataset df=pd.DataFrame(data=boston.data,columns=boston.feature_names) df.head()Append the target column to the dataframe
#adding price column to the dataframe df['PRICE'] = boston.target df.head()
Splitting the dataset
# split into train test sets df_train,df_test = train_test_split(df, test_size=0.33, random_state=1) df_train.shape,df_test.shapeWe will drop the target column from the test dataset.
test_data=df_test.drop(['PRICE'],axis=1) test_data.head()Setup the predictor (regressor)
predictor= TabularPredictor(label =’PRICE’).fit(train_data = df_train, verbosity = 2,presets='best_quality') predictor.leaderboard(df_train, silent=True)# Making Predictions y_pred = predictor.predict(test_data)
In this example too, AutoGluon correctly identified the type of problem as Regression based on the dtype=float for the column and the presence of multiple unique values. Next, it also aptly selected the evaluation metric as ‘root_mean_squared_error’
For the regression problem, AutoGluon trained 11 models and recommended kNN (KNeighborsDist_BAG_L1) as the best model followed by XGBoost (XGBoost_BAG_L1).
The syntax for the predictor is the same in both classification and regression problems. AutoGluon TabularPrediction task seems to work nicely on different datasets. To keep the tutorial simple we selected smaller datasets, but it would be interesting to see how it performs when a bigger dataset is used.
Other use cases in AutoGluonBefore we wrap up the guide, let us briefly look at the other available options in AutoGluon.
Image Prediction: Like Tabular prediction, AutoGluon uses a simple ‘fit()’ command for classifying images based on their content which automatically produces high-quality image classification models.
Object Detection: Object detection is an important task in computer vision involving the process of detecting and localizing objects in an image. Here too, AutoGluon gives an option of calling a simple ‘fit()’ command which will automatically generate a high-quality object detection model for identifying the presence and location of objects in images.
Text Prediction: Likewise for the prediction of text data in supervised learning, we can use a simple ‘fit()’ command to automatically generate high-quality text prediction models. Each training example in the data may be a sentence, a short paragraph, some additional numeric/categorical features present in the text. A single call to ‘predictor.fit()’ command can train highly accurate neural networks on the given text dataset where the target values or labels used to predict may be continuous values or individual categories. Even though the TextPredictor is designed for classification and regression tasks only, it can directly be used for other NLP tasks also if the data is properly formatted into a data table. The TextPredictor uses only Transformer neural network models. These are fit to the provided data via transfer learning from a pre-trained list of NLP models like BERT, ALBERT, and ELECTRA. It also allows training on multi-modal data tables which contain text, numeric and categorical columns, and the neural network hyperparameter which can be automatically tuned with Hyperparameter Optimization (HPO).
Multimodal Prediction: Multimodal tabular data consisting of text, numeric, and categorical columns can also be handled by AutoGluon. Raw text data is observed as a first-class citizen of data tables in AutoGluon. It can help you train and match a wide variety of models including classical tabular models like LightGBM, RF, CatBoost as well as the pre-trained NLP model-based multimodal network.
ConclusionIt is quite interesting to see the effectiveness of AutoML frameworks. It could be used to reduce the time it takes to create production-ready ML models with remarkable simplicity and efficiency. This expedites the overall ML process; thereby freeing up time for data scientists so that they focus on finding the solution to real-life problems. The biggest benefit of using AutoML could be attributed to its ability of training and test multiple existing machine learning algorithms on a variety of data sets autonomously. Further, it is to be noted that using AutoML does not remove the need for training and some basic understanding of data, data annotation, and the desired outcome. Thus, AutoML’s success would likely depend on how soon it is accepted, adopted and the tangible benefits it brings to a certain industry. Nevertheless, we can say that AutoML is there to stay.
Author Bio:Devashree has an chúng tôi degree in Information Technology from Germany and a Data Science background. As an Engineer, she enjoys working with numbers and uncovering hidden insights in diverse datasets from different sectors to build beautiful visualizations to try and solve interesting real-world machine learning problems.
In her spare time, she loves to cook, read & write, discover new Python-Machine Learning libraries or participate in coding competitions.
You can follow her on LinkedIn, GitHub, Kaggle, Medium, Twitter.
The media shown in this article on Ensemble Modeling for Neural Networks is not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
Update the detailed information about Python Zip File With Example on the Katfastfood.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!