Trending February 2024 # Download Financial Dataset Using Yahoo Finance In Python # Suggested March 2024 # Top 5 Popular

You are reading the article Download Financial Dataset Using Yahoo Finance In Python updated in February 2024 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested March 2024 Download Financial Dataset Using Yahoo Finance In Python

This article was published as a part of the Data Science Blogathon


The article aims to empower you to create your projects by learning how to create your data frame and collect data about the stock market and the crypto market from the internet and then base your code on it. This will allow you to create your ML models and experiment with real-world data.

In this article, I will demonstrate two methods and both use Yahoo Finance Python as the data source since it is free and no registration is required. You can use any other data source like Quandi, Tiingo, IEX Cloud, and more.

Getting Ready

In the first approach, we will consider the finance module in python and it is a very easy module to work with. The other module we will talk about is yahoofinancials which requires extra effort but gives back a whole lot of extra information in return. We will discuss that later and now we will begin by importing the required modules into our code.

Initial Setup:

We need to load the following libraries:

import pandas as pd import yfinance as yf from yahoofinancials import YahooFinancials

If you do not have these libraries, you can install them via pip.

!pip install yfinance !pip install yahoofinancials First Method: How to use yfinance

It was previously known as ‘fix_yahoo_finance’ but later it transformed into a module of its own but it is not an official one by Yahoo. The module ‘yfinance’ is now a very popular library that is very python friendly and can be used as a patch to pandas_datareader or a standalone library in itself. It has many potential uses and many people use it to download stock prices and also crypto prices. Without any further delay, let us execute the following code. We will begin by downloading the stock price of ‘Apple’

Code :

Output :

The data interval is set to 1 day but the internal can be externally specified with values like 1m,5m,15m,30m,60m,1h,1d,1wk,1mo, and more. The above command for downloading the data shows a start and an end date but you can also simply download the data with the code given below :

Code : aapl_df ='AAPL') Output :

There are many parameters of the download function which you can find in the documentation and start and end are some of the most common ones to be used. Since the data was small, the progress bar was set to false and showing it makes no sense and should be used for high volume or data.

We can also download multiple stock prices of more than one asset at one time. By providing a list of company names in a list format ( eg. [‘FB’,’ MSFT’,’AAPL’] )as the tickers argument. We can also provide an additional argument which is auto-adjust=True, so that all the current prices are adjusted for potential corporate actions like splits.

Apart from the function, we can also use the ticker module and you can execute the below code to download the last 5year stock prices of Apple.

Code :

ticker = yf.Ticker('AAPL') aapl_df = ticker.history(period="5y") aapl_df['Close'].plot(title="APPLE's stock price")

Output :

info – This method prints out a JSON formatter output which contains a lot of information about the company starting from their business full name, summary, industry, exchanges listed on with country and time zone, and more. It also comes equipped with the beta coefficient.

recommendations – This method contains a historical list of recommendations made by different analysts regarding the stock and whether to buy sell or give suggestions on it.

actions – This displays the actions like splits and dividends.

major_holders – This method displays the major holders of the share along with other relevant details.

institutional_holders – This method shows all the institutional holders of a particular share.

calendar – This function shows all the incoming events such as the earnings and you can even add this to your google calendar through code. Basically, it shows the important dividend dates for a company.

If you still want to explore more regarding the working of the functions, you can check out this GitHub repository of yfinance.

Second Method: How to use yahoofinancials?

The second method is to use the yahoofinancials module which is a bit tougher to work with but it provides much more information than yfinance. We will begin by downloading Apple’s stock prices.

To do this we will first pass an object of YahooFinancials bypassing the Apply ticker name and then use a variety of important information to get out the required data. Here the returned data is in a JSON format and hence we do some beautification to it so that it can be transformed into a DataFrame to display it properly.

Code :

yahoo_financials = YahooFinancials('AAPL') data = yahoo_financials.get_historical_price_data(start_date='2024-01-01', end_date='2024-12-31', time_interval='weekly') aapl_df = pd.DataFrame(data['AAPL']['prices']) aapl_df = aapl_df.drop('date', axis=1).set_index('formatted_date') aapl_df.head()

Output :

Coming down on a technical level, the process of obtaining a historical stock price is a bit longer than the case of yfinance but that is mostly due to the huge volume of data. Now we move onto some of the important functions of yahoofinancials.

get_stock_quote_type_data() – This method returns a lot of generic information about a stock which is similar to the yfinance info() function. The output is something like this.

get_summary_data() – This method returns a summary of the whole company along with useful data like the beta value, price to book value, and more.

get_stock_earnings_data() – THis method returns the information on the quarterly and yearly earnings of the company along with the next date when the company will report its earnings.

get_financial_stmts() – This is another useful method to retrieve financial statements of a company which is useful for the analysis of a stock

get_historical_price_data() – This is a method similar to the download() or Ticker() function to get the prices of stock with start_date, end_date and interval ranges.

The above module can also be used to download company data at once like yfinance and cryptocurrency data can also be downloaded as shown in the following code.

Code :

yahoo_financials = YahooFinancials('BTC-USD') data=yahoo_financials.get_historical_price_data("2024-07-10", "2024-05-30", "monthly") btc_df = pd.DataFrame(data['BTC-USD']['prices']) btc_df = btc_df.drop('date', axis=1).set_index('formatted_date') btc_df.head()

Output :

For more details about the module, you can check out its GitHub Repository.


Thank you for reading till the end. Hope you are doing well and stay safe and are getting vaccinated soon or already are.

About the Author :

Arnab Mondal

Link to my other articles


You're reading Download Financial Dataset Using Yahoo Finance In Python

Heart Disease Prediction Using Logistic Regression On Uci Dataset

This article was published as a part of the Data Science Blogathon.



Hi everyone!

In this article, we study, in detail, the hyperparameters, code and libraries used for heart disease prediction using logistic regression on the UCI heart disease dataset.

Importing Libraries #Importing libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Numpy: Numpy is an open-source python library for handling n-dimensional arrays, written in the C programming language. Python is also written in the C programming language. Loading Numpy in the memory enables the Python interpreter to work with array computing in a fast and efficient manner. Numpy offers the implementation of various mathematical functions, algebraic routines and Fourier transforms. Numpy supports different hardware and computing technologies and is well suited for GPU and distributed computing. The high-level language used provides ease of use with respect to the various Numpy functionalities.

Pandas: Pandas is a fast open-source data analysis tool built on top of Python. Pandas allow various data manipulation activities using Pandas DataFrame objects. The different Pandas methods used in this study will be explained in detail later.

Matplotlib: Matplotlib is a Python library that enables plotting publication-quality graphs, static and interactive graphs using Python. Matplotlib plots can be exported to various file formats, can work with third-party packages and can be embedded in Jupyter notebooks. Matplotlib methods used are explained in detail as we encounter them.

Seaborn: Seaborn is a statistical data visualization tool for Python built over Matplotlib. The library enables us to create high-quality visualizations in Python.

Data Exploration and Visualization dataframe = pd.read_csv('heart_disease_dataset_UCI.csv')

The read_csv method from the Pandas library enables us to read the *.csv (comma-separated value) file format heart disease dataset published by UCI into the dataframe. The DataFrame object is the primary Pandas data structure which is a two-dimensional table with labelled axes – along rows and along with columns. Various data manipulation operations can be applied to the Pandas dataframe along rows and columns.

The Pandas dataframe head(10) method enables us to get a peek at the top 10 rows of the dataframe. This helps us in gaining an insight into the various columns and an insight into the type and values of data being stored in the dataframe.

The Pandas dataframe info() method provides information on the number of row-entries in the dataframe and the number of columns in the dataframe. Count of non-null entries per column, the data type of each column and the memory usage of the dataframe is also provided.

The Pandas dataframe isna().sum() methods provide the count of null values in each column.

The Matplotlib.figure API implements the Figure class which is the top-level class for all plot elements. Figsize = (15,10) defines the plot size as 15 inches wide and 10 inches high.

The Seaborn heatmap API provides the colour encoded plot for 2-D matrix data. The Pandas dataframe corr() method provides pairwise correlation (movement of the two variables in relation to each other) of columns in the dataframe. NA or null values are excluded by this method. The method allows us to find positive and negative correlations and strong and weak correlations between the various columns and the target variable. This can help us in feature selection. Weakly correlated features can be neglected. Positive and negative correlations can be used to describe model predictions. Positive correlation implies that as the value of one variable goes up, the value of the other variable also goes up. A negative correlation implies that as the value of one variable goes down, the value of the other variable also goes down. Zero correlation implies that there is no linear relationship between the variables. linewidth gives the width of the line that divides each cell in the heatmap. Setting can not to True, labels each cell with the corresponding correlation value. cmap value defines the mapping of the data value to the colorspace.


The Pandas dataframe hist method plots the histogram of the different columns, with figsize equal to 12 inches wide and 12 inches high.

Standard Scaling X = dataframe.iloc[:,0:13] y = dataframe.iloc[:,13]

Next,  we split our dataframe into features (X) and target variable (y) by using the integer-location based indexing ‘iloc’ dataframe property. We select all the rows and the first 13 columns as the X variable and all the rows and the 14th column as the target variable.

X = X.values y = y.values

We extract and return a Numpy representation of X and y values using the dataframe values property for our machine learning study.

from sklearn.preprocessing import StandardScaler X_std=StandardScaler().fit_transform(X)

We use the scikit-learn (sklearn) library for our machine learning studies. The scikit-learn library is an open-source Python library for predictive data analysis and machine learning and is built on top of Numpy, SciPy and Matplotlib. The SciPy ecosystem is used for scientific computing and provides optimized modules for Linear Algebra, Calculus, ODE solvers and Fast Fourier transforms among others. The sklearn preprocessing module implements function like scaling, normalizing and binarizing data. The StandardScaler standardizes the features by making the mean equal to zero and variance equal to one. The fit_transform() method achieves the dual purpose of (i) the fit() method by fitting a scaling algorithm and finding out the parameters for scaling (ii) the transform method, where the actual scaling transformation is applied by using the parameters found in the fit() method. Many machine learning algorithms are designed based on the assumption of expecting normalized/scaled data and standard scaling is thus one of the methods that help in improving the accuracy of machine learning models.

Train-Test Split from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X_std,y,test_size=0.25,random_state=40)

The sklearn model_selection class implements different data splitter classes (split into train and test sets, KFold train and test sets etc.), Hyper-parameter optimizers (search over a grid to find optimal hyperparameters) and model validation functionalities (evaluate the metrics of the cross-validated model etc).

N.B. – KFold (K=10) cross-validation means splitting the train set into 10 parts. 9 parts are used for training while the last part is used for testing. Next, another set of 9 parts (different from the previous set) is used for training while the remaining one part is used for testing. This process is repeated until each part forms one test set. The average of the 10 accuracy scores on 10 test sets is the KFold cross_val_score.

The train_test_split method from the sklearn model_selection class is used to split our features (X) and targets (y) into training and test sets. The test size = 0.25 specifies that 25 % of data is to be kept in the test set while setting a random_state = 40 ensures that the algorithm generates the same set of training and test data every time the algorithm is run. Machine learning algorithms are random by nature and setting a random_state ensures that the results are reproducible.

Model Fitting and Prediction from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix lr=LogisticRegression(C=1.0,class_weight='balanced',dual=False, fit_intercept=True, intercept_scaling=1,max_iter=100,multi_class='auto', n_jobs=None,penalty='l2',random_state=1234,solver='lbfgs',tol=0.0001, verbose=0,warm_start=False),y_train) prediction1=model1.predict(X_test) cm=confusion_matrix(y_test,prediction1) sns.heatmap(cm,annot=True,cmap='winter',linewidths=0.3, linecolor='black',annot_kws={"size":20}) TP=cm[0][0] TN=cm[1][1] FN=cm[1][0] FP=cm[0][1] print('Testing Accuracy for Logistic Regression:',(TP+TN)/(TP+TN+FN+FP))

The sklearn.metrics module includes score functions, performance metrics and distance metrics among others. The confusion_matrix method provides the accuracy of classification in a matrix format.

The sklearn linear_model class implements a variety of linear models like Linear regression, Logistic regression, Ridge regression, Lasso regression etc. We import the LogisticRegression class for our classification studies. A LogisticRegression object is instantiated.


The parameter C specifies regularization strength. Regularization implies penalizing the model for overfitting. C=1.0 is the default value for LogisticRegressor in the sklearn library.

The class_weight=’balanced’ method provides weights to the classes. If unspecified, the default class_weight is = 1. Class weight = ‘balanced’ assigns class weights by using the formula (n_samples/(n_classes*np.bin_count(y))). e.g. if n_samples =100, n_classes=2 and there are 50 samples belonging to each of the 0 and 1 classes, class_weight = 100/(2*50) = 1

N.B. Liblinear solver utilizes the coordinate-descent algorithm instead of the gradient descent algorithms to find the optimal parameters for the logistic regression model. E.g. in the gradient descent algorithms, we optimize all the parameters at once. While coordinate descent optimizes only one parameter at a time. In coordinate descent, we first initialize the parameter vector (theta = [theta0, theta1 …….. thetan]). In the kth iteration, only thetaik is updated while (theta0k… thetai-1k and thetai+1k-1…. thetank-1) are fixed.

fit_intercept = True The default value is True. Specifies if a constant should be added to the decision function.

intercept_scaling = 1 The default value is 1. Is applicable only when the solver is liblinear and fit_intercept = True. [X] becomes [X, intercept_scaling]. A synthetic feature with constant value = intercept_scaling is appended to [X]. The intercept becomes, intercept scaling * synthetic feature weight. Synthetic feature weight is modified by L1/L2 regularizations. To lessen the effect of regularization on synthetic feature weights, high intercept_scaling value must be chosen.

max_iter = 100 (default). A maximum number of iterations is taken for the solvers to converge.

multi_class = ‘ovr’, ‘multinomial’ or auto(default). auto selects ‘ovr’ i.e. binary problem if the data is binary or if the solver is liblinear. Otherwise auto selects multinomial which minimises the multinomial loss function even when the data is binary.

n_jobs (default = None). A number of CPU cores are utilized when parallelizing computations for multi_class=’ovr’. None means 1 core is used. -1 means all cores are used. Ignored when the solver is set to liblinear.

penalty: specify the penalty norm (default = L2).

random_state = set random state so that the same results are returned every time the model is run.

solver = the choice of the optimization algorithm (default = ‘lbfgs’)

tol = Tolerance for stopping criteria (default = 1e-4)

verbose = 0 (for suppressing information during the running of the algorithm)

warm_start = (default = False). when set to True, use the solution from the previous step as the initialization for the present step. This is not applicable for the liblinear solver.

Next, we call the fit method on the logistic regressor object using (X_train, y_train) to find the parameters of our logistic regression model. We call the predict method on the logistic regressor object utilizing X_test and the parameters predicted using the fit() method earlier.

We can calculate the confusion matrix to measure the accuracy of the model using the predicted values and y_test.

The parameters for the sns (seaborn) heatmap have been explained earlier. The linecolor parameter specifies the colour of the lines that will divide each cell. The annot_kws parameter passes keyword arguments to the matplotlib method – fontsize in this case.

of test samples = 89.47%.



This brings us to the end of the article. In this article, we developed a logistic regression model for heart disease prediction using a dataset from the UCI repository. We focused on gaining an in-depth understanding of the hyperparameters, libraries and code used when defining a logistic regression model through the scikit-learn library.



The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


Cohort Analysis Using Python For Beginners

This article was published as a part of the Data Science Blogathon

After understanding and working with this hands-on tutorial, you will be able to:

Understand what is cohort and cohort analysis

Handling missing values

Month extraction from date

Assign cohort to each transaction

Assigning cohort Index to each transaction

Calculate the number of unique customers in each group

Create cohort table for retention rate

Visualize the cohort table using the heatmap

Interpret the retention rate

What is Cohort and Cohort Analysis?

A cohort is a collection of users who have something in common. A traditional cohort, for example, divides people by the week or month of which they were first acquired. When referring to non-time-dependent groupings, the term segment is often used instead of the cohort.

Cohort analysis is a descriptive analytics technique in cohort analysis. Customers are divided into mutually exclusive cohorts, which are then tracked over time. Vanity indicators don’t offer the same level of perspective as cohort research. It aids in the deeper interpretation of high-level patterns by supplying metrics around the product and consumer lifecycle.

Generally, there are three major types of Cohort:

Time cohorts: customers who signed up for a product or service during a particular time frame.

Behavior cohorts: customers who purchased a product or subscribed to service in the past.

Size cohorts: refer to the various sizes of customers who purchase the company’s products or services.

However, we will be performing Cohort Analysis based on Time. Customers will be divided into acquisition cohorts depending on the month of their first purchase. The cohort index would then be assigned to each of the customer’s purchases, which will represent the number of months since the first transaction.


Finding the percentage of active customers compared to the total number of customers after each month: Customer Segmentations

Interpret the retention rate

Here’s the full code for this tutorial if you would like to follow along as you progress through the tutorial.

Step involved in Cohort Retention Rate Analysis

1. Data Loading and Cleaning

2. Assigned the cohort and calculate the

Step 2.1

Truncate data object in into needed one(here we need month so transaction_date)

Create groupby object with target column ( here, customer_id)

Transform with a min() function to assign the smallest transaction date in month value to each customer.

The result of this process is the acquisition month cohort for each customer i.e. we have assigned the acquisition month cohort to each customer.

Step 2.2

Calculate Time offset by extracting integer values of the year, month, and day from a datetime() object.

Calculate the number of months between any transaction and the first transaction for each customer. We will use the TransactionMonth and CohortMonth values to do this.

The result of this will be cohortIndex i.e, the difference between “TransactionMonth ” and “CohortMonth” in terms of the number of months and call the column “cohortIndex”.

Step 2.3

Create a groupby object with CohortMonth and CohortIndex.

Count the number of customers in each group by applying the pandas nunique() function.

Reset the index and create a pandas pivot with CohortMonth in the rows, CohortIndex in the columns, and customer_id counts as values.

The result of this will be the table that will serve as the basis for calculating retention rate and other matrices as well.

3. Calculate business matrices: Retention rate.

Retention measures how many customers from each of the cohort have returned in the subsequent months.

Using the dataframe called cohort_counts we will select the First columns(equals to the total number of customer in cohorts)

Calculate the ratio of how many of these customers came back in the subsequent months.

The result gives a retention rate.

4. Visualizing the  retention rate

5. Interpreting the retention rate

Retention rate monthly cohorts.


Let’s Begin:

Import Libraries

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np import datetime as dt import missingno as msno from textwrap import wrap

Data loading and cleaning

# Loading dataset transaction_df = pd.read_excel('transcations.xlsx') # View data transaction_df.head()

Checking and working with missing value

# Inspect missing values in the dataset print(transaction_df.isnull().values.sum()) # Replace the ' 's with NaN transaction_df = transaction_df.replace(" ",np.NaN) # Impute the missing values with mean imputation transaction_df = transaction_df.fillna(transaction_df.mean()) # Count the number of NaNs in the dataset to verify print(transaction_df.isnull().values.sum()) print( for col in transaction_df.columns: # Check if the column is of object type if transaction_df[col].dtypes == 'object': # Impute with the most frequent value transaction_df[col] = transaction_df[col].fillna(transaction_df[col].value_counts().index[0]) # Count the number of NaNs in the dataset and print the counts to verify print(transaction_df.isnull().values.sum())

Here, we can see that we have 1542 null values. Which we treated with mean as well as most frequent values as per datatype. Now, as we have completed our data cleaning and understanding, now we will commence the Cohort Analysis.

Assigned the cohorts and calculated the monthly offset # A function that will parse the date Time based cohort: 1 day of month def get_month(x): return dt.datetime(x.year, x.month, 1) # Create transaction_date column based on month and store in TransactionMonth transaction_df['TransactionMonth'] = transaction_df['transaction_date'].apply(get_month) # Grouping by customer_id and select the InvoiceMonth value grouping = transaction_df.groupby('customer_id')['TransactionMonth'] # Assigning a minimum InvoiceMonth value to the dataset transaction_df['CohortMonth'] = grouping.transform('min') # printing top 5 rows print(transaction_df.head())


Calculating time offset in Month as Cohort Index

Calculating the time offset for each transaction allows you to evaluate the metrics for each cohort in a comparable fashion.

First, we will create 6 variables that capture the integer value of years, months, and days for Transaction and Cohort Date using the get_date_int() function.

def get_date_int(df, column): year = df[column].dt.year month = df[column].dt.month day = df[column] return year, month, day # Getting the integers for date parts from the `InvoiceDay` column transcation_year, transaction_month, _ = get_date_int(transaction_df, 'TransactionMonth') # Getting the integers for date parts from the `CohortDay` column cohort_year, cohort_month, _ = get_date_int(transaction_df, 'CohortMonth')

Now we will calculate the difference between the Invoice Dates and Cohort dates in years, months separately. then calculate the total Months difference between the two. This will be our month’s offset or cohort Index, which we will use in the next section to calculate the retention rate. 

# Get the difference in years years_diff = transcation_year - cohort_year # Calculate difference in months months_diff = transaction_month - cohort_month """ Extract the difference in months from all previous values "+1" in addeded at the end so that first month is marked as 1 instead of 0 for easier interpretation. """ transaction_df['CohortIndex'] = years_diff * 12 + months_diff + 1 print(transaction_df.head(5))

Here, at first, we create a group() object with CohortMonth and CohortIndex and store it as a grouping.

Then, we call this object, select the customer_id column and calculate the average.

Then we store the results as cohort_data. Then, we reset the index before calling the pivot function to be able to access the columns now stored as indices.

Finally, we create a pivot table bypassing

CohortMonth to the index parameter,

CohortIndex to the columns parameter,

customer_id to the values parameter.

and rounding it up to 1 digit, and see what we get.

# Counting daily active user from each chort grouping = transaction_df.groupby(['CohortMonth', 'CohortIndex']) # Counting number of unique customer Id's falling in each group of CohortMonth and CohortIndex cohort_data = grouping['customer_id'].apply(pd.Series.nunique) cohort_data = cohort_data.reset_index() # Assigning column names to the dataframe created above cohort_counts = cohort_data.pivot(index='CohortMonth', columns ='CohortIndex', values = 'customer_id') # Printing top 5 rows of Dataframe cohort_data.head()

Calculate business metrics: Retention rate

The percentage of active customers compared to the total number of customers after a specific time interval is called retention rate.

In this section, we will calculate the retention count for each cohort Month paired with cohort Index

Now that we have a count of the retained customers for each cohortMonth and cohortIndex. We will calculate the retention rate for each Cohort.

We will create a pivot table for this purpose.

cohort_sizes = cohort_counts.iloc[:,0] retention = cohort_counts.divide(cohort_sizes, axis=0) # Coverting the retention rate into percentage and Rounding off. retention.round(3)*100

The retention rate dataframe represents Customer retained across Cohorts. We can read it as follows:

Index value represents the Cohort

Columns represent the number of months since the current Cohort

For instance: The value at CohortMonth 2024-01-01, CohortIndex 3 is 35.9 and represents 35.9% of customers from cohort 2024-01 were retained in the 3rd Month.

Also, you can see from the retention Rate DataFrame:

Retention Rate 1st index i.e 1st month is 100% as all the customers for that particular customer signed up in 1st Month

The retention rate may increase or decrease in subsequent Indexes.

Values towards the bottom right have a lot of NaN values.

Visualizing the retention rate

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.


Identifying Potential Youtube Influencers Using Python

This article was published as a part of the Data Science Blogathon.


Don’t push people to where you want to be; meet them where they are.

– Meghan Keaney Anderson, VP Marketing, Hubspot

In the past few years, we have witnessed a drastic transformation in the marketing strategies employed by organizations across the globe. Companies are constantly striving and coming up with innovative ideas to capture the attention of their customers. In today’s day and age of heightened activity across the digital space, it only makes sense that organizations have focussed their budgets and attention more on digital marketing as compared to some of the traditional forms. With several platforms at an organization’s disposal like YouTube, Instagram, Twitter, Facebook, etc., it becomes important to identify the right one that resonates the most with the organization’s target audience.

However, identification of the right platform is never the last step. A robust digital marketing strategy for the respective platform is important that ensures a wider reach, greater engagement, and more conversions. One such strategy that has taken prominence in recent times in a collaboration with influencers on a particular platform. It is a strategy employed by most startups worldwide today where they identify and tie up with influencers who create content related to the same field and have thousands and millions of followers. These influencers are new-age versions of brand ambassadors, and the organization can leverage the trust and authenticity built by them among their huge follower base and market their products.

In this article, we will try and make use of the capabilities of Python and web scraping techniques to identify potential influencers for collaboration on one such online platform – YouTube.


For a hypothetical scenario, let us assume that there is an Indian startup operating in the space of Crypto Trading. The organization is looking for means to reach out to a wider audience and is looking for collaboration opportunities with individuals creating Youtube content related to Cryptocurrencies. They want to collaborate with someone who

1. Creates content specifically related to cryptocurrencies

2. Caters primarily to an Indian audience

3. Has more than 10,000 subscribers

4. Has more than 100,000 views across all videos


Our approach to identifying a potential influencer with the above specifications would be as follows:

1. Search for all the listings on Youtube with different combinations of keywords related to cryptocurrency trading

2. Make a list of all the channels that get listed in our search results

3. For every channel identified, extract all the relevant information and prepare the data

4. Filter the list based on our required criteria mentioned above

WARNING! Please refer to the chúng tôi of the respective website before scrapping any data. In case the website does not allow scrapping of what you want to extract, please mark an email to the web administrator before proceeding.

Stage 1 – Search

We will start with importing the necessary libraries

To automatically search for the relevant listing and extract the details, we will use Selenium

import selenium from selenium import webdriver as wb from time import sleep from chúng tôi import WebDriverWait from chúng tôi import By from import expected_conditions as EC from chúng tôi import Keys

For basic data wrangling, format conversion, and cleaning, we will use pandas and time

import pandas as pd import time

We firstly create a variable called ‘keyword,’ to which we will assign all the search keywords we will be using on YouTube in the form of a list.

keywords = ['crypto trading','cryptocurrency','cryptocurrency trading']

Next, we would want to open YouTube using chromedriver and search for the above keywords in the search bar.

The following piece of code will present to us all the listings related to our keywords on the platform. However, the initial listing shows just 3-4 results on the page, and to access more results, we will have to scroll down. It is completely up to us how many results we would want; accordingly, we can put a number for the number of scroll downs we should do for each search result. We will incorporate this into our script, and for our purpose, scroll down 500 times for each search result.

for i in [i.replace(' ','+') for i in keywords]: time.sleep(5) driver.set_window_size(1024, 600) driver.maximize_window() elem = driver.find_element_by_tag_name("body") no_of_pagedowns = 500 while no_of_pagedowns: elem.send_keys(Keys.PAGE_DOWN) time.sleep(0.2) no_of_pagedowns-=1

For all the loaded video results, we will try and extract the details of the profile from which they have been uploaded. For this, we will extract the channel link and create a dataframe, df, where we map the channel link with the respective keyword that generated that channel’s listing. Since a search term can possibly list multiple videos from the same Youtube channel, we will delete all such duplicate entries and keep only unique channel names that appear with each keyword search. We finally have a list of 64 unique YouTube channels at our disposal.

profile =[video.get_attribute(‘href’) for video in driver.find_elements_by_xpath(‘//*[@id=”text”]/a')] profile = pd.Series(profile) df = pd.DataFrame({'Channel':profile,'Keyword':str(i)}) df = df.drop_duplicates() df.head() Stage 2 – Data Extraction

For every channel that we obtained above, we would next go to the respective channel’s page link. Each channel’s page has 6 sections – Home, Videos, Playlists, Community, Channels, and About.

Among these 6 sections, we are concerned with only the Videos and About sections as they would give us the information we require.

About – This section will help us with the channel description, date of joining, channel views, subscriber count, and base location of the channel’s owner

We will use the following code to extract all this information

df2 = pd.DataFrame() for i in range(0,len(df)): driver.get(df.iloc[i]['Channel']) time.sleep(3) time.sleep(3) frequency = [x.text for x in driver.find_elements_by_xpath('//*[@id="metadata-line"]/span[2]')][0:3] frequency = pd.Series(frequency) frequency = ', '.join(map(str, frequency)) try: except: try: except: name = driver.find_element_by_xpath('//*[@id="channel-name"]').text name = pd.Series(name) time.sleep(3) desc = driver.find_element_by_xpath('//*[@id="description-container"]').text desc = desc.split('n')[1:] desc = ' '.join(map(str, desc)) desc = pd.Series(desc) DOJ = driver.find_element_by_xpath('//*[@id="right-column"]/yt-formatted-string[2]/span[2]').text DOJ = pd.Series(DOJ) views = driver.find_element_by_xpath('//*[@id="right-column"]/yt-formatted-string[3]').text views = pd.Series(views) subs = driver.find_element_by_xpath('//*[@id="subscriber-count"]').text subs = pd.Series(subs) location = driver.find_element_by_xpath('//*[@id="details-container"]/table/tbody/tr[2]/td[2]/yt-formatted-string').text location = pd.Series(location) link = pd.Series(df.iloc[i]['Channel']) tbl = pd.DataFrame({'Channel':name,'Description':desc,'DOJ':DOJ,'Last_3_Vids_Freq':frequency,'Views':views,'Subscribers':subs,'Location':location,'Channel Link':link}) df2 = df2.append(tbl) df2['Views'] = pd.to_numeric(df2['Views'].str.replace('views','').str.replace(',','')) df2['Subscribers'] = df2['Subscribers'].str.replace('subscribers','').replace({'K': '*1e3', 'M': '*1e6'}, regex=True).map(pd.eval).astype(int) df2.head() Stage 3 – Data Filtering

Until now, we have obtained a list of 64 YouTube channels that showed up among the results when we searched using our identified keywords. As we can see from the sample output, not all will be of use to us. We will go back to our list of criteria, based on which we would like to collaborate with a content creator and filter our data accordingly. The criteria were as follows

1. Creates content specifically related to cryptocurrencies

2. Caters primarily to an Indian audience

3. Has more than 10,000 subscribers

4. Has more than 100,000 views across all videos

We will firstly filter out channels that specifically create content related to cryptocurrencies

df2 = df2[df2['Description'].str.contains('crypto',case=False)]

Next, we need channels that primarily cater to an Indian audience. For this, we will make an assumption that a channel from a particular location creates content wanting to primarily cater to the audience from the same location. Thus, we will filter out channels with location as ‘India.’

df2 = df2[df2['Location'].str.contains('India',case=False)]

We will use our defined thresholds to further filter out the channels that meet our criteria regarding views count and subscribers.

We have a curated list of 6 channels that meet all of our criteria – Crypto Crown, Ways2profit, Techno Vas, Crypto Marg, CryptoVel – The Cryptopreneur, and Mayank Kharayat. We can reach out to these six channels for potential collaboration and promote our products through them.


In this article, we used web scrapping with Python to identify relevant influencers on YouTube who can be ‘potential brand ambassadors for an organization dealing with cryptocurrencies. The process is very handy for small firms and startups as they look to increase their reach quickly. Collaborating with relevant influencers on digital platforms like YouTube, Instagram, Facebook etc., can facilitate the process as it enables the organization to leverage the massive follower base of the influencer and widen its reach in a short span of time and at a minimal cost.

Key Takeaways

Collaboration with digital influencers is one of the most effective means to increase audience reach in today’s day and age

Web scraping can be a handy technique to extract a list of relevant influencers for brands, especially startups

One should carefully understand the audience and the reach of the identified list and, based on one’s preferred criteria, collaborate with the most suitable digital influencers

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.


How To Common Keys In List And Dictionary Using Python

In this article, we will learn how to find common keys in a list and dictionary in python.

Methods Used

The following are the various methods to accomplish this task −

Using the ‘in’ operator and List Comprehension

Using set(), intersection() functions

Using keys() function & in operator

Using the Counter() function


Assume we have taken an input dictionary and list. We will find the common elements in the input list and keys of a dictionary using the above methods.

Input inputDict = {"hello": 2, "all": 4, "welcome": 6, "to": 8, "tutorialspoint": 10} inputList = ["hello", "tutorialspoint", "python"] Output Resultant list: ['hello', 'tutorialspoint']

In the above example, ‘hello‘ and ‘tutorialspoint‘ are the common elements in the input list and keys of a dictionary. Hence they are printed.

Method 1: Using the ‘in’ operator and List Comprehension List Comprehension

When you wish to build a new list based on the values of an existing list, list comprehension provides a shorter/concise syntax.

Python ‘in’ keyword

The in keyword works in two ways −

The in keyword is used to determine whether a value exists in a sequence (list, range, string, etc).

It is also used to iterate through a sequence in a for loop

Algorithm (Steps)

Following are the Algorithm/steps to be followed to perform the desired task –.

Create a variable to store the input dictionary.

Create another variable to store the input list.

Traverse through the input list and check whether any input list element matches the keys of a dictionary using list comprehension.

Print the resultant list.


The following program returns common elements in the input list and dictionary keys using the ‘in’ operator and list comprehension –

# input dictionary inputDict = {"hello": 2, "all": 4, "welcome": 6, "to": 8, "tutorialspoint": 10} # printing input dictionary print("Input dictionary:", inputDict) # input list inputList = ["hello", "tutorialspoint", "python"] # printing input list print("Input list:", inputList) # checking whether any input list element matches the keys of a dictionary outputList = [e for e in inputDict if e in inputList] # printing the resultant list print("Resultant list:", outputList) Output

On executing, the above program will generate the following output –

Input dictionary: {'hello': 2, 'all': 4, 'welcome': 6, 'to': 8, 'tutorialspoint': 10} Input list: ['hello', 'tutorialspoint', 'python'] Resultant list: ['hello', 'tutorialspoint'] Method 2: Using set(), intersection() functions

set() function − creates a set object. A set list will appear in random order because the items are not ordered. It removes all the duplicates.

intersection() function − A set containing the similarity between two or more sets is what the intersection() method returns.

It means Only items that are present in both sets, or in all sets if more than two sets are being compared, are included in the returned set.


The following program returns common elements in the input list and dictionary keys using set() and intersection() –

# input dictionary inputDict = {"hello": 2, "all": 4, "welcome": 6, "to": 8, "tutorialspoint": 10} # printing input dictionary print("Input dictionary:", inputDict) # input list inputList = ["hello", "tutorialspoint", "python"] # printing input list print("Input list:", inputList) # Converting the input dictionary and input List to sets # getting common elements in input list and input dictionary keys # from these two sets using the intersection() function outputList = set(inputList).intersection(set(inputDict)) # printing the resultant list print("Resultant list:", outputList) Output

On executing, the above program will generate the following output –

Input dictionary: {'hello': 2, 'all': 4, 'welcome': 6, 'to': 8, 'tutorialspoint': 10} Input list: ['hello', 'tutorialspoint', 'python'] Resultant list: {'hello', 'tutorialspoint'} Method 3: Using keys() function & in operator

keys() function − the dict. keys() method provides a view object that displays a list of all the keys in the dictionary in order of insertion.


The following program returns common elements in the input list and dictionary keys using the keys() function and in operator–

# input dictionary inputDict = {"hello": 2, "all": 4, "welcome": 6, "to": 8, "tutorialspoint": 10} # printing input dictionary print("Input dictionary:", inputDict) # input list inputList = ["hello", "tutorialspoint", "python"] # printing input list print("Input list:", inputList) # empty list for storing the common elements in the list and dictionary keys outputList = [] # getting the list of keys of a dictionary keysList = list(inputDict.keys()) # traversing through the keys list for k in keysList: # checking whether the current key is present in the input list if k in inputList: # appending that key to the output list outputList.append(k) # printing the resultant list print("Resultant list:", outputList) Output Input dictionary: {'hello': 2, 'all': 4, 'welcome': 6, 'to': 8, 'tutorialspoint': 10} Input list: ['hello', 'tutorialspoint', 'python'] Resultant list: ['hello', 'tutorialspoint'] Method 4: Using the Counter() function

Counter() function − a sub-class that counts the hashable objects. It implicitly creates a hash table of an iterable when called/invoked.

Here the Counter() function is used to get the frequency of input list elements.


The following program returns common elements in the input list and dictionary keys using the Counter() function –

# importing a Counter function from the collections module from collections import Counter # input dictionary inputDict = {"hello": 2, "all": 4, "welcome": 6, "to": 8, "tutorialspoint": 10} # printing input dictionary print("Input dictionary:", inputDict) # input list inputList = ["hello", "tutorialspoint", "python"] # printing input list print("Input list:", inputList) # getting the frequency of input list elements as a dictionary frequency = Counter(inputList) # empty list for storing the common elements of the list and dictionary keys outputList = [] # getting the list of keys of a dictionary keysList = list(inputDict.keys()) # traversing through the keys list for k in keysList: # checking whether the current key is present in the input list if k in frequency.keys(): # appending/adding that key to the output list outputList.append(k) # printing the resultant list print("Resultant list:", outputList) Output Input dictionary: {'hello': 2, 'all': 4, 'welcome': 6, 'to': 8, 'tutorialspoint': 10} Input list: ['hello', 'tutorialspoint', 'python'] Resultant list: ['hello', 'tutorialspoint'] Conclusion

We studied four different methods in this article for displaying the Common keys in the given list and dictionary. We also learned how to get a dictionary of the iterables’ frequencies using the Counter() function.

Incorporating Growth Mindset In Personal Finance Classes

When it comes to personal finance education, instructional goals tend to focus on teaching students strategies and providing them with tools to help with money management. Although these resources have their place, helping students develop a growth mindset increases the chances that they will use tools like budgets and investing apps or strategies like financial goal setting in the future.

In 2023, The Decision Lab conducted the Mind Over Money study in partnership with Capital One, which found that a simple change in perspective can make a big difference in one’s financial well-being.

Connecting Mindset and Money Management

Responsible money management begins with the right mindset. One way to help students establish a strong “why” for money management is to surface some of their underlying beliefs about money and then work with them on developing a growth mindset as it relates to finances.  

Helping students understand how to differentiate between a growth mindset and a fixed mindset can be helpful as they start to shift their mindset to positive beliefs about their financial well-being. A fun way to help students practice differentiating between growth and fixed mindsets is to play a game with various statements where students identify which mindset is represented.  

A growth mindset statement could be “I invest my money now so that I can become wealthy later.” A fixed mindset statement could be “I don’t have money to invest, so I’ll never become wealthy.” I’ve presented this activity to both middle and high school students in the form of a flash-card game, and they have a fun time playing together while learning how to differentiate between growth and fixed mindsets.     

Adopting an ‘I can’ Perspective

Even if students have had negative experiences with money management or have observed the adults in their life make poor financial decisions, the future of their finances begin with adopting a positive “I can” attitude.  

One way to help students adopt an “I can” attitude is to encourage them to approach financial challenges by asking the question “How can I?” instead of saying “I can’t.”

Approaching challenges with the question “How can I?” encourages students to think about options and solutions to their problems. You can help students practice this by presenting them with personal finance scenarios where they can work with peers to come up with potential solutions to financial challenges by brainstorming ideas to the question “How can I?” 

After working with a group of high school students on budgeting skills, I presented them with the following scenario: You want to go to a concert with your friends in six weeks. The concert costs $200. You make $50 per week walking your neighbors’ dogs. You save $50 per month, and your monthly expenses include daily lunch from your favorite deli ($100), your cell phone bill ($40), and a YouTube subscription ($10). Your parents are not willing to help you pay for the concert, and you only have $50 saved for an emergency. What are you going to do?  

Students who ask the question “How can I go to the concert?” work together to come up with a plan of action, which includes ideas like walking more dogs to earn more money, getting an after-school job that pays more money, or bringing lunch to school each day instead of buying lunch from the deli.

Some students ask themselves, “How can I pay myself first?” They say that they won’t go to the concert because they need to work on paying themselves first by saving up at least $500 for an emergency fund. 

There are some students who figure out how to pay for the concert as well as contribute to their emergency fund. Giving students an opportunity to work together on these scenarios helps develop their ability to think flexibly and interdependently while learning different ways to approach financial decision-making.

Learning from Past Mistakes

Part of developing a growth mindset is to embrace mistakes as learning opportunities. Taking the time to explore some common mistakes people make with money and engaging students in a conversation where they can suggest better solutions are great ways to get students to learn from failure. Exploring common mistakes also creates a space that invites students to make personal connections with and share their own experiences with money with their peers.  

There are so many real-life examples of individuals who make bad financial decisions in current events that provide great learning opportunities for students.  Consider engaging students in a financial news conversation.

During these conversations, present a current events story that focuses on poor financial decision-making, and discuss the lessons that can be learned from the news story. These conversations can be about anything from the latest statistics for how much money Americans have saved for emergencies to the consequences that influencers face when they heavily invest in volatile assets.

Mindset is the foundation upon which healthy financial habits are built. Taking the time to help students understand how embracing a growth mindset can help them approach personal finance with a positive outlook, even when they make mistakes along the way, will increase the chances of their staying the course in the long term.

Financial wellness is a marathon, not a sprint, and only those who believe that they can finish the race make it to the end. The earlier we help students believe in their ability to be good stewards over their finances and provide them with the resources to do so, the better equipped they will be to finish the race well.

Update the detailed information about Download Financial Dataset Using Yahoo Finance In Python on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!