bdc.steps package
Subpackages
Submodules
bdc.steps.analyze_emails module
- class bdc.steps.analyze_emails.AnalyzeEmails(force_refresh: bool = False)[source]
Bases:
Step
A pipeline step performing various preprocessing steps with the given email address. The following columns will be added on successful processing:
domain: The custom domain name/website if any
email_valid: Boolean result of email check
first_name_in_account: Boolean, True if the given first name is part of the email account name
last_name_in_account: Boolean, True if the given last name is part of the email account name
- name
Name of this step, used for logging
- Type:
str
- added_cols
List of fields that will be added to the main dataframe by executing this step
- Type:
list[str]
- required_cols
List of fields that are required to be existent in the input dataframe before performing this step
- Type:
list[str]
- Added Columns:
domain (str): The custom domain name/website if any email_valid (bool): Boolean result of email check first_name_in_account (bool): Boolean, True if the given first name is part of the email account name last_name_in_account (bool): Boolean, True if the given last name is part of the email account name
- added_cols: list[str] = ['domain', 'email_valid', 'first_name_in_account', 'last_name_in_account']
- finish()[source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- load_data()[source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- name: str = 'Analyze-Emails'
- required_cols: list[str] = ['Email', 'First Name', 'Last Name']
bdc.steps.analyze_reviews module
- class bdc.steps.analyze_reviews.GPTReviewSentimentAnalyzer(force_refresh: bool = False)[source]
Bases:
Step
A class that performs sentiment analysis on reviews using GPT-4 model.
- name
The name of the step.
- Type:
str
- model
The GPT model to be used for sentiment analysis.
- Type:
str
- model_encoding_name
The encoding name of the GPT model.
- Type:
str
- MAX_PROMPT_TOKENS
The maximum number of tokens allowed for a prompt.
- Type:
int
- no_answer
The default value for no answer.
- Type:
str
- gpt_required_fields
The required fields for GPT analysis.
- Type:
dict
- system_message_for_sentiment_analysis
The system message for sentiment analysis.
- Type:
str
- user_message_for_sentiment_analysis
The user message for sentiment analysis.
- Type:
str
- extracted_col_name
The name of the column to store the sentiment scores.
- Type:
str
- added_cols
The list of additional columns to be added to the DataFrame.
- Type:
list
- gpt
The GPT instance for sentiment analysis.
- Type:
openai.OpenAI
- extract_text_from_reviews(reviews_list)[source]
Extracts text from reviews and removes line characters.
- batch_reviews(reviews, max_tokens)[source]
Batches reviews into smaller batches based on token limit.
- Added Columns:
reviews_sentiment_score (float): The sentiment score of the reviews.
- MAX_PROMPT_TOKENS = 4096
- added_cols: list[str] = ['reviews_sentiment_score']
- batch_reviews(reviews, max_tokens=4096)[source]
Batches reviews into smaller batches based on token limit.
- Parameters:
reviews – The list of reviews.
max_tokens (int) – The maximum number of tokens allowed for a batch.
- Returns:
The list of batches.
- Return type:
list
- extract_text_from_reviews(reviews_list)[source]
Extracts text from reviews and removes line characters.
- Parameters:
reviews_list – The list of reviews.
- Returns:
The list of formatted review texts.
- Return type:
list
- extracted_col_name = 'reviews_sentiment_score'
- finish() None [source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- gpt = None
- gpt_calculate_avg_sentiment_score(reviews)[source]
Calculates the average sentiment score for a list of reviews using GPT.
- Parameters:
reviews (list) – A list of review texts.
- Returns:
The average sentiment score.
- Return type:
float
- gpt_required_fields = {'place_id': 'google_places_place_id'}
- gpt_sentiment_analyze_review(review_list)[source]
GPT calculates the sentiment score considering the reviews.
- Parameters:
review_list – The list of reviews.
- Returns:
The sentiment score calculated by GPT.
- Return type:
float
- model = 'gpt-4'
- model_encoding_name = 'cl100k_base'
- name: str = 'GPT-Review-Sentiment-Analyzer'
- no_answer = 'None'
- num_tokens_from_string(text: str)[source]
Returns the number of tokens in a text string.
- Parameters:
text (str) – The input text.
- Returns:
The number of tokens in the text.
- Return type:
int
- required_cols: list[str] = dict_values(['google_places_place_id'])
- run() DataFrame [source]
Runs the sentiment analysis on the reviews.
- Returns:
The DataFrame with the sentiment scores added.
- Return type:
DataFrame
- run_sentiment_analysis(place_id)[source]
Runs sentiment analysis on reviews of lead extracted from company’s website.
- Parameters:
place_id – The ID of the place.
- Returns:
The average sentiment score of the reviews.
- Return type:
float
- system_message_for_sentiment_analysis = "You are review sentiment analyzer, you being provided reviews of the companies. You analyze the review and come up with the score between range [-1, 1], if no reviews then just answer with 'None'"
- text_analyzer = <bdc.steps.helpers.text_analyzer.TextAnalyzer object>
- textblob_calculate_avg_sentiment_score(reviews)[source]
Calculates the average sentiment score for a list of reviews using TextBlob sentiment analysis.
- Parameters:
reviews (list) – A list of dictionaries containing review text and language information.
- Returns:
The average sentiment score for the reviews.
- Return type:
float
- user_message_for_sentiment_analysis = 'Sentiment analyze the reviews and provide me a score between range [-1, 1] : {}'
- class bdc.steps.analyze_reviews.SmartReviewInsightsEnhancer(force_refresh: bool = False)[source]
Bases:
Step
A step class that enhances review insights for smart review analysis.
- name
The name of the step.
- Type:
str
- required_fields
A dictionary of required fields for the step.
- Type:
dict
- language_tools
A dictionary of language tools for different languages.
- Type:
dict
- MIN_RATINGS_COUNT
The minimum number of ratings required to identify polarization.
- Type:
int
- RATING_DOMINANCE_THRESHOLD
The threshold for high or low rating dominance in decimal.
- Type:
float
- added_cols
A list of added columns for the enhanced review insights.
- Type:
list
- _get_language_tool(lang)
Get the language tool for the specified language.
- _quantify_polarization(ratings)[source]
Analyzes and quantifies the polarization in a list of ratings.
- _determine_polarization_type(polarization_score, highest_rating_ratio, lowest_rating_ratio, threshold)[source]
Determines the type of polarization based on rating ratios and a threshold.
- _calculate_average_grammatical_score(reviews)[source]
Calculates the average grammatical score for a list of reviews.
- _grammatical_errors(text, lang)
Calculates the number of grammatical errors in a text.
- Added Columns:
review_avg_grammatical_score (float): The average grammatical score of the reviews. review_polarization_type (str): The type of polarization in the reviews. review_polarization_score (float): The score of polarization in the reviews. review_highest_rating_ratio (float): The ratio of highest ratings in the reviews. review_lowest_rating_ratio (float): The ratio of lowest ratings in the reviews. review_rating_trend (float): The trend of ratings over time.
- MIN_RATINGS_COUNT = 1
- RATING_DOMINANCE_THRESHOLD = 0.4
- added_cols: list[str] = ['review_avg_grammatical_score', 'review_polarization_type', 'review_polarization_score', 'review_highest_rating_ratio', 'review_lowest_rating_ratio', 'review_rating_trend']
- name: str = 'Smart-Review-Insights-Enhancer'
- required_fields = {'place_id': 'google_places_place_id'}
- run() DataFrame [source]
Runs the step and enhances the review insights.
- Returns:
The enhanced DataFrame with the added review insights.
- Return type:
DataFrame
- text_analyzer = <bdc.steps.helpers.text_analyzer.TextAnalyzer object>
- bdc.steps.analyze_reviews.check_api_key(api_key, api_name)[source]
Checks if an API key is provided for a specific API.
- Parameters:
api_key (str) – The API key to be checked.
api_name (str) – The name of the API.
- Raises:
StepError – If the API key is not provided.
- Returns:
True if the API key is provided, False otherwise.
- Return type:
bool
- bdc.steps.analyze_reviews.is_review_valid(review)[source]
Checks if the review is valid (has text and original language).
Parameters: review (dict): A dictionary representing a review.
Returns: bool: True if the review is valid, False otherwise.
- bdc.steps.analyze_reviews.log = <CustomLogger AMOS-APP (DEBUG)>
HELPER FUNCTIONS
bdc.steps.google_places module
- class bdc.steps.google_places.GooglePlaces(force_refresh: bool = False)[source]
Bases:
Step
The GooglePlaces step will try to find the correct business entry in the Google Maps database. It will save basic information along with the place id, that can be used to retrieve further detailed information and a confidence score that should indicate the confidence in having found the correct result. Confidence can vary based on the data source used for identifying the business and if multiple sources are used confidence is higher when results match.
- name
Name of this step, used for logging and as a column prefix
- Type:
str
- added_cols
List of fields that will be added to the main dataframe by executing this step
- Type:
list[str]
- required_cols
List of fields that are required to be existent in the input dataframe before performing this step
- Type:
list[str]
- Added Columns:
google_places_place_id (str): The place id of the business google_places_business_status (str): The business status of the business google_places_formatted_address (str): The formatted address of the business google_places_name (str): The name of the business google_places_user_ratings_total (int): The number of user ratings of the business google_places_rating (float): The rating of the business google_places_price_level (int): The price level of the business google_places_candidate_count_mail (int): The number of candidates found by mail search google_places_candidate_count_phone (int): The number of candidates found by phone search google_places_place_id_matches_phone_search (bool): Whether the place id found by mail search matches the one found by phone search google_places_confidence (float): A confidence score for the results
- added_cols: list[str] = ['google_places_place_id', 'google_places_business_status', 'google_places_formatted_address', 'google_places_name', 'google_places_user_ratings_total', 'google_places_rating', 'google_places_price_level', 'google_places_candidate_count_mail', 'google_places_candidate_count_phone', 'google_places_place_id_matches_phone_search', 'google_places_confidence']
- api_fields = ['place_id', 'business_status', 'formatted_address', 'name', 'user_ratings_total', 'rating', 'price_level']
- calculate_confidence(results_list, lead) float | None [source]
Calculate some confidence score, representing how sure we are to have found the correct Google Place (using super secret, patented AI algorithm :P) :param results_list: :return: confidence
- df_fields = ['place_id', 'business_status', 'formatted_address', 'name', 'user_ratings_total', 'rating', 'price_level', 'candidate_count_mail', 'candidate_count_phone', 'place_id_matches_phone_search', 'confidence']
- finish() None [source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- gmaps = None
- load_data() None [source]
Make sure that the API key for Google places is present and construct the API client
- name: str = 'Google_Places'
- required_cols: list[str] = ['Email', 'domain', 'first_name_in_account', 'last_name_in_account', 'number_formatted']
bdc.steps.google_places_detailed module
- class bdc.steps.google_places_detailed.GooglePlacesDetailed(force_refresh: bool = False)[source]
Bases:
Step
The GooglePlacesDetailed step will try to gather detailed information for a given google business entry, identified by the place ID. This information could be the website link, the review text and the business type. Reviews will be saved to a separate location based on the persistence settings this could be local or AWS S3.
- name
Name of this step, used for logging
- Type:
str
- added_cols
List of fields that will be added to the main dataframe by executing this step
- Type:
list[str]
- required_cols
List of fields that are required to be existent in the input dataframe before performing this step
- Type:
list[str]
- Added Columns:
google_places_detailed_website (str): The website of the company from google places google_places_detailed_type (str): The type of the company from google places
- added_cols: list[str] = ['google_places_detailed_website', 'google_places_detailed_type']
- api_fields = ['website', 'type', 'reviews']
- api_fields_output = ['website', 'types']
- df_fields = ['website', 'type']
- finish() None [source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- gmaps = None
- load_data() None [source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- name: str = 'Google_Places_Detailed'
- required_cols: list[str] = ['google_places_place_id']
bdc.steps.gpt_summarizer module
- class bdc.steps.gpt_summarizer.GPTSummarizer(force_refresh: bool = False)[source]
Bases:
Step
The GPTSummarizer step will attempt to download a businesses website in raw html format and pass this information to OpenAIs GPT, which will then attempt to summarize the raw contents and extract valuable information for a salesperson.
- name
Name of this step, used for logging
- Type:
str
- added_cols
List of fields that will be added to the main dataframe by executing this step
- Type:
list[str]
- required_cols
List of fields that are required to be existent in the input dataframe before performing this step
- Type:
list[str]
- Added Columns:
sales_person_summary (str): The summary of the company website for the salesperson using GPT
- added_cols: list[str] = ['sales_person_summary']
- client = None
- extracted_col_name_website_summary = 'sales_person_summary'
- finish() None [source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- gpt_required_fields = {'place_id': 'google_places_place_id', 'website': 'google_places_detailed_website'}
- load_data() None [source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- model = 'gpt-4'
- name: str = 'GPT-Summarizer'
- no_answer = 'None'
- required_cols: list[str] = dict_values(['google_places_detailed_website', 'google_places_place_id'])
- run() DataFrame [source]
Perform the actual processing step. Will not be executed if verify() fails.
- Raises:
StepError
- summarize_the_company_website(website, place_id)[source]
Summarise client website using GPT. Handles exceptions that mightarise from the API call.
- system_message_for_website_summary = "You are html summarizer, you being provided the companies' htmls and you answer with the summary of three to five sentences including all the necessary information which might be useful for salesperson. If no html then just answer with 'None'"
- user_message_for_website_summary = 'Give salesperson a summary using following html: {}'
bdc.steps.hash_generator module
- class bdc.steps.hash_generator.HashGenerator(force_refresh: bool = False)[source]
Bases:
Step
A pipeline step computing the hashed value of a lead using the basic data that should be present for every lead. These data include:
First Name
Last Name
Company / Account
Phone
Email
- name
Name of this step, used for logging
- Type:
str
- added_cols
List of fields that will be added to the main dataframe by executing this step
- Type:
list[str]
- required_cols
List of fields that are required to be existent in the input dataframe before performing this step
- Type:
list[str]
- added_cols: list[str] = ['lead_hash']
- finish() None [source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- load_data() None [source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- name: str = 'Hash-Generator'
- required_cols: list[str] = ['First Name', 'Last Name', 'Company / Account', 'Phone', 'Email']
bdc.steps.preprocess_phonenumbers module
- class bdc.steps.preprocess_phonenumbers.PreprocessPhonenumbers(force_refresh: bool = False)[source]
Bases:
Step
The PreprocessPhonenumbers step will check if the provided phone numbers are valid and extract geo information if possible.
- name
Name of this step, used for logging
- Type:
str
- added_cols
List of fields that will be added to the main dataframe by executing this step
- Type:
list[str]
- required_cols
List of fields that are required to be existent in the input dataframe before performing this step
- Type:
list[str]
- Added Columns:
number_formatted (str): The formatted phone number, e.g. +49 123 456789 number_country (str): The country of the phone number, e.g. Germany number_area (str): The area of the phone number, e.g. Berlin number_valid (bool): Whether the phone number is valid number_possible (bool): Whether the phone number is possible
- added_cols: list[str] = ['number_formatted', 'number_country', 'number_area', 'number_valid', 'number_possible']
- finish()[source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- load_data()[source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- name: str = 'Preprocess-Phonenumbers'
- required_cols: list[str] = ['Phone']
bdc.steps.regionalatlas module
- class bdc.steps.regionalatlas.RegionalAtlas(force_refresh: bool = False)[source]
Bases:
Step
- The RegionalAtlas step will query the RegionalAtlas database for location based geographic and demographic
information, based on the address that was found for a business (currently through Google API) or the area provided by the phonenumber (preprocess_phonenumbers.py).
- name
Name of this step, used for logging
- Type:
str
- reagionalatlas_feature_keys
Dictionary to translate between the keys in the merged.geojson and the used column names in the df
- Type:
dict
- df_fields
the keys of the merged.geojson
- Type:
list[str]
- added_cols
List of fields that will be added to the main dataframe by executing this step
- Type:
list[str]
- required_cols
List of fields that are required in the input dataframe before performing this step
- Type:
list[str]
- regions_gdfs
dataframe that includes all keys/values from the merged.geojson
- empty_result
empty result that will be used in case there are problems with the data
- Type:
dict
- epsg_code_etrs
25832 is the standard used by RegionAtlas
- Added Columns:
pop_density (float): Population density of the searched city pop_development (float): Population development of the searched city age_0 (float): Population age group 0-18 of the searched city age_1 (float): Population age group 18-30 of the searched city age_2 (float): Population age group 30-45 of the searched city age_3 (float): Population age group 45-60 of the searched city age_4 (float): Population age group 60+ of the searched city pop_avg_age (float): Average age of the searched city per_service_sector (float): Percentage of the service sector of the searched city per_trade (float): Percentage of the trade sector of the searched city employment_rate (float): Employment rate of the searched city unemployment_rate (float): Unemployment rate of the searched city per_long_term_unemployment (float): Percentage of long term unemployment of the searched city investments_p_employee (float): Investments per employee of the searched city gross_salary_p_employee (float): Gross salary per employee of the searched city disp_income_p_inhabitant (float): Disposable income per inhabitant of the searched city tot_income_p_taxpayer (float): Total income per taxpayer of the searched city gdp_p_employee (float): GDP per employee of the searched city gdp_development (float): GDP development of the searched city gdp_p_inhabitant (float): GDP per inhabitant of the searched city gdp_p_workhours (float): GDP per workhour of the searched city pop_avg_age_zensus (float): Average age of the searched city (zensus) unemployment_rate (float): Unemployment rate of the searched city (zensus) regional_score (float): Regional score of the searched city
- added_cols: list[str] = ['regional_atlas_pop_density', 'regional_atlas_pop_development', 'regional_atlas_age_0', 'regional_atlas_age_1', 'regional_atlas_age_2', 'regional_atlas_age_3', 'regional_atlas_age_4', 'regional_atlas_pop_avg_age', 'regional_atlas_per_service_sector', 'regional_atlas_per_trade', 'regional_atlas_employment_rate', 'regional_atlas_unemployment_rate', 'regional_atlas_per_long_term_unemployment', 'regional_atlas_investments_p_employee', 'regional_atlas_gross_salary_p_employee', 'regional_atlas_disp_income_p_inhabitant', 'regional_atlas_tot_income_p_taxpayer', 'regional_atlas_gdp_p_employee', 'regional_atlas_gdp_development', 'regional_atlas_gdp_p_inhabitant', 'regional_atlas_gdp_p_workhours', 'regional_atlas_pop_avg_age_zensus', 'regional_atlas_regional_score']
- calculate_regional_score(lead) float | None [source]
Calculate a regional score for a lead based on information from the RegionalAtlas API.
This function uses population density, employment rate, and average income to compute the buying power of potential customers in the area in millions of euro.
- The score is computed as:
(population density * employment rate * average income) / 1,000,000
Possible extensions could include: - Population age groups
- Parameters:
lead – Lead for which to compute the score
- Returns:
float | None - The computed score if the necessary fields are present for the lead. None otherwise.
- df_fields: list[str] = dict_values(['ai0201', 'ai0202', 'ai0203', 'ai0204', 'ai0205', 'ai0206', 'ai0207', 'ai0218', 'ai0706', 'ai0707', 'ai0710', 'ai_z08', 'ai0808', 'ai1001', 'ai1002', 'ai1601', 'ai1602', 'ai1701', 'ai1702', 'ai1703', 'ai1704', 'ai_z01'])
- empty_result: dict = {'ai0201': None, 'ai0202': None, 'ai0203': None, 'ai0204': None, 'ai0205': None, 'ai0206': None, 'ai0207': None, 'ai0218': None, 'ai0706': None, 'ai0707': None, 'ai0710': None, 'ai0808': None, 'ai1001': None, 'ai1002': None, 'ai1601': None, 'ai1602': None, 'ai1701': None, 'ai1702': None, 'ai1703': None, 'ai1704': None, 'ai_z01': None, 'ai_z08': None}
- epsg_code_etrs = 25832
- finish() None [source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- get_data_from_address(row)[source]
Retrieve the regional features for every lead. Every column of reagionalatlas_feature_keys is added.
Based on the google places address or the phonenumber area. Checks if the centroid of the searched city is in a RegionalAtlas region.
Possible extensions could include: - More RegionalAtlas features
- Parameters:
row – Lead for which to retrieve the features
- Returns:
dict - The retrieved features if the necessary fields are present for the lead. Empty dictionary otherwise.
- load_data() None [source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- name: str = 'Regional_Atlas'
- reagionalatlas_feature_keys: dict = {'age_0': 'ai0203', 'age_1': 'ai0204', 'age_2': 'ai0205', 'age_3': 'ai0206', 'age_4': 'ai0207', 'disp_income_p_inhabitant': 'ai1601', 'employment_rate': 'ai0710', 'gdp_development': 'ai1702', 'gdp_p_employee': 'ai1701', 'gdp_p_inhabitant': 'ai1703', 'gdp_p_workhours': 'ai1704', 'gross_salary_p_employee': 'ai1002', 'investments_p_employee': 'ai1001', 'per_long_term_unemployment': 'ai0808', 'per_service_sector': 'ai0706', 'per_trade': 'ai0707', 'pop_avg_age': 'ai0218', 'pop_avg_age_zensus': 'ai_z01', 'pop_density': 'ai0201', 'pop_development': 'ai0202', 'tot_income_p_taxpayer': 'ai1602', 'unemployment_rate': 'ai_z08'}
- regions_gdfs = Empty GeoDataFrame Columns: [] Index: []
- required_cols: list[str] = ['google_places_formatted_address']
bdc.steps.search_offeneregister module
- class bdc.steps.search_offeneregister.SearchOffeneRegister(force_refresh: bool = False)[source]
Bases:
Step
This class represents a step in the sales lead qualification process that searches for company-related data using the OffeneRegisterAPI.
- name
The name of the step.
- Type:
str
- required_cols
The list of required columns in the input DataFrame.
- Type:
list
- added_cols
The list of columns to be added to the input DataFrame.
- Type:
list
- offeneregisterAPI
An instance of the OffeneRegisterAPI class.
- Type:
Extracts company-related data for a given lead.
- Added Columns:
company_name (str): The name of the company from offeneregister.de company_objective (str): The objective of the company offeneregister.de company_capital (float): The capital of the company offeneregister.de company_capital_currency (str): The currency of the company capital offeneregister.de company_address (str): The address of the company offeneregister.de
- added_cols: list[str] = ['company_name', 'company_objective', 'company_capital', 'company_capital_currency', 'compan_address']
- finish()[source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- load_data()[source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- name: str = 'OffeneRegister'
- offeneregisterAPI = <bdc.steps.helpers.offeneregister_api.OffeneRegisterAPI object>
- required_cols: list[str] = ['Last Name', 'First Name']
bdc.steps.step module
- class bdc.steps.step.Step(force_refresh: bool = False)[source]
Bases:
object
Step is an abstract parent class for all steps of the data enrichment pipeline. Steps can be added to a list and then be passed to the pipeline for sequential execution.
- name
Name of this step, used for logging and as column prefix
- Type:
str
- added_cols
List of fields that will be added to the main dataframe by executing a step
- Type:
list[str]
- required_cols
List of fields that are required to be existent in the input dataframe before performing a step
- Type:
list[str]
- added_cols: list[str] = []
- check_data_presence() bool [source]
Check whether the data this step collects is already present in the df. Can be forced to return False if self._force_execution is set to True.
- property df: DataFrame
- finish() None [source]
Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.
- load_data() None [source]
Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.
- name: str = None
- required_cols: list[str] = []