bdc.steps package

Subpackages

Submodules

bdc.steps.analyze_emails module

class bdc.steps.analyze_emails.AnalyzeEmails(force_refresh: bool = False)[source]

Bases: Step

A pipeline step performing various preprocessing steps with the given email address. The following columns will be added on successful processing:

  • domain: The custom domain name/website if any

  • email_valid: Boolean result of email check

  • first_name_in_account: Boolean, True if the given first name is part of the email account name

  • last_name_in_account: Boolean, True if the given last name is part of the email account name

name

Name of this step, used for logging

Type:

str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:

list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:

list[str]

Added Columns:

domain (str): The custom domain name/website if any email_valid (bool): Boolean result of email check first_name_in_account (bool): Boolean, True if the given first name is part of the email account name last_name_in_account (bool): Boolean, True if the given last name is part of the email account name

added_cols: list[str] = ['domain', 'email_valid', 'first_name_in_account', 'last_name_in_account']
finish()[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data()[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Analyze-Emails'
required_cols: list[str] = ['Email', 'First Name', 'Last Name']
run()[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify()[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.analyze_emails.analyze_email_account(lead) Series[source]
bdc.steps.analyze_emails.extract_custom_domain(email: str) Series[source]

bdc.steps.analyze_reviews module

class bdc.steps.analyze_reviews.GPTReviewSentimentAnalyzer(force_refresh: bool = False)[source]

Bases: Step

A class that performs sentiment analysis on reviews using GPT-4 model.

name

The name of the step.

Type:

str

model

The GPT model to be used for sentiment analysis.

Type:

str

model_encoding_name

The encoding name of the GPT model.

Type:

str

MAX_PROMPT_TOKENS

The maximum number of tokens allowed for a prompt.

Type:

int

no_answer

The default value for no answer.

Type:

str

gpt_required_fields

The required fields for GPT analysis.

Type:

dict

system_message_for_sentiment_analysis

The system message for sentiment analysis.

Type:

str

user_message_for_sentiment_analysis

The user message for sentiment analysis.

Type:

str

extracted_col_name

The name of the column to store the sentiment scores.

Type:

str

added_cols

The list of additional columns to be added to the DataFrame.

Type:

list

gpt

The GPT instance for sentiment analysis.

Type:

openai.OpenAI

load_data()[source]

Loads the GPT model.

verify()[source]

Verifies the validity of the API key and DataFrame.

run()[source]

Runs the sentiment analysis on the reviews.

finish()[source]

Finishes the sentiment analysis step.

run_sentiment_analysis(place_id)[source]

Runs sentiment analysis on the reviews of a lead.

gpt_sentiment_analyze_review(review_list)[source]

Calculates the sentiment score using GPT.

extract_text_from_reviews(reviews_list)[source]

Extracts text from reviews and removes line characters.

num_tokens_from_string(text)[source]

Returns the number of tokens in a text string.

batch_reviews(reviews, max_tokens)[source]

Batches reviews into smaller batches based on token limit.

Added Columns:

reviews_sentiment_score (float): The sentiment score of the reviews.

MAX_PROMPT_TOKENS = 4096
added_cols: list[str] = ['reviews_sentiment_score']
batch_reviews(reviews, max_tokens=4096)[source]

Batches reviews into smaller batches based on token limit.

Parameters:
  • reviews – The list of reviews.

  • max_tokens (int) – The maximum number of tokens allowed for a batch.

Returns:

The list of batches.

Return type:

list

extract_text_from_reviews(reviews_list)[source]

Extracts text from reviews and removes line characters.

Parameters:

reviews_list – The list of reviews.

Returns:

The list of formatted review texts.

Return type:

list

extracted_col_name = 'reviews_sentiment_score'
finish() None[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

gpt = None
gpt_calculate_avg_sentiment_score(reviews)[source]

Calculates the average sentiment score for a list of reviews using GPT.

Parameters:

reviews (list) – A list of review texts.

Returns:

The average sentiment score.

Return type:

float

gpt_required_fields = {'place_id': 'google_places_place_id'}
gpt_sentiment_analyze_review(review_list)[source]

GPT calculates the sentiment score considering the reviews.

Parameters:

review_list – The list of reviews.

Returns:

The sentiment score calculated by GPT.

Return type:

float

load_data() None[source]

Loads the GPT model.

model = 'gpt-4'
model_encoding_name = 'cl100k_base'
name: str = 'GPT-Review-Sentiment-Analyzer'
no_answer = 'None'
num_tokens_from_string(text: str)[source]

Returns the number of tokens in a text string.

Parameters:

text (str) – The input text.

Returns:

The number of tokens in the text.

Return type:

int

required_cols: list[str] = dict_values(['google_places_place_id'])
run() DataFrame[source]

Runs the sentiment analysis on the reviews.

Returns:

The DataFrame with the sentiment scores added.

Return type:

DataFrame

run_sentiment_analysis(place_id)[source]

Runs sentiment analysis on reviews of lead extracted from company’s website.

Parameters:

place_id – The ID of the place.

Returns:

The average sentiment score of the reviews.

Return type:

float

system_message_for_sentiment_analysis = "You are review sentiment analyzer, you being provided reviews of the companies. You analyze the review and come up with the score between range [-1, 1], if no reviews then just answer with 'None'"
text_analyzer = <bdc.steps.helpers.text_analyzer.TextAnalyzer object>
textblob_calculate_avg_sentiment_score(reviews)[source]

Calculates the average sentiment score for a list of reviews using TextBlob sentiment analysis.

Parameters:

reviews (list) – A list of dictionaries containing review text and language information.

Returns:

The average sentiment score for the reviews.

Return type:

float

user_message_for_sentiment_analysis = 'Sentiment analyze the reviews  and provide me a score between range [-1, 1]  : {}'
verify() bool[source]

Verifies the validity of the API key and DataFrame.

Returns:

True if the API key and DataFrame are valid, False otherwise.

Return type:

bool

class bdc.steps.analyze_reviews.SmartReviewInsightsEnhancer(force_refresh: bool = False)[source]

Bases: Step

A step class that enhances review insights for smart review analysis.

name

The name of the step.

Type:

str

required_fields

A dictionary of required fields for the step.

Type:

dict

language_tools

A dictionary of language tools for different languages.

Type:

dict

MIN_RATINGS_COUNT

The minimum number of ratings required to identify polarization.

Type:

int

RATING_DOMINANCE_THRESHOLD

The threshold for high or low rating dominance in decimal.

Type:

float

added_cols

A list of added columns for the enhanced review insights.

Type:

list

load_data()[source]

Loads the data for the step.

verify()[source]

Verifies if the required fields are present in the data.

run()[source]

Runs the step and enhances the review insights.

finish()[source]

Finishes the step.

_get_language_tool(lang)

Get the language tool for the specified language.

_enhance_review_insights(lead)[source]

Enhances the review insights for a given lead.

_analyze_rating_trend(rating_time)[source]

Analyzes the general trend of ratings over time.

_quantify_polarization(ratings)[source]

Analyzes and quantifies the polarization in a list of ratings.

_determine_polarization_type(polarization_score, highest_rating_ratio, lowest_rating_ratio, threshold)[source]

Determines the type of polarization based on rating ratios and a threshold.

_calculate_average_grammatical_score(reviews)[source]

Calculates the average grammatical score for a list of reviews.

_calculate_score(review)[source]

Calculates the score for a review.

_grammatical_errors(text, lang)

Calculates the number of grammatical errors in a text.

Added Columns:

review_avg_grammatical_score (float): The average grammatical score of the reviews. review_polarization_type (str): The type of polarization in the reviews. review_polarization_score (float): The score of polarization in the reviews. review_highest_rating_ratio (float): The ratio of highest ratings in the reviews. review_lowest_rating_ratio (float): The ratio of lowest ratings in the reviews. review_rating_trend (float): The trend of ratings over time.

MIN_RATINGS_COUNT = 1
RATING_DOMINANCE_THRESHOLD = 0.4
added_cols: list[str] = ['review_avg_grammatical_score', 'review_polarization_type', 'review_polarization_score', 'review_highest_rating_ratio', 'review_lowest_rating_ratio', 'review_rating_trend']
finish() None[source]

Finishes the step.

load_data() None[source]

Loads the data for the step.

name: str = 'Smart-Review-Insights-Enhancer'
required_fields = {'place_id': 'google_places_place_id'}
run() DataFrame[source]

Runs the step and enhances the review insights.

Returns:

The enhanced DataFrame with the added review insights.

Return type:

DataFrame

text_analyzer = <bdc.steps.helpers.text_analyzer.TextAnalyzer object>
verify() bool[source]

Verifies if the required fields are present in the data.

Returns:

True if the required fields are present, False otherwise.

Return type:

bool

bdc.steps.analyze_reviews.check_api_key(api_key, api_name)[source]

Checks if an API key is provided for a specific API.

Parameters:
  • api_key (str) – The API key to be checked.

  • api_name (str) – The name of the API.

Raises:

StepError – If the API key is not provided.

Returns:

True if the API key is provided, False otherwise.

Return type:

bool

bdc.steps.analyze_reviews.is_review_valid(review)[source]

Checks if the review is valid (has text and original language).

Parameters: review (dict): A dictionary representing a review.

Returns: bool: True if the review is valid, False otherwise.

bdc.steps.analyze_reviews.log = <CustomLogger AMOS-APP (DEBUG)>

HELPER FUNCTIONS

bdc.steps.google_places module

class bdc.steps.google_places.GooglePlaces(force_refresh: bool = False)[source]

Bases: Step

The GooglePlaces step will try to find the correct business entry in the Google Maps database. It will save basic information along with the place id, that can be used to retrieve further detailed information and a confidence score that should indicate the confidence in having found the correct result. Confidence can vary based on the data source used for identifying the business and if multiple sources are used confidence is higher when results match.

name

Name of this step, used for logging and as a column prefix

Type:

str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:

list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:

list[str]

Added Columns:

google_places_place_id (str): The place id of the business google_places_business_status (str): The business status of the business google_places_formatted_address (str): The formatted address of the business google_places_name (str): The name of the business google_places_user_ratings_total (int): The number of user ratings of the business google_places_rating (float): The rating of the business google_places_price_level (int): The price level of the business google_places_candidate_count_mail (int): The number of candidates found by mail search google_places_candidate_count_phone (int): The number of candidates found by phone search google_places_place_id_matches_phone_search (bool): Whether the place id found by mail search matches the one found by phone search google_places_confidence (float): A confidence score for the results

added_cols: list[str] = ['google_places_place_id', 'google_places_business_status', 'google_places_formatted_address', 'google_places_name', 'google_places_user_ratings_total', 'google_places_rating', 'google_places_price_level', 'google_places_candidate_count_mail', 'google_places_candidate_count_phone', 'google_places_place_id_matches_phone_search', 'google_places_confidence']
api_fields = ['place_id', 'business_status', 'formatted_address', 'name', 'user_ratings_total', 'rating', 'price_level']
calculate_confidence(results_list, lead) float | None[source]

Calculate some confidence score, representing how sure we are to have found the correct Google Place (using super secret, patented AI algorithm :P) :param results_list: :return: confidence

df_fields = ['place_id', 'business_status', 'formatted_address', 'name', 'user_ratings_total', 'rating', 'price_level', 'candidate_count_mail', 'candidate_count_phone', 'place_id_matches_phone_search', 'confidence']
finish() None[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

get_data_from_google_api(lead_row)[source]

Request Google Places Text Search API

get_first_place_candidate(query, input_type) -> (<class 'dict'>, <class 'int'>)[source]
gmaps = None
load_data() None[source]

Make sure that the API key for Google places is present and construct the API client

name: str = 'Google_Places'
required_cols: list[str] = ['Email', 'domain', 'first_name_in_account', 'last_name_in_account', 'number_formatted']
run() DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify() bool[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.google_places_detailed module

class bdc.steps.google_places_detailed.GooglePlacesDetailed(force_refresh: bool = False)[source]

Bases: Step

The GooglePlacesDetailed step will try to gather detailed information for a given google business entry, identified by the place ID. This information could be the website link, the review text and the business type. Reviews will be saved to a separate location based on the persistence settings this could be local or AWS S3.

name

Name of this step, used for logging

Type:

str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:

list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:

list[str]

Added Columns:

google_places_detailed_website (str): The website of the company from google places google_places_detailed_type (str): The type of the company from google places

added_cols: list[str] = ['google_places_detailed_website', 'google_places_detailed_type']
api_fields = ['website', 'type', 'reviews']
api_fields_output = ['website', 'types']
df_fields = ['website', 'type']
finish() None[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

get_data_from_detailed_google_api(lead_row)[source]
gmaps = None
load_data() None[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Google_Places_Detailed'
required_cols: list[str] = ['google_places_place_id']
run() DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify() bool[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.gpt_summarizer module

class bdc.steps.gpt_summarizer.GPTSummarizer(force_refresh: bool = False)[source]

Bases: Step

The GPTSummarizer step will attempt to download a businesses website in raw html format and pass this information to OpenAIs GPT, which will then attempt to summarize the raw contents and extract valuable information for a salesperson.

name

Name of this step, used for logging

Type:

str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:

list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:

list[str]

Added Columns:

sales_person_summary (str): The summary of the company website for the salesperson using GPT

added_cols: list[str] = ['sales_person_summary']
client = None
extract_the_raw_html_and_parse(url)[source]
extracted_col_name_website_summary = 'sales_person_summary'
finish() None[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

gpt_required_fields = {'place_id': 'google_places_place_id', 'website': 'google_places_detailed_website'}
load_data() None[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

model = 'gpt-4'
name: str = 'GPT-Summarizer'
no_answer = 'None'
required_cols: list[str] = dict_values(['google_places_detailed_website', 'google_places_place_id'])
run() DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

summarize_the_company_website(website, place_id)[source]

Summarise client website using GPT. Handles exceptions that mightarise from the API call.

system_message_for_website_summary = "You are html summarizer, you being provided the companies' htmls and you answer with the summary of three to five sentences including all the necessary information which might be useful for salesperson. If no html then just answer with 'None'"
user_message_for_website_summary = 'Give salesperson a summary using following html: {}'
verify() bool[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.hash_generator module

class bdc.steps.hash_generator.HashGenerator(force_refresh: bool = False)[source]

Bases: Step

A pipeline step computing the hashed value of a lead using the basic data that should be present for every lead. These data include:

  • First Name

  • Last Name

  • Company / Account

  • Phone

  • Email

name

Name of this step, used for logging

Type:

str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:

list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:

list[str]

added_cols: list[str] = ['lead_hash']
finish() None[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data() None[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Hash-Generator'
required_cols: list[str] = ['First Name', 'Last Name', 'Company / Account', 'Phone', 'Email']
run() DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify() bool[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.preprocess_phonenumbers module

class bdc.steps.preprocess_phonenumbers.PreprocessPhonenumbers(force_refresh: bool = False)[source]

Bases: Step

The PreprocessPhonenumbers step will check if the provided phone numbers are valid and extract geo information if possible.

name

Name of this step, used for logging

Type:

str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:

list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:

list[str]

Added Columns:

number_formatted (str): The formatted phone number, e.g. +49 123 456789 number_country (str): The country of the phone number, e.g. Germany number_area (str): The area of the phone number, e.g. Berlin number_valid (bool): Whether the phone number is valid number_possible (bool): Whether the phone number is possible

added_cols: list[str] = ['number_formatted', 'number_country', 'number_area', 'number_valid', 'number_possible']
check_number(phone_number: str) str | None[source]
finish()[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data()[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Preprocess-Phonenumbers'
process_row(row)[source]
required_cols: list[str] = ['Phone']
run()[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify()[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.regionalatlas module

class bdc.steps.regionalatlas.RegionalAtlas(force_refresh: bool = False)[source]

Bases: Step

The RegionalAtlas step will query the RegionalAtlas database for location based geographic and demographic

information, based on the address that was found for a business (currently through Google API) or the area provided by the phonenumber (preprocess_phonenumbers.py).

name

Name of this step, used for logging

Type:

str

reagionalatlas_feature_keys

Dictionary to translate between the keys in the merged.geojson and the used column names in the df

Type:

dict

df_fields

the keys of the merged.geojson

Type:

list[str]

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:

list[str]

required_cols

List of fields that are required in the input dataframe before performing this step

Type:

list[str]

regions_gdfs

dataframe that includes all keys/values from the merged.geojson

empty_result

empty result that will be used in case there are problems with the data

Type:

dict

epsg_code_etrs

25832 is the standard used by RegionAtlas

Added Columns:

pop_density (float): Population density of the searched city pop_development (float): Population development of the searched city age_0 (float): Population age group 0-18 of the searched city age_1 (float): Population age group 18-30 of the searched city age_2 (float): Population age group 30-45 of the searched city age_3 (float): Population age group 45-60 of the searched city age_4 (float): Population age group 60+ of the searched city pop_avg_age (float): Average age of the searched city per_service_sector (float): Percentage of the service sector of the searched city per_trade (float): Percentage of the trade sector of the searched city employment_rate (float): Employment rate of the searched city unemployment_rate (float): Unemployment rate of the searched city per_long_term_unemployment (float): Percentage of long term unemployment of the searched city investments_p_employee (float): Investments per employee of the searched city gross_salary_p_employee (float): Gross salary per employee of the searched city disp_income_p_inhabitant (float): Disposable income per inhabitant of the searched city tot_income_p_taxpayer (float): Total income per taxpayer of the searched city gdp_p_employee (float): GDP per employee of the searched city gdp_development (float): GDP development of the searched city gdp_p_inhabitant (float): GDP per inhabitant of the searched city gdp_p_workhours (float): GDP per workhour of the searched city pop_avg_age_zensus (float): Average age of the searched city (zensus) unemployment_rate (float): Unemployment rate of the searched city (zensus) regional_score (float): Regional score of the searched city

added_cols: list[str] = ['regional_atlas_pop_density', 'regional_atlas_pop_development', 'regional_atlas_age_0', 'regional_atlas_age_1', 'regional_atlas_age_2', 'regional_atlas_age_3', 'regional_atlas_age_4', 'regional_atlas_pop_avg_age', 'regional_atlas_per_service_sector', 'regional_atlas_per_trade', 'regional_atlas_employment_rate', 'regional_atlas_unemployment_rate', 'regional_atlas_per_long_term_unemployment', 'regional_atlas_investments_p_employee', 'regional_atlas_gross_salary_p_employee', 'regional_atlas_disp_income_p_inhabitant', 'regional_atlas_tot_income_p_taxpayer', 'regional_atlas_gdp_p_employee', 'regional_atlas_gdp_development', 'regional_atlas_gdp_p_inhabitant', 'regional_atlas_gdp_p_workhours', 'regional_atlas_pop_avg_age_zensus', 'regional_atlas_regional_score']
calculate_regional_score(lead) float | None[source]

Calculate a regional score for a lead based on information from the RegionalAtlas API.

This function uses population density, employment rate, and average income to compute the buying power of potential customers in the area in millions of euro.

The score is computed as:

(population density * employment rate * average income) / 1,000,000

Possible extensions could include: - Population age groups

Parameters:

lead – Lead for which to compute the score

Returns:

float | None - The computed score if the necessary fields are present for the lead. None otherwise.

df_fields: list[str] = dict_values(['ai0201', 'ai0202', 'ai0203', 'ai0204', 'ai0205', 'ai0206', 'ai0207', 'ai0218', 'ai0706', 'ai0707', 'ai0710', 'ai_z08', 'ai0808', 'ai1001', 'ai1002', 'ai1601', 'ai1602', 'ai1701', 'ai1702', 'ai1703', 'ai1704', 'ai_z01'])
empty_result: dict = {'ai0201': None, 'ai0202': None, 'ai0203': None, 'ai0204': None, 'ai0205': None, 'ai0206': None, 'ai0207': None, 'ai0218': None, 'ai0706': None, 'ai0707': None, 'ai0710': None, 'ai0808': None, 'ai1001': None, 'ai1002': None, 'ai1601': None, 'ai1602': None, 'ai1701': None, 'ai1702': None, 'ai1703': None, 'ai1704': None, 'ai_z01': None, 'ai_z08': None}
epsg_code_etrs = 25832
finish() None[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

get_data_from_address(row)[source]

Retrieve the regional features for every lead. Every column of reagionalatlas_feature_keys is added.

Based on the google places address or the phonenumber area. Checks if the centroid of the searched city is in a RegionalAtlas region.

Possible extensions could include: - More RegionalAtlas features

Parameters:

row – Lead for which to retrieve the features

Returns:

dict - The retrieved features if the necessary fields are present for the lead. Empty dictionary otherwise.

load_data() None[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Regional_Atlas'
reagionalatlas_feature_keys: dict = {'age_0': 'ai0203', 'age_1': 'ai0204', 'age_2': 'ai0205', 'age_3': 'ai0206', 'age_4': 'ai0207', 'disp_income_p_inhabitant': 'ai1601', 'employment_rate': 'ai0710', 'gdp_development': 'ai1702', 'gdp_p_employee': 'ai1701', 'gdp_p_inhabitant': 'ai1703', 'gdp_p_workhours': 'ai1704', 'gross_salary_p_employee': 'ai1002', 'investments_p_employee': 'ai1001', 'per_long_term_unemployment': 'ai0808', 'per_service_sector': 'ai0706', 'per_trade': 'ai0707', 'pop_avg_age': 'ai0218', 'pop_avg_age_zensus': 'ai_z01', 'pop_density': 'ai0201', 'pop_development': 'ai0202', 'tot_income_p_taxpayer': 'ai1602', 'unemployment_rate': 'ai_z08'}
regions_gdfs = Empty GeoDataFrame Columns: [] Index: []
required_cols: list[str] = ['google_places_formatted_address']
run() DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify() bool[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.search_offeneregister module

class bdc.steps.search_offeneregister.SearchOffeneRegister(force_refresh: bool = False)[source]

Bases: Step

This class represents a step in the sales lead qualification process that searches for company-related data using the OffeneRegisterAPI.

name

The name of the step.

Type:

str

required_cols

The list of required columns in the input DataFrame.

Type:

list

added_cols

The list of columns to be added to the input DataFrame.

Type:

list

offeneregisterAPI

An instance of the OffeneRegisterAPI class.

Type:

OffeneRegisterAPI

verify()[source]

Verifies if the step is ready to run.

finish()[source]

Performs any necessary cleanup or finalization steps.

load_data()[source]

Loads any required data for the step.

run()[source]

Executes the step and returns the modified DataFrame.

Extracts company-related data for a given lead.

Added Columns:

company_name (str): The name of the company from offeneregister.de company_objective (str): The objective of the company offeneregister.de company_capital (float): The capital of the company offeneregister.de company_capital_currency (str): The currency of the company capital offeneregister.de company_address (str): The address of the company offeneregister.de

added_cols: list[str] = ['company_name', 'company_objective', 'company_capital', 'company_capital_currency', 'compan_address']
finish()[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data()[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'OffeneRegister'
offeneregisterAPI = <bdc.steps.helpers.offeneregister_api.OffeneRegisterAPI object>
required_cols: list[str] = ['Last Name', 'First Name']
run() DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify() bool[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.step module

class bdc.steps.step.Step(force_refresh: bool = False)[source]

Bases: object

Step is an abstract parent class for all steps of the data enrichment pipeline. Steps can be added to a list and then be passed to the pipeline for sequential execution.

name

Name of this step, used for logging and as column prefix

Type:

str

added_cols

List of fields that will be added to the main dataframe by executing a step

Type:

list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing a step

Type:

list[str]

added_cols: list[str] = []
check_data_presence() bool[source]

Check whether the data this step collects is already present in the df. Can be forced to return False if self._force_execution is set to True.

property df: DataFrame
finish() None[source]

Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data() None[source]

Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = None
required_cols: list[str] = []
run() DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:

StepError

verify() bool[source]

Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

exception bdc.steps.step.StepError[source]

Bases: Exception

Module contents