bdc.steps package

Subpackages

bdc.steps.helpers package

Submodules

bdc.steps.analyze_emails module

class bdc.steps.analyze_emails.AnalyzeEmails(force_refresh: bool = False)[source]

Bases: Step

A pipeline step performing various preprocessing steps with the given email address. The following columns will be added on successful processing:

domain: The custom domain name/website if any
email_valid: Boolean result of email check
first_name_in_account: Boolean, True if the given first name is part of the email account name
last_name_in_account: Boolean, True if the given last name is part of the email account name

name

Name of this step, used for logging

Type:: str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:: list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:: list[str]

Added Columns:: domain (str): The custom domain name/website if any email_valid (bool): Boolean result of email check first_name_in_account (bool): Boolean, True if the given first name is part of the email account name last_name_in_account (bool): Boolean, True if the given last name is part of the email account name

added_cols: list[str] = ['domain', 'email_valid', 'first_name_in_account', 'last_name_in_account']

finish()[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data()[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Analyze-Emails'

required_cols: list[str] = ['Email', 'First Name', 'Last Name']

run()[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify()[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.analyze_emails.analyze_email_account(lead) → Series[source]

bdc.steps.analyze_emails.extract_custom_domain(email: str) → Series[source]

bdc.steps.analyze_reviews module

class bdc.steps.analyze_reviews.GPTReviewSentimentAnalyzer(force_refresh: bool = False)[source]

Bases: Step

A class that performs sentiment analysis on reviews using GPT-4 model.

name

The name of the step.

Type:: str

model

The GPT model to be used for sentiment analysis.

Type:: str

model_encoding_name

The encoding name of the GPT model.

Type:: str

MAX_PROMPT_TOKENS

The maximum number of tokens allowed for a prompt.

Type:: int

no_answer

The default value for no answer.

Type:: str

gpt_required_fields

The required fields for GPT analysis.

Type:: dict

system_message_for_sentiment_analysis

The system message for sentiment analysis.

Type:: str

user_message_for_sentiment_analysis

The user message for sentiment analysis.

Type:: str

extracted_col_name

The name of the column to store the sentiment scores.

Type:: str

added_cols

The list of additional columns to be added to the DataFrame.

Type:: list

gpt

The GPT instance for sentiment analysis.

Type:: openai.OpenAI

load_data()[source]: Loads the GPT model.

verify()[source]: Verifies the validity of the API key and DataFrame.

run()[source]: Runs the sentiment analysis on the reviews.

finish()[source]: Finishes the sentiment analysis step.

run_sentiment_analysis(place_id)[source]: Runs sentiment analysis on the reviews of a lead.

gpt_sentiment_analyze_review(review_list)[source]: Calculates the sentiment score using GPT.

extract_text_from_reviews(reviews_list)[source]: Extracts text from reviews and removes line characters.

num_tokens_from_string(text)[source]: Returns the number of tokens in a text string.

batch_reviews(reviews, max_tokens)[source]: Batches reviews into smaller batches based on token limit.

Added Columns:: reviews_sentiment_score (float): The sentiment score of the reviews.

MAX_PROMPT_TOKENS = 4096

added_cols: list[str] = ['reviews_sentiment_score']

batch_reviews(reviews, max_tokens=4096)[source]

Batches reviews into smaller batches based on token limit.

Parameters:

reviews – The list of reviews.
max_tokens (int) – The maximum number of tokens allowed for a batch.

Returns:

The list of batches.

Return type:

list

extract_text_from_reviews(reviews_list)[source]

Extracts text from reviews and removes line characters.

Parameters:: reviews_list – The list of reviews.
Returns:: The list of formatted review texts.
Return type:: list

extracted_col_name = 'reviews_sentiment_score'

finish() → None[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

gpt = None

gpt_calculate_avg_sentiment_score(reviews)[source]

Calculates the average sentiment score for a list of reviews using GPT.

Parameters:: reviews (list) – A list of review texts.
Returns:: The average sentiment score.
Return type:: float

gpt_required_fields = {'place_id': 'google_places_place_id'}

gpt_sentiment_analyze_review(review_list)[source]

GPT calculates the sentiment score considering the reviews.

Parameters:: review_list – The list of reviews.
Returns:: The sentiment score calculated by GPT.
Return type:: float

load_data() → None[source]: Loads the GPT model.

model = 'gpt-4'

model_encoding_name = 'cl100k_base'

name: str = 'GPT-Review-Sentiment-Analyzer'

no_answer = 'None'

num_tokens_from_string(text: str)[source]

Returns the number of tokens in a text string.

Parameters:: text (str) – The input text.
Returns:: The number of tokens in the text.
Return type:: int

required_cols: list[str] = dict_values(['google_places_place_id'])

run() → DataFrame[source]

Runs the sentiment analysis on the reviews.

Returns:: The DataFrame with the sentiment scores added.
Return type:: DataFrame

run_sentiment_analysis(place_id)[source]

Runs sentiment analysis on reviews of lead extracted from company’s website.

Parameters:: place_id – The ID of the place.
Returns:: The average sentiment score of the reviews.
Return type:: float

system_message_for_sentiment_analysis = "You are review sentiment analyzer, you being provided reviews of the companies. You analyze the review and come up with the score between range [-1, 1], if no reviews then just answer with 'None'"

text_analyzer = <bdc.steps.helpers.text_analyzer.TextAnalyzer object>

textblob_calculate_avg_sentiment_score(reviews)[source]

Calculates the average sentiment score for a list of reviews using TextBlob sentiment analysis.

Parameters:: reviews (list) – A list of dictionaries containing review text and language information.
Returns:: The average sentiment score for the reviews.
Return type:: float

user_message_for_sentiment_analysis = 'Sentiment analyze the reviews and provide me a score between range [-1, 1] : {}'

verify() → bool[source]

Verifies the validity of the API key and DataFrame.

Returns:: True if the API key and DataFrame are valid, False otherwise.
Return type:: bool

class bdc.steps.analyze_reviews.SmartReviewInsightsEnhancer(force_refresh: bool = False)[source]

Bases: Step

A step class that enhances review insights for smart review analysis.

name

The name of the step.

Type:: str

required_fields

A dictionary of required fields for the step.

Type:: dict

language_tools

A dictionary of language tools for different languages.

Type:: dict

MIN_RATINGS_COUNT

The minimum number of ratings required to identify polarization.

Type:: int

RATING_DOMINANCE_THRESHOLD

The threshold for high or low rating dominance in decimal.

Type:: float

added_cols

A list of added columns for the enhanced review insights.

Type:: list

load_data()[source]: Loads the data for the step.

verify()[source]: Verifies if the required fields are present in the data.

run()[source]: Runs the step and enhances the review insights.

finish()[source]: Finishes the step.

_get_language_tool(lang): Get the language tool for the specified language.

_enhance_review_insights(lead)[source]: Enhances the review insights for a given lead.

_analyze_rating_trend(rating_time)[source]: Analyzes the general trend of ratings over time.

_quantify_polarization(ratings)[source]: Analyzes and quantifies the polarization in a list of ratings.

_determine_polarization_type(polarization_score, highest_rating_ratio, lowest_rating_ratio, threshold)[source]: Determines the type of polarization based on rating ratios and a threshold.

_calculate_average_grammatical_score(reviews)[source]: Calculates the average grammatical score for a list of reviews.

_calculate_score(review)[source]: Calculates the score for a review.

_grammatical_errors(text, lang): Calculates the number of grammatical errors in a text.

Added Columns:: review_avg_grammatical_score (float): The average grammatical score of the reviews. review_polarization_type (str): The type of polarization in the reviews. review_polarization_score (float): The score of polarization in the reviews. review_highest_rating_ratio (float): The ratio of highest ratings in the reviews. review_lowest_rating_ratio (float): The ratio of lowest ratings in the reviews. review_rating_trend (float): The trend of ratings over time.

MIN_RATINGS_COUNT = 1

RATING_DOMINANCE_THRESHOLD = 0.4

added_cols: list[str] = ['review_avg_grammatical_score', 'review_polarization_type', 'review_polarization_score', 'review_highest_rating_ratio', 'review_lowest_rating_ratio', 'review_rating_trend']

finish() → None[source]: Finishes the step.

load_data() → None[source]: Loads the data for the step.

name: str = 'Smart-Review-Insights-Enhancer'

required_fields = {'place_id': 'google_places_place_id'}

run() → DataFrame[source]

Runs the step and enhances the review insights.

Returns:: The enhanced DataFrame with the added review insights.
Return type:: DataFrame

text_analyzer = <bdc.steps.helpers.text_analyzer.TextAnalyzer object>

verify() → bool[source]

Verifies if the required fields are present in the data.

Returns:: True if the required fields are present, False otherwise.
Return type:: bool

bdc.steps.analyze_reviews.check_api_key(api_key, api_name)[source]

Checks if an API key is provided for a specific API.

Parameters:

api_key (str) – The API key to be checked.
api_name (str) – The name of the API.

Raises:

StepError – If the API key is not provided.

Returns:

True if the API key is provided, False otherwise.

Return type:

bool

bdc.steps.analyze_reviews.is_review_valid(review)[source]

Checks if the review is valid (has text and original language).

Parameters: review (dict): A dictionary representing a review.

Returns: bool: True if the review is valid, False otherwise.

bdc.steps.analyze_reviews.log = <CustomLogger AMOS-APP (DEBUG)>: HELPER FUNCTIONS

bdc.steps.google_places module

class bdc.steps.google_places.GooglePlaces(force_refresh: bool = False)[source]

Bases: Step

The GooglePlaces step will try to find the correct business entry in the Google Maps database. It will save basic information along with the place id, that can be used to retrieve further detailed information and a confidence score that should indicate the confidence in having found the correct result. Confidence can vary based on the data source used for identifying the business and if multiple sources are used confidence is higher when results match.

name

Name of this step, used for logging and as a column prefix

Type:: str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:: list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:: list[str]

Added Columns:: google_places_place_id (str): The place id of the business google_places_business_status (str): The business status of the business google_places_formatted_address (str): The formatted address of the business google_places_name (str): The name of the business google_places_user_ratings_total (int): The number of user ratings of the business google_places_rating (float): The rating of the business google_places_price_level (int): The price level of the business google_places_candidate_count_mail (int): The number of candidates found by mail search google_places_candidate_count_phone (int): The number of candidates found by phone search google_places_place_id_matches_phone_search (bool): Whether the place id found by mail search matches the one found by phone search google_places_confidence (float): A confidence score for the results

added_cols: list[str] = ['google_places_place_id', 'google_places_business_status', 'google_places_formatted_address', 'google_places_name', 'google_places_user_ratings_total', 'google_places_rating', 'google_places_price_level', 'google_places_candidate_count_mail', 'google_places_candidate_count_phone', 'google_places_place_id_matches_phone_search', 'google_places_confidence']

api_fields = ['place_id', 'business_status', 'formatted_address', 'name', 'user_ratings_total', 'rating', 'price_level']

calculate_confidence(results_list, lead) → float | None[source]: Calculate some confidence score, representing how sure we are to have found the correct Google Place (using super secret, patented AI algorithm :P) :param results_list: :return: confidence

df_fields = ['place_id', 'business_status', 'formatted_address', 'name', 'user_ratings_total', 'rating', 'price_level', 'candidate_count_mail', 'candidate_count_phone', 'place_id_matches_phone_search', 'confidence']

finish() → None[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

get_data_from_google_api(lead_row)[source]: Request Google Places Text Search API

get_first_place_candidate(query, input_type) -> (<class 'dict'>, <class 'int'>)[source]

gmaps = None

load_data() → None[source]: Make sure that the API key for Google places is present and construct the API client

name: str = 'Google_Places'

required_cols: list[str] = ['Email', 'domain', 'first_name_in_account', 'last_name_in_account', 'number_formatted']

run() → DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify() → bool[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.google_places_detailed module

class bdc.steps.google_places_detailed.GooglePlacesDetailed(force_refresh: bool = False)[source]

Bases: Step

The GooglePlacesDetailed step will try to gather detailed information for a given google business entry, identified by the place ID. This information could be the website link, the review text and the business type. Reviews will be saved to a separate location based on the persistence settings this could be local or AWS S3.

name

Name of this step, used for logging

Type:: str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:: list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:: list[str]

Added Columns:: google_places_detailed_website (str): The website of the company from google places google_places_detailed_type (str): The type of the company from google places

added_cols: list[str] = ['google_places_detailed_website', 'google_places_detailed_type']

api_fields = ['website', 'type', 'reviews']

api_fields_output = ['website', 'types']

df_fields = ['website', 'type']

finish() → None[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

get_data_from_detailed_google_api(lead_row)[source]

gmaps = None

load_data() → None[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Google_Places_Detailed'

required_cols: list[str] = ['google_places_place_id']

run() → DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify() → bool[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.gpt_summarizer module

class bdc.steps.gpt_summarizer.GPTSummarizer(force_refresh: bool = False)[source]

Bases: Step

The GPTSummarizer step will attempt to download a businesses website in raw html format and pass this information to OpenAIs GPT, which will then attempt to summarize the raw contents and extract valuable information for a salesperson.

name

Name of this step, used for logging

Type:: str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:: list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:: list[str]

Added Columns:: sales_person_summary (str): The summary of the company website for the salesperson using GPT

added_cols: list[str] = ['sales_person_summary']

client = None

extract_the_raw_html_and_parse(url)[source]

extracted_col_name_website_summary = 'sales_person_summary'

finish() → None[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

gpt_required_fields = {'place_id': 'google_places_place_id', 'website': 'google_places_detailed_website'}

load_data() → None[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

model = 'gpt-4'

name: str = 'GPT-Summarizer'

no_answer = 'None'

required_cols: list[str] = dict_values(['google_places_detailed_website', 'google_places_place_id'])

run() → DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

summarize_the_company_website(website, place_id)[source]: Summarise client website using GPT. Handles exceptions that mightarise from the API call.

system_message_for_website_summary = "You are html summarizer, you being provided the companies' htmls and you answer with the summary of three to five sentences including all the necessary information which might be useful for salesperson. If no html then just answer with 'None'"

user_message_for_website_summary = 'Give salesperson a summary using following html: {}'

verify() → bool[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.hash_generator module

class bdc.steps.hash_generator.HashGenerator(force_refresh: bool = False)[source]

Bases: Step

A pipeline step computing the hashed value of a lead using the basic data that should be present for every lead. These data include:

First Name
Last Name
Company / Account
Phone
Email

name

Name of this step, used for logging

Type:: str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:: list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:: list[str]

added_cols: list[str] = ['lead_hash']

finish() → None[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data() → None[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Hash-Generator'

required_cols: list[str] = ['First Name', 'Last Name', 'Company / Account', 'Phone', 'Email']

run() → DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify() → bool[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.preprocess_phonenumbers module

class bdc.steps.preprocess_phonenumbers.PreprocessPhonenumbers(force_refresh: bool = False)[source]

Bases: Step

The PreprocessPhonenumbers step will check if the provided phone numbers are valid and extract geo information if possible.

name

Name of this step, used for logging

Type:: str

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:: list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing this step

Type:: list[str]

Added Columns:: number_formatted (str): The formatted phone number, e.g. +49 123 456789 number_country (str): The country of the phone number, e.g. Germany number_area (str): The area of the phone number, e.g. Berlin number_valid (bool): Whether the phone number is valid number_possible (bool): Whether the phone number is possible

added_cols: list[str] = ['number_formatted', 'number_country', 'number_area', 'number_valid', 'number_possible']

check_number(phone_number: str) → str | None[source]

finish()[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data()[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Preprocess-Phonenumbers'

process_row(row)[source]

required_cols: list[str] = ['Phone']

run()[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify()[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.regionalatlas module

class bdc.steps.regionalatlas.RegionalAtlas(force_refresh: bool = False)[source]

Bases: Step

The RegionalAtlas step will query the RegionalAtlas database for location based geographic and demographic: information, based on the address that was found for a business (currently through Google API) or the area provided by the phonenumber (preprocess_phonenumbers.py).

name

Name of this step, used for logging

Type:: str

reagionalatlas_feature_keys

Dictionary to translate between the keys in the merged.geojson and the used column names in the df

Type:: dict

df_fields

the keys of the merged.geojson

Type:: list[str]

added_cols

List of fields that will be added to the main dataframe by executing this step

Type:: list[str]

required_cols

List of fields that are required in the input dataframe before performing this step

Type:: list[str]

regions_gdfs: dataframe that includes all keys/values from the merged.geojson

empty_result

empty result that will be used in case there are problems with the data

Type:: dict

epsg_code_etrs: 25832 is the standard used by RegionAtlas

Added Columns:: pop_density (float): Population density of the searched city pop_development (float): Population development of the searched city age_0 (float): Population age group 0-18 of the searched city age_1 (float): Population age group 18-30 of the searched city age_2 (float): Population age group 30-45 of the searched city age_3 (float): Population age group 45-60 of the searched city age_4 (float): Population age group 60+ of the searched city pop_avg_age (float): Average age of the searched city per_service_sector (float): Percentage of the service sector of the searched city per_trade (float): Percentage of the trade sector of the searched city employment_rate (float): Employment rate of the searched city unemployment_rate (float): Unemployment rate of the searched city per_long_term_unemployment (float): Percentage of long term unemployment of the searched city investments_p_employee (float): Investments per employee of the searched city gross_salary_p_employee (float): Gross salary per employee of the searched city disp_income_p_inhabitant (float): Disposable income per inhabitant of the searched city tot_income_p_taxpayer (float): Total income per taxpayer of the searched city gdp_p_employee (float): GDP per employee of the searched city gdp_development (float): GDP development of the searched city gdp_p_inhabitant (float): GDP per inhabitant of the searched city gdp_p_workhours (float): GDP per workhour of the searched city pop_avg_age_zensus (float): Average age of the searched city (zensus) unemployment_rate (float): Unemployment rate of the searched city (zensus) regional_score (float): Regional score of the searched city

added_cols: list[str] = ['regional_atlas_pop_density', 'regional_atlas_pop_development', 'regional_atlas_age_0', 'regional_atlas_age_1', 'regional_atlas_age_2', 'regional_atlas_age_3', 'regional_atlas_age_4', 'regional_atlas_pop_avg_age', 'regional_atlas_per_service_sector', 'regional_atlas_per_trade', 'regional_atlas_employment_rate', 'regional_atlas_unemployment_rate', 'regional_atlas_per_long_term_unemployment', 'regional_atlas_investments_p_employee', 'regional_atlas_gross_salary_p_employee', 'regional_atlas_disp_income_p_inhabitant', 'regional_atlas_tot_income_p_taxpayer', 'regional_atlas_gdp_p_employee', 'regional_atlas_gdp_development', 'regional_atlas_gdp_p_inhabitant', 'regional_atlas_gdp_p_workhours', 'regional_atlas_pop_avg_age_zensus', 'regional_atlas_regional_score']

calculate_regional_score(lead) → float | None[source]

Calculate a regional score for a lead based on information from the RegionalAtlas API.

This function uses population density, employment rate, and average income to compute the buying power of potential customers in the area in millions of euro.

The score is computed as:: (population density * employment rate * average income) / 1,000,000

Possible extensions could include: - Population age groups

Parameters:: lead – Lead for which to compute the score
Returns:: float | None - The computed score if the necessary fields are present for the lead. None otherwise.

df_fields: list[str] = dict_values(['ai0201', 'ai0202', 'ai0203', 'ai0204', 'ai0205', 'ai0206', 'ai0207', 'ai0218', 'ai0706', 'ai0707', 'ai0710', 'ai_z08', 'ai0808', 'ai1001', 'ai1002', 'ai1601', 'ai1602', 'ai1701', 'ai1702', 'ai1703', 'ai1704', 'ai_z01'])

empty_result: dict = {'ai0201': None, 'ai0202': None, 'ai0203': None, 'ai0204': None, 'ai0205': None, 'ai0206': None, 'ai0207': None, 'ai0218': None, 'ai0706': None, 'ai0707': None, 'ai0710': None, 'ai0808': None, 'ai1001': None, 'ai1002': None, 'ai1601': None, 'ai1602': None, 'ai1701': None, 'ai1702': None, 'ai1703': None, 'ai1704': None, 'ai_z01': None, 'ai_z08': None}

epsg_code_etrs = 25832

finish() → None[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

get_data_from_address(row)[source]

Retrieve the regional features for every lead. Every column of reagionalatlas_feature_keys is added.

Based on the google places address or the phonenumber area. Checks if the centroid of the searched city is in a RegionalAtlas region.

Possible extensions could include: - More RegionalAtlas features

Parameters:: row – Lead for which to retrieve the features
Returns:: dict - The retrieved features if the necessary fields are present for the lead. Empty dictionary otherwise.

load_data() → None[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'Regional_Atlas'

reagionalatlas_feature_keys: dict = {'age_0': 'ai0203', 'age_1': 'ai0204', 'age_2': 'ai0205', 'age_3': 'ai0206', 'age_4': 'ai0207', 'disp_income_p_inhabitant': 'ai1601', 'employment_rate': 'ai0710', 'gdp_development': 'ai1702', 'gdp_p_employee': 'ai1701', 'gdp_p_inhabitant': 'ai1703', 'gdp_p_workhours': 'ai1704', 'gross_salary_p_employee': 'ai1002', 'investments_p_employee': 'ai1001', 'per_long_term_unemployment': 'ai0808', 'per_service_sector': 'ai0706', 'per_trade': 'ai0707', 'pop_avg_age': 'ai0218', 'pop_avg_age_zensus': 'ai_z01', 'pop_density': 'ai0201', 'pop_development': 'ai0202', 'tot_income_p_taxpayer': 'ai1602', 'unemployment_rate': 'ai_z08'}

regions_gdfs = Empty GeoDataFrame Columns: [] Index: []

required_cols: list[str] = ['google_places_formatted_address']

run() → DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify() → bool[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.search_offeneregister module

class bdc.steps.search_offeneregister.SearchOffeneRegister(force_refresh: bool = False)[source]

Bases: Step

This class represents a step in the sales lead qualification process that searches for company-related data using the OffeneRegisterAPI.

name

The name of the step.

Type:: str

required_cols

The list of required columns in the input DataFrame.

Type:: list

added_cols

The list of columns to be added to the input DataFrame.

Type:: list

offeneregisterAPI

An instance of the OffeneRegisterAPI class.

Type:: OffeneRegisterAPI

verify()[source]: Verifies if the step is ready to run.

finish()[source]: Performs any necessary cleanup or finalization steps.

load_data()[source]: Loads any required data for the step.

run()[source]: Executes the step and returns the modified DataFrame.

_extract_company_related_data(lead)[source]: Extracts company-related data for a given lead.

Added Columns:: company_name (str): The name of the company from offeneregister.de company_objective (str): The objective of the company offeneregister.de company_capital (float): The capital of the company offeneregister.de company_capital_currency (str): The currency of the company capital offeneregister.de company_address (str): The address of the company offeneregister.de

added_cols: list[str] = ['company_name', 'company_objective', 'company_capital', 'company_capital_currency', 'compan_address']

finish()[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data()[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = 'OffeneRegister'

offeneregisterAPI = <bdc.steps.helpers.offeneregister_api.OffeneRegisterAPI object>

required_cols: list[str] = ['Last Name', 'First Name']

run() → DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify() → bool[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

bdc.steps.step module

class bdc.steps.step.Step(force_refresh: bool = False)[source]

Bases: object

Step is an abstract parent class for all steps of the data enrichment pipeline. Steps can be added to a list and then be passed to the pipeline for sequential execution.

name

Name of this step, used for logging and as column prefix

Type:: str

added_cols

List of fields that will be added to the main dataframe by executing a step

Type:: list[str]

required_cols

List of fields that are required to be existent in the input dataframe before performing a step

Type:: list[str]

added_cols: list[str] = []

check_data_presence() → bool[source]: Check whether the data this step collects is already present in the df. Can be forced to return False if self._force_execution is set to True.

property df: DataFrame

finish() → None[source]: Finish the execution. Print a report or clean up temporary files. Will not be executed if verify() fails.

load_data() → None[source]: Load data for this processing step. This could be an API call or reading from a CSV file. Can also be empty if self.df is used.

name: str = None

required_cols: list[str] = []

run() → DataFrame[source]

Perform the actual processing step. Will not be executed if verify() fails.

Raises:: StepError

verify() → bool[source]: Verify that the data has been loaded correctly and is present in a format that can be processed by this step. If this fails, run() and finish() will not be executed.

exception bdc.steps.step.StepError[source]: Bases: Exception

bdc.steps package

Subpackages

Submodules

bdc.steps.analyze_emails module

bdc.steps.analyze_reviews module

bdc.steps.google_places module

bdc.steps.google_places_detailed module

bdc.steps.gpt_summarizer module

bdc.steps.hash_generator module

bdc.steps.preprocess_phonenumbers module

bdc.steps.regionalatlas module

bdc.steps.search_offeneregister module

bdc.steps.step module

Module contents