Build Documentation

Creating the Environment

The repository contains the file .env.template. This file is a template for the environment variables that need to be set for the application to run. Copy this file into a file called .env at the root level of this repository and fill in all values with the corresponding secrets.

To create the virtual environment in this project you must have pipenv installed on your machine. Then run the following commands:

# for development environment
pipenv install --dev
# for production environment
pipenv install

To work within the environment you can now run:

# to activate the virtual environment
pipenv shell
# to run a single command
pipenv run <COMMAND>

Build Process

This application is built and tested on every push and pull request creation through Github actions. For this, the pipenv environment is installed and then the code style is checked using flake8. Finally, the tests/ directory is executed using pytest and a test coverage report is created using coverage. The test coverage report can be found in the Github actions output.

In another task, all used packages are tested for their license to ensure that the software does not use any copy-left licenses and remains open source and free to use.

If any of these steps fail for a pull request the pull request is blocked from being merged until the corresponding step is fixed.

Furthermore, it is required to install the pre-commit hooks as described here. This ensures uniform coding style throughout the project as well as that the software is compliant with the REUSE licensing specifications.

Running the app

To run the application the pipenv environment must be installed and all needed environment variables must be set in the .env file. Then the application can be started via

pipenv run python src/main.py

Pre-Commit Hooks

This repository uses pre-commit hooks to ensure a consistent and clean file organization. Each registered hook will be executed when committing to the repository. To ensure that the hooks will be executed they need to be installed using the following command:

pre-commit install

The following things are done by hooks automatically:

formatting of python files using black and isort
formatting of other files using prettier
syntax check of JSON and yaml files
adding new line at the end of files
removing trailing whitespaces
prevent commits to dev and main branch
check adherence to REUSE licensing format

User Documentation

Project vision

This product will give our industry partner a tool at hand, that can effectively increase conversion of their leads to customers, primarily by providing the sales team with valuable information. The modular architecture makes our product future-proof, by making it easy to add further data sources, employ improved prediction models or to adjust the output format if desired.

Project mission

The mission of this project is to enrich historical data about customers and recent data about leads (with information from external sources) and to leverage the enriched data in machine learning, so that the estimated Merchant Size of leads can be predicted.

Usage

To execute the final program, ensure the environment is installed (refer to build-documents.md) and run python .\src\main.py either locally or via the build process. The user will be presented with the following options:

Choose demo:
(0) : Base Data Collector
(1) : Data preprocessing
(2) : ML model training
(3) : Merchant Size Predictor
(4) : Exit

(0) : Base Data Collector

This is the data enrichment pipeline, utilizing multiple data enrichment steps. Configuration options are presented:

Do you want to list all available pipeline configs? (y/N) If y:

Please enter the index of requested pipeline config:
(0) : config_sprint09_release.json
(1) : just_run_search_offeneregister.json
(2) : run_all_steps.json
(3) : Exit

(0) Coniguration used in sprint 9.
(1) Coniguration for OffeneRegister.
(2) Running all the steps of the pipeline without steps selection.
(3) Exit to the pipeline step selection.

If n: proceed to pipeline step selection for data enrichment. Subsequent questions arise:

Run Scrape Address (will take a long time)(y/N)?
Run Search OffeneRegister (will take a long time)(y/N)?
Run Phone Number Validation (y/N)?
Run Google API (will use token and generate cost!)(y/N)?
Run Google API Detailed (will use token and generate cost!)(y/N)?
Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?
Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?
Run Smart Review Insights (will take looong time!)(y/N)?
Run Regionalatlas (y/N)?

Run Scrape Address (will take a long time)(y/N)?: This enrichment step scrapes the leads website for an address using regex.
Run Search OffeneRegister (will take a long time)(y/N)?: This enrichment step searches for company-related data using the OffeneRegisterAPI.
Run Phone Number Validation (y/N)?: This enrichment step checks if the provided phone numbers are valid and extract geographical information using geocoder.
Run Google API (will use token and generate cost!)(y/N)?: This enrichment step tries to the correct business entry in the Google Maps database. It will save basic information along with the place id, that can be used to retrieve further detailed information and a confidence score that should indicate the confidence in having found the correct result.
Run Google API Detailed (will use token and generate cost!)(y/N)?: This enrichment step tries to gather detailed information for a given google business entry, identified by the place ID.
Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?: This enrichment step performs sentiment analysis on reviews using GPT-4 model.
Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?: This enrichment step attempts to download a businesses website in raw html format and pass this information to OpenAIs GPT, which will then attempt to summarize the raw contents and extract valuable information for a salesperson.
Run Smart Review Insights (will take looong time!)(y/N)?: This enrichment step enhances review insights for smart review analysis
Run Regionalatlas (y/N)?: This enrichment step will query the RegionalAtlas database for location based geographic and demographic information, based on the address that was found for a business through Google API.

It is emphasized that some steps are dependent on others, and excluding one might result in dependency issues for subsequent steps.

After selecting the desired enrichtment steps, a prompt asks the user to Set limit for data points to be processed (0=No limit) such that the user chooses whether it apply the data enrichment steps for all the leads (no limit) or for a certain number of leads.

Note: In case DATABASE_TYPE="S3" in your .env file, the limit will be removed, in order to enrich all the data into s3://amos--data--events S3 bucket.

(1) : Data preprocessing

Post data enrichment, preprocessing is crucial for machine learning models, involving scaling, numerical outlier removal, and categorical one-hot encoding. The user is prompted with questions:

Filter out the API-irrelevant data? (y/n): This will filter out all the leads that couldn’t be enriched during the data enrichtment steps, removing them would be useful for the Machine Learning algorithms, to avoid any bias introduced, even if we pad the features with zeros. Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!: The user has to have DATABASE_TYPE="S3" in .env file in order to run on historical data, otherwise, it will run locally. After preprocessing, the log will show where the preprocessed_data is stored.

(2) : ML model training

Six machine learning models are available:

(0) : Random Forest
(1) : XGBoost
(2) : Naive Bayes
(3) : KNN Classifier
(4) : AdaBoost
(5) : LightGBM

After selection of the desired machine learning model, the user would be prompted with a series of questions:

Load model from file? (y/N) : In case of y, the program will ask for a file location of a previously saved model to use for predictions and testing.
Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N): In case of y, the S, M, L labels of the data would be grouped alltogether as one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL}) instead of the 5 classes. It is worth noting that grouping the S, M and L classes alltogether as one class resulted in boosting the classification performance.

Do you want to train on a subset of features?
(0) : ['Include all features']
(1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score']

0 would include all the numerical and categorical one-hot encoded features, while 1 would choose a small subset of data as features for the machine learning models

Then, the user would be given multiple options:

(1) Train
(2) Test
(3) Predict on single lead
(4) Save model
(5) Exit

(1): Train the current model on the current trainig dataset.
(2): Test the current model on the test dataset, displaying the mean squared error.
(3): Choose a single lead from the test dataset and display the prediction and true label.
(4): Save the current model to the amos--models/models on S3 in case of DATABASE_TYPE=S3, otherwise it will save it locally.
(5): Exit the EVP submenu

(3) : Merchant Size Predictor

After training, testing, and saving the model, the true essence of models lies not just in crucial task of generating forecasted predictions for previously unseen leads.

(4) : Exit

Gracefully exit the program.

Design Documentation

Introduction

This application serves as a pivotal tool employed by our esteemed industry partner, SumUp, for the enrichment of information pertaining to potential leads garnered through their sign-up website. The refined data obtained undergoes utilization in the prediction of potential value that a lead could contribute to SumUp, facilitated by a sophisticated machine learning model. The application is branched into two integral components: the Base Data Collector (BDC) and the Merchant Size Predictor (MSP).

Component Diagram

Component Diagram

External Software

Lead Form (LF)

The Lead Form is submitted by every new lead and provides a small set of data about the lead.

Customer Relationship Management (CRM)

The project output is made available to the sales team. This can be done in different ways, e.g. writing to a Google Sheet or pushing directly to SalesForce.

Components

Base Data Collector (BDC)

General description

The Base Data Collector (BDC) plays a crucial role in enriching the dataset related to potential client leads. The initial dataset solely comprises fundamental lead information, encompassing the lead’s first and last name, phone number, email address, and company name. Recognizing the insufficiency of this baseline data for value prediction, the BDC is designed to query diverse data sources, incorporating various Application Programming Interfaces (APIs), to enrich the provided lead data.

Design

The different data sources are organised as steps in the program. Each step extends from a common parent class and implements methods to validate that it can run, perform the data collection from the source and perform clean up and statistics reports for itself. These steps are then collected in a pipeline object sequentially performing the steps to enhance the given data with all chosen data sources. The data sources include:

inspecting the possible custom domain of the email address.
retrieving multiple data from the Google Places API.
analysing the sentiment of Google reviews using GPT.
inspecting the surrounding areas of the business using the Regional Atlas API.
searching for company-related data using the OffeneRegisterAPI.
performing sentiment analysis on reviews using GPT-4 model.

Data storage

All data for this project is stored in CSV files in the client’s AWS S3 storage. The files here are split into three buckets. The input data and enhanced data are stored in the events bucket, pre-processed data ready for use of ML models is stored in the features bucket and the used model and inference is stored in the model bucket. Data preprocessing Following data enrichment, a pivotal phase in the machine learning pipeline is data preprocessing, an essential process encompassing scaling operations, numerical outlier elimination, and categorical one-hot encoding. This preprocessing stage serves transforms the output originating from the BDC into feature vectors, thereby rendering them amenable for predictive analysis by the machine learning model.

Merchant Size Predictor (MSP) / Estimated Value Predictor (EVP)

Historical Note

The primary objective of the Estimated Value Predictor was initially oriented towards forecasting the estimated life-time value of leads. However, this objective evolved during the project’s progression, primarily influenced by labelling considerations. The main objective has therefore changed to predicting only the size of a given lead, which can then be used as an indication for their potential life-time value. As a consequence, the component in questions is now (somewhat inconsistently) either referred to as the Estimated Value Predictor (EVP) or as the Merchant Size Predictor (MSP).

Design

In the context of Merchant Size Prediction, our aim is to leverage pre-trained ML models on new lead data. By applying these models, we intend to predict the potential Merchant Size, thereby assisting SumUp in prioritizing leads and making informed decisions on which leads to contact first. This predictive approach enhances the efficiency of lead management and optimizes resource allocation for maximum impact.

The machine learning model, integral to the MSP, undergoes training on proprietary historical data sourced from SumUp. The training process aims to discern discriminative features that effectively stratify each class within the Merchant Size taxonomy. It is imperative to note that the confidentiality of the underlying data prohibits its public disclosure.

Data Fields

Data Field Definitions

This document outlines the data fields obtained for each lead. The data can be sourced from the online Lead Form or be retrieved from the internet using APIs.

Data Field Table

The most recent Data Fields table can now be found in a separate CSV File.

Links to Data Sources:

Lead form: https://www.sumup.com/de-de/kontaktieren-vertriebsteam/
Google Places API: https://developers.google.com/maps/documentation/places/web-service/overview
OpenAI API: https://platform.openai.com/docs/overview
Meta API: https://developers.facebook.com/docs/graph-api/overview

Data Fields CSV

The following table highlights the data fields obtained for each lead. The acquisition of such data may derive from the Lead Form or may be extracted from external sources utilizing APIs.

Data Field Definition
Field Name	Type	Description	Data source	Dependencies	Example
Last Name	string	Last name of the lead	Lead data		Mustermann
First Name	string	First name of the lead	Lead data		Mustername
Company / Account	string	Company name of the lead	Lead data		Mustercompany
Phone	string	Phone number of the lead	Lead data		49 1234 56789
Email	string	Email of the lead	Lead data		musteremail@example.com
domain	string	The domain of the email is the part that follows the “@” symbol, indicating the organization or service hosting the email address.	processing	Email	example.com
email_valid	boolean	Checks if the email is valid.	email_validator package	Email	True/False
first_name_in_account	boolean	Checks if first name is written in “Account” input	processing	First Name	True/False
last_name_in_account	boolean	Checks if last name is written in “Account” input	processing	Last Name	True/False
number_formatted	string	Phone number (formatted)	phonenumbers package	Phone	49123456789
number_country	string	Country derived from phone number	phonenumbers package	Phone	Germany
number_area	string	Area derived from phone number	phonenumbers package	Phone	Erlangen
number_valid	boolean	Indicator weather a phone number is valid	phonenumbers package	Phone	True/False
number_possible	boolean	Indicator weather a phone number is possible	phonenumbers package	Phone	True/False
google_places_place_id	string	Place ID used by Google	Google Places API	Company / Account
google_places_business_status	string	Business Status	Google Places API	Company / Account	Operational
google_places_formatted_address	string	Formatted address	Google Places API	Company / Account	Musterstr.1
google_places_name	string	Business Name	Google Places API	Company / Account	Mustername
google_places_user_ratings_total	integer	Total number of ratings	Google Places API	Company / Account	100
google_places_rating	float	Average star rating	Google Places API	Company / Account	4.5
google_places_price_level	float	Price level (1-3)	Google Places API	Company / Account
google_places_candidate_count_mail	integer	Number of results from E-Mail based search	Google Places API	Company / Account	1
google_places_candidate_count_phone	integer	Number of results from Phone based search	Google Places API	Company / Account	1
google_places_place_id_matches_phone_search	boolean	Indicator weather phone based and EMail based search gave the same result	Google Places API	Company / Account	True/False
google_places_confidence	float	Indicator of confidence in the Google result	processing		0.9
google_places_detailed_website	string	Link to business website	Google Places API	Company / Account	www.musterwebsite.de
google_places_detailed_type	list	Type of business	Google Places API	Company / Account	[“florist”, “store”]
reviews_sentiment_score	float	Sentiment score between -1 and 1 for the reviews	GPT	Google reviews	0.9
regional_atlas_pop_density	float	Population density	Regional Atlas	google_places_formatted_address	2649.6
regional_atlas_pop_development	float	Population development	Regional Atlas	google_places_formatted_address	-96.5
regional_atlas_age_0	float	Age group	Regional Atlas	google_places_formatted_address	16.3
regional_atlas_age_1	float	Age group	Regional Atlas	google_places_formatted_address	8.2
regional_atlas_age_2	float	Age group	Regional Atlas	google_places_formatted_address	31.1
regional_atlas_age_3	float	Age group	Regional Atlas	google_places_formatted_address	26.8
regional_atlas_age_4	float	Age group	Regional Atlas	google_places_formatted_address	17.7
regional_atlas_pop_avg_age	float	Average population age	Regional Atlas	google_places_formatted_address	42.1
regional_atlas_per_service_sector	float		Regional Atlas	google_places_formatted_address	88.4
regional_atlas_per_trade	float		Regional Atlas	google_places_formatted_address	28.9
regional_atlas_employment_rate	float	Employment rate	Regional Atlas	google_places_formatted_address	59.9
regional_atlas_unemployment_rate	float	Unemployment rate	Regional Atlas	google_places_formatted_address	6.4
regional_atlas_per_long_term_unemployment	float	Long term unemployment	Regional Atlas	google_places_formatted_address	49.9
regional_atlas_investments_p_employee	float	Investments per employee	Regional Atlas	google_places_formatted_address	6.8
regional_atlas_gross_salary_p_employee	float	Gross salary per employee	Regional Atlas	google_places_formatted_address	63.9
regional_atlas_disp_income_p_inhabitant	float	Income per inhabitant	Regional Atlas	google_places_formatted_address	23703
regional_atlas_tot_income_p_taxpayer	float	Income per taxpayer	Regional Atlas	google_places_formatted_address	45.2
regional_atlas_gdp_p_employee	float	GDP per employee	Regional Atlas	google_places_formatted_address	84983
regional_atlas_gdp_development	float	GDP development	Regional Atlas	google_places_formatted_address	5.2
regional_atlas_gdp_p_inhabitant	float	GDP per inhabitant	Regional Atlas	google_places_formatted_address	61845
regional_atlas_gdp_p_workhours	float	GDP per workhours	Regional Atlas	google_places_formatted_address	60.7
regional_atlas_pop_avg_age_zensus	float	Average population age (from zensus)	Regional Atlas	google_places_formatted_address	41.3
regional_atlas_regional_score	float	Regional score	Regional Atlas	google_places_formatted_address	3761.93
review_avg_grammatical_score	float	Average grammatical score of reviews	processing	google_places_place_id	0.56
review_polarization_type	string	Polarization type of review ratings	processing	google_places_place_id	High-Rating Dominance
review_polarization_score	float	Polarization score of review ratings	processing	google_places_place_id	1
review_highest_rating_ratio	float	Ratio of the highest review ratings	processing	google_places_place_id	1
review_lowest_rating_ratio	float	Ratio of the lowest review ratings	processing	google_places_place_id	0
review_rating_trend	float	Value indicating the trend of ratings	processing	google_places_place_id	0

Google Search Strategy

Introduction

In order to gather more information about a lead, we query the Google Places API. The API has multiple endpoints, enabling different search method. To have the best chances at correctly identifying a lead we try to combine the search methods and derive the most probable result.

Available Lead Information

First Name	Last Name	Phone Number	Email
Max	Muster	+491721234567	max-muster@mybusiness.com
Melanie	Muster	+491322133321	melanies-flowershop@gmail.nl
…	…	…	…

Available Search Methods

Fulltext Search (used with components of the E-Mail address)
Phone Number Search

Search Strategy

Phone Number Search 2) If there’s a valid phone number, look it up
Email Based Search 3) If there’s a custom domain, look it up 4) Else: Unless it contains the full name, look up the E-Mail account (everything before the @ sing)
If Email-based Search returned any results, use those
Else: Return Phone-based search results
Else: Return nothing

OpenLLM Business Type Analysis

Business Type Analysis: Research and Proposed Solution

Research

1. Open-source LLM Model : I explored an open-source LLM model named CrystalChat available on Hugging Face (https://huggingface.co/LLM360/CrystalChat). Despite its capabilities, it has some limitations:

Computational Intensity: CrystalChat is computationally heavy and cannot be run efficiently on local machines.
Infrastructure Constraints: Running the model on Colab, although feasible, faces GPU limitations.

2. OpenAI as an Alternative : Given the challenges with the open LLM model, OpenAI’s GPT models provide a viable solution. While GPT is known for its computational costs, it offers unparalleled language understanding and generation capabilities.

Proposed Solution

Considering the limitations of CrystalChat and the potential infrastructure costs associated with running an open LLM model on local machines, I propose the following solution:

Utilize OpenAI Models: Leverage OpenAI models, which are known for their robust language capabilities.
Manage Costs: Acknowledge the computational costs associated with GPT models and explore efficient usage options, such as optimizing queries or using cost-effective computing environments.
Experiment with CrystalChat on AWS SageMaker: As part of due diligence, consider executing CrystalChat on AWS SageMaker to evaluate its performance and potential integration.
Decision Making: After the experimentation phase, evaluate the performance, costs, and feasibility of both OpenAI and CrystalChat. Make an informed decision based on the achieved results.

Conclusion

Leveraging OpenAI’s GPT models offers advanced language understanding. To explore the potential of open-source LLM models, an experiment with CrystalChat on AWS SageMaker is suggested before making a final decision.

Classifier Comparison

Abstract

This report presents a comprehensive evaluation of various classifiers trained on the historical dataset, which has been enriched and preprocessed through our pipeline. Each model type was tested on two splits of the data set. The used data set has five classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL. The first split of the data set used exactly these classes for the prediction corresponding to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While this does not exactly correspond to the given classes from SumUp, this simplification ofthe prediction task generally resulted in a better F1-score across models.

Experimental Attempts

In accordance with the free lunch theorem, indicating no universal model superiority, multiple attempts were made to find the optimal solution. Unfortunately, certain models did not perform satisfactorily. Here are the experimented models and methodolgies

Quadratic Discriminant Analysis (QDA)
Ridge Classifier
Random Forest
Support Vecotr Machine (SVM)
Fully Connected Neural Networks Classifier Model (FCNNC)
Fully Connected Neural Networks Regression Model (FCNNR)
XGBoost Classifier Model
K Nearest Neighbor Classifier (KNN)
Bernoulli Naive Bayes Classifier
LightGBM

Models not performing well

Support Vector Machine Classifier Model

Training Support Vector Machine (SVM) took a while such that the training never ended. We believe that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.

Fully Connected Neural Networks Classifier Model

Fully Connected Neural Networks (FCNN) achieved overall lower performance than that Random Forest Classifier, mainly it had f1 score 0.84 on the XS class, while having 0.00 f1 scores on the other class, it learned only the XS class. the FCNN consisted of 4 layers overall, RELU activation function in each layer, except in the logits layer the activation function is Softmax. The loss functions investigated were Cross-Entropy and L2 Loss. The Optimizers were Adam and Sctohastic Gradient Descent. Moreover, Skip connections, L1 and L2 Regularization techniques and class weights have been investigated as well. Unfortunately we haven’t found any FCNN that outperforms the simpler ML models.

Fully Connected Neural Networks Regression Model

There has been an idea written in the scientific paper “Inter-species cell detection - datasets on pulmonary hemosiderophages in equine, human and feline specimens” by Marzahl et al. (https://www.nature.com/articles/s41597-022-01389-0) where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.

QDA & Ridge Classifier

Both of these classifiers could not produce a satisfactory performance on either data set split. While the prediction on the XS class was satisfactory (F1-score of ~0.84) all other classes had F1-scores of ~0.00-0.15. For this reason we are not considering these predictors in future experiments. This resulted in an overall F1-score of ~0.11, which is significantly outperformed by the other tested models.

TabNet Architecture

TabNet, short for “Tabular Neural Network,” is a novel neural network architecture specifically designed for tabular data, commonly encountered in structured data, such as databases and CSV files. It was introduced in the paper titled “TabNet: Attentive Interpretable Tabular Learning” by Arik et al. (https://arxiv.org/abs/1908.07442). TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. Unfortunately, TabNet similarly to our proposed 4 layer network, TabNet only learned the features of the XS class with XS f1 score of 0.84, while the other f1 scores of other classes are zeros. The underlying data does not seem to respond positively to neural network-based approaches.

Well performing models

In this sub-section we will discuss the results of well performing models, which arer XGBoost, LightGBM, K-Nearest Neighbor (KNN), Random Forest, AdaBoost and Naive Bayes.

Feature subsets

We have collected a lot of features (~54 data points) for the leads, additionally one-hot encoding the categorical variables results in a high dimensional feature space (132 features). Not all features might be equally relevant for our classification task so we want to try different subsets.

The following subsets are available:

google_places_rating, google_places_user_ratings_total, google_places_confidence, regional_atlas_regional_score

Overall Results

Notes:

The Random Forest Classifier used 100 estimators.
The AdaBoost Classifier used 100 DecisionTree classifiers.
The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.
The XGBoost was trained for 10000 rounds.
The LightGBM was trained with 2000 number of leaves

In the following table we can see the model’s overall weighted F1-score on the 3-class and 5-class data set split. The best performing classifiers per row is marked bold.

	KNN	Naive Bayes	Random Forest	XGBoost	AdaBoost	LightGBM
5-Class	0.6314	0.6073	0.6150	0.6442	0.6098	0.6405
3-Class	0.6725	0.6655	0.6642	0.6967	0.6523	0.6956

	KNN (subset=1)	Naive Bayes (subset=1)	RandomForest (subset=1)	XGBoost (subset=1)	AdaBoost (subset=1)	LightGBM (subset=1)
5-Class	0.6288	0.6075	0.5995	0.6198	0.6090	0.6252
3-Class	0.6680	0.6075	0.6506	0.6664	0.6591	0.6644

We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits. These results are consistent for both the full dataset as well as subset 1. We observe a slight performance for almost all classifiers when using subset 1 compared to the full dataset (except AdaBoost/3-class and Naive Bayes/5-class). This indicates that the few features retained in subset 1 are not the sole discriminant features of the dataset. However, the performance is still high enough to suggest that the features in subset 1 are highly relevant to make classifications on the data.

Results for each class

5-class split

In the following table we can see the F1-score of each model for each class in the 5-class split:

Class	KNN	Naive Bayes	Random Forest	XGBoost	AdaBoost	LightGBM
XS	0.82	0.83	0.81	0.84	0.77	0.83
S	0.15	0.02	0.13	0.13	0.22	0.14
M	0.08	0.02	0.09	0.08	0.14	0.09
L	0.06	0.00	0.08	0.06	0.07	0.05
XL	0.18	0.10	0.15	0.16	0.14	0.21

Class	KNN (subset=1)	Naive Bayes (subset=1)	RandomForest (subset=1)	XGBoost (subset=1)	AdaBoost (subset=1)	LightGBM (subset=1)
XS	0.82	0.84	0.78	0.84	0.78	0.82
S	0.16	0.00	0.16	0.04	0.19	0.13
M	0.07	0.00	0.07	0.02	0.09	0.08
L	0.07	0.00	0.06	0.05	0.07	0.06
XL	0.19	0.00	0.11	0.13	0.14	0.18

For every model we can see that the predictions on the XS class are significantly better than every other class. For the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL. Using subset 1 again mostly decreased performance on all classes, with the exception of the KNN classifier and classes L and XL where we can observe a slight increase in F1-score.

3-class split

In the following table we can see the F1-score of each model for each class in the 3-class split:

Class	KNN	Naive Bayes	Random Forest	XGBoost	AdaBoost	LightGBM
XS	0.83	0.82	0.81	0.84	0.78	0.83
S,M,L	0.27	0.28	0.30	0.33	0.34	0.34
XL	0.16	0.07	0.13	0.14	0.12	0.19

Class	KNN (subset=1)	Naive Bayes (subset=1)	RandomForest (subset=1)	XGBoost (subset=1)	AdaBoost (subset=1)	LightGBM (subset=1)
XS	0.82	0.84	0.79	0.84	0.79	0.81
S,M,L	0.29	0.00	0.30	0.22	0.32	0.28
XL	0.18	0.00	0.11	0.11	0.20	0.17

For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the LightGBM model slightly outperforms the other models. The LightGBM classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split. The AdaBoost Classifier, trained on subset 1, performs best for the XL class. The KNN classifier got a slight boost in performance for the {S, M, L} and XL classes when using subset 1. All other models perform worse on subset 1.

Conclusion

In summary, XGBoost consistently demonstrated superior performance, showcasing robust results across various splits and subsets. However, it is crucial to note that its elevated score is attributed to potential overfitting on the XS class. Given SumUp’s emphasis on accurate predictions for higher classes, we recommend considering LightGBM. This model outperformed XGBoost in predicting the XL class and the other classes, offering better results in both the five-class and three-class splits.

Concepts, Unrealized Ideas & Miscellaneous

Unused Ideas

This document lists ideas and implementations which have either not been tried yet or have been deprecated as they are not used in the current product version but still carry some conceptual value.

Deprecated

The original implementation of the deprecated modules can be found in the deprecated/ directory.

Controller

Note: This package has the additional dependency pydantic==2.4.2

The controller module was originally planned to be used as a communication device between EVP and BDC. Whenever the salesperson interface would register a new lead the controller is supposed to trigger the BDC pipeline to enrich the data of that lead and preprocess it to create a feature vector. The successful completion of the BDC pipeline is then registered at the controller which will then trigger an inference of the EVP to compute the predicted merchant size and write this back to the lead data. The computed merchant size can then be used to rank the leads and allow the salesperson to decide the value of the leads and which one to call.

The current implementation of the module supports queueing messages from the BDC and EVP as indicated by their type. Depending on the message type the message is then routed to the corresponding module (EVP or BDC). The actual processing of the messages by the modules is not implemented. All of this is done asynchronously by using the python threading library.

FacebookGraphAPI

Note: This package has the additional dependency facebook-sdk==3.1.0. Also the environment variables FACEBOOK_APP_ID FACEBOOK_APP_SECRET need to be set with a valid token.

This step was supposed to be used for querying lead data from the facebook by using either the business owner’s name or the company name. The attempt was deprecated as the cost for the needed API token was evaluated too high and because the usage permissions of the facebook API were changed. Furthermore, it is paramount to check the legal ramifications of querying facebook for this kind of data as there might be legal consequences of searching for individuals on facebook instead of their businesses due to data privacy regulations in the EU.

ScrapeAddresses

This step was an early experiment, using only the custom domain from an email address. We check if there’s a live website running for the domain, and then try to parse the main site for a business address using a RegEx pattern. The pattern is not very precise and calling the website, as well as parsing it, takes quite some time, which accumulates for a lot of entries. The Google places step yields better results for the business address and is faster, that’s why scrape_addresses.py was deprecated.

Possible ML improvements

Creating data subsets

The data collected by the BDC pipeline has not been refined to only include semantically valuable data fields. It is possible that some data fields contain no predictive power. This would mean they are practically polluting the dataset with unnecessary information. A proper analysis of the predictive power of all data fields would allow cutting down on the amount of data for each lead, reducing processing time and possibly make predictions more precise. This approach has been explored very briefly by the subset 1 as described in Classifier-Comparison.md. However, the choice of included features has not been justified by experiments making them somewhat arbitrary. Additionally, an analysis of this type could give insights on which data fields to expand on and what new data one might want to collect to increase the EVP’s performance in predicting merchant sizes.

Possibly filtering data based on some quality metric could also improve general performance. The regional_atlas_score and google_confidence_score have been tried for this but did not improve performance. However, these values are computed somewhat arbitrarily and implementing a more refined quality metric might result in more promising results.

Controller

Automation

The Controller is a planned component, that has not been implemented beyond a conceptual prototype. In the planned scenario, the controller would coordinate BDC, MSP and the external components as a centralized instance of control. In contrast to our current design, this scenario would enable the automation of our current workflow, where there are currently several steps of human interaction required to achieve a prediction result for initially unprocessed lead data.

Diagrams

The following diagrams were created during the prototyping phase for the Controller component. As they are from an early stage of our project, the Merchant Size Predictor is labelled as the (Estimated) Value Predictor here.

Component Diagram

Component Diagram

Sequence Diagram

Sequence Diagram

Controller Workflow Diagram

Controller Workflow Diagram

Twitter API Limitation

Limitations of Twitter API for user information retrieval and biased sentiment analysis

This documentation highlights the research and the limitations regarding customer information retrieval and unbiased sentiment analysis when using Twiiter API (tweepy). Two primary constraints include the absence of usernames in provided customer data and inherent biases in tweet content, which significantly impact the API’s utility for these purposes.

Limitation 1: Absence of usernames in provided customer data:

A fundamental shortfall within the Twitter API (tweepy) lies in the unavailability of usernames in the customer information obtained through its endpoints. Twitter (X) primarily uses usernames as identifiers to retrieve user information, on the other hand we only have the Full Names of the customers as indicators.

Limitation 2: Inherent Biases in Tweet Content for Sentiment Analysis:

Conducting sentiment analysis on tweets extracted via the Twitter API poses challenges due to inherent biases embedded in tweet done by the customer themselves. Sentiment analysis on something like reviews would be definitely helpful. However, sentiment analysis done on tweet written by customer themselves would deeply imposes biases.

Links to Twitter’s API documentation:

tweepy documentaion: https://docs.tweepy.org/en/stable/

Contribution

Contribution Workflow

Branching Strategy

main: It contains fully stable production code

dev: It contains stable under-development code
- epic: It contains a module branch. Like high level of feature. For example, we have an authentication module then we can create a branch like “epic/authentication”
  - feature: It contains specific features under the module. For example, under authentication, we have a feature called registration. Sample branch name: “feature/registration”
  - bugfix: It contains bug fixing during the testing phase and branch name start with the issue number for example “bugfix/3-validate-for-wrong-user-name”

Commits and Pull Requests

The stable branches main and dev are protected against direct pushes. To commit code to these branches create a pull request (PR) describing the feature/bugfix that you are committing to the dev branch. This PR will then be reviewed by another SD from the project. Only after being approved by another SD a PR may be merged into the dev branch. Periodically the stable code on the dev branch will be merged into the main branch by creating a PR from dev. Hence, every feature that should be committed to the main branch must first run without issues on the dev branch for some time.

Before contributing to this repository make sure that you are identifiable in your git user settings. This way commits and PRs created by you can be identified and easily traced back.

git config --local user.name "Manu Musterperson"
git config --local user.email "manu@musterperson.org"

Any commit should always contain a commit message that references an issue created in the project. Also, always signoff on your commits for identification reasons.

git commit -m "Fixed issue #123" --signoff

When doing pair programming be sure to always have all SDs mentioned in the commit message. Each SD should be listed on a new line for clarity reasons.

git commit -a -m "Fixed problem #123
> Co-authored-by: Manu Musterperson <manu.musterperson@fau.de>" --signoff

Pull Request Workflow

The main and dev branches are protected against direct pushes, which means that we want to do a Pull Request (PR) in order to merge a developed branch into these branches. Having developed a branch (let’s call it feature-1) and we want to merge feature-1 branch into main branch.

Here is a standard way to merge pull requests:

Have all your local changes added, committed, and pushed on the remote feature-1 branch

git checkout feature-1
git add .
git commit -m "added a feature" --signoff  # don't forget the signoff ;)
git push

Make sure your local main branch up-to-date
```
git checkout main
git pull main
```
Go to Pull Requests > click on “New pull request” > make sure the base is main branch (or dev branch, depends on which branch you want to update) and the compare to be your feature-1 branch, as highlighted in the photo below and click “create pull requests”:

Make sure to link the issue your PR relates to.
Inform the other SDs on slack that you have created the PR and it is awaiting a review and wait for others to review your code. The reviewers will potentially leave comments and change requests in their PR review. If this is the case reason why the change request is not warranted or checkout your branch again and apply the requested changes. Then push your branch once more and request another review by the reviewer. Once there are no more change requests and the PR has been approved by another SD you can merge the PR into the target branch.
Delete the feature branch feature-1 once it has been merged into the target branch.

In case of merge conflict:

Should we experience merge conflict after step 3, we should solve the merge conflicts manually, below the title of “This branch has conflicts that must be resolved” click on web editor (you can use vscode or any editor you want). The conflict should look like this:

<<<<<<< HEAD
// Your changes at **feature-1** branch
=======
// Data already on the main branch
>>>>>>> main

-choose which one of these you would adopt for the merge to the main branch, we would be better off solving the merge -conflicts together rather than alone, feel free to announce it in the slack group chat. -mark it as resolved and remerge the PR again, there shouldn’t any problem with it.

Feel free to add more about that matter here.

SBOM Generator

Automatic SBOM generation

pipenv install
pipenv shell

pip install pipreqs
pip install cyclonedx-bom
pip install pip-licenses

# Create the SBOM (cyclonedx-bom) based on (pipreqs) requirements that are actually imported in the .py files

$sbom = pipreqs --print | cyclonedx-py -r -pb -o - -i -

# Create an XmlDocument object
$xml = New-Object System.Xml.XmlDocument

# Load XML content into the XmlDocument
$xml.LoadXml($sbom)


# Create an empty CSV file
$csvPath = "SBOM.csv"

# Initialize an empty array to store rows
$result = @()

# Iterate through the XML nodes and create rows for each node
$xml.SelectNodes("//*[local-name()='component']") | ForEach-Object {

    $row = @{
        "Version" = $_.Version
        "Context" = $_.Purl
        "Name" = if ($_.Name -eq 'scikit_learn') { 'scikit-learn' } else { $_.Name }
    }

    # Get license information
    $match = pip-licenses --from=mixed --format=csv --with-system --packages $row.Name | ConvertFrom-Csv

    # Add license information to the row
    $result += [PSCustomObject]@{
        "Context" = $row.Context
        "Name" = $row.Name
        "Version" = $row.Version
        "License" = $match.License
    }
}

# Export the data to the CSV file
$result | Export-Csv -Path $csvPath -NoTypeInformation

# Create the license file
$licensePath = $csvPath + '.license'
@"
SPDX-License-Identifier: CC-BY-4.0
SPDX-FileCopyrightText: 2023 Fabian-Paul Utech <f.utech@gmx.net>
"@ | Out-File -FilePath $licensePath

exit

Miscellaneous

Miscellaneous Content

This file contains content that was moved over from our Wiki, which we gave up in favor of having the documentation available more centrally. The contents of this file might to some extend overlap with the contents found in other documentation files.

Knowledge Base

AWS

New password has to be >= 16 char and contain special chars
After changing the password you have to re-login
Add MFA (IAM -> Users -> Your Name -> Access Info)
MFA device = FirstName.LastName like the credential
Re-login
Get access keys:
- IAM -> Users -> Your Name -> Access Info -> Scroll to Access Keys
- Create new access key (for local development)
- Accept the warning
- Copy the secret key to your .env file
- Don’t add description tags to your key

PR Management:

Create PR
Link issue
Other SD reviews the PR
- Modification needed?
  - Fix/Discuss issue in the GitHub comments
- Make new commit
- Return to step 3
- No Modification needed
- Reviewer approves PR
PR creator merges PR
Delete the used branch

Branch-Management:

Remove branches after merging
Add reviews / pull requests so others check the code
Feature branches with dev instead of main as base

Pre-commit:

# If not installed yet
pip install pre-commit

# pre-commit hooks now automatically are executed before every commit
python -m pre-commit install

# execute pre-commit manually
python pre-commit

Features

Existing Website (Pingable, SEO-Score, DNS Lookup)
Existing Google Business Entry (using the Google Places API
- Opening Times
- Number, Quality of Ratings
- Overall “completeness” of the entry/# of available datapoints
- Price category
- Phone Number (compare with lead form input)
- Website (compare with lead form input)
- Number of visitors (estimate revenue from that?)
- Product recognition from images
- Merchant Category (e.g. cafe, restaurant, retailer, etc.)
Performance Indicators (NorthData, some other API)
- Revenue (as I understodd, this should be > 5000$/month)
- Number of Employees
- Bundesanzeiger / Handelsregister (Deutschland API)
Popularity: Insta / facebook followers or website ranking on google
Business type: google or website extraction (maybe with ChatGPT)
Size of business: To categorize leads to decide whether they need to deal with a salesperson or self-direct their solution
Business profile
Sentiment Analysis: https://arxiv.org/pdf/2307.10234.pdf

Storage

Unique ID for Lead (Felix)?
How to handle frequent data layout changes at S3 (Simon)?
3 stage file systems (Felix) vs. DB (Ruchita)?
3 stage file system (Felix):
- BDC trigger on single new lead entries or batches
- After BDC enriched the data => store in a parquet file in the events folder with some tag
- BDC triggers the creation of the feature vectors
- Transform the data in the parquet file after it was stored in the events file and store them in the feature folder with the same tag
- Use the data as a input for the model, which is triggered after the creation of the input, and store the results in the model folder
Maybe the 3 stage file system as a part of the DB and hide the final decision behind the database abstraction layer (Simon)?

Control flow (Berkay)

Listener
MessageQueue
RoutingQueue

Listener, as the name suggests, listens for incoming messages from other component, such as BDC, EVP, and enqueues these messages in messageQueue to be “read” and processed. If there are not incoming messages, it is in idle status. messageQueue is, where listened messages are being processed. After each message is processed by messageQueue,it is enqueued in routingQueue, to be routed to corresponding component. Both messageQueue and routingQueue are in idle, if there are no elements in queues. Whole concept of Controller is multi-threaded and asynchronous. While it accepts new incoming messages, it processes messages and at the same time routes some other messages.

AI

expected value = life-time value of lead x probability of the lead becoming a customer

AI models needed that solve a regression or probability problem

AI Models

Classification:
- Decision Trees
- Random Forest
- Neural Networks
- Naïve Bayes

What data do we need?

Classification: Labeled data
Probability: Data with leads and customers

ML Pipeline

Preprocessing
Feature selection
Dataset split / cross validation
Dimensional reduction
Training
Testing / Evaluation
Improve performance
- Batch Normalization
- Optimizer
- L1 / L2 regularization: reduced overfitting by regularize the model
- Dropout (NN)
- Depth and width (NN)
- Initialization techniques (NN: Xavier and He)
  - He: Layers with ReLu activation
  - Xavier: Layers with sigmoid activation

Troubleshooting

Build

pipenv

install stuck

pipenv install –dev

Solution: Remove .lock file + restart PC

Docker

VSCode

Terminal can’t run docker image (on windows)

Solution: workaround with git bash or with ubuntu

Testing

Reuse

don’t analyze a certain part of the code with reuse Solution:

# REUSE-IgnoreStart
  ...
# REUSE-IgnoreEnd

Failed checks

Go to the specific pull request or Actions Actions
Click “show all checks”
Click “details”
Click on the elements with the “red marks”

BDC

Google Places API

Language is adjusted to the location from which the API is run

Solution: adjust the language feature, documentation in Google Solution

Google search results are based on the location from which the API is run

Solution: Pass a fixed point in the center of the country / city / area of the company (OSMNX) as a location bias, documentation in Google Solution

Branch-Management

Divergent branch

Commits on local and remote are not the same

Solution:
1. Pull remote changes
2. Rebase the changes
3. Solve any conflict during any commit you get from remote