Build Documentation

Creating the Environment

The repository contains the file .env.template. This file is a template for the environment variables that need to be set for the application to run. Copy this file into a file called .env at the root level of this repository and fill in all values with the corresponding secrets.

To create the virtual environment in this project you must have pipenv installed on your machine. Then run the following commands:

# for development environment
pipenv install --dev
# for production environment
pipenv install

To work within the environment you can now run:

# to activate the virtual environment
pipenv shell
# to run a single command
pipenv run <COMMAND>

Build Process

This application is built and tested on every push and pull request creation through Github actions. For this, the pipenv environment is installed and then the code style is checked using flake8. Finally, the tests/ directory is executed using pytest and a test coverage report is created using coverage. The test coverage report can be found in the Github actions output.

In another task, all used packages are tested for their license to ensure that the software does not use any copy-left licenses and remains open source and free to use.

If any of these steps fail for a pull request the pull request is blocked from being merged until the corresponding step is fixed.

Furthermore, it is required to install the pre-commit hooks as described here. This ensures uniform coding style throughout the project as well as that the software is compliant with the REUSE licensing specifications.

Running the app

To run the application the pipenv environment must be installed and all needed environment variables must be set in the .env file. Then the application can be started via

pipenv run python src/main.py

Pre-Commit Hooks

This repository uses pre-commit hooks to ensure a consistent and clean file organization. Each registered hook will be executed when committing to the repository. To ensure that the hooks will be executed they need to be installed using the following command:

pre-commit install

The following things are done by hooks automatically:

  • formatting of python files using black and isort

  • formatting of other files using prettier

  • syntax check of JSON and yaml files

  • adding new line at the end of files

  • removing trailing whitespaces

  • prevent commits to dev and main branch

  • check adherence to REUSE licensing format

User Documentation

Project vision

This product will give our industry partner a tool at hand, that can effectively increase conversion of their leads to customers, primarily by providing the sales team with valuable information. The modular architecture makes our product future-proof, by making it easy to add further data sources, employ improved prediction models or to adjust the output format if desired.

Project mission

The mission of this project is to enrich historical data about customers and recent data about leads (with information from external sources) and to leverage the enriched data in machine learning, so that the estimated Merchant Size of leads can be predicted.

Usage

To execute the final program, ensure the environment is installed (refer to build-documents.md) and run python .\src\main.py either locally or via the build process. The user will be presented with the following options:

Choose demo:
(0) : Base Data Collector
(1) : Data preprocessing
(2) : ML model training
(3) : Merchant Size Predictor
(4) : Exit

(0) : Base Data Collector

This is the data enrichment pipeline, utilizing multiple data enrichment steps. Configuration options are presented:

Do you want to list all available pipeline configs? (y/N) If y:

Please enter the index of requested pipeline config:
(0) : config_sprint09_release.json
(1) : just_run_search_offeneregister.json
(2) : run_all_steps.json
(3) : Exit
  • (0) Coniguration used in sprint 9.

  • (1) Coniguration for OffeneRegister.

  • (2) Running all the steps of the pipeline without steps selection.

  • (3) Exit to the pipeline step selection.

If n: proceed to pipeline step selection for data enrichment. Subsequent questions arise:

Run Scrape Address (will take a long time)(y/N)?
Run Search OffeneRegister (will take a long time)(y/N)?
Run Phone Number Validation (y/N)?
Run Google API (will use token and generate cost!)(y/N)?
Run Google API Detailed (will use token and generate cost!)(y/N)?
Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?
Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?
Run Smart Review Insights (will take looong time!)(y/N)?
Run Regionalatlas (y/N)?
  • Run Scrape Address (will take a long time)(y/N)?: This enrichment step scrapes the leads website for an address using regex.

  • Run Search OffeneRegister (will take a long time)(y/N)?: This enrichment step searches for company-related data using the OffeneRegisterAPI.

  • Run Phone Number Validation (y/N)?: This enrichment step checks if the provided phone numbers are valid and extract geographical information using geocoder.

  • Run Google API (will use token and generate cost!)(y/N)?: This enrichment step tries to the correct business entry in the Google Maps database. It will save basic information along with the place id, that can be used to retrieve further detailed information and a confidence score that should indicate the confidence in having found the correct result.

  • Run Google API Detailed (will use token and generate cost!)(y/N)?: This enrichment step tries to gather detailed information for a given google business entry, identified by the place ID.

  • Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?: This enrichment step performs sentiment analysis on reviews using GPT-4 model.

  • Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?: This enrichment step attempts to download a businesses website in raw html format and pass this information to OpenAIs GPT, which will then attempt to summarize the raw contents and extract valuable information for a salesperson.

  • Run Smart Review Insights (will take looong time!)(y/N)?: This enrichment step enhances review insights for smart review analysis

  • Run Regionalatlas (y/N)?: This enrichment step will query the RegionalAtlas database for location based geographic and demographic information, based on the address that was found for a business through Google API.

It is emphasized that some steps are dependent on others, and excluding one might result in dependency issues for subsequent steps.

After selecting the desired enrichtment steps, a prompt asks the user to Set limit for data points to be processed (0=No limit) such that the user chooses whether it apply the data enrichment steps for all the leads (no limit) or for a certain number of leads.

Note: In case DATABASE_TYPE="S3" in your .env file, the limit will be removed, in order to enrich all the data into s3://amos--data--events S3 bucket.

(1) : Data preprocessing

Post data enrichment, preprocessing is crucial for machine learning models, involving scaling, numerical outlier removal, and categorical one-hot encoding. The user is prompted with questions:

Filter out the API-irrelevant data? (y/n): This will filter out all the leads that couldn’t be enriched during the data enrichtment steps, removing them would be useful for the Machine Learning algorithms, to avoid any bias introduced, even if we pad the features with zeros. Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!: The user has to have DATABASE_TYPE="S3" in .env file in order to run on historical data, otherwise, it will run locally. After preprocessing, the log will show where the preprocessed_data is stored.

(2) : ML model training

Six machine learning models are available:

(0) : Random Forest
(1) : XGBoost
(2) : Naive Bayes
(3) : KNN Classifier
(4) : AdaBoost
(5) : LightGBM

After selection of the desired machine learning model, the user would be prompted with a series of questions:

  • Load model from file? (y/N) : In case of y, the program will ask for a file location of a previously saved model to use for predictions and testing.

  • Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N): In case of y, the S, M, L labels of the data would be grouped alltogether as one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL}) instead of the 5 classes. It is worth noting that grouping the S, M and L classes alltogether as one class resulted in boosting the classification performance.

  • Do you want to train on a subset of features?
    (0) : ['Include all features']
    (1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score']
    

0 would include all the numerical and categorical one-hot encoded features, while 1 would choose a small subset of data as features for the machine learning models

Then, the user would be given multiple options:

(1) Train
(2) Test
(3) Predict on single lead
(4) Save model
(5) Exit
  • (1): Train the current model on the current trainig dataset.

  • (2): Test the current model on the test dataset, displaying the mean squared error.

  • (3): Choose a single lead from the test dataset and display the prediction and true label.

  • (4): Save the current model to the amos--models/models on S3 in case of DATABASE_TYPE=S3, otherwise it will save it locally.

  • (5): Exit the EVP submenu

(3) : Merchant Size Predictor

After training, testing, and saving the model, the true essence of models lies not just in crucial task of generating forecasted predictions for previously unseen leads.

(4) : Exit

Gracefully exit the program.

Design Documentation

Introduction

This application serves as a pivotal tool employed by our esteemed industry partner, SumUp, for the enrichment of information pertaining to potential leads garnered through their sign-up website. The refined data obtained undergoes utilization in the prediction of potential value that a lead could contribute to SumUp, facilitated by a sophisticated machine learning model. The application is branched into two integral components: the Base Data Collector (BDC) and the Merchant Size Predictor (MSP).

Component Diagram

Component Diagram

External Software

Lead Form (LF)

The Lead Form is submitted by every new lead and provides a small set of data about the lead.

Customer Relationship Management (CRM)

The project output is made available to the sales team. This can be done in different ways, e.g. writing to a Google Sheet or pushing directly to SalesForce.

Components

Base Data Collector (BDC)

General description

The Base Data Collector (BDC) plays a crucial role in enriching the dataset related to potential client leads. The initial dataset solely comprises fundamental lead information, encompassing the lead’s first and last name, phone number, email address, and company name. Recognizing the insufficiency of this baseline data for value prediction, the BDC is designed to query diverse data sources, incorporating various Application Programming Interfaces (APIs), to enrich the provided lead data.

Design

The different data sources are organised as steps in the program. Each step extends from a common parent class and implements methods to validate that it can run, perform the data collection from the source and perform clean up and statistics reports for itself. These steps are then collected in a pipeline object sequentially performing the steps to enhance the given data with all chosen data sources. The data sources include:

  • inspecting the possible custom domain of the email address.

  • retrieving multiple data from the Google Places API.

  • analysing the sentiment of Google reviews using GPT.

  • inspecting the surrounding areas of the business using the Regional Atlas API.

  • searching for company-related data using the OffeneRegisterAPI.

  • performing sentiment analysis on reviews using GPT-4 model.

Data storage

All data for this project is stored in CSV files in the client’s AWS S3 storage. The files here are split into three buckets. The input data and enhanced data are stored in the events bucket, pre-processed data ready for use of ML models is stored in the features bucket and the used model and inference is stored in the model bucket. Data preprocessing Following data enrichment, a pivotal phase in the machine learning pipeline is data preprocessing, an essential process encompassing scaling operations, numerical outlier elimination, and categorical one-hot encoding. This preprocessing stage serves transforms the output originating from the BDC into feature vectors, thereby rendering them amenable for predictive analysis by the machine learning model.

Merchant Size Predictor (MSP) / Estimated Value Predictor (EVP)

Historical Note

The primary objective of the Estimated Value Predictor was initially oriented towards forecasting the estimated life-time value of leads. However, this objective evolved during the project’s progression, primarily influenced by labelling considerations. The main objective has therefore changed to predicting only the size of a given lead, which can then be used as an indication for their potential life-time value. As a consequence, the component in questions is now (somewhat inconsistently) either referred to as the Estimated Value Predictor (EVP) or as the Merchant Size Predictor (MSP).

Design

In the context of Merchant Size Prediction, our aim is to leverage pre-trained ML models on new lead data. By applying these models, we intend to predict the potential Merchant Size, thereby assisting SumUp in prioritizing leads and making informed decisions on which leads to contact first. This predictive approach enhances the efficiency of lead management and optimizes resource allocation for maximum impact.

The machine learning model, integral to the MSP, undergoes training on proprietary historical data sourced from SumUp. The training process aims to discern discriminative features that effectively stratify each class within the Merchant Size taxonomy. It is imperative to note that the confidentiality of the underlying data prohibits its public disclosure.

Data Fields

Data Field Definitions

This document outlines the data fields obtained for each lead. The data can be sourced from the online Lead Form or be retrieved from the internet using APIs.

Data Field Table

The most recent Data Fields table can now be found in a separate CSV File.

Data Fields CSV

The following table highlights the data fields obtained for each lead. The acquisition of such data may derive from the Lead Form or may be extracted from external sources utilizing APIs.

Data Field Definition

Field Name

Type

Description

Data source

Dependencies

Example

Last Name

string

Last name of the lead

Lead data

Mustermann

First Name

string

First name of the lead

Lead data

Mustername

Company / Account

string

Company name of the lead

Lead data

Mustercompany

Phone

string

Phone number of the lead

Lead data

49 1234 56789

Email

string

Email of the lead

Lead data

musteremail@example.com

domain

string

The domain of the email is the part that follows the “@” symbol, indicating the organization or service hosting the email address.

processing

Email

example.com

email_valid

boolean

Checks if the email is valid.

email_validator package

Email

True/False

first_name_in_account

boolean

Checks if first name is written in “Account” input

processing

First Name

True/False

last_name_in_account

boolean

Checks if last name is written in “Account” input

processing

Last Name

True/False

number_formatted

string

Phone number (formatted)

phonenumbers package

Phone

49123456789

number_country

string

Country derived from phone number

phonenumbers package

Phone

Germany

number_area

string

Area derived from phone number

phonenumbers package

Phone

Erlangen

number_valid

boolean

Indicator weather a phone number is valid

phonenumbers package

Phone

True/False

number_possible

boolean

Indicator weather a phone number is possible

phonenumbers package

Phone

True/False

google_places_place_id

string

Place ID used by Google

Google Places API

Company / Account

google_places_business_status

string

Business Status

Google Places API

Company / Account

Operational

google_places_formatted_address

string

Formatted address

Google Places API

Company / Account

Musterstr.1

google_places_name

string

Business Name

Google Places API

Company / Account

Mustername

google_places_user_ratings_total

integer

Total number of ratings

Google Places API

Company / Account

100

google_places_rating

float

Average star rating

Google Places API

Company / Account

4.5

google_places_price_level

float

Price level (1-3)

Google Places API

Company / Account

google_places_candidate_count_mail

integer

Number of results from E-Mail based search

Google Places API

Company / Account

1

google_places_candidate_count_phone

integer

Number of results from Phone based search

Google Places API

Company / Account

1

google_places_place_id_matches_phone_search

boolean

Indicator weather phone based and EMail based search gave the same result

Google Places API

Company / Account

True/False

google_places_confidence

float

Indicator of confidence in the Google result

processing

0.9

google_places_detailed_website

string

Link to business website

Google Places API

Company / Account

www.musterwebsite.de

google_places_detailed_type

list

Type of business

Google Places API

Company / Account

[“florist”, “store”]

reviews_sentiment_score

float

Sentiment score between -1 and 1 for the reviews

GPT

Google reviews

0.9

regional_atlas_pop_density

float

Population density

Regional Atlas

google_places_formatted_address

2649.6

regional_atlas_pop_development

float

Population development

Regional Atlas

google_places_formatted_address

-96.5

regional_atlas_age_0

float

Age group

Regional Atlas

google_places_formatted_address

16.3

regional_atlas_age_1

float

Age group

Regional Atlas

google_places_formatted_address

8.2

regional_atlas_age_2

float

Age group

Regional Atlas

google_places_formatted_address

31.1

regional_atlas_age_3

float

Age group

Regional Atlas

google_places_formatted_address

26.8

regional_atlas_age_4

float

Age group

Regional Atlas

google_places_formatted_address

17.7

regional_atlas_pop_avg_age

float

Average population age

Regional Atlas

google_places_formatted_address

42.1

regional_atlas_per_service_sector

float

Regional Atlas

google_places_formatted_address

88.4

regional_atlas_per_trade

float

Regional Atlas

google_places_formatted_address

28.9

regional_atlas_employment_rate

float

Employment rate

Regional Atlas

google_places_formatted_address

59.9

regional_atlas_unemployment_rate

float

Unemployment rate

Regional Atlas

google_places_formatted_address

6.4

regional_atlas_per_long_term_unemployment

float

Long term unemployment

Regional Atlas

google_places_formatted_address

49.9

regional_atlas_investments_p_employee

float

Investments per employee

Regional Atlas

google_places_formatted_address

6.8

regional_atlas_gross_salary_p_employee

float

Gross salary per employee

Regional Atlas

google_places_formatted_address

63.9

regional_atlas_disp_income_p_inhabitant

float

Income per inhabitant

Regional Atlas

google_places_formatted_address

23703

regional_atlas_tot_income_p_taxpayer

float

Income per taxpayer

Regional Atlas

google_places_formatted_address

45.2

regional_atlas_gdp_p_employee

float

GDP per employee

Regional Atlas

google_places_formatted_address

84983

regional_atlas_gdp_development

float

GDP development

Regional Atlas

google_places_formatted_address

5.2

regional_atlas_gdp_p_inhabitant

float

GDP per inhabitant

Regional Atlas

google_places_formatted_address

61845

regional_atlas_gdp_p_workhours

float

GDP per workhours

Regional Atlas

google_places_formatted_address

60.7

regional_atlas_pop_avg_age_zensus

float

Average population age (from zensus)

Regional Atlas

google_places_formatted_address

41.3

regional_atlas_regional_score

float

Regional score

Regional Atlas

google_places_formatted_address

3761.93

review_avg_grammatical_score

float

Average grammatical score of reviews

processing

google_places_place_id

0.56

review_polarization_type

string

Polarization type of review ratings

processing

google_places_place_id

High-Rating Dominance

review_polarization_score

float

Polarization score of review ratings

processing

google_places_place_id

1

review_highest_rating_ratio

float

Ratio of the highest review ratings

processing

google_places_place_id

1

review_lowest_rating_ratio

float

Ratio of the lowest review ratings

processing

google_places_place_id

0

review_rating_trend

float

Value indicating the trend of ratings

processing

google_places_place_id

0

Google Search Strategy

Google Search Strategy

Introduction

In order to gather more information about a lead, we query the Google Places API. The API has multiple endpoints, enabling different search method. To have the best chances at correctly identifying a lead we try to combine the search methods and derive the most probable result.

Available Lead Information

First Name

Last Name

Phone Number

Email

Max

Muster

+491721234567

max-muster@mybusiness.com

Melanie

Muster

+491322133321

melanies-flowershop@gmail.nl

Available Search Methods
  1. Fulltext Search (used with components of the E-Mail address)

  2. Phone Number Search

Search Strategy

  1. Phone Number Search 2) If there’s a valid phone number, look it up

  2. Email Based Search 3) If there’s a custom domain, look it up 4) Else: Unless it contains the full name, look up the E-Mail account (everything before the @ sing)

  3. If Email-based Search returned any results, use those

  4. Else: Return Phone-based search results

  5. Else: Return nothing

Search Strategy

OpenLLM Business Type Analysis

Business Type Analysis: Research and Proposed Solution

Research

1. Open-source LLM Model : I explored an open-source LLM model named CrystalChat available on Hugging Face (https://huggingface.co/LLM360/CrystalChat). Despite its capabilities, it has some limitations:

  • Computational Intensity: CrystalChat is computationally heavy and cannot be run efficiently on local machines.

  • Infrastructure Constraints: Running the model on Colab, although feasible, faces GPU limitations.

2. OpenAI as an Alternative : Given the challenges with the open LLM model, OpenAI’s GPT models provide a viable solution. While GPT is known for its computational costs, it offers unparalleled language understanding and generation capabilities.

Proposed Solution

Considering the limitations of CrystalChat and the potential infrastructure costs associated with running an open LLM model on local machines, I propose the following solution:

  1. Utilize OpenAI Models: Leverage OpenAI models, which are known for their robust language capabilities.

  2. Manage Costs: Acknowledge the computational costs associated with GPT models and explore efficient usage options, such as optimizing queries or using cost-effective computing environments.

  3. Experiment with CrystalChat on AWS SageMaker: As part of due diligence, consider executing CrystalChat on AWS SageMaker to evaluate its performance and potential integration.

  4. Decision Making: After the experimentation phase, evaluate the performance, costs, and feasibility of both OpenAI and CrystalChat. Make an informed decision based on the achieved results.

Conclusion

Leveraging OpenAI’s GPT models offers advanced language understanding. To explore the potential of open-source LLM models, an experiment with CrystalChat on AWS SageMaker is suggested before making a final decision.

Classifier Comparison

Classifier Comparison

Abstract

This report presents a comprehensive evaluation of various classifiers trained on the historical dataset, which has been enriched and preprocessed through our pipeline. Each model type was tested on two splits of the data set. The used data set has five classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL. The first split of the data set used exactly these classes for the prediction corresponding to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While this does not exactly correspond to the given classes from SumUp, this simplification ofthe prediction task generally resulted in a better F1-score across models.

Experimental Attempts

In accordance with the free lunch theorem, indicating no universal model superiority, multiple attempts were made to find the optimal solution. Unfortunately, certain models did not perform satisfactorily. Here are the experimented models and methodolgies

  • Quadratic Discriminant Analysis (QDA)

  • Ridge Classifier

  • Random Forest

  • Support Vecotr Machine (SVM)

  • Fully Connected Neural Networks Classifier Model (FCNNC)

  • Fully Connected Neural Networks Regression Model (FCNNR)

  • XGBoost Classifier Model

  • K Nearest Neighbor Classifier (KNN)

  • Bernoulli Naive Bayes Classifier

  • LightGBM

Models not performing well

Support Vector Machine Classifier Model

Training Support Vector Machine (SVM) took a while such that the training never ended. We believe that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.

Fully Connected Neural Networks Classifier Model

Fully Connected Neural Networks (FCNN) achieved overall lower performance than that Random Forest Classifier, mainly it had f1 score 0.84 on the XS class, while having 0.00 f1 scores on the other class, it learned only the XS class. the FCNN consisted of 4 layers overall, RELU activation function in each layer, except in the logits layer the activation function is Softmax. The loss functions investigated were Cross-Entropy and L2 Loss. The Optimizers were Adam and Sctohastic Gradient Descent. Moreover, Skip connections, L1 and L2 Regularization techniques and class weights have been investigated as well. Unfortunately we haven’t found any FCNN that outperforms the simpler ML models.

Fully Connected Neural Networks Regression Model

There has been an idea written in the scientific paper “Inter-species cell detection - datasets on pulmonary hemosiderophages in equine, human and feline specimens” by Marzahl et al. (https://www.nature.com/articles/s41597-022-01389-0) where they proposed using regression model on a classification task. The idea is to train the regression model on the class values, whereas the model predicts a continous values and learns the relation between the classes. The output is then subjected to threshholds (0-0.49,0.5-1.49,1.5-2.49,2.5-3.49,3.5-4.5) for classes XS, S, M, L, XL respectivly. This yielded better performance than the FCNN classifier but still was worse than that of the Random Forest.

QDA & Ridge Classifier

Both of these classifiers could not produce a satisfactory performance on either data set split. While the prediction on the XS class was satisfactory (F1-score of ~0.84) all other classes had F1-scores of ~0.00-0.15. For this reason we are not considering these predictors in future experiments. This resulted in an overall F1-score of ~0.11, which is significantly outperformed by the other tested models.

TabNet Architecture

TabNet, short for “Tabular Neural Network,” is a novel neural network architecture specifically designed for tabular data, commonly encountered in structured data, such as databases and CSV files. It was introduced in the paper titled “TabNet: Attentive Interpretable Tabular Learning” by Arik et al. (https://arxiv.org/abs/1908.07442). TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. Unfortunately, TabNet similarly to our proposed 4 layer network, TabNet only learned the features of the XS class with XS f1 score of 0.84, while the other f1 scores of other classes are zeros. The underlying data does not seem to respond positively to neural network-based approaches.

Well performing models

In this sub-section we will discuss the results of well performing models, which arer XGBoost, LightGBM, K-Nearest Neighbor (KNN), Random Forest, AdaBoost and Naive Bayes.

Feature subsets

We have collected a lot of features (~54 data points) for the leads, additionally one-hot encoding the categorical variables results in a high dimensional feature space (132 features). Not all features might be equally relevant for our classification task so we want to try different subsets.

The following subsets are available:

  1. google_places_rating, google_places_user_ratings_total, google_places_confidence, regional_atlas_regional_score

Overall Results

Notes:

  • The Random Forest Classifier used 100 estimators.

  • The AdaBoost Classifier used 100 DecisionTree classifiers.

  • The KNN classifier used a distance based weighting for the evaluated neighbors and considered 10 neighbors in the 5-class split and 19 neighbors for the 3-class split.

  • The XGBoost was trained for 10000 rounds.

  • The LightGBM was trained with 2000 number of leaves

In the following table we can see the model’s overall weighted F1-score on the 3-class and 5-class data set split. The best performing classifiers per row is marked bold.

KNN

Naive Bayes

Random Forest

XGBoost

AdaBoost

LightGBM

5-Class

0.6314

0.6073

0.6150

0.6442

0.6098

0.6405

3-Class

0.6725

0.6655

0.6642

0.6967

0.6523

0.6956

KNN (subset=1)

Naive Bayes (subset=1)

RandomForest (subset=1)

XGBoost (subset=1)

AdaBoost (subset=1)

LightGBM (subset=1)

5-Class

0.6288

0.6075

0.5995

0.6198

0.6090

0.6252

3-Class

0.6680

0.6075

0.6506

0.6664

0.6591

0.6644

We can see that all classifiers perform better on the 3-class data set split and that the XGBoost classifier is the best performing for both data set splits. These results are consistent for both the full dataset as well as subset 1. We observe a slight performance for almost all classifiers when using subset 1 compared to the full dataset (except AdaBoost/3-class and Naive Bayes/5-class). This indicates that the few features retained in subset 1 are not the sole discriminant features of the dataset. However, the performance is still high enough to suggest that the features in subset 1 are highly relevant to make classifications on the data.

Results for each class

5-class split

In the following table we can see the F1-score of each model for each class in the 5-class split:

Class

KNN

Naive Bayes

Random Forest

XGBoost

AdaBoost

LightGBM

XS

0.82

0.83

0.81

0.84

0.77

0.83

S

0.15

0.02

0.13

0.13

0.22

0.14

M

0.08

0.02

0.09

0.08

0.14

0.09

L

0.06

0.00

0.08

0.06

0.07

0.05

XL

0.18

0.10

0.15

0.16

0.14

0.21

Class

KNN (subset=1)

Naive Bayes (subset=1)

RandomForest (subset=1)

XGBoost (subset=1)

AdaBoost (subset=1)

LightGBM (subset=1)

XS

0.82

0.84

0.78

0.84

0.78

0.82

S

0.16

0.00

0.16

0.04

0.19

0.13

M

0.07

0.00

0.07

0.02

0.09

0.08

L

0.07

0.00

0.06

0.05

0.07

0.06

XL

0.19

0.00

0.11

0.13

0.14

0.18

For every model we can see that the predictions on the XS class are significantly better than every other class. For the KNN, Random Forest, and XGBoost all perform similar, having second best classes S and XL and worst classes M and L. The Naive Bayes classifier performs significantly worse on the S, M, and L classes and has second best class XL. Using subset 1 again mostly decreased performance on all classes, with the exception of the KNN classifier and classes L and XL where we can observe a slight increase in F1-score.

3-class split

In the following table we can see the F1-score of each model for each class in the 3-class split:

Class

KNN

Naive Bayes

Random Forest

XGBoost

AdaBoost

LightGBM

XS

0.83

0.82

0.81

0.84

0.78

0.83

S,M,L

0.27

0.28

0.30

0.33

0.34

0.34

XL

0.16

0.07

0.13

0.14

0.12

0.19

Class

KNN (subset=1)

Naive Bayes (subset=1)

RandomForest (subset=1)

XGBoost (subset=1)

AdaBoost (subset=1)

LightGBM (subset=1)

XS

0.82

0.84

0.79

0.84

0.79

0.81

S,M,L

0.29

0.00

0.30

0.22

0.32

0.28

XL

0.18

0.00

0.11

0.11

0.20

0.17

For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the LightGBM model slightly outperforms the other models. The LightGBM classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split. The AdaBoost Classifier, trained on subset 1, performs best for the XL class. The KNN classifier got a slight boost in performance for the {S, M, L} and XL classes when using subset 1. All other models perform worse on subset 1.

Conclusion

In summary, XGBoost consistently demonstrated superior performance, showcasing robust results across various splits and subsets. However, it is crucial to note that its elevated score is attributed to potential overfitting on the XS class. Given SumUp’s emphasis on accurate predictions for higher classes, we recommend considering LightGBM. This model outperformed XGBoost in predicting the XL class and the other classes, offering better results in both the five-class and three-class splits.

Concepts, Unrealized Ideas & Miscellaneous

Unused Ideas

This document lists ideas and implementations which have either not been tried yet or have been deprecated as they are not used in the current product version but still carry some conceptual value.

Deprecated

The original implementation of the deprecated modules can be found in the deprecated/ directory.

Controller

Note: This package has the additional dependency pydantic==2.4.2

The controller module was originally planned to be used as a communication device between EVP and BDC. Whenever the salesperson interface would register a new lead the controller is supposed to trigger the BDC pipeline to enrich the data of that lead and preprocess it to create a feature vector. The successful completion of the BDC pipeline is then registered at the controller which will then trigger an inference of the EVP to compute the predicted merchant size and write this back to the lead data. The computed merchant size can then be used to rank the leads and allow the salesperson to decide the value of the leads and which one to call.

The current implementation of the module supports queueing messages from the BDC and EVP as indicated by their type. Depending on the message type the message is then routed to the corresponding module (EVP or BDC). The actual processing of the messages by the modules is not implemented. All of this is done asynchronously by using the python threading library.

FacebookGraphAPI

Note: This package has the additional dependency facebook-sdk==3.1.0. Also the environment variables FACEBOOK_APP_ID FACEBOOK_APP_SECRET need to be set with a valid token.

This step was supposed to be used for querying lead data from the facebook by using either the business owner’s name or the company name. The attempt was deprecated as the cost for the needed API token was evaluated too high and because the usage permissions of the facebook API were changed. Furthermore, it is paramount to check the legal ramifications of querying facebook for this kind of data as there might be legal consequences of searching for individuals on facebook instead of their businesses due to data privacy regulations in the EU.

ScrapeAddresses

This step was an early experiment, using only the custom domain from an email address. We check if there’s a live website running for the domain, and then try to parse the main site for a business address using a RegEx pattern. The pattern is not very precise and calling the website, as well as parsing it, takes quite some time, which accumulates for a lot of entries. The Google places step yields better results for the business address and is faster, that’s why scrape_addresses.py was deprecated.

Possible ML improvements

Creating data subsets

The data collected by the BDC pipeline has not been refined to only include semantically valuable data fields. It is possible that some data fields contain no predictive power. This would mean they are practically polluting the dataset with unnecessary information. A proper analysis of the predictive power of all data fields would allow cutting down on the amount of data for each lead, reducing processing time and possibly make predictions more precise. This approach has been explored very briefly by the subset 1 as described in Classifier-Comparison.md. However, the choice of included features has not been justified by experiments making them somewhat arbitrary. Additionally, an analysis of this type could give insights on which data fields to expand on and what new data one might want to collect to increase the EVP’s performance in predicting merchant sizes.

Possibly filtering data based on some quality metric could also improve general performance. The regional_atlas_score and google_confidence_score have been tried for this but did not improve performance. However, these values are computed somewhat arbitrarily and implementing a more refined quality metric might result in more promising results.

Controller

Automation

The Controller is a planned component, that has not been implemented beyond a conceptual prototype. In the planned scenario, the controller would coordinate BDC, MSP and the external components as a centralized instance of control. In contrast to our current design, this scenario would enable the automation of our current workflow, where there are currently several steps of human interaction required to achieve a prediction result for initially unprocessed lead data.

Diagrams

The following diagrams were created during the prototyping phase for the Controller component. As they are from an early stage of our project, the Merchant Size Predictor is labelled as the (Estimated) Value Predictor here.

Component Diagram

Component Diagram

Sequence Diagram

Sequence Diagram

Controller Workflow Diagram

Controller Workflow Diagram

Twitter API Limitation

Limitations of Twitter API for user information retrieval and biased sentiment analysis

This documentation highlights the research and the limitations regarding customer information retrieval and unbiased sentiment analysis when using Twiiter API (tweepy). Two primary constraints include the absence of usernames in provided customer data and inherent biases in tweet content, which significantly impact the API’s utility for these purposes.

Limitation 1: Absence of usernames in provided customer data:

A fundamental shortfall within the Twitter API (tweepy) lies in the unavailability of usernames in the customer information obtained through its endpoints. Twitter (X) primarily uses usernames as identifiers to retrieve user information, on the other hand we only have the Full Names of the customers as indicators.

Limitation 2: Inherent Biases in Tweet Content for Sentiment Analysis:

Conducting sentiment analysis on tweets extracted via the Twitter API poses challenges due to inherent biases embedded in tweet done by the customer themselves. Sentiment analysis on something like reviews would be definitely helpful. However, sentiment analysis done on tweet written by customer themselves would deeply imposes biases.

Contribution

Contribution Workflow

Branching Strategy

main: It contains fully stable production code

  • dev: It contains stable under-development code

    • epic: It contains a module branch. Like high level of feature. For example, we have an authentication module then we can create a branch like “epic/authentication”

      • feature: It contains specific features under the module. For example, under authentication, we have a feature called registration. Sample branch name: “feature/registration”

      • bugfix: It contains bug fixing during the testing phase and branch name start with the issue number for example “bugfix/3-validate-for-wrong-user-name”

Commits and Pull Requests

The stable branches main and dev are protected against direct pushes. To commit code to these branches create a pull request (PR) describing the feature/bugfix that you are committing to the dev branch. This PR will then be reviewed by another SD from the project. Only after being approved by another SD a PR may be merged into the dev branch. Periodically the stable code on the dev branch will be merged into the main branch by creating a PR from dev. Hence, every feature that should be committed to the main branch must first run without issues on the dev branch for some time.

Before contributing to this repository make sure that you are identifiable in your git user settings. This way commits and PRs created by you can be identified and easily traced back.

git config --local user.name "Manu Musterperson"
git config --local user.email "manu@musterperson.org"

Any commit should always contain a commit message that references an issue created in the project. Also, always signoff on your commits for identification reasons.

git commit -m "Fixed issue #123" --signoff

When doing pair programming be sure to always have all SDs mentioned in the commit message. Each SD should be listed on a new line for clarity reasons.

git commit -a -m "Fixed problem #123
> Co-authored-by: Manu Musterperson <manu.musterperson@fau.de>" --signoff

Pull Request Workflow

The main and dev branches are protected against direct pushes, which means that we want to do a Pull Request (PR) in order to merge a developed branch into these branches. Having developed a branch (let’s call it feature-1) and we want to merge feature-1 branch into main branch.

Here is a standard way to merge pull requests:

  1. Have all your local changes added, committed, and pushed on the remote feature-1 branch

    git checkout feature-1
    git add .
    git commit -m "added a feature" --signoff  # don't forget the signoff ;)
    git push
    
  2. Make sure your local main branch up-to-date

    git checkout main
    git pull main
    
  3. Go to Pull Requests > click on “New pull request” > make sure the base is main branch (or dev branch, depends on which branch you want to update) and the compare to be your feature-1 branch, as highlighted in the photo below and click “create pull requests”: image

    Make sure to link the issue your PR relates to.

  4. Inform the other SDs on slack that you have created the PR and it is awaiting a review and wait for others to review your code. The reviewers will potentially leave comments and change requests in their PR review. If this is the case reason why the change request is not warranted or checkout your branch again and apply the requested changes. Then push your branch once more and request another review by the reviewer. Once there are no more change requests and the PR has been approved by another SD you can merge the PR into the target branch.

  5. Delete the feature branch feature-1 once it has been merged into the target branch.

In case of merge conflict:

Should we experience merge conflict after step 3, we should solve the merge conflicts manually, below the title of “This branch has conflicts that must be resolved” click on web editor (you can use vscode or any editor you want). The conflict should look like this:

<<<<<<< HEAD
// Your changes at **feature-1** branch
=======
// Data already on the main branch
>>>>>>> main

-choose which one of these you would adopt for the merge to the main branch, we would be better off solving the merge -conflicts together rather than alone, feel free to announce it in the slack group chat. -mark it as resolved and remerge the PR again, there shouldn’t any problem with it.

Feel free to add more about that matter here.

SBOM Generator

Automatic SBOM generation

pipenv install
pipenv shell

pip install pipreqs
pip install cyclonedx-bom
pip install pip-licenses

# Create the SBOM (cyclonedx-bom) based on (pipreqs) requirements that are actually imported in the .py files

$sbom = pipreqs --print | cyclonedx-py -r -pb -o - -i -

# Create an XmlDocument object
$xml = New-Object System.Xml.XmlDocument

# Load XML content into the XmlDocument
$xml.LoadXml($sbom)


# Create an empty CSV file
$csvPath = "SBOM.csv"

# Initialize an empty array to store rows
$result = @()

# Iterate through the XML nodes and create rows for each node
$xml.SelectNodes("//*[local-name()='component']") | ForEach-Object {

    $row = @{
        "Version" = $_.Version
        "Context" = $_.Purl
        "Name" = if ($_.Name -eq 'scikit_learn') { 'scikit-learn' } else { $_.Name }
    }

    # Get license information
    $match = pip-licenses --from=mixed --format=csv --with-system --packages $row.Name | ConvertFrom-Csv

    # Add license information to the row
    $result += [PSCustomObject]@{
        "Context" = $row.Context
        "Name" = $row.Name
        "Version" = $row.Version
        "License" = $match.License
    }
}

# Export the data to the CSV file
$result | Export-Csv -Path $csvPath -NoTypeInformation

# Create the license file
$licensePath = $csvPath + '.license'
@"
SPDX-License-Identifier: CC-BY-4.0
SPDX-FileCopyrightText: 2023 Fabian-Paul Utech <f.utech@gmx.net>
"@ | Out-File -FilePath $licensePath

exit

Miscellaneous

Miscellaneous Content

This file contains content that was moved over from our Wiki, which we gave up in favor of having the documentation available more centrally. The contents of this file might to some extend overlap with the contents found in other documentation files.

Knowledge Base

AWS

  1. New password has to be >= 16 char and contain special chars

  2. After changing the password you have to re-login

  3. Add MFA (IAM -> Users -> Your Name -> Access Info)

  4. MFA device = FirstName.LastName like the credential

  5. Re-login

  6. Get access keys:

    • IAM -> Users -> Your Name -> Access Info -> Scroll to Access Keys

    • Create new access key (for local development)

    • Accept the warning

    • Copy the secret key to your .env file

    • Don’t add description tags to your key

PR Management:

  1. Create PR

  2. Link issue

  3. Other SD reviews the PR

    • Modification needed?

      • Fix/Discuss issue in the GitHub comments

    • Make new commit

    • Return to step 3

    • No Modification needed

    • Reviewer approves PR

  4. PR creator merges PR

  5. Delete the used branch

Branch-Management:

  • Remove branches after merging

  • Add reviews / pull requests so others check the code

  • Feature branches with dev instead of main as base

Pre-commit:

# If not installed yet
pip install pre-commit

# pre-commit hooks now automatically are executed before every commit
python -m pre-commit install

# execute pre-commit manually
python pre-commit

Features

  • Existing Website (Pingable, SEO-Score, DNS Lookup)

  • Existing Google Business Entry (using the Google Places API

    • Opening Times

    • Number, Quality of Ratings

    • Overall “completeness” of the entry/# of available datapoints

    • Price category

    • Phone Number (compare with lead form input)

    • Website (compare with lead form input)

    • Number of visitors (estimate revenue from that?)

    • Product recognition from images

    • Merchant Category (e.g. cafe, restaurant, retailer, etc.)

  • Performance Indicators (NorthData, some other API)

    • Revenue (as I understodd, this should be > 5000$/month)

    • Number of Employees

    • Bundesanzeiger / Handelsregister (Deutschland API)

  • Popularity: Insta / facebook followers or website ranking on google

  • Business type: google or website extraction (maybe with ChatGPT)

  • Size of business: To categorize leads to decide whether they need to deal with a salesperson or self-direct their solution

  • Business profile

  • Sentiment Analysis: https://arxiv.org/pdf/2307.10234.pdf

Storage

  • Unique ID for Lead (Felix)?

  • How to handle frequent data layout changes at S3 (Simon)?

  • 3 stage file systems (Felix) vs. DB (Ruchita)?

  • 3 stage file system (Felix):

    • BDC trigger on single new lead entries or batches

    • After BDC enriched the data => store in a parquet file in the events folder with some tag

    • BDC triggers the creation of the feature vectors

    • Transform the data in the parquet file after it was stored in the events file and store them in the feature folder with the same tag

    • Use the data as a input for the model, which is triggered after the creation of the input, and store the results in the model folder

  • Maybe the 3 stage file system as a part of the DB and hide the final decision behind the database abstraction layer (Simon)?

Control flow (Berkay)

  • Listener

  • MessageQueue

  • RoutingQueue

Listener, as the name suggests, listens for incoming messages from other component, such as BDC, EVP, and enqueues these messages in messageQueue to be “read” and processed. If there are not incoming messages, it is in idle status. messageQueue is, where listened messages are being processed. After each message is processed by messageQueue,it is enqueued in routingQueue, to be routed to corresponding component. Both messageQueue and routingQueue are in idle, if there are no elements in queues. Whole concept of Controller is multi-threaded and asynchronous. While it accepts new incoming messages, it processes messages and at the same time routes some other messages.

AI

expected value = life-time value of lead x probability of the lead becoming a customer

AI models needed that solve a regression or probability problem

AI Models
  • Classification:

    • Decision Trees

    • Random Forest

    • Neural Networks

    • Naïve Bayes

What data do we need?
  • Classification: Labeled data

  • Probability: Data with leads and customers

ML Pipeline
  1. Preprocessing

  2. Feature selection

  3. Dataset split / cross validation

  4. Dimensional reduction

  5. Training

  6. Testing / Evaluation

  7. Improve performance

    • Batch Normalization

    • Optimizer

    • L1 / L2 regularization: reduced overfitting by regularize the model

    • Dropout (NN)

    • Depth and width (NN)

    • Initialization techniques (NN: Xavier and He)

      • He: Layers with ReLu activation

      • Xavier: Layers with sigmoid activation

Troubleshooting

Build

pipenv
install stuck
pipenv install –dev

Solution: Remove .lock file + restart PC

Docker
VSCode

Terminal can’t run docker image (on windows)

  • Solution: workaround with git bash or with ubuntu

Testing
Reuse

don’t analyze a certain part of the code with reuse Solution:

# REUSE-IgnoreStart
  ...
# REUSE-IgnoreEnd
Failed checks
  1. Go to the specific pull request or Actions Actions

  2. Click “show all checks”

  3. Click “details”

  4. Click on the elements with the “red marks”

BDC

Google Places API

Language is adjusted to the location from which the API is run

Google search results are based on the location from which the API is run

  • Solution: Pass a fixed point in the center of the country / city / area of the company (OSMNX) as a location bias, documentation in Google Solution

Branch-Management

Divergent branch

Commits on local and remote are not the same

  • Solution:

    1. Pull remote changes

    2. Rebase the changes

    3. Solve any conflict during any commit you get from remote