Dimensionality Reduction
DimensionalityReduction
Bases: DataManipulationBaseInterface
Detects and combines columns based on correlation or exact duplicates.
Example
from rtdip_sdk.pipelines.monitoring.spark.data_manipulation.column_correlation import ColumnCorrelationDetection
from pyspark.sql import SparkSession
column_correlation_monitor = ColumnCorrelationDetection(
df,
columns_to_check=['column1', 'column2'],
threshold=0.95,
combination_method='mean'
)
result = column_correlation_monitor.process()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
PySpark DataFrame to be analyzed and transformed. |
required |
columns |
list
|
List of column names to check for correlation. Only two columns are supported. |
required |
threshold |
float
|
Correlation threshold for column combination [0-1]. If the absolute value of the correlation is equal or bigger, than the columns are combined. Defaults to 0.9. |
0.9
|
combination_method |
str
|
Method to combine correlated columns. Supported methods: - 'mean': Average the values of both columns and write the result to the first column (New value = (column1 + column2) / 2) - 'sum': Sum the values of both columns and write the result to the first column (New value = column1 + column2) - 'first': Keep the first column, drop the second column - 'second': Keep the second column, drop the first column - 'delete': Remove both columns entirely from the DataFrame Defaults to 'mean'. |
'mean'
|
Source code in src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
system_type()
staticmethod
Attributes:
Name | Type | Description |
---|---|---|
SystemType |
Environment
|
Requires PYSPARK |
Source code in src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.py
95 96 97 98 99 100 101 |
|
filter()
Process DataFrame by detecting and combining correlated columns.
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
Transformed PySpark DataFrame |
Source code in src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|