Gaussian Smoothing
GaussianSmoothing
Bases: DataManipulationBaseInterface
Applies Gaussian smoothing to a PySpark DataFrame. This method smooths the values in a specified column using a Gaussian filter, which helps reduce noise and fluctuations in time-series or spatial data.
The smoothing can be performed in two modes: - Temporal mode: Applies smoothing along the time axis within each unique ID. - Spatial mode: Applies smoothing across different IDs for the same timestamp.
Example
from pyspark.sql import SparkSession
from some_module import GaussianSmoothing
spark = SparkSession.builder.getOrCreate()
df = ... # Load your PySpark DataFrame
smoothed_df = GaussianSmoothing(
df=df,
sigma=2.0,
mode="temporal",
id_col="sensor_id",
timestamp_col="timestamp",
value_col="measurement"
).filter()
smoothed_df.show()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
The input PySpark DataFrame. |
required |
sigma |
float
|
The standard deviation for the Gaussian kernel, controlling the amount of smoothing. |
required |
mode |
str
|
The smoothing mode, either |
'temporal'
|
id_col |
str
|
The name of the column representing unique entity IDs (default: |
'id'
|
timestamp_col |
str
|
The name of the column representing timestamps (default: |
'timestamp'
|
value_col |
str
|
The name of the column containing the values to be smoothed (default: |
'value'
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Source code in src/sdk/python/rtdip_sdk/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|