Smoothing
Statistics "smooth" are calculated on the sample of two continuous variables (i.e., sample of points or lines). It interpolates data points to create a smoother curve.
This notebook uses definitions from DataFrame.
Usage
The "Smooth" statistic proves beneficial in scenarios with over-plotting or noise, simplifying the process of identifying inherent trends and patterns. It can also be used to make a more pretty line with a small number of points.
Arguments
Input (mandatory):
x
— numeric sample of input pointsx
coordinatesy
— numeric sample of input pointsy
coordinates
Parameters (optional):
method: SmoothMethod
— smoothing model:SmoothMethod.Linear(confidenceLevel: Double)
— linear modelSmoothMethod.Polynomial(degree: Int, confidenceLevel: Double)
— polynomial modelSmoothMethod.LOESS(span: Double, loessCriticalSize: Int, samplingSeed: Long, confidenceLevel: Double)
— Local Polynomial Regression model
smootherPointCount: Int
— number of sampled points
Generalized signature
The specific signature depends on the function, but all functions related to "smooth" statistic (which will be discussed further below — different variations of statSmooth()
, smoothLine()
) have approximately the same signature with the arguments above:
The possible types of x
and y
depend on where a certain function is used. They can be simply Iterable
(List
, Set
, etc.) or a reference to a column in a DataFrame
(String
, ColumnAccessor
) or the DataColumn
itself.
Output statistics
name | type | description |
---|---|---|
Stat.x | Double |
|
Stat.y | Double |
|
Stat.yMin | Double | Lower point-wise confidence interval around the mean |
Stat.yMax | Double | Upper point-wise confidence interval around the mean |
Stat.se | Double | Standard error |
StatSmooth plots
speed | efficiency |
---|---|
-2.00 | 0.500380 |
-1.92 | 0.459302 |
-1.84 | 0.636746 |
-1.78 | 0.623408 |
-1.68 | 0.839757 |
df
has a signature
speed | efficiency |
---|
Let's take a look at StatSmooth
output DataFrame:
Stat | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | ||||
---|---|---|---|---|
x | y | yMin | yMax | se |
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statSmooth
, each row corresponds to one of the line points. Stat.x
is the column with this point x
coordinate. Stat.y
is points y
coordinate; Stat.yMin
— lower point of confidence level. Stat.yMax
— upper point of confidence level. Stat.se
— standard error.
DataFrame
with "smooth" statistics is called StatSmoothFrame
statSmooth
transform
statSmooth(statSmoothArgs) { /*new plotting context*/ }
modifies a plotting context — instead of original data (no matter was it empty or not) new StatSmooth
dataset (calculated on given arguments. Inputs can be provided as Iterable
or as dataset column reference — by name as a String
, as a ColumnReference
or as a DataColumn
) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statSmooth
context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat
group and can be called inside the new context:
smoothLine
layer
smoothLine
layer is a shortcut for fast plotting a smoothed line:
smoothLine
uses statSmooth
and line
and performs coordinate mappings under the hood. And we can customize smoothLine
layer: smoothLine()
optionally opens a new context, where we can configure the line (as in the usual context opened by line { ... }
) — even change coordinate mappings from default ones. StatSmooth
dataset of smoothLine
is also can be accessed here.
Grouped statSmooth
statSmooth
can be applied for grouped data — statistics will be calculated on each group independently but with equal categories. This application returns a new GroupBy
dataset with the same keys as the old one but with StatSmooth
groups instead of old ones.
time | value | category |
---|---|---|
-4.96 | -5.735 | A |
-4.89 | -5.57 | A |
-4.87 | -5.384 | A |
-4.84 | -5.261 | A |
-4.83 | -5.333 | A |
It has the following signature:
time | value | category |
---|
category | group | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A |
| ||||||||||||||||||
B |
|
Now we have a GroupBy
with a signature
key: [category] | group: DataFrame[time|value|category] |
---|---|
A | A-Group |
B | B-Group |
category | group | ||
---|---|---|---|
A |
| ||
B |
|
After statSmooth
applying it's still a GroupBy
but with different signature of group
— all groups have the same signature as usual DataFrame
after statSmooth
applying (i.e. StatSmoothFrame
):
key: [drv] | group: StaSmoothFrame |
---|---|
"A" | "A"-Group |
"B" | "B"-Group |
As you can see, we did indeed do a statSmooth
transformation within groups, the grouping keys did not change.
The plotting process doesn't change much — we do everything the same.
As you can see, there are two lines because we have two groups of data. To distinguish them, we need to add mapping to the color from the key. This is convenient — the key is available in the context
The smoothLine()
layer also works. Moreover, if we have exactly one grouping key, a mapping from it to color
will be created by default.
We can customize it like we used to. From the differences — access to key
columns:
Inside groupBy{}
plot context
We can apply groupBy
modification to the initial dataset and build a histogram with grouped data the same way: