Smoothing
Edit pageLast modified: 05 December 2023Statistics "smooth" are calculated on the sample of two continuous variables (i.e., sample of points or lines). It interpolates data points to create a smoother curve.
This notebook uses definitions from DataFrame.
Usage
The "Smooth" statistic proves beneficial in scenarios with over-plotting or noise, simplifying the process of identifying inherent trends and patterns. It can also be used to make a more pretty line with a small number of points.
Arguments
Input (mandatory):
x
— numeric sample of input pointsx
coordinatesy
— numeric sample of input pointsy
coordinates
Parameters (optional):
method: SmoothMethod
— smoothing model:SmoothMethod.Linear(confidenceLevel: Double)
— linear modelSmoothMethod.Polynomial(degree: Int, confidenceLevel: Double)
— polynomial modelSmoothMethod.LOESS(span: Double, loessCriticalSize: Int, samplingSeed: Long, confidenceLevel: Double)
— Local Polynomial Regression model
smootherPointCount: Int
— number of sampled points
Generalized signature
The specific signature depends on the function, but all functions related to "smooth" statistic (which will be discussed further below — different variations of statSmooth()
, smoothLine()
) have approximately the same signature with the arguments above:
statSmoothArgs :=
x,
y,
method: SmoothMethod = SmoothMethod.LOESS(),
smootherPointCount: Int = 100
The possible types of x
and y
depend on where a certain function is used. They can be simply Iterable
(List
, Set
, etc.) or a reference to a column in a DataFrame
(String
, ColumnAccessor
) or the DataColumn
itself.
Output statistics
name | type | description |
---|---|---|
Stat.x | Double |
|
Stat.y | Double |
|
Stat.yMin | Double | Lower point-wise confidence interval around the mean |
Stat.yMax | Double | Upper point-wise confidence interval around the mean |
Stat.se | Double | Standard error |
StatSmooth plots
// To generate the data, we use a standard java math library
// https://commons.apache.org/proper/commons-math/
// Generate line with formula
val xs = (-100..100).map { it / 50.0 }
val lineFormula = { x: Double -> 2.0 / (x * x + 0.5) }
// Generate noises from normal distribution
val noises = NormalDistribution(0.0, 0.1).sample(xs.size).toList()
val ys = xs.zip(noises).map { lineFormula(it.first) + it.second }
// And drop 2/3 points
val random = Random(42)
val (newXs, newYs) = xs.zip(ys).shuffled(random).take(xs.size * 1 / 3).sortedBy { it.first }.unzip()
// Gather them into the DataFrame
val df = dataFrameOf(
"speed" to newXs,
"efficiency" to newYs
)
df.head(5)
speed | efficiency |
---|---|
-2.00 | 0.500380 |
-1.92 | 0.459302 |
-1.84 | 0.636746 |
-1.78 | 0.623408 |
-1.68 | 0.839757 |
df
has a signature
speed | efficiency |
---|
Let's take a look at StatSmooth
output DataFrame:
df.statSmooth("speed", "efficiency").head(5)
Stat | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | ||||
---|---|---|---|---|
x | y | yMin | yMax | se |
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statSmooth
, each row corresponds to one of the line points. Stat.x
is the column with this point x
coordinate. Stat.y
is points y
coordinate; Stat.yMin
— lower point of confidence level. Stat.yMax
— upper point of confidence level. Stat.se
— standard error.
DataFrame
with "smooth" statistics is called StatSmoothFrame
statSmooth
transform
statSmooth(statSmoothArgs) { /*new plotting context*/ }
modifies a plotting context — instead of original data (no matter was it empty or not) new StatSmooth
dataset (calculated on given arguments. Inputs can be provided as Iterable
or as dataset column reference — by name as a String
, as a ColumnReference
or as a DataColumn
) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statSmooth
context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat
group and can be called inside the new context:
plot {
statSmooth(newXs, newYs) {
// new `StatSmooth` dataset here
area {
// use `Stat.*` columns for mappings
x(Stat.x)
y(Stat.y)
}
}
points {
x(newXs)
y(newYs)
}
}
df.plot {
statSmooth(speed, efficiency, method = SmoothMethod.Polynomial(2), smootherPointCount = 250) {
ribbon {
x(Stat.x)
yMin(Stat.yMin)
yMax(Stat.yMax)
}
}
// Dataset is not changed here
points {
x(speed)
y(efficiency)
}
}
smoothLine
layer
smoothLine
layer is a shortcut for fast plotting a smoothed line:
val smoothLineLayerPlot = plot {
smoothLine(newXs, newYs)
layout.title = "`smoothLine()` layer"
}
smoothLineLayerPlot
// Compare it with `statSmooth` + usual `line`
val statSmoothAndLinePlot = plot {
statSmooth(newXs, newYs) {
line {
x(Stat.x)
y(Stat.y)
}
}
layout.title = "`statSmooth()` + non-statistical `line` layer"
}
plotGrid(listOf(smoothLineLayerPlot, statSmoothAndLinePlot))
smoothLine
uses statSmooth
and line
and performs coordinate mappings under the hood. And we can customize smoothLine
layer: smoothLine()
optionally opens a new context, where we can configure the line (as in the usual context opened by line { ... }
) — even change coordinate mappings from default ones. StatSmooth
dataset of smoothLine
is also can be accessed here.
df.plot {
smoothLine(speed, efficiency, SmoothMethod.LOESS(span = 0.1), 120) {
// change a column mapped on `y` to `Stat.scaled`
y(Stat.yMax)
color = Color.RED
width = 4.0
}
points {
x(speed)
y(efficiency)
}
}
Grouped statSmooth
statSmooth
can be applied for grouped data — statistics will be calculated on each group independently but with equal categories. This application returns a new GroupBy
dataset with the same keys as the old one but with StatSmooth
groups instead of old ones.
// Generate two lines
val fA = { x: Double -> 0.02 * x * x * x - 0.2 * x * x + 0.1 * x + 2.1 }
val fB = { x: Double -> -0.1 * x * x * x + 0.5 * x * x - 0.8 }
val xRange = (-500..500).map { it / 100.0 }
val noisesA = NormalDistribution(0.0, 0.05).sample(xRange.size).toList()
val noisesB = NormalDistribution(0.0, 0.2).sample(xRange.size).toList()
val valuesA = xRange.zip(noisesA).map { fA(it.first) + it.second }
val valuesB = xRange.zip(noisesB).map { fB(it.first) + it.second }
val (xsA, ysA) = xRange.zip(valuesA).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }
.unzip()
val (xsB, ysB) = xRange.zip(valuesB).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }
.unzip()
// Gather them into `DataFrame` with "A" and "B" keys in "category" column
val valuesDF = dataFrameOf(
"time" to xsA + xsB,
"value" to ysA + ysB,
"category" to List(xsA.size) { "A" } + List(xsB.size) { "B" }
)
valuesDF.head(5)
time | value | category |
---|---|---|
-4.96 | -5.735 | A |
-4.89 | -5.57 | A |
-4.87 | -5.384 | A |
-4.84 | -5.261 | A |
-4.83 | -5.333 | A |
It has the following signature:
time | value | category |
---|
// Group it by "category"
val groupedDF = valuesDF.groupBy { category }
groupedDF
category | group | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A |
| ||||||||||||||||||
B |
|
Now we have a GroupBy
with a signature
key: [category] | group: DataFrame[time|value|category] |
---|---|
A | A-Group |
B | B-Group |
groupedDF.statSmooth { x(time); y(value) }
category | group | ||
---|---|---|---|
A |
| ||
B |
|
After statSmooth
applying it's still a GroupBy
but with different signature of group
— all groups have the same signature as usual DataFrame
after statSmooth
applying (i.e. StatSmoothFrame
):
key: [drv] | group: StaSmoothFrame |
---|---|
"A" | "A"-Group |
"B" | "B"-Group |
As you can see, we did indeed do a statSmooth
transformation within groups, the grouping keys did not change.
The plotting process doesn't change much — we do everything the same.
groupedDF.plot {
statSmooth(time, value) {
line {
x(Stat.x)
y(Stat.y)
}
}
}
As you can see, there are two lines because we have two groups of data. To distinguish them, we need to add mapping to the color from the key. This is convenient — the key is available in the context
groupedDF.plot {
statSmooth(time, value, method = SmoothMethod.Polynomial(3)) {
line {
x(Stat.x)
y(Stat.y)
color(category)
}
}
}
The smoothLine()
layer also works. Moreover, if we have exactly one grouping key, a mapping from it to color
will be created by default.
groupedDF.plot {
smoothLine(time, value)
}
We can customize it like we used to. From the differences — access to key
columns:
groupedDF.plot {
smoothLine(time, value) {
color = Color.GREEN
type(category)
}
}
Inside groupBy{}
plot context
We can apply groupBy
modification to the initial dataset and build a histogram with grouped data the same way:
valuesDF.plot {
groupBy(category) {
smoothLine(time, value)
}
}