Smoothing

Usage

The "Smooth" statistic proves beneficial in scenarios with over-plotting or noise, simplifying the process of identifying inherent trends and patterns. It can also be used to make a more pretty line with a small number of points.

Arguments

Generalized signature

The specific signature depends on the function, but all functions related to "smooth" statistic (which will be discussed further below — different variations of statSmooth(), smoothLine()) have approximately the same signature with the arguments above:

statSmoothArgs :=
   x,
   y,
   method: SmoothMethod = SmoothMethod.LOESS(),
   smootherPointCount: Int = 100

The possible types of x and y depend on where a certain function is used. They can be simply Iterable (List, Set, etc.) or a reference to a column in a DataFrame (String, ColumnAccessor) or the DataColumn itself.

Output statistics

name	type	description
Stat.x	Double	`x` coordinate
Stat.y	Double	`y` coordinate
Stat.yMin	Double	Lower point-wise confidence interval around the mean
Stat.yMax	Double	Upper point-wise confidence interval around the mean
Stat.se	Double	Standard error

StatSmooth plots

// To generate the data, we use a standard java math library
// https://commons.apache.org/proper/commons-math/

// Generate line with formula
val xs = (-100..100).map { it / 50.0 }
val lineFormula = { x: Double -> 2.0 / (x * x + 0.5) }
// Generate noises from normal distribution
val noises = NormalDistribution(0.0, 0.1).sample(xs.size).toList()
val ys = xs.zip(noises).map { lineFormula(it.first) + it.second }
// And drop 2/3 points
val random = Random(42)
val (newXs, newYs) = xs.zip(ys).shuffled(random).take(xs.size * 1 / 3).sortedBy { it.first }.unzip()

// Gather them into the DataFrame
val df = dataFrameOf(
    "speed" to newXs,
    "efficiency" to newYs
)
df.head(5)

speed	efficiency
-2.00	0.500380
-1.92	0.459302
-1.84	0.636746
-1.78	0.623408
-1.68	0.839757

df has a signature

Let's take a look at StatSmooth output DataFrame:

df.statSmooth("speed", "efficiency").head(5)

Stat
x	y	yMin	yMax	se

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statSmooth, each row corresponds to one of the line points. Stat.x is the column with this point x coordinate. Stat.y is points y coordinate; Stat.yMin — lower point of confidence level. Stat.yMax — upper point of confidence level. Stat.se — standard error.

DataFrame with "smooth" statistics is called StatSmoothFrame

`statSmooth` transform

statSmooth(statSmoothArgs) { /*new plotting context*/ } modifies a plotting context — instead of original data (no matter was it empty or not) new StatSmooth dataset (calculated on given arguments. Inputs can be provided as Iterable or as dataset column reference — by name as a String, as a ColumnReference or as a DataColumn) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statSmooth context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat group and can be called inside the new context:

plot {
    statSmooth(newXs, newYs) {
        // new `StatSmooth` dataset here
        area {
            // use `Stat.*` columns for mappings
            x(Stat.x)
            y(Stat.y)
        }
    }
    points {
        x(newXs)
        y(newYs)
    }
}

df.plot {
    statSmooth(speed, efficiency, method = SmoothMethod.Polynomial(2), smootherPointCount = 250) {
        ribbon {
            x(Stat.x)
            yMin(Stat.yMin)
            yMax(Stat.yMax)
        }
    }
    // Dataset is not changed here
    points {
        x(speed)
        y(efficiency)
    }
}

`smoothLine` layer

smoothLine layer is a shortcut for fast plotting a smoothed line:

val smoothLineLayerPlot = plot {
    smoothLine(newXs, newYs)
    layout.title = "`smoothLine()` layer"
}
smoothLineLayerPlot

// Compare it with `statSmooth` + usual `line`
val statSmoothAndLinePlot = plot {
    statSmooth(newXs, newYs) {
        line {
            x(Stat.x)
            y(Stat.y)
        }
    }
    layout.title = "`statSmooth()` + non-statistical `line` layer"
}
plotGrid(listOf(smoothLineLayerPlot, statSmoothAndLinePlot))

smoothLine uses statSmooth and line and performs coordinate mappings under the hood. And we can customize smoothLine layer: smoothLine() optionally opens a new context, where we can configure the line (as in the usual context opened by line { ... }) — even change coordinate mappings from default ones. StatSmooth dataset of smoothLine is also can be accessed here.

df.plot {
    smoothLine(speed, efficiency, SmoothMethod.LOESS(span = 0.1), 120) {
        // change a column mapped on `y` to `Stat.scaled`
        y(Stat.yMax)
        color = Color.RED
        width = 4.0
    }
    points {
        x(speed)
        y(efficiency)
    }
}

Grouped `statSmooth`

statSmooth can be applied for grouped data — statistics will be calculated on each group independently but with equal categories. This application returns a new GroupBy dataset with the same keys as the old one but with StatSmooth groups instead of old ones.

// Generate two lines
val fA = { x: Double -> 0.02 * x * x * x - 0.2 * x * x + 0.1 * x + 2.1 }
val fB = { x: Double -> -0.1 * x * x * x + 0.5 * x * x - 0.8 }
val xRange = (-500..500).map { it / 100.0 }
val noisesA = NormalDistribution(0.0, 0.05).sample(xRange.size).toList()
val noisesB = NormalDistribution(0.0, 0.2).sample(xRange.size).toList()
val valuesA = xRange.zip(noisesA).map { fA(it.first) + it.second }
val valuesB = xRange.zip(noisesB).map { fB(it.first) + it.second }

val (xsA, ysA) = xRange.zip(valuesA).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }
    .unzip()
val (xsB, ysB) = xRange.zip(valuesB).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }
    .unzip()

// Gather them into `DataFrame` with "A" and "B" keys in "category" column
val valuesDF = dataFrameOf(
    "time" to xsA + xsB,
    "value" to ysA + ysB,
    "category" to List(xsA.size) { "A" } + List(xsB.size) { "B" }
)
valuesDF.head(5)

time	value	category
-4.96	-5.735	A
-4.89	-5.57	A
-4.87	-5.384	A
-4.84	-5.261	A
-4.83	-5.333	A

// Group it by "category"
val groupedDF = valuesDF.groupBy { category }
groupedDF

Now we have a GroupBy with a signature

key: [category]	group: DataFrame[time\|value\|category]
A	A-Group
B	B-Group

groupedDF.statSmooth { x(time); y(value) }

After statSmooth applying it's still a GroupBy but with different signature of group — all groups have the same signature as usual DataFrame after statSmooth applying (i.e. StatSmoothFrame):

key: [drv]	group: StaSmoothFrame
"A"	"A"-Group
"B"	"B"-Group

As you can see, we did indeed do a statSmooth transformation within groups, the grouping keys did not change.

groupedDF.plot {
    statSmooth(time, value) {
        line {
            x(Stat.x)
            y(Stat.y)
        }
    }
}

groupedDF.plot {
    statSmooth(time, value, method = SmoothMethod.Polynomial(3)) {
        line {
            x(Stat.x)
            y(Stat.y)
            color(category)
        }
    }
}

The smoothLine() layer also works. Moreover, if we have exactly one grouping key, a mapping from it to color will be created by default.

groupedDF.plot {
    smoothLine(time, value)
}

We can customize it like we used to. From the differences — access to key columns:

groupedDF.plot {
    smoothLine(time, value) {
        color = Color.GREEN
        type(category)
    }
}

Inside `groupBy{}` plot context

We can apply groupBy modification to the initial dataset and build a histogram with grouped data the same way:

valuesDF.plot {
    groupBy(category) {
        smoothLine(time, value)
    }
}

Smoothing﻿

Usage﻿

Arguments﻿

Generalized signature﻿

Output statistics﻿

StatSmooth plots﻿

statSmooth transform﻿

smoothLine layer﻿

Grouped statSmooth﻿

Inside groupBy{} plot context﻿

See also