kandy 0.8.0-RC1 Help

Density Plot

Statistics "density" are calculated on the sample of a single continuous variable. It approximates the Probability Density Function (PDF) of this sample. "Density" statistic samples this function point. It's weighted, it means the counted density depends on observation weights.

This notebook uses definitions from DataFrame.

Usage

"Density" statistic is useful when you have a large dataset, and you want to understand the underlying probability distribution. Density plot visualizes the PDF and also allows you to compare the distribution of different samples. This is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.

Arguments

  • Input (mandatory):

    • x — numeric sample on which the statistics are calculated

  • Weights (optional):

    • weights — set of weights of the same size as the input sample. null (by default) means all weights are equal to 1.0 and the weighted density is equal to the normal one

  • Parameters (optional):

    • n: Int — number of sampled points;

    • trim: Boolean — if false, each density is computed on the full range of the data, if true, each density is computed over the range of that group (only for grouped inputs).

    • adjust: Double — adjusts the value of bandwidth by multiplying it; changes how smooth the frequency curve is.

    • kernel: Kernel — the kernel used to calculate the density function:

      • Kernel.GAUSSIAN

      • Kernel.RECTANGULAR

      • Kernel.TRIANGULAR

      • Kernel.BIWEIGHT

      • Kernel.EPANECHNIKOV

      • Kernel.OPTCOSINE

    • fullScanMax: Int — maximum size of data to use density computation with 'full scan'. For bigger data, less accurate but more efficient density computation is applied

    • bandWidth: BandWidth — the method (or exact value) of bandwidth:

      • BandWidth.Method.NRD

      • BandWidth.Method.NRD0

      • BandWidth.value(value: Double)

Generalized signature

The specific signature depends on the function, but all functions related to "density" statistic (which will be discussed further below — different variations of statDensity(), densityPlot()) have approximately the same signature with the arguments above:

statDensityArgs := x, weights = null, n: Int = 512, trim: Boolean = false, adjust: Double = 1.0, kernel: Kernel = Kernel.GAUSSIAN, fullScanMax: Int = 5000, bandWidth: BandWidth = BandWidth.Method.NRD0,

The possible types of x and weights depend on where a certain function is used. They can be simply Iterable (List, Set, etc.) or a reference to a column in a DataFrame (String, ColumnAccessor) or the DataColumn itself.

Output statistics

name

type

description

Stat.x

Double

x coordinate

Stat.density

Double

Density estimate

Stat.densityWeighted

Double

Weighted density

Stat.scaled

Double

Density estimate, scaled to maximin of 1.0.

Stat.scaledWeighted

Double

Weighted scaled

StatDensity plots

// To generate the data, we use a standard java math library // https://commons.apache.org/proper/commons-math/ // Generate sample from normal distribution val depthList = NormalDistribution(500.0, 100.0).sample(1000).toList() // Generate sample from uniform distribution val coeffList = UniformRealDistribution(0.0, 1.0).sample(1000).toList()
// Gather them into the DataFrame val df = dataFrameOf( "depth" to depthList, "coeff" to coeffList ) df.head()

depth

coeff

495.7

0.818

666.918

0.863

466.139

1

488.06

0.489

338.757

0.917

df has a signature

depth

coeff

Let's take a look at StatDensity output DataFrame:

df.statDensity("depth", "coeff").head()

Stat

x

density

densityWeighted

scaled

scaledWeighted

181.351

0

0

0.011

0.015

182.64

0

0

0.012

0.016

183.929

0

0

0.012

0.016

185.218

0

0

0.012

0.017

186.506

0

0

0.013

0.017

It has the following signature:

Stat

x

density

densityWeighted

scaled

scaledWeighted

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statDensity, each row corresponds to one PDF point. Stat.x is the column with this point x coordinate. Stat.density contains the estimated density. Stat.densityWeighted — weighted version of density. Stat.scaled is a density scaled to a maximum of 1.0. Stat.scaledWeighted — weighted version of scaled. DataFrame with "density" statistics is called StatDensityFrame

statDensity transform

statDensity(statDensityArgs) { /*new plotting context*/ } modifies a plotting context — instead of original data (no matter was it empty or not) new StatDensity dataset (calculated on given arguments, inputs and weights can be provided as Iterable or as dataset column reference — by name as a String, as a ColumnReference or as a DataColumn) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statDensity context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat group and can be called inside the new context:

plot { statDensity(depthList, adjust = 0.2) { // New `StatDensity` dataset here line { // Use `Stat.*` columns for mappings x(Stat.x) y(Stat.density) color(Stat.density) } } }
StatDensity Plot

DensityPlot layer

Density plot is a statistical plot used for visualizing the distribution of continuous variables. It's an area graph of kernel-estimated PDF. So basically, we can build a histogram with statDensity as follows:

val statDensityAndAreaPlot = df.plot { statDensity("depth") { area { x(Stat.x) y(Stat.density) } } layout.title = "`statDensity()` + `area()` layer" } statDensityAndAreaPlot
StatDensity Area Plot

But we can do it even faster with densityPlot(statDensityArgs) method:

val densityLayerPlot = plot { densityPlot(depthList) layout.title = "`densityPlot()` layer" } densityLayerPlot
Density Plot

Let's compare them:

plotGrid(listOf(statDensityAndAreaPlot, densityLayerPlot))
Compare StatDensity Area vs Density Plot

These two plots are identical. Indeed, densityPlot just uses statDensity and area and performs coordinate mappings under the hood. And we can customize densityPlot layer: densityPlot() optionally opens a new context, where we can configure bars (as in the usual context opened by area { ... }) — even change coordinate mappings from default ones. StatDensity dataset of densityPlot is also can be accessed here.

df.plot { densityPlot(depth) { // Change a column mapped on `y` to `Stat.scaled` y(Stat.scaled) alpha = 0.7 fillColor = Color.RED borderLine.color = Color.BLACK } }
Scaled y-axis in Density Plot

If we specify weights, Stat.densityWeighted is mapped to y by default:

df.plot { densityPlot(depth, coeff, n = 700, adjust = 0.8, bandWidth = BandWidth.value(17.0)) // We can add other layers as well. // Let's add a horizontal mark line with constant y intercept: vLine { // Count sample mean val mean = depth.mean() xIntercept.constant(mean) tooltips { line("Depth mean: ${String.format("%.2f", mean)}m") } color = Color.RED; width = 2.0 } x.axis.name = "depth, m" }
Density Plot with Mark Line

densityPlot plot

densityPlot(statDensityArgs) and DataFrame.densityPlot(statDensityArgs) are a family of functions for fast plotting a density plot.

densityPlot(depthList, kernel = Kernel.COSINE)
Density Plot
df.densityPlot("depth")
Simple Density Plot

In case you want to provide input and weights using column selection DSL, it's a bit different from the usual one — you should assign x input and (optionally) weight throw invocation eponymous functions:

df.densityPlot(adjust = 0.5) { x(depth) weight(coeff) }
Density Plot with Weight

densityPlot plot can be configured with .configure {} extension — it opens context that combines area, StatDensity and plot context. That means you can configure bars settings, mappings using StatDensity dataset and any plot adjustments:

df.densityPlot { x(depth) }.configure { // Area + StatDensity + PlotBuilder // Can't add new layer // Can add area mapping, including for `Stat.*` columns fillColor(Stat.scaled) // doesn't work properly for now alpha = 0.6 // Can configure general plot adjustments layout { title = "Configured `densityPlot` plot" size = 600 to 350 } }
Configured Density Plot

Grouped statDensity

statDensity can be applied for grouped data — statistics will be calculated on each group independently but with equal categories. This application returns a new GroupBy dataset with the same keys as the old one but with StatDensity groups instead of old ones.

// Create two samples from normal distribution with different mean/std val rangesA = NormalDistribution(500.0, 100.0).sample(5000).toList() val rangesB = NormalDistribution(400.0, 80.0).sample(5000).toList() // Gather them into `DataFrame` with "A" and "B" keys in the "category" column val rangesDF = dataFrameOf( "range" to rangesA + rangesB, "category" to List(5000) { "A" } + List(5000) { "B" } ) rangesDF.head()

range

category

503.671

A

560.585

A

525.12

A

488.74

A

357.084

A

It has the following signature:

range

category

// Group it by "category" val groupedRangesDF = rangesDF.groupBy { category } groupedRangesDF

category

group

A

range

category

503.671

A

560.585

A

525.12

A

488.74

A

357.084

A

B

range

category

391.811

B

291.449

B

378.368

B

408.26

B

388.129

B

Now we have a GroupBy with a signature

key: [category]

group: DataFrame[range|category]

A

A-Group

B

B-Group

groupedRangesDF.statDensity { x(range) }

category

group

A

Stat

{ x: 107.258, density: 0, densityWeighted: 0, ... }

B

Stat

{ x: 117.39, density: 0, densityWeighted: 0, ... }

After statDensity applying it's still a GroupBy but with different signature of group - all groups have the same signature as usual DataFrame after statDensity applying (i.e. StatDensityFrame):

key: [drv]

group: StaDensityFrame

"A"

"A"-Group

"B"

"B"-Group

As you can see, we did indeed do a statDensity transformation within groups, the grouping keys did not change. The plotting process doesn't change much — we do everything the same.

groupedRangesDF.plot { statDensity(range) { line { x(Stat.x) y(Stat.density) } } }
StatDensity Grouped Line

As you can see, there are two lines because we have two groups of data. To distinguish them, we need to add mapping to the color from the key. This is convenient — the key is available in the context

groupedRangesDF.plot { statDensity(range) { line { x(Stat.x) y(Stat.density) color(category) } } }
StatDensity Grouped Line with Color

The densityPlot() layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor and borderLine.color will be created by default.

groupedRangesDF.plot { densityPlot(range) }
Grouped Density Plot

We can customize it like we used to. From the differences — access to key columns:

groupedRangesDF.plot { densityPlot(range) { // Customize scale of default mapping fillColor(category) { scale = categorical("A" to Color.GREEN, "B" to Color.ORANGE) } borderLine.color = Color.BLACK alpha = 0.5 } }
Customized Density Plot

Also, we can stack areas (for that we need x coordinates to match — use trim = true):

groupedRangesDF.plot { // Use trim densityPlot(range, trim = true) { // Adjust position of areas from different groups position = Position.stack() alpha = 0.8 } }
Stacked Density Plot

densityPlot plot for GroupBy (i.e. GroupBy.densityPlot(statDensityArgs) extensions) works as well:

groupedRangesDF.densityPlot("range", bandWidth = BandWidth.value(10.0))
Density Plot by Range

... and can be configured the same way:

groupedRangesDF.densityPlot(n = 750, trim = true, adjust = 0.75) { x(range) }.configure { alpha = 0.6 position = Position.stack() // Can access key column by name as `String` fillColor("category") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) } }
Configured Density Plot

Inside groupBy{} plot context

We can apply groupBy modification to the initial dataset and build a density plot with grouped data the same way:

rangesDF.plot { groupBy(category) { densityPlot(range) } }
GroupBy in Plot
Last modified: 15 July 2024