Histogram

Usage

Binning is commonly used in statistics and data analysis to simplify complex data sets and make them easier to interpret. Histogram (or any other plot with "bin" statistics) helps to give an overview of the sample distribution.

Arguments

Generalized signature

The specific signature depends on the function, but all functions related to "bin" statistic (which will be discussed further below - different variations of statBin(), histogram()) have approximately the same signature with the arguments above:

statBinArgs :=
   x,
   weights = null,
   binsOption: BinsOption = BinsOption.byNumber(20),
   binsAlign: BinsAlign = BinsAlign.center(0.0)

The possible types of x and weights depend on where a certain function is used. They can be simply Iterable (List, Set, etc.) or a reference to a column in a DataFrame (String, ColumnAccessor) or the DataColumn itself.

Output statistics

name	type	description
Stat.x	Double	Center of bin
Stat.count	Int	Number of observations in this bin
Stat.countWeighted	Double	Weighted count (sum of observations weights in this bin)
Stat.density	Double	Empirically estimated density in this bin
Stat.densityWeighted	Double	Weighted density

StatBin plots

// Generate sample from normal distribution
val depthList = NormalDistribution(500.0, 100.0).sample(1000).toList()
// Generate sample from uniform distribution
val coeffList = UniformRealDistribution(0.0, 1.0).sample(1000).toList()
// gather them into the DataFrame
val df = dataFrameOf(
    "depth" to depthList,
    "coeff" to coeffList
)
df.head()

depth	coeff
458.195	0.343
336.811	0.807
762.538	0.101
692.733	0.51
424.594	0.873

df has a signature

Let's take a look at StatBin output DataFrame:

df.statBin("depth", "coeff", binsOption = BinsOption.byNumber(10))

Stat
x	count	countWeighted	density	densityWeighted

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statBin, each row corresponds to one bin. Stat.x is the column with the centers of the bins. Stat.count contains the number of observations in the bin. Stat.countWeighted — weighted version of count. There are also Stat.density and Stat.densityWeighted. They contain empirically estimated density (both normal and weighted) of the sample in the points corresponding to the centers of bins.

DataFrame with "bin" statistics is called StatBinFrame

`statBin` context transform

statBin(statBinArgs) { /*new plotting context*/ } modifies a plotting context — instead of original data (no matter was it empty or not) new StatBin dataset (calculated on given arguments. Inputs and weights can be provided as Iterable or as dataset column reference — by name as a String, as a ColumnReference or as a DataColumn) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statBin context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat group and can be called inside the new context:

plot {
    statBin(depthList, binsAlign = BinsAlign.center(500.0)) {
        // new `StatBin` dataset here
        area {
            // use `Stat.*` columns for mappings
            x(Stat.x)
            y(Stat.count)
            fillColor = Color.RED
            alpha = 0.5
        }
    }
}

Histogram layer

A histogram is a statistical chart that serves to visually approximate the distribution of a numerical variable. It's a bar plot where each bar is representing a bin: its x coordinate is corresponding to bin range and y to count. So basically, we can build a histogram with statBin as follows:

val statBinBarsPlot = df.plot {
    statBin("depth") {
        bars {
            x(Stat.x)
            y(Stat.count)
        }
    }
    layout.title = "`statBin` + `bars`"
}
statBinBarsPlot

But we can do it even faster with histogram(statBinArgs) method:

val histogramPlot = plot {
    histogram(depthList)
    layout.title = "`histogram`"
}
histogramPlot

plotGrid(listOf(statBinBarsPlot, histogramPlot))

These two plots are identical. Indeed, histogram just uses statBin and bars and performs coordinates mappings under the hood. And we can customize histogram layer: histogram() optionally opens a new context, where we can configure bars (as in the usual context opened by bars { ... }) — even change coordinate mappings from default ones. StatBin dataset of histogram is also can be accessed here.

df.plot {
    histogram(depth, binsAlign = BinsAlign.center(500.0)) {
        // Change a column mapped on `y` to `Stat.density`
        y(Stat.density)
        // Filling color depends on `density` statistic
        fillColor(Stat.density) {
            scale = continuous(Color.YELLOW..Color.RED)
        }
        borderLine.color = Color.BLACK
    }
}

If we specify weights, Stat.countWeighted is mapped to y by default:

df.plot {
    // Count sample mean
    val mean = depth.mean()
    // Add weighted histogram
    histogram(depth, coeff, binsOption = BinsOption.byNumber(10), binsAlign = BinsAlign.boundary(mean))
    // We can add other layers as well.
    // Let's add a vertical mark line in the mean of sample
    vLine {
        xIntercept.constant(mean)
        tooltips { line("Depth mean: ${String.format("%.2f", mean)}m") }
        color = Color.RED; width = 3.0
    }
    x.axis.name = "depth, m"
}

`histogram` plot

histogram(statBinArgs) and DataFrame.histogram(statBinArgs) are a family of functions for fast plotting a histogram.

histogram(depthList, binsAlign = BinsAlign.center(500.0))

df.histogram("depth")

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one — you should assign x input and (optionally) weight throw invocation eponymous functions:

df.histogram(binsOption = BinsOption.byNumber(10)) {
    x(depth)
    weight(coeff)
}

Histogram plot can be configured with .configure {} extension — it opens a context that combines bars, StatBin and plot context. That means you can configure bars settings, mappings using StatBin dataset and any plot adjustments:

df.histogram(binsOption = BinsOption.byNumber(15)) {
    x(depth)
}.configure {
    // Bars + StatBin + PlotBuilder
    // Can't add a new layer
    x.axis.limits = 100..900
    // Can add bar mapping, include on `Stat.*` columns
    fillColor(Stat.count) { scale = continuous(Color.GREEN..Color.RED) }
    // Can configure general plot adjustments
    layout {
        title = "Configured histogram plot"
        size = 600 to 350
    }
}

Grouped `staBin`

statBin can be applied for grouped data — statistics will be counted on each group independently but with equal bins. This application returns a new GroupBy dataset with the same keys as the old one but with StatBin groups instead of old ones.

// Create two samples from normal distribution with different mean/std
val rangesA = NormalDistribution(500.0, 100.0).sample(5000).toList()
val rangesB = NormalDistribution(400.0, 80.0).sample(5000).toList()

// Gather them into `DataFrame` with "A" and "B" keys in the "category" column
val rangesDF = dataFrameOf(
    "range" to rangesA + rangesB,
    "category" to List(5000) { "A" } + List(5000) { "B" }
)
rangesDF.head(5)

range	category
347.452	A
467.839	A
527.679	A
538.295	A
654.991	A

// Group it by "category"
val groupedRangesDF = rangesDF.groupBy { category }
groupedRangesDF

Now we have a GroupBy with a signature

key: [category]	group: DataFrame[range\|category]
A	A-Group
B	B-Group

groupedRangesDF.statBin { x(range) }

After statBin applying it's still a GroupBy but with different signature of group — all groups have the same signature as usual DataFrame after statBin applying (i.e. StatBinFrame):

key: [category]	group: StaBinFrame
A	A-Group
B	B-Group

As you can see, we did indeed do a statBin transformation within groups, the grouping keys did not change. Also, all bin centers match — it helps to build grouped histogram.

groupedRangesDF.plot {
    statBin(range) {
        area {
            x(Stat.x)
            y(Stat.density)
        }
    }
}

groupedRangesDF.plot {
    statBin(range) {
        area {
            x(Stat.x)
            y(Stat.density)
            // can access "key." columns and create mapping from them
            fillColor(category)
            alpha = 0.6
        }
    }
}

The histogram layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor will be created by default.

groupedRangesDF.plot {
    histogram(range)
}

We can customize it like we used to. From the differences - access to key columns, and we can customize the position of bars (within a single x-coordinate), for example — stack them:

groupedRangesDF.plot {
    histogram(range) {
        fillColor(category) {
            scale = categorical(listOf(Color.GREEN, Color.ORANGE))
        }
        borderLine.width = 0.0
        width = 1.0
        // Adjust position of bars from different groups
        position = Position.stack()
    }
}

Histogram plot for GroupBy (i.e. GroupBy.histogram(statBinArgs) extensions) works as well:

groupedRangesDF.histogram("range")

groupedRangesDF.histogram(binsAlign = BinsAlign.center(500.0)) { x(range) }.configure {
    alpha = 0.6
    // make the bars from different groups overlap with each other
    position = Position.identity()
    // can access key column by name as `String`
    fillColor("category") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) }
}

Inside `groupBy{}` plot context

We can apply groupBy modification to the initial dataset and build a histogram with grouped data the same way:

rangesDF.plot {
    groupBy(category) {
        histogram(range)
    }
}

Histogram﻿

Usage﻿

Arguments﻿

Generalized signature﻿

Output statistics﻿

StatBin plots﻿

statBin context transform﻿

Histogram layer﻿

histogram plot﻿

Grouped staBin﻿

Inside groupBy{} plot context﻿

See also