kandy 0.7.0 Help

Histogram

Statistics "bin" are counted on the sample of a single continuous variable. Firstly, it divides the range of values into bins (sequential, non-overlapping sections), and then it counts the number of observations in each bin. It's weighted, it means the weighted count for each bin is calculated (each element within a bin counted along with its weight). It's really important to carefully choose a bin-constructing method (for example, by the exact number of bins or by their width). This decision has a big impact on how the data is shown and studied. It makes sure that the way the data is shown is natural to understand and gives a true picture of the information.

This notebook uses definitions from DataFrame.

Usage

Binning is commonly used in statistics and data analysis to simplify complex data sets and make them easier to interpret. Histogram (or any other plot with "bin" statistics) helps to give an overview of the sample distribution.

Arguments

  • Input (mandatory):

    • x — numeric sample on which the statistics are calculated

  • Weights (optional):

    • weights — set of weights of the same size as the input sample. null (by default) means all weights are equal to 1.0 and the weighted count is equal to the normal one

  • Parameters (optional):

    • binsOption: BinsOption — specifies either the number of bins or their width:

      • BinsOption.byNumber(n: Int) — values are divided into n bins (bins width is derived)

      • BinsOption.byWidth(width: Double) — values are divided into bins of width width (the number of bins is derived)

    • binsAlign: BinsAlign — specifies bins aligning:

      • BinsAlign.center(pos: Double) — bins are aligned by centering bin in pos

      • BinsAlign.boundary(pos: Double) — bins are aligned by boundary between two bins in pos

      • BinsAlign.none() — no aligning

Generalized signature

The specific signature depends on the function, but all functions related to "bin" statistic (which will be discussed further below - different variations of statBin(), histogram()) have approximately the same signature with the arguments above:

statBinArgs := x, weights = null, binsOption: BinsOption = BinsOption.byNumber(20), binsAlign: BinsAlign = BinsAlign.center(0.0)

The possible types of x and weights depend on where a certain function is used. They can be simply Iterable (List, Set, etc.) or a reference to a column in a DataFrame (String, ColumnAccessor) or the DataColumn itself.

Output statistics

name

type

description

Stat.x

Double

Center of bin

Stat.count

Int

Number of observations in this bin

Stat.countWeighted

Double

Weighted count (sum of observations weights in this bin)

Stat.density

Double

Empirically estimated density in this bin

Stat.densityWeighted

Double

Weighted density

StatBin plots

// Generate sample from normal distribution val depthList = NormalDistribution(500.0, 100.0).sample(1000).toList() // Generate sample from uniform distribution val coeffList = UniformRealDistribution(0.0, 1.0).sample(1000).toList() // gather them into the DataFrame val df = dataFrameOf( "depth" to depthList, "coeff" to coeffList ) df.head()

depth

coeff

458.195

0.343

336.811

0.807

762.538

0.101

692.733

0.51

424.594

0.873

df has a signature

depth

coeff

Let's take a look at StatBin output DataFrame:

df.statBin("depth", "coeff", binsOption = BinsOption.byNumber(10))

Stat

x

count

countWeighted

density

densityWeighted

167.131

1

0.325

0

0

233.984

8

3.68

0

0

300.836

33

18.901

0

0.001

367.689

110

57.011

0.002

0.002

434.541

216

112.568

0.003

0.003

It has the following signature:

Stat

x

count

countWeighted

density

densityWeighted

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statBin, each row corresponds to one bin. Stat.x is the column with the centers of the bins. Stat.count contains the number of observations in the bin. Stat.countWeighted — weighted version of count. There are also Stat.density and Stat.densityWeighted. They contain empirically estimated density (both normal and weighted) of the sample in the points corresponding to the centers of bins.

DataFrame with "bin" statistics is called StatBinFrame

statBin context transform

statBin(statBinArgs) { /*new plotting context*/ } modifies a plotting context — instead of original data (no matter was it empty or not) new StatBin dataset (calculated on given arguments. Inputs and weights can be provided as Iterable or as dataset column reference — by name as a String, as a ColumnReference or as a DataColumn) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statBin context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat group and can be called inside the new context:

plot { statBin(depthList, binsAlign = BinsAlign.center(500.0)) { // new `StatBin` dataset here area { // use `Stat.*` columns for mappings x(Stat.x) y(Stat.count) fillColor = Color.RED alpha = 0.5 } } }
StatBin with Area Plot

Histogram layer

A histogram is a statistical chart that serves to visually approximate the distribution of a numerical variable. It's a bar plot where each bar is representing a bin: its x coordinate is corresponding to bin range and y to count. So basically, we can build a histogram with statBin as follows:

val statBinBarsPlot = df.plot { statBin("depth") { bars { x(Stat.x) y(Stat.count) } } layout.title = "`statBin` + `bars`" } statBinBarsPlot
StatBin Bars Plot

But we can do it even faster with histogram(statBinArgs) method:

val histogramPlot = plot { histogram(depthList) layout.title = "`histogram`" } histogramPlot
Histogram Plot

Let's compare them:

plotGrid(listOf(statBinBarsPlot, histogramPlot))
Compare Histogram vs StatBin Bar Plots

These two plots are identical. Indeed, histogram just uses statBin and bars and performs coordinates mappings under the hood. And we can customize histogram layer: histogram() optionally opens a new context, where we can configure bars (as in the usual context opened by bars { ... }) — even change coordinate mappings from default ones. StatBin dataset of histogram is also can be accessed here.

df.plot { histogram(depth, binsAlign = BinsAlign.center(500.0)) { // Change a column mapped on `y` to `Stat.density` y(Stat.density) // Filling color depends on `density` statistic fillColor(Stat.density) { scale = continuous(Color.YELLOW..Color.RED) } borderLine.color = Color.BLACK } }
Customized Histogram Plot

If we specify weights, Stat.countWeighted is mapped to y by default:

df.plot { // Count sample mean val mean = depth.mean() // Add weighted histogram histogram(depth, coeff, binsOption = BinsOption.byNumber(10), binsAlign = BinsAlign.boundary(mean)) // We can add other layers as well. // Let's add a vertical mark line in the mean of sample vLine { xIntercept.constant(mean) tooltips { line("Depth mean: ${String.format("%.2f", mean)}m") } color = Color.RED; width = 3.0 } x.axis.name = "depth, m" }
Histogram with mapping on countWeighted

histogram plot

histogram(statBinArgs) and DataFrame.histogram(statBinArgs) are a family of functions for fast plotting a histogram.

histogram(depthList, binsAlign = BinsAlign.center(500.0))
Simple Histogram Plot
df.histogram("depth")
Simple Histogram on DataFrame

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one — you should assign x input and (optionally) weight throw invocation eponymous functions:

df.histogram(binsOption = BinsOption.byNumber(10)) { x(depth) weight(coeff) }
Histogram with mapping on DataColumns

Histogram plot can be configured with .configure {} extension — it opens a context that combines bars, StatBin and plot context. That means you can configure bars settings, mappings using StatBin dataset and any plot adjustments:

df.histogram(binsOption = BinsOption.byNumber(15)) { x(depth) }.configure { // Bars + StatBin + PlotBuilder // Can't add a new layer x.limits = 100..900 // Can add bar mapping, include on `Stat.*` columns fillColor(Stat.count) { scale = continuous(Color.GREEN..Color.RED) } // Can configure general plot adjustments layout { title = "Configured histogram plot" size = 600 to 350 } }
Configured Histogram Plot

Grouped staBin

statBin can be applied for grouped data — statistics will be counted on each group independently but with equal bins. This application returns a new GroupBy dataset with the same keys as the old one but with StatBin groups instead of old ones.

// Create two samples from normal distribution with different mean/std val rangesA = NormalDistribution(500.0, 100.0).sample(5000).toList() val rangesB = NormalDistribution(400.0, 80.0).sample(5000).toList() // Gather them into `DataFrame` with "A" and "B" keys in the "category" column val rangesDF = dataFrameOf( "range" to rangesA + rangesB, "category" to List(5000) { "A" } + List(5000) { "B" } ) rangesDF.head(5)

range

category

347.452

A

467.839

A

527.679

A

538.295

A

654.991

A

It has the following signature:

range

category

// Group it by "category" val groupedRangesDF = rangesDF.groupBy { category } groupedRangesDF

category

group

A

range

category

527.679

A

654.991

A

538.295

A

467.839

A

347.452

A

B

range

category

377.8

B

266.069

B

306.389

B

543.127

B

482.897

B

Now we have a GroupBy with a signature

key: [category]

group: DataFrame[range|category]

A

A-Group

B

B-Group

groupedRangesDF.statBin { x(range) }

category

group

A

Stat

{ x: 116.91, count: 1, countWeighted: 1, ... }

B

Stat

{ x: 116.91, count: 5, countWeighted: 5, ... }

After statBin applying it's still a GroupBy but with different signature of group — all groups have the same signature as usual DataFrame after statBin applying (i.e. StatBinFrame):

key: [category]

group: StaBinFrame

A

A-Group

B

B-Group

As you can see, we did indeed do a statBin transformation within groups, the grouping keys did not change. Also, all bin centers match — it helps to build grouped histogram.

The plotting process doesn't change much — we do everything the same.

groupedRangesDF.plot { statBin(range) { area { x(Stat.x) y(Stat.density) } } }
Grouped StatBin Plot

As you can see, there are two areas because we have two groups of data. To distinguish them, we need to add mapping to the filling color from the key. This is convenient — the key is available in the context

groupedRangesDF.plot { statBin(range) { area { x(Stat.x) y(Stat.density) // can access "key." columns and create mapping from them fillColor(category) alpha = 0.6 } } }
StatBin Area with Mapped fillColor

The histogram layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor will be created by default.

groupedRangesDF.plot { histogram(range) }
Simple Grouped Histogram Plot

We can customize it like we used to. From the differences - access to key columns, and we can customize the position of bars (within a single x-coordinate), for example — stack them:

groupedRangesDF.plot { histogram(range) { fillColor(category) { scale = categorical(listOf(Color.GREEN, Color.ORANGE)) } borderLine.width = 0.0 width = 1.0 // Adjust position of bars from different groups position = Position.stack() } }
Stack Position on Histogram Plot

Histogram plot for GroupBy (i.e. GroupBy.histogram(statBinArgs) extensions) works as well:

groupedRangesDF.histogram("range")
Simple Histogram on Grouped Data

... and can be configured the same way:

groupedRangesDF.histogram(binsAlign = BinsAlign.center(500.0)) { x(range) }.configure { alpha = 0.6 // make the bars from different groups overlap with each other position = Position.identity() // can access key column by name as `String` fillColor("category") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) } }
Configured Grouped Histogram

Inside groupBy{} plot context

We can apply groupBy modification to the initial dataset and build a histogram with grouped data the same way:

rangesDF.plot { groupBy(category) { histogram(range) } }
GroupBy in Plot
Last modified: 15 July 2024