Histogram
Statistics "bin" are counted on the sample of a single continuous variable. Firstly, it divides the range of values into bins (sequential, non-overlapping sections), and then it counts the number of observations in each bin. It's weighted, it means the weighted count for each bin is calculated (each element within a bin counted along with its weight). It's really important to carefully choose a bin-constructing method (for example, by the exact number of bins or by their width). This decision has a big impact on how the data is shown and studied. It makes sure that the way the data is shown is natural to understand and gives a true picture of the information.
This notebook uses definitions from DataFrame.
Usage
Binning is commonly used in statistics and data analysis to simplify complex data sets and make them easier to interpret. Histogram (or any other plot with "bin" statistics) helps to give an overview of the sample distribution.
Arguments
Input (mandatory):
x
— numeric sample on which the statistics are calculated
Weights (optional):
weights
— set of weights of the same size as the input sample.null
(by default) means all weights are equal to1.0
and the weighted count is equal to the normal one
Parameters (optional):
binsOption: BinsOption
— specifies either the number of bins or their width:BinsOption.byNumber(n: Int)
— values are divided inton
bins (bins width is derived)BinsOption.byWidth(width: Double)
— values are divided into bins of widthwidth
(the number of bins is derived)
binsAlign: BinsAlign
— specifies bins aligning:BinsAlign.center(pos: Double)
— bins are aligned by centering bin inpos
BinsAlign.boundary(pos: Double)
— bins are aligned by boundary between two bins inpos
BinsAlign.none()
— no aligning
Generalized signature
The specific signature depends on the function, but all functions related to "bin" statistic (which will be discussed further below - different variations of statBin()
, histogram()
) have approximately the same signature with the arguments above:
The possible types of x
and weights
depend on where a certain function is used. They can be simply Iterable
(List
, Set
, etc.) or a reference to a column in a DataFrame
(String
, ColumnAccessor
) or the DataColumn
itself.
Output statistics
name | type | description |
---|---|---|
Stat.x | Double | Center of bin |
Stat.count | Int | Number of observations in this bin |
Stat.countWeighted | Double | Weighted count (sum of observations weights in this bin) |
Stat.density | Double | Empirically estimated density in this bin |
Stat.densityWeighted | Double | Weighted density |
StatBin plots
depth | coeff |
---|---|
458.195 | 0.343 |
336.811 | 0.807 |
762.538 | 0.101 |
692.733 | 0.51 |
424.594 | 0.873 |
df
has a signature
depth | coeff |
---|
Let's take a look at StatBin
output DataFrame:
Stat | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | ||||
---|---|---|---|---|
x | count | countWeighted | density | densityWeighted |
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statBin
, each row corresponds to one bin. Stat.x
is the column with the centers of the bins. Stat.count
contains the number of observations in the bin. Stat.countWeighted
— weighted version of count
. There are also Stat.density
and Stat.densityWeighted
. They contain empirically estimated density (both normal and weighted) of the sample in the points corresponding to the centers of bins.
DataFrame
with "bin" statistics is called StatBinFrame
statBin
context transform
statBin(statBinArgs) { /*new plotting context*/ }
modifies a plotting context — instead of original data (no matter was it empty or not) new StatBin
dataset (calculated on given arguments. Inputs and weights can be provided as Iterable
or as dataset column reference — by name as a String
, as a ColumnReference
or as a DataColumn
) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statBin
context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat
group and can be called inside the new context:
Histogram layer
A histogram is a statistical chart that serves to visually approximate the distribution of a numerical variable. It's a bar plot where each bar is representing a bin: its x coordinate is corresponding to bin range and y to count. So basically, we can build a histogram with statBin
as follows:
But we can do it even faster with histogram(statBinArgs)
method:
Let's compare them:
These two plots are identical. Indeed, histogram
just uses statBin
and bars
and performs coordinates mappings under the hood. And we can customize histogram layer: histogram()
optionally opens a new context, where we can configure bars (as in the usual context opened by bars { ... }
) — even change coordinate mappings from default ones. StatBin
dataset of histogram is also can be accessed here.
If we specify weights, Stat.countWeighted
is mapped to y
by default:
histogram
plot
histogram(statBinArgs)
and DataFrame.histogram(statBinArgs)
are a family of functions for fast plotting a histogram.
In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one — you should assign x
input and (optionally) weight
throw invocation eponymous functions:
Histogram plot can be configured with .configure {}
extension — it opens a context that combines bars, StatBin
and plot context. That means you can configure bars settings, mappings using StatBin
dataset and any plot adjustments:
Grouped staBin
statBin
can be applied for grouped data — statistics will be counted on each group independently but with equal bins. This application returns a new GroupBy
dataset with the same keys as the old one but with StatBin
groups instead of old ones.
range | category |
---|---|
347.452 | A |
467.839 | A |
527.679 | A |
538.295 | A |
654.991 | A |
It has the following signature:
range | category |
---|
category | group | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A |
| ||||||||||||
B |
|
Now we have a GroupBy
with a signature
key: [category] | group: DataFrame[range|category] |
---|---|
A | A-Group |
B | B-Group |
category | group | ||
---|---|---|---|
A |
| ||
B |
|
After statBin
applying it's still a GroupBy
but with different signature of group
— all groups have the same signature as usual DataFrame
after statBin
applying (i.e. StatBinFrame
):
key: [category] | group: StaBinFrame |
---|---|
A | A-Group |
B | B-Group |
As you can see, we did indeed do a statBin
transformation within groups, the grouping keys did not change. Also, all bin centers match — it helps to build grouped histogram.
The plotting process doesn't change much — we do everything the same.
As you can see, there are two areas because we have two groups of data. To distinguish them, we need to add mapping to the filling color from the key. This is convenient — the key is available in the context
The histogram
layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor
will be created by default.
We can customize it like we used to. From the differences - access to key
columns, and we can customize the position
of bars (within a single x-coordinate), for example — stack them:
Histogram plot for GroupBy
(i.e. GroupBy.histogram(statBinArgs)
extensions) works as well:
... and can be configured the same way:
Inside groupBy{}
plot context
We can apply groupBy
modification to the initial dataset and build a histogram with grouped data the same way: