Heatmap
Statistics "count2d" are calculated on the sample of two categorical variables (usually provided as two samples of single variable — x
and y
). It counts the number of observations in each pair of x-category and y-category. It's weighted, it means the weighted count for each pair is calculated (each element within a pair is counted along with its weight).
This notebook uses definitions from DataFrame.
Usage
"Count2D" plots give a visual representation of the two-variable discrete sample distribution.
Arguments
Input (mandatory):
x
—x
-part of input sampley
—y
-part of input sample
Weights (optional):
weights
— set of weights of the same size as the input samples.null
(by default) means all weights are equal to1.0
and the weighted count is equal to the normal one
Generalized signature
The specific signature depends on the function, but all functions related to "count2d" statistic (which will be discussed further below — different variations of statCount2D()
, heatmap()
) have approximately the same signature with the arguments above:
The possible types of x
, y
and weights
depend on where a certain function is used. They can be simply Iterable
(List
, Set
, etc.) or a reference to a column in a DataFrame
(String
, ColumnAccessor
) or the DataColumn
itself. x
elements are type of X
— generic type parameter, y
elements are type of Y
— generic type parameter.
Output statistics
name | type | description |
---|---|---|
Stat.x | X |
|
Stat.y | Y |
|
Stat.count | Int | Number of observations in this category |
Stat.countWeighted | Double | Weighted count (sum of observations weights in this category) |
StatCount plots
untitled | manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | audi | a4 | 18,0 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
2 | audi | a4 | 18,0 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
3 | audi | a4 | 2,0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
4 | audi | a4 | 2,0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
5 | audi | a4 | 28,0 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
class | drv | hwy |
---|---|---|
compact | f | 29 |
compact | f | 29 |
compact | f | 31 |
compact | f | 30 |
compact | f | 26 |
It has a signature
class | drv | hwy |
---|
Let's take a look at StatCount2D
output DataFrame:
Stat | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | |||
---|---|---|---|
x | y | count | countWeighted |
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statCount2D
, each row corresponds to one pair of categories. Stat.x
is the column with its x
-category. Stat.y
is the column with its y
-category. Stat.count
contains the number of observations in the pair. Stat.countWeighted
— weighted version of count
. DataFrame
with "count2D" statistics is called StatCount2DFrame
statCount2D
plot transform
statCount2D(statCount2DArgs) { /*new plotting context*/ }
modifies a plotting context — instead of original data (no matter was it empty or not) new statCount2D
dataset (calculated on given arguments; inputs and weights can be provided as Iterable
or as dataset column reference - by name as a String
, as a ColumnReference
or as a DataColumn
) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statCount2D
context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat
group and can be called inside the new context:
Heatmap layer
Heatmap is a statistical plot used for visualizing the distribution of two categorical variables sample. It's a tile plot where each tile is representing one of a pair of categories: its x
coordinate is corresponding to x
category, y
to y
category, and its color is to count of this pair. So basically, we can build a heatmap with statCount2D
as follows:
But we can do it even faster with heatmap(statCount2DArgs)
method:
Let's compare them:
These two plots are identical. Indeed, heatmap
just uses statCount2D
and tile
and performs coordinates and fillColor
mappings under the hood. And we can customize heatmap layer: heatmap()
optionally opens a new context, where we can configure tiles (as in the usual context opened by tile { ... }
) — even change default mappings. StatCount2D
dataset of heatmap also can be accessed here.
If we specify weights, Stat.countWeighted
is mapped to fillColor
by default:
heatmap
plot
heatmap(statCount2DArgs)
and DataFrame.heatmap(statCount2DArgs)
are a family of functions for fast plotting a heatmap.
In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one — you should assign x
and y
inputs and (optionally) weight
throw invocation eponymous functions:
Heatmap plot can be configured with .configure {}
extension — it opens context that combines tile, StatCount2D
and plot context. That means you can configure tile settings, mappings using StatCount2D
dataset and any plot adjustments: