Count Plot
Statistics "count" are calculated on the sample of a single categorical variable. It counts the number of observations in each category. It's weighted, it means the weighted count for each category is calculated (each element within a category is counted along with its weight).
This notebook uses definitions from DataFrame.
Usage
"Count" is one of the most important statistics with different usages. The count plot provides a graphical depiction of how categories are distributed.
Arguments
Input (mandatory):
x
— discrete sample on which the statistics are calculated
Weights (optional):
weights
— set of weights of the same size as the input sample.null
(by default) means all weights are equal to1.0
and the weighted count is equal to the normal one
Generalized signature
The specific signature depends on the function, but all functions related to "count" statistic (which will be discussed further below — different variations of statCount()
, countPlot()
) have approximately the same signature with the arguments above:
The possible types of x
and weights
depend on where a certain function is used. They can be simply Iterable
(List
, Set
, etc.) or a reference to a column in a DataFrame
(String
, ColumnAccessor
) or the DataColumn
itself. x
elements are type of X
— generic type parameter.
Output statistics
name | type | description |
---|---|---|
Stat.x | X | Category |
Stat.count | Int | Number of observations in this category |
Stat.countWeighted | Double | Weighted count (sum of observations weights in this category) |
StatCount plots
untitled | manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | audi | a4 | 18,0 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
2 | audi | a4 | 18,0 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
3 | audi | a4 | 2,0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
4 | audi | a4 | 2,0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
5 | audi | a4 | 28,0 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
class | drv | hwy |
---|---|---|
compact | f | 29 |
compact | f | 29 |
compact | f | 31 |
compact | f | 30 |
compact | f | 26 |
It has a signature
class | drv | hwy |
---|
Let's take a look at StatCount
output DataFrame:
Stat | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | ||
---|---|---|
x | count | countWeighted |
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statCount2D
, each row corresponds to one category. Stat.x
is the column with this category. Stat.count
contains the number of observations in the category. Stat.countWeighted
- weighted version of count
. DataFrame
with "count" statistics is called StatCountFrame
statCount
transform
statCount(statCountArgs) { /*new plotting context*/ }
modifies a plotting context - instead of original data (no matter was it empty or not) new StatCount
dataset (calculated on given arguments, inputs and weights can be provided as Iterable
or as dataset column reference - by name as a String
, as a ColumnReference
or as a DataColumn
) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the statCount
context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat
group and can be called inside the new context:
CountPlot layer
CountPlot is a statistical plot used for visualizing the distribution of categorical variables. It's a bar plot where each bar is representing one of the categories: its x
coordinate is corresponding to the category and y
to its count. So basically, we can build a histogram with statCount
as follows:
But we can do it even faster with countPlot(statCountArgs)
method:
Let's compare them:
These two plots are identical. Indeed, countPlot
just uses statCount
and bars
and performs coordinate mappings under the hood. And we can customize count plot layer: countPlot()
optionally opens a new context, where we can configure bars (as in the usual context opened by bars { ... }
) — even change coordinate mappings from default ones. StatCount
dataset of count plot is also can be accessed here.
If we specify weights, Stat.countWeighted
is mapped to y
by default:
countPlot
plot
countPlot(statCountArgs)
and DataFrame.countPlot(statCountArgs)
are a family of functions for fast plotting a count plot.
In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one - you should assign x
input and (optionally) weight
throw invocation eponymous functions:
CountPlot plot can be configured with .configure {}
extension — it opens a context that combines bars, StatCount
and plot context. That means you can configure bars settings, mappings using StatCount
dataset and any plot adjustments:
Grouped statCount
statCount
can be applied for grouped data — statistics will be calculated on each group independently but with equal categories. This application returns a new GroupBy
dataset with the same keys as the old one but with StatCount
groups instead of old ones.
drv | group | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f |
| ||||||||||||||||||
4 |
| ||||||||||||||||||
r |
|
Now we have a GroupBy
with a signature
key: [drv] | group: DataFrame[class|drv|hwy] |
---|---|
"f" | "f"-Group |
"4" | "4"-Group |
"r" | "r"-Group |
drv | group | ||
---|---|---|---|
f |
| ||
4 |
| ||
r |
|
After statCount
applying it's still a GroupBy
but with different signature of group
- all groups have the same signature as usual DataFrame
after statCount
applying (i.e. StatCountFrame
):
key: [drv] | group: StaCountFrame |
---|---|
"f" | "f"-Group |
"4" | "4"-Group |
"r" | "r"-Group |
As you can see, we did indeed do a statCount
transformation within groups, the grouping keys did not change.
The plotting process doesn't change much — we do everything the same.
As you can see, there are several bars in some categories because we have three groups of data. To distinguish them, we need to add mapping to the filling color from the key. This is convenient — the key is available in the context
The countPlot
layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor
will be created by default.
We can customize it like we used to. From the differences - access to key
columns, and we can customize the position
of bars (within a single x-coordinate), for example — stack them:
CountPlot plot for GroupBy
(i.e. GroupBy.countPlot(statCountArgs)
extensions) works as well:
... and can be configured the same way:
Inside groupBy{}
plot context
We can apply groupBy
modification to the initial dataset and count plot a histogram with grouped data the same way: