Boxplot
A boxplot, alternatively referred to as a whisker plot, serves as a statistical visualization technique, illustrating the distribution and summary statistics of a dataset in a graphical format. It consists of several components:
Median (Q2): the line inside the box represents the median of the dataset, which is the middle value when the data is sorted in ascending order. It divides the data into two equal halves, with 50% of the data falling below and 50% above the median.
Interquartile Range (IQR): the box itself spans the interquartile range, which is the range between the first quartile (Q1) and the third quartile (Q3). The first quartile (Q1) is the 25th percentile, meaning that 25% of the data falls below it, while the third quartile (Q3) is the 75th percentile, indicating that 75% of the data falls below it. The IQR captures the middle 50% of the data.
Whiskers: the whiskers extend from the top and bottom edges of the box to the minimum and maximum non-outlier data points within a certain range. The range is typically determined by a multiplier (often 1.5 times the IQR), and it defines the outer limits for what is considered a potential outlier.
Outliers (optional): individual data points that fall outside the whiskers are considered potential outliers. These are data points that are significantly different from the rest of the data and may warrant special attention in further analysis. The auxiliary statistic "boxplotOutliers" is used to count outliers. This statistic is not weighted.
This notebook uses definitions from DataFrame.
Usage
A boxplot is a visual representation of a dataset's distribution, showing the median, quartiles, and potential outliers. It's a useful tool for understanding the spread and central tendency of data, as well as identifying outliers. The compactness of this chart also makes it convenient to visually compare the characteristics of different samples with each other.
Arguments
Both statBoxplot()
and statBoxplotOutliers()
(as well as statistical boxplot()
layer and plot functions) have the same arguments and signature.
Input (mandatory):
x
- a categorical variable dividing the data into different groups (in some versions of functions it is absent, i.e., all calculations will be performed for one sample without a division);y
- numeric sample on which the statistics are calculated;
Parameters (optional):
whiskerIQRRatio: Double
- interquartile range multiplier of whiskers lengths.
Generalized signature
The specific signature depends on the function, but all functions related to "boxplot" statistic (which will be discussed further below - different variations of statBoxplot()
, statBoxplotOutliers()
boxplot()
) have approximately the same signature with the arguments above:
The possible types y
depend on where a certain function is used. It can be simply Iterable
(List
, Set
, etc.) or a reference to a column in a DataFrame
(String
, ColumnAccessor
) or the DataColumn
itself. It's used only with DataFrame
- it's a reference to a column of the same type as an y
. x
elements are type of X
- generic type parameter.
Output statistics
"boxplot"
name | type | description |
---|---|---|
Stat.x | X | Boxplot |
Stat.min | Double | Lower whisker end - the minimum non-outlier data point |
Stat.lower | Double | Lower box edge - the first quartile (Q1) |
Stat.middle | Double | Median / the second quartile (Q2) |
Stat.upper | Double | Upper box edge - the third quartile (Q3) |
Stat.max | Double | Upper whisker end - the maximum non-outlier data point |
"boxplotOutliers"
name | type | description |
---|---|---|
Stat.x | X | Boxplot |
Stat.y | Double | Outlier value |
StatBoxplot plots
rate | cond |
---|---|
38.387 | A |
33.406 | A |
33.51 | A |
36.099 | A |
38.703 | A |
df
has a signature
rate | cond |
---|
Let's take a look at StatBoxplot
output DataFrame:
Stat | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | |||||
---|---|---|---|---|---|
x | min | lower | middle | upper | max |
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statBoxplot
, each row corresponds to one boxplot. It's the column with the x
-coordinate category. Stat.min
, Stat.lower
, Stat.upper
and Stat.max
correspond boxplot statistics—box and whiskers y
-coordinates. Stat.middle
- median value, middle line y
-coordinate.
DataFrame
with "boxplot" statistics is called StatBoxplotFrame
Also, we can calculate outliers of these boxplot:
Stat | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | |
---|---|
x | y |
There are only two columns in Stat
group: Stat.x
with x
boxplot category and Stat.y
with y
outlier coordinate.
DataFrame
with "boxplotOutliers" statistics is called StatBoxplotOutliersFrame
statBoxplot
and statBoxplotOutliers
transforms
statBoxplot(statBoxplotArgs) { /*new plotting context*/ }
modifies a plotting context - instead of original data (no matter was it empty or not) new StatBoxplot
dataset (calculated on given arguments; inputs and weights can be provided as Iterable
or as dataset column reference - by name as a String
, as a ColumnReference
or as a DataColumn
) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the statBoxplot
context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat
group and can be called inside the new context.
statBoxplotOutliers(statBoxplotArgs) { /*new plotting context*/ }
works the same way with a new StatBoxplotOutliers` dataset.
boxplot
layer
To make a boxplot (statistical chart), we need boxplot statistics and boxes
geom. Boxes attributes and boxplot statistics match. Also, we can add outliers using boxplotOutliers
statistic and points
layer.
But we can do it much faster with boxplot(statBoxplotArgs)
method:
Let's compare them:
These two plots are identical. Indeed, statistical boxplot
just uses the combination of statistics and layers above (statBoxplot
+ boxes
and statBoxplotOutlier
+ points
) and performs coordinates mappings under the hood. And we can customize statistical boxplot layer: boxplot()
optionally opens a new context, where we can configure both boxes and outliers (as in usual contexts opened by boxes { ... }
/points { ... }
). Moreover, Stat.
columns of StatBoxplot
dataset are available in the context of boxes, exactly as Stat.
columns of StatBoxplotOutliers
are available in the context of outliers. Also, we can hide outliers.
Boxplot layer by a single sample (without x
categories) - receives only one sample (Iterable
or column reference)
boxplot
plot
boxplot(statBoxplotArgs)
and DataFrame.boxplot(statBoxplotArgs)
is a family of functions for fast plotting a boxplot.
In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one - you should assign x
and y
inputs throw invocation eponymous functions:
Boxplot plot can be configured with .configure {}
extension - it opens context similar to the one that creates a statistical boxplot
layer, where you can configure boxes and outliers the same way, but also can configure any plot adjustments:
Grouped staBoxplot
Sometimes you need it to group data within x
categories. Can be applied for grouped data — statistics will be counted on each group independently (each is counted for some x
category). This application returns a new GroupBy
dataset with the same keys as the old one but with StatBoxplot
groups instead of old ones.
untitled | manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | audi | a4 | 18,0 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
2 | audi | a4 | 18,0 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
3 | audi | a4 | 2,0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
4 | audi | a4 | 2,0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
5 | audi | a4 | 28,0 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
class | hwy | drv |
---|---|---|
compact | 29 | f |
compact | 29 | f |
compact | 31 | f |
compact | 30 | f |
compact | 26 | f |
drv | group |
---|---|
f | {drv: "f", hwy: 29, class:...} |
4 | {drv: "4", hwy: 26, class:...} |
r | {drv: "r", hwy: 20, class:...} |
Now we have a GroupBy
with a signature
key: [drv] | group: DataFrame[class|hwy|drv] |
---|---|
"f" | "f"-Group |
"4" | "4"-Group |
"r" | "r"-Group |
drv | group |
---|---|
f | {Stat: {min: 23, middle: 29,...} |
4 | {Stat: {min: 25, middle: 25,...} |
r | {Stat: {min: 16, middle: 17,...} |
After statBoxplot
applying it's still a GroupBy
but with different signature of group
- all groups have the same signature as usual DataFrame
after statBoxplot
applying (i.e. StatBoxplotFrame
):
key: [drv] | group: StaBoxplotFrame |
---|---|
"f" | "f"-Group |
"4" | "4"-Group |
"r" | "r"-Group |
As you can see, we did indeed do a statBoxplot
transformation within groups, the grouping keys did not change.
The plotting process doesn't change much — we do everything the same.
As you can see there are two or three error bars in some x
categories because we have three groups of data. To distinguish them, we need to adjust position and add mapping to the color from the key. This is convenient — the key is available in the context
The statistical boxplot
layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor
will be created by default.
We can customize it like we used to. From the differences - access to key
columns, and we can customize the position
of boxes (within a single x-coordinate), for example - overlap them:
boxplot
plot for GroupBy
(i.e. GroupBy.boxplot(statBoxplotArgs)
extensions) works as well:
... and can be configured the same way:
Inside groupBy{}
plot context
We can apply groupBy
modification to the initial dataset and build a boxplot with grouped data the same way: