Boxplot

Usage

A boxplot is a visual representation of a dataset's distribution, showing the median, quartiles, and potential outliers. It's a useful tool for understanding the spread and central tendency of data, as well as identifying outliers. The compactness of this chart also makes it convenient to visually compare the characteristics of different samples with each other.

Arguments

Both statBoxplot() and statBoxplotOutliers() (as well as statistical boxplot() layer and plot functions) have the same arguments and signature.

Generalized signature

The specific signature depends on the function, but all functions related to "boxplot" statistic (which will be discussed further below - different variations of statBoxplot(), statBoxplotOutliers() boxplot()) have approximately the same signature with the arguments above:

statBoxplotArgs :=
   x, // not necessarily
   y,
   whiskerIQRRatio: Double = 1.5

The possible types y depend on where a certain function is used. It can be simply Iterable (List, Set, etc.) or a reference to a column in a DataFrame (String, ColumnAccessor) or the DataColumn itself. It's used only with DataFrame - it's a reference to a column of the same type as an y. x elements are type of X - generic type parameter.

Output statistics

"boxplot"

name	type	description
Stat.x	X	Boxplot `x` category
Stat.min	Double	Lower whisker end - the minimum non-outlier data point
Stat.lower	Double	Lower box edge - the first quartile (Q1)
Stat.middle	Double	Median / the second quartile (Q2)
Stat.upper	Double	Upper box edge - the third quartile (Q3)
Stat.max	Double	Upper whisker end - the maximum non-outlier data point

StatBoxplot plots

// To generate the data, we use a standard java math library
// https://commons.apache.org/proper/commons-math/
// Generate sample from normal distribution
val rateA = NormalDistribution(37.8, 4.3).sample(5000).toList()
// Generate sample from uniform distribution
val rateB = UniformRealDistribution(20.0, 50.0).sample(1000).toList()
// Combine two previous samples and filter them by lower bound for third sample
val rateC = (rateA + rateB).filter { it >= 36.0 }

// gather them into the DataFrame in a single column and with corresponding keys in column `cond`
val df = dataFrameOf(
    "rate" to rateA + rateB + rateC,
    "cond" to List(rateA.size) { "A" } + List(rateB.size) { "B" } + List(rateC.size) { "C" }
)
df.head(5)

rate	cond
38.387	A
33.406	A
33.51	A
36.099	A
38.703	A

df has a signature

Let's take a look at StatBoxplot output DataFrame:

df.statBoxplot("cond", "rate")

Stat
x	min	lower	middle	upper	max

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statBoxplot, each row corresponds to one boxplot. It's the column with the x-coordinate category. Stat.min, Stat.lower, Stat.upper and Stat.max correspond boxplot statistics—box and whiskers y-coordinates. Stat.middle - median value, middle line y-coordinate.

DataFrame with "boxplot" statistics is called StatBoxplotFrame

df.statBoxplotOutliers("cond", "rate").head(5)

Stat
x	y

There are only two columns in Stat group: Stat.x with x boxplot category and Stat.y with y outlier coordinate.

DataFrame with "boxplotOutliers" statistics is called StatBoxplotOutliersFrame

`statBoxplot` and `statBoxplotOutliers` transforms

statBoxplot(statBoxplotArgs) { /*new plotting context*/ } modifies a plotting context - instead of original data (no matter was it empty or not) new StatBoxplot dataset (calculated on given arguments; inputs and weights can be provided as Iterable or as dataset column reference - by name as a String, as a ColumnReference or as a DataColumn) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the statBoxplot context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat group and can be called inside the new context.

df.plot {
    statBoxplot(cond, rate) {
        // New "StatBoxplot" dataset here
        errorBars {
            // Use `Stat.*` columns for mappings
            x(Stat.x)
            yMin(Stat.min)
            yMax(Stat.max)
            borderLine.color(Stat.x)
        }
    }
    // Initial dataset here
    points {
        x("cond")
        y("rate")
        size = 0.5
        alpha = 0.2
        color("cond")
        position = Position.jitter()
    }
}

statBoxplotOutliers(statBoxplotArgs) { /*new plotting context*/ } works the same way with a new StatBoxplotOutliers` dataset.

df.plot {
    statBoxplotOutliers(cond, rate) {
        // New "StatBoxplotOutliers" dataset here
        points {
            x(Stat.x)
            y(Stat.y)
            color(Stat.x)
        }
    }
}

`boxplot` layer

To make a boxplot (statistical chart), we need boxplot statistics and boxes geom. Boxes attributes and boxplot statistics match. Also, we can add outliers using boxplotOutliers statistic and points layer.

val manualBoxplot = df.plot {
    statBoxplot(cond, rate) {
        boxes {
            // All positional aesthetics match boxplot statistics
            x(Stat.x)
            yMin(Stat.min)
            lower(Stat.lower)
            middle(Stat.middle)
            upper(Stat.upper)
            yMax(Stat.max)
        }
    }
    statBoxplotOutliers(cond, rate) {
        points {
            x(Stat.x)
            y(Stat.y)
        }
    }
    layout {
        title = "`statBoxplot` + `boxes` \n" +
                "and `statBoxplotOutliers` + `points`"
    }
}
manualBoxplot

But we can do it much faster with boxplot(statBoxplotArgs) method:

val boxplotPlot = df.plot {
    // Statistical boxplot layer - receives "statBoxplotArgs" and has default mappings
    boxplot(cond, rate)
    layout.title = "`boxplot()`"
}
boxplotPlot

plotGrid(listOf(manualBoxplot, boxplotPlot))

These two plots are identical. Indeed, statistical boxplot just uses the combination of statistics and layers above (statBoxplot + boxes and statBoxplotOutlier + points) and performs coordinates mappings under the hood. And we can customize statistical boxplot layer: boxplot() optionally opens a new context, where we can configure both boxes and outliers (as in usual contexts opened by boxes { ... }/points { ... }). Moreover, Stat. columns of StatBoxplot dataset are available in the context of boxes, exactly as Stat. columns of StatBoxplotOutliers are available in the context of outliers. Also, we can hide outliers.

df.plot {
    boxplot(cond, rate) {
        boxes {
            // Boxes context + StatBoxplot context
            // filling color depends on `x` category
            fillColor(Stat.x)
        }
        // hide outliers
        outliers.show = false
    }
}

df.plot {
    boxplot(cond, rate) {
        boxes {
            fatten = 0.5
            alpha = 0.6
            // Border line color depends on `x` category
            borderLine.color(Stat.x)
        }
        outliers {
            // points context + StatBoxplotOutliers context
            // color depends on `x` category
            color(Stat.x)
            symbol = Symbol.ASTERIX
        }
    }
}

Boxplot layer by a single sample (without x categories) - receives only one sample (Iterable or column reference)

plot {
    boxplot(rateC)
}

`boxplot` plot

boxplot(statBoxplotArgs) and DataFrame.boxplot(statBoxplotArgs) is a family of functions for fast plotting a boxplot.

// There's an additional argument "showOutliers"
df.boxplot("cond", "rate", showOutliers = false)

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one - you should assign x and y inputs throw invocation eponymous functions:

df.boxplot(whiskerIQRRatio = 2.0) {
    x(cond)
    y(rate)
}

Boxplot plot can be configured with .configure {} extension - it opens context similar to the one that creates a statistical boxplot layer, where you can configure boxes and outliers the same way, but also can configure any plot adjustments:

df.boxplot {
    x(cond)
    y(rate)
}.configure {
    // BoxplotLayer + PlotBuilder
    // can't add new layer but can configure `boxes` and `outliers`
    boxes {
        alpha = 0.7
        fillColor(Stat.middle) { scale = continuous(Color.GREEN..Color.RED) }
    }
    outliers {
        color(Stat.x)
        // jittered outliers
        position = Position.jitter(0.1, 0.0)
    }
    // can configure general plot adjustments
    layout {
        title = "Configured boxplot"
        size = 600 to 350
    }
}

Grouped `staBoxplot`

Sometimes you need it to group data within x categories. Can be applied for grouped data — statistics will be counted on each group independently (each is counted for some x category). This application returns a new GroupBy dataset with the same keys as the old one but with StatBoxplot groups instead of old ones.

// Use "mpg" dataset
val mpgDF =
    DataFrame.readCSV("https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv")
mpgDF.head()

untitled	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	18,0	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	18,0	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2,0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2,0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	28,0	1999	6	auto(l5)	f	16	26	p	compact

// We need only three columns
val mpgShortDF = mpgDF["class", "hwy", "drv"]
mpgShortDF.head(5)

class	hwy	drv
compact	29	f
compact	29	f
compact	31	f
compact	30	f
compact	26	f

// group it by "drv"
val groupedDF = mpgShortDF.groupBy { drv }
groupedDF

drv	group
f	{drv: "f", hwy: 29, class:...}
4	{drv: "4", hwy: 26, class:...}
r	{drv: "r", hwy: 20, class:...}

Now we have a GroupBy with a signature

key: [drv]	group: DataFrame[class\|hwy\|drv]
"f"	"f"-Group
"4"	"4"-Group
"r"	"r"-Group

groupedDF.statBoxplot { x(`class`); y(hwy) }

drv	group
f	{Stat: {min: 23, middle: 29,...}
4	{Stat: {min: 25, middle: 25,...}
r	{Stat: {min: 16, middle: 17,...}

After statBoxplot applying it's still a GroupBy but with different signature of group - all groups have the same signature as usual DataFrame after statBoxplot applying (i.e. StatBoxplotFrame):

key: [drv]	group: StaBoxplotFrame
"f"	"f"-Group
"4"	"4"-Group
"r"	"r"-Group

As you can see, we did indeed do a statBoxplot transformation within groups, the grouping keys did not change.

groupedDF.plot {
    statBoxplot(`class`, hwy) {
        errorBars {
            x(Stat.x)
            yMin(Stat.min)
            yMax(Stat.max)
        }
    }
}

As you can see there are two or three error bars in some x categories because we have three groups of data. To distinguish them, we need to adjust position and add mapping to the color from the key. This is convenient — the key is available in the context

groupedDF.plot {
    statBoxplot(`class`, hwy) {
        errorBars {
            x(Stat.x)
            yMin(Stat.min)
            yMax(Stat.max)
            borderLine.color(drv)
            position = Position.dodge()
        }
    }
}

The statistical boxplot layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor will be created by default.

groupedDF.plot {
    boxplot(`class`, hwy)
}

We can customize it like we used to. From the differences - access to key columns, and we can customize the position of boxes (within a single x-coordinate), for example - overlap them:

groupedDF.plot {
    boxplot(`class`, hwy) {
        boxes {
            borderLine.color(drv)
            // `identity` position, i.e boxes are overlapping
            position = Position.identity()
            alpha = 0.5
        }
        outliers.show = false
    }
}

boxplot plot for GroupBy (i.e. GroupBy.boxplot(statBoxplotArgs) extensions) works as well:

groupedDF.boxplot("class", "hwy")

groupedDF.boxplot {
    x(`class`)
    y(hwy)
}.configure {
    boxes.borderLine.color = Color.hex("#000080")
    outliers {
        color(drv)
    }
    layout {
        size = 750 to 450
        title = "Configured grouped boxplot"
    }
}

Inside `groupBy{}` plot context

We can apply groupBy modification to the initial dataset and build a boxplot with grouped data the same way:

mpgShortDF.plot {
    groupBy(drv) {
        boxplot(`class`, hwy)
    }
}

Boxplot﻿

Usage﻿

Arguments﻿

Generalized signature﻿

Output statistics﻿

"boxplot"﻿

"boxplotOutliers"﻿

StatBoxplot plots﻿

statBoxplot and statBoxplotOutliers transforms﻿

boxplot layer﻿

boxplot plot﻿

Grouped staBoxplot﻿

Inside groupBy{} plot context﻿

See also