kandy 0.7.0 Help

Heatmap

Statistics "count2d" are calculated on the sample of two categorical variables (usually provided as two samples of single variable — x and y). It counts the number of observations in each pair of x-category and y-category. It's weighted, it means the weighted count for each pair is calculated (each element within a pair is counted along with its weight).

This notebook uses definitions from DataFrame.

Usage

"Count2D" plots give a visual representation of the two-variable discrete sample distribution.

Arguments

  • Input (mandatory):

    • xx-part of input sample

    • yy-part of input sample

  • Weights (optional):

    • weights — set of weights of the same size as the input samples. null (by default) means all weights are equal to 1.0 and the weighted count is equal to the normal one

Generalized signature

The specific signature depends on the function, but all functions related to "count2d" statistic (which will be discussed further below — different variations of statCount2D(), heatmap()) have approximately the same signature with the arguments above:

statCount2DArgs := x, y, weights = null

The possible types of x, y and weights depend on where a certain function is used. They can be simply Iterable (List, Set, etc.) or a reference to a column in a DataFrame (String, ColumnAccessor) or the DataColumn itself. x elements are type of X — generic type parameter, y elements are type of Y — generic type parameter.

Output statistics

name

type

description

Stat.x

X

x-category

Stat.y

Y

y-category

Stat.count

Int

Number of observations in this category

Stat.countWeighted

Double

Weighted count (sum of observations weights in this category)

StatCount plots

// Use "mpg" dataset val mpgDF = DataFrame.readCSV("https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv") mpgDF.head(5)

untitled

manufacturer

model

displ

year

cyl

trans

drv

cty

hwy

fl

class

1

audi

a4

18,0

1999

4

auto(l5)

f

18

29

p

compact

2

audi

a4

18,0

1999

4

manual(m5)

f

21

29

p

compact

3

audi

a4

2,0

2008

4

manual(m6)

f

20

31

p

compact

4

audi

a4

2,0

2008

4

auto(av)

f

21

30

p

compact

5

audi

a4

28,0

1999

6

auto(l5)

f

16

26

p

compact

// We need only three columns val df = mpgDF["class", "drv", "hwy"] df.head(5)

class

drv

hwy

compact

f

29

compact

f

29

compact

f

31

compact

f

30

compact

f

26

It has a signature

class

drv

hwy

Let's take a look at StatCount2D output DataFrame:

df.statCount2D("class", "drv", "hwy")

Stat

x

y

count

countWeighted

compact

f

35

1020

compact

4

12

310

midsize

4

3

72

suv

r

11

192

2seater

r

5

124

It has the following signature:

Stat

x

y

count

countWeighted

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statCount2D, each row corresponds to one pair of categories. Stat.x is the column with its x-category. Stat.y is the column with its y-category. Stat.count contains the number of observations in the pair. Stat.countWeighted — weighted version of count. DataFrame with "count2D" statistics is called StatCount2DFrame

statCount2D plot transform

statCount2D(statCount2DArgs) { /*new plotting context*/ } modifies a plotting context — instead of original data (no matter was it empty or not) new statCount2D dataset (calculated on given arguments; inputs and weights can be provided as Iterable or as dataset column reference - by name as a String, as a ColumnReference or as a DataColumn) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statCount2D context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat group and can be called inside the new context:

df.plot { statCount2D(`class`, drv) { // New `StatCount` dataset here points { // Use `Stat.*` columns for mappings x(Stat.x) { axis.expand(0.0, 0.5) } y(Stat.y) size(Stat.count) { scale = continuous(10.0..30.0) } color = Color.RED } } }
StatCount2D with Points

Heatmap layer

Heatmap is a statistical plot used for visualizing the distribution of two categorical variables sample. It's a tile plot where each tile is representing one of a pair of categories: its x coordinate is corresponding to x category, y to y category, and its color is to count of this pair. So basically, we can build a heatmap with statCount2D as follows:

val statCount2DAndTilePlot = df.plot { statCount2D("class", "drv") { tiles { x(Stat.x) y(Stat.y) fillColor(Stat.count) } } layout.title = "`statCount2D()` + `tile()` layer" } statCount2DAndTilePlot
StatCount Tile Plot

But we can do it even faster with heatmap(statCount2DArgs) method:

val heatmapLayerPlot = df.plot { heatmap(`class`, drv) layout.title = "`heatmap()` layer" } heatmapLayerPlot
Heatmap Plot

Let's compare them:

plotGrid(listOf(statCount2DAndTilePlot, heatmapLayerPlot))
Compare StatCount Tile and Heatmap

These two plots are identical. Indeed, heatmap just uses statCount2D and tile and performs coordinates and fillColor mappings under the hood. And we can customize heatmap layer: heatmap() optionally opens a new context, where we can configure tiles (as in the usual context opened by tile { ... }) — even change default mappings. StatCount2D dataset of heatmap also can be accessed here.

df.plot { heatmap(`class`, drv) { // Swap coordinate mappings: x(Stat.y) y(Stat.x) // Default mapping but with custom scale fillColor(Stat.count) { scale = continuousColorBrewer(BrewerPalette.Sequential.Reds) } } }
Heatmap with Continuous Color Brewer

If we specify weights, Stat.countWeighted is mapped to fillColor by default:

df.plot { heatmap(`class`, drv, hwy) }
Default color Mapping for Heatmap

heatmap plot

heatmap(statCount2DArgs) and DataFrame.heatmap(statCount2DArgs) are a family of functions for fast plotting a heatmap.

heatmap( listOf("A", "A", "A", "B", "B", "C", "B", "B"), listOf(1, 1, 1, 2, 1, 2, 1, 2), )
Heatmap on Iterable
df.heatmap("class", "drv")
Simple Heatmap

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one — you should assign x and y inputs and (optionally) weight throw invocation eponymous functions:

df.heatmap { x(`class`) y(drv) weight(hwy) }
Heatmap with Weight

Heatmap plot can be configured with .configure {} extension — it opens context that combines tile, StatCount2D and plot context. That means you can configure tile settings, mappings using StatCount2D dataset and any plot adjustments:

df.heatmap { x(`class`) y(drv) weight(hwy) }.configure { // Tile + StatCount2D + PlotBuilder // Can't add new layer // Can add tile mapping, including for `Stat.*` columns fillColor(Stat.count) { scale = continuous(Color.GREEN..Color.RED) } alpha = 0.6 // Can configure general plot adjustments layout { title = "Configured `heatmap` plot" size = 600 to 350 } }
Configured Heatmap
Last modified: 15 July 2024