kandy 0.8.0-RC1 Help

Count Plot

Statistics "count" are calculated on the sample of a single categorical variable. It counts the number of observations in each category. It's weighted, it means the weighted count for each category is calculated (each element within a category is counted along with its weight).

This notebook uses definitions from DataFrame.

Usage

"Count" is one of the most important statistics with different usages. The count plot provides a graphical depiction of how categories are distributed.

Arguments

  • Input (mandatory):

    • x — discrete sample on which the statistics are calculated

  • Weights (optional):

    • weights — set of weights of the same size as the input sample. null (by default) means all weights are equal to 1.0 and the weighted count is equal to the normal one

Generalized signature

The specific signature depends on the function, but all functions related to "count" statistic (which will be discussed further below — different variations of statCount(), countPlot()) have approximately the same signature with the arguments above:

statCountArgs := x, weights = null

The possible types of x and weights depend on where a certain function is used. They can be simply Iterable (List, Set, etc.) or a reference to a column in a DataFrame (String, ColumnAccessor) or the DataColumn itself. x elements are type of X — generic type parameter.

Output statistics

name

type

description

Stat.x

X

Category

Stat.count

Int

Number of observations in this category

Stat.countWeighted

Double

Weighted count (sum of observations weights in this category)

StatCount plots

// Use "mpg" dataset val mpgDF = DataFrame.readCSV("https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv") mpgDF.head(5)

untitled

manufacturer

model

displ

year

cyl

trans

drv

cty

hwy

fl

class

1

audi

a4

18,0

1999

4

auto(l5)

f

18

29

p

compact

2

audi

a4

18,0

1999

4

manual(m5)

f

21

29

p

compact

3

audi

a4

2,0

2008

4

manual(m6)

f

20

31

p

compact

4

audi

a4

2,0

2008

4

auto(av)

f

21

30

p

compact

5

audi

a4

28,0

1999

6

auto(l5)

f

16

26

p

compact

// We need only three columns val df = mpgDF["class", "drv", "hwy"] df.head(5)

class

drv

hwy

compact

f

29

compact

f

29

compact

f

31

compact

f

30

compact

f

26

It has a signature

class

drv

hwy

Let's take a look at StatCount output DataFrame:

df.statCount("class", "hwy")

Stat

x

count

countWeighted

compact

47

1330

midsize

41

1119

suv

62

1124

2seater

5

124

minivan

11

246

It has the following signature:

Stat

x

count

countWeighted

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statCount2D, each row corresponds to one category. Stat.x is the column with this category. Stat.count contains the number of observations in the category. Stat.countWeighted - weighted version of count. DataFrame with "count" statistics is called StatCountFrame

statCount transform

statCount(statCountArgs) { /*new plotting context*/ } modifies a plotting context - instead of original data (no matter was it empty or not) new StatCount dataset (calculated on given arguments, inputs and weights can be provided as Iterable or as dataset column reference - by name as a String, as a ColumnReference or as a DataColumn) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the statCount context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat group and can be called inside the new context:

plot { statCount(df["class"]) { // New `StatCount` dataset here points { // Use `Stat.*` columns for mappings x(Stat.x) y(Stat.count) size(Stat.count) color(Stat.x) } } }
StatCount Points Plot

CountPlot layer

CountPlot is a statistical plot used for visualizing the distribution of categorical variables. It's a bar plot where each bar is representing one of the categories: its x coordinate is corresponding to the category and y to its count. So basically, we can build a histogram with statCount as follows:

val statCountAndBarsPlt = df.plot { statCount("class") { bars { x(Stat.x) y(Stat.count) } } layout.title = "`statCount()` + `bars()` layer" } statCountAndBarsPlt
StatCount Bar Plot

But we can do it even faster with countPlot(statCountArgs) method:

val countPlt = plot { countPlot(df["class"]) layout.title = "`countPlot()` layer" } countPlt
Count Plot

Let's compare them:

plotGrid(listOf(statCountAndBarsPlt, countPlt))
Compare StatCount Bar vs Count Plot

These two plots are identical. Indeed, countPlot just uses statCount and bars and performs coordinate mappings under the hood. And we can customize count plot layer: countPlot() optionally opens a new context, where we can configure bars (as in the usual context opened by bars { ... }) — even change coordinate mappings from default ones. StatCount dataset of count plot is also can be accessed here.

df.plot { countPlot(`class`) { // filling color depends on `count` statistic fillColor(Stat.count) { scale = continuous(Color.GREEN..Color.RED) } borderLine.color = Color.BLACK } }
Count Plot with Filled Color

If we specify weights, Stat.countWeighted is mapped to y by default:

df.plot { countPlot(`class`, hwy) // We can add other layers as well. // Let's add a horizontal mark line with constant y intercept: hLine { val criticalCount = 500 yIntercept.constant(criticalCount) tooltips { line("Critical count: ${String.format("%d", criticalCount)}") } color = Color.RED; width = 3.0 } x.axis.name = "Car class" }
Count Plot with Mark Line

countPlot plot

countPlot(statCountArgs) and DataFrame.countPlot(statCountArgs) are a family of functions for fast plotting a count plot.

countPlot(listOf("A", "A", "A", "B", "B", "C", "B", "B"))
Count Plot on Iterable
df.countPlot("class")
Simple Count Plot

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one - you should assign x input and (optionally) weight throw invocation eponymous functions:

df.countPlot { x(`class`) weight(hwy) }
Count Plot with Weight

CountPlot plot can be configured with .configure {} extension — it opens a context that combines bars, StatCount and plot context. That means you can configure bars settings, mappings using StatCount dataset and any plot adjustments:

df.countPlot { x(`class`) }.configure { // Bars + StatCount + PlotBuilder // can't add new layer // can add bars mapping, including for `Stat.*` columns fillColor(Stat.x) alpha = 0.6 // can configure general plot adjustments layout { title = "Configured `countPlot` plot" size = 600 to 350 } }
Configured Count Plot

Grouped statCount

statCount can be applied for grouped data — statistics will be calculated on each group independently but with equal categories. This application returns a new GroupBy dataset with the same keys as the old one but with StatCount groups instead of old ones.

// group our dataframe by `drv` column val groupedDF = df.groupBy { drv } groupedDF

drv

group

f

class

drv

hwy

compact

f

29

compact

f

29

compact

f

31

compact

f

30

compact

f

26

4

class

drv

hwy

compact

4

26

compact

4

25

compact

4

28

compact

4

27

compact

4

25

r

class

drv

hwy

compact

r

20

compact

r

15

compact

r

20

compact

r

17

compact

r

17

Now we have a GroupBy with a signature

key: [drv]

group: DataFrame[class|drv|hwy]

"f"

"f"-Group

"4"

"4"-Group

"r"

"r"-Group

groupedDF.statCount { x(`class`) }

drv

group

f

Stat

{ x: compact, count: 35, countWeighted: 35}

4

Stat

{ x: compact, count: 12, countWeighted: 12}

r

Stat

{ x: suv, count: 11, countWeighted: 11}

After statCount applying it's still a GroupBy but with different signature of group - all groups have the same signature as usual DataFrame after statCount applying (i.e. StatCountFrame):

key: [drv]

group: StaCountFrame

"f"

"f"-Group

"4"

"4"-Group

"r"

"r"-Group

As you can see, we did indeed do a statCount transformation within groups, the grouping keys did not change.

The plotting process doesn't change much — we do everything the same.

groupedDF.plot { statCount(`class`) { bars { x(Stat.x) y(Stat.countWeighted) } } }
Grouped StatCount Plot

As you can see, there are several bars in some categories because we have three groups of data. To distinguish them, we need to add mapping to the filling color from the key. This is convenient — the key is available in the context

groupedDF.plot { statCount(`class`) { bars { x(Stat.x) y(Stat.countWeighted) fillColor(drv) } } }
Grouped StatCount Plot with Filled Color

The countPlot layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor will be created by default.

groupedDF.plot { countPlot("class") }
Grouped Count Plot with Default Colors

We can customize it like we used to. From the differences - access to key columns, and we can customize the position of bars (within a single x-coordinate), for example — stack them:

groupedDF.plot { countPlot("class") { fillColor(drv) { scale = categorical(listOf(Color.GREEN, Color.ORANGE, Color.LIGHT_PURPLE)) } borderLine.width = 0.0 width = 1.0 // adjust position of bars position = Position.stack() } }
Stacked Count Plot

CountPlot plot for GroupBy (i.e. GroupBy.countPlot(statCountArgs) extensions) works as well:

groupedDF.countPlot("class")
Simple Grouped Count Plot

... and can be configured the same way:

groupedDF.countPlot { x(`class`) }.configure { alpha = 0.6 // make the bars from different groups overlap with each other position = Position.identity() // can access key column by name as `String` fillColor("drv") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) } }
Configured Grouped Count Plot

Inside groupBy{} plot context

We can apply groupBy modification to the initial dataset and count plot a histogram with grouped data the same way:

df.plot { groupBy(drv) { countPlot(`class`) } }
GroupBy in Plot
Last modified: 15 July 2024