Statistics Guide

kandy-statistics allows you to build statistical plots, i.e., plots with statistical transformations of data. With them, you can explore your data in a better way as well as visualize important statistical observations.

How do statistics work?

The process of statistical transformations is straightforward and intuitive. You have some dataset — it can be a single List or a whole DataFrame. Statistics consume one or more sets of values (List, DataColumn) from this dataset and import a new dataset with the transformed data. Then this dataset is used for visualization. Kandy has an API for explicit work with this dataset as well as more simplified for quick plotting.

`statBin` anatomy example

Let's look at an example. The bin statistic is one of the most used — it allows you to split observations by bins and count the number of observations in each one. It is used to construct one of the most common statistical plots — histogram. But before we build a histogram, let's examine the statistics.

// Generate sample from normal distribution
val sample = NormalDistribution().sample(1000).toList()
// Generate weights from uniform distribution
val weights = UniformRealDistribution(0.0, 1.0).sample(1000).toList()

Let's look at the checklist of these arguments for statBin:

val statBinData = statBin(
    sample, // Pass a sample as an input
    null, // Don't provide weights
    BinsOption.byNumber(20), // Set the number of bins
    BinsAlign.center(0.0) // Set the align of bins
)

statBinData

As you can see, we got a DataFrame with one ColumnGroup called Stat which contains several columns with statics. For statBin, each row corresponds to one bin. Stat.x is the column with the centers of the bins; Stat.count contains the number of observations in the bin. Stat.countWeighted - weighted version of count (but since we do not pass weights, this column differs from the previous one only in type - Double instead of Int; values are the same). There are also Stat.density and Stat.densityWeighted. They contain empirically estimated density (both normal and weighted) of the sample in the points corresponding to the centers of bins.

Awesome! But what about plotting?

As mentioned earlier, statBin is used to plot a histogram. And now, having our new dataset, it is really easy to build it — for a classic histogram we need bars with coordinates (x: bin center (i.e. Stat.x), y = bin count (i.e. Stat.count)):

statBinData.plot {
    bars {
        x(Stat.x)
        y(Stat.count)
    }
    layout.title = "Our awesome histogram!"
}

Statistics APIs

Stat-transform API

"Stat-transform" API allows you to transform a dataset right inside PlotBuilder, calculating stats on the fly. It is essentially a set of extensions for PlotBuilder that have the usual statistics API (input samples, weights and parameters) but also open a new context. As usual, new layers can be created in this context, but within it, they will have a new dataset — a dataset with a statistical transformation.

val df = dataFrameOf("sample" to sample, "weights" to weights)

df.plot {
    statBin(sample, weights, binsOption = BinsOption.byWidth(0.25)) {
        // New `StatBin` dataset inside this context
        line {
            // The old dataset is not actual, so we can use `Stat.` columns of a new one
            x(Stat.x)
            y(Stat.density)
        }
    }
    // Dataset hasn't changed here, so we can use it in the usual way
    vLine {
        xIntercept.constant(sample.mean())
        width = 3.0
        color = Color.RED
    }
}

Stat-layers API

plot {
    // Equal to `statBin` + `bars` + x/y mappings on Stat.x/Stat.count
    histogram(sample)
}

Everything is the same, however, three times less code! But that doesn't mean we lose flexibility. First of all, .histogram() has all the same arguments as .statBin(), which means we can fully control the counting of statistics. Second, it optionally creates a new context — a union of bars and statBin contexts. This will allow you to customize bars (including overriding default mappings!).

plot {
    histogram(sample, weights, binsAlign = BinsAlign.center(0.0)) {
        // This context combines `bars` and `statBin` context; that means we can
        // make `bars` mappings and use `Stat.` columns.
        // By default, `Stat.count` is mapped on `y` if weights are not provided;
        // however, we can easily override mapping to `y`, for example, from `Stat.density`
        y(Stat.density)
        fillColor(Stat.density) {
            scale = continuous(Color.GREEN..Color.RED)
        }
    }
    x.axis.limits = -3.5..3.5
}

Stat-plots API

"Stats-plots" API allows you to build a histogram even faster — only with one function! Usually it is a function or set of extensions for a DataFrame with standard statistic arguments (inputs, weights, parameters).

histogram(sample)

df.histogram("sample", binsOption = BinsOption.byNumber(10))

df.histogram {
    x(sample)
    weight(weights)
}

And stat plots can be configured. We can configure layer mappings and settings exactly as in stat layer, and also change the general settings of the plot. The .configure() extension is used for this purpose — it opens a context that combines several contexts you are familiar with — stat context, layer context and plot context:

df.histogram(BinsOption.byNumber(14), BinsAlign.boundary(0.0)) {
    x(sample)
}.configure {
    // StatBin + Bars + Plot contexts
    x.axis.limits = -3.5..3.5
    y(Stat.density)
    borderLine.color = Color.BLACK
    layout.title = "Configured histogram"
}

Statistics and grouped data

Everything described above works with grouped data as well. Statistics are calculated independently inside each group (however, sometimes not exactly; for example, to plot a histogram, we want the centers of bins in different groups to be equals for better plotting). Thus, a statistical transformation for GroupBy will return a GroupBy with the same keys, but instead of the original datasets we will have a Stat dataframes.

// Generate two samples from a normal distribution with different mean/std
val sampleA = NormalDistribution(1.5, 1.0).sample(1000).toList()
val sampleB = NormalDistribution(4.0, 2.0).sample(1000).toList()

// Gather them into `DataFrame` with "A" and "B" keys in the "category" column
val dfAB = dataFrameOf(
    "sample" to sampleA + sampleB,
    "type" to List(1000) { "A" } + List(1000) { "B" }
)

val gbAB = dfAB.groupBy { type }
gbAB

gbAB.statBin("sample")

type

group

Stat
{ x: -2.8, count: 0, countWeigh..., ... }
{ x: -2.1, count: 0, countWeigh..., ... }
{ x: -1.4, count: 4, countWeigh..., ... }
{ x: -0.7, count: 22, countWeigh..., ... }
{ x: 0.0, count: 99, countWeigh..., ... }

Stat
{ x: -2.8, count: 2, countWeigh..., ... }
{ x: -2.1, count: 2, countWeigh..., ... }
{ x: -2.4, count: 2, countWeigh..., ... }
{ x: -0.7, count: 8, countWeigh..., ... }
{ x: 0.0, count: 20, countWeigh..., ... }

As you can see, we did indeed do a statBin transformation within groups, the grouping keys did not change.

gbAB.plot {
    statBin(sample) {
        bars {
            x(Stat.x)
            y(Stat.count)
            fillColor(type)
            borderLine.width = 0.0
            position = Position.dodge()
        }
        line {
            x(Stat.x)
            y(Stat.count)
            color(type)
        }
    }
}

For histogram layer, this also works. Moreover, if we have exactly one grouping key, it will be mapped to fillColor by default:

gbAB.plot {
    histogram(sample)
}

gbAB.plot {
    histogram(sample, binsOption = BinsOption.byNumber(12)) {
        fillColor(type)
        borderLine.color = Color.BLACK
        position = Position.stack()
    }
}

And GroupBy has a .histogram() extension that works exactly like one for DataFrame and can be configured the same way:

gbAB.histogram("sample")

gbAB.histogram(BinsOption.byNumber(20), binsAlign = BinsAlign.boundary(0.0)) {
    x(sample)
}.configure {
    fillColor(type) {
        scale = categorical(listOf(Color.GREEN, Color.ORANGE))
    }
    layout {
        size = 650 to 350
        title = "Configured grouped histogram!"
    }
}

Statistics Guide﻿

How do statistics work?﻿

statBin anatomy example﻿

Awesome! But what about plotting?﻿

Statistics APIs﻿

Stat-transform API﻿

Stat-layers API﻿

Stat-plots API﻿

Statistics and grouped data﻿

See also