Statistics Guide
kandy-statistics
allows you to build statistical plots, i.e., plots with statistical transformations of data. With them, you can explore your data in a better way as well as visualize important statistical observations.
How do statistics work?
The process of statistical transformations is straightforward and intuitive. You have some dataset — it can be a single List
or a whole DataFrame
. Statistics consume one or more sets of values (List
, DataColumn
) from this dataset and import a new dataset with the transformed data. Then this dataset is used for visualization. Kandy has an API for explicit work with this dataset as well as more simplified for quick plotting.
statBin
anatomy example
Let's look at an example. The bin
statistic is one of the most used — it allows you to split observations by bins and count the number of observations in each one. It is used to construct one of the most common statistical plots — histogram. But before we build a histogram, let's examine the statistics.
Each statistic has several types of arguments:
Main inputs — one or more sets of values (usually named
x
,y
,z
) on which the statistic is counted — these are the only mandatory arguments. All inputs must be of the same size.weight — some statistics are weighted, i.e., the weight of each element will be taken into account. To pass it, the optional argument
weights
is used. This set must have the same size as the main inputs.Statistics parameters. Each statistic has its unique parameters, on which its calculation depends directly. All of them have a default value.
Let's look at the checklist of these arguments for statBin
:
statBin
consumes exactly one values set — sample of values to bin (x
).It's weighted. In addition, to
count
(i.e., the number of observations within bin)statBin
countscountWeighted
statistic, i.e., the weighted count refers to the total sum of the observation weights within a specific bin. To calculate this, passweights
set of the same size as the sample.statBin
has two parameters, both of which configure binsbinOptions
- allows you to specify either the number of bins or their width.binAlign
- sets the alignment of the bin.
Let's use it on our sample...
...and take a look at the output dataset:
Stat | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statBin
, each row corresponds to one bin. Stat.x
is the column with the centers of the bins; Stat.count
contains the number of observations in the bin. Stat.countWeighted
- weighted version of count
(but since we do not pass weights, this column differs from the previous one only in type - Double
instead of Int
; values are the same). There are also Stat.density
and Stat.densityWeighted
. They contain empirically estimated density (both normal and weighted) of the sample in the points corresponding to the centers of bins.
Awesome! But what about plotting?
As mentioned earlier, statBin
is used to plot a histogram. And now, having our new dataset, it is really easy to build it — for a classic histogram we need bars with coordinates (x: bin center (i.e. Stat.x
), y = bin count (i.e. Stat.count
)):
Of course, we won't need to explicitly calculate a new dataset every time. Moreover, we will not need to define the histogram manually again each time. There are different types of APIs for this purpose, which are described in the next chapter.
Statistics APIs
Stat-transform API
"Stat-transform" API allows you to transform a dataset right inside PlotBuilder
, calculating stats on the fly. It is essentially a set of extensions for PlotBuilder
that have the usual statistics API (input samples, weights and parameters) but also open a new context. As usual, new layers can be created in this context, but within it, they will have a new dataset — a dataset with a statistical transformation.
Stat-layers API
"Stat-layers" API is a set of shortcuts for the most popular statistical graphs (such as a histogram); it's an integration of "stat-transform" API and regular layers — with just one function we can plot a statistical layer (i.e., it's an amalgamation of three whole things — stat counting, layer creation and default mappings)
Everything is the same, however, three times less code! But that doesn't mean we lose flexibility. First of all, .histogram()
has all the same arguments as .statBin()
, which means we can fully control the counting of statistics. Second, it optionally creates a new context — a union of bars
and statBin
contexts. This will allow you to customize bars
(including overriding default mappings!).
Stat-plots API
"Stats-plots" API allows you to build a histogram even faster — only with one function! Usually it is a function or set of extensions for a DataFrame
with standard statistic arguments (inputs, weights, parameters).
or
Column selection DSL for stat plots is slightly different from the standard one. You can still open a new scope in which you can access the columns of the dataframe. However, unlike the classic one, you must not return the columns as the result of the expression, but rather access the inputs of the statistics through the function of the same name. Weights are provided in the same way.
And stat plots can be configured. We can configure layer mappings and settings exactly as in stat layer, and also change the general settings of the plot. The .configure()
extension is used for this purpose — it opens a context that combines several contexts you are familiar with — stat context, layer context and plot context:
Statistics and grouped data
Everything described above works with grouped data as well. Statistics are calculated independently inside each group (however, sometimes not exactly; for example, to plot a histogram, we want the centers of bins in different groups to be equals for better plotting). Thus, a statistical transformation for GroupBy
will return a GroupBy
with the same keys, but instead of the original datasets we will have a Stat
dataframes.
Let's make sure of that:
type | group | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A |
| ||||||||||||
B |
|
type | group | ||||||
---|---|---|---|---|---|---|---|
A |
| ||||||
B |
|
As you can see, we did indeed do a statBin
transformation within groups, the grouping keys did not change.
The plotting process isn't much different either. As usual, different sets of points are plotted for different groups. Within the new "stat" context, we also can access columns corresponding to the grouping keys. Also, we can configure position inside the layer.
For histogram
layer, this also works. Moreover, if we have exactly one grouping key, it will be mapped to fillColor
by default:
And we can customize it:
And GroupBy
has a .histogram()
extension that works exactly like one for DataFrame
and can be configured the same way: