Density Plot
Statistics "density" are calculated on the sample of a single continuous variable. It approximates the Probability Density Function (PDF) of this sample. "Density" statistic samples this function point. It's weighted, it means the counted density depends on observation weights.
This notebook uses definitions from DataFrame.
Usage
"Density" statistic is useful when you have a large dataset, and you want to understand the underlying probability distribution. Density plot visualizes the PDF and also allows you to compare the distribution of different samples. This is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.
Arguments
Input (mandatory):
x
— numeric sample on which the statistics are calculated
Weights (optional):
weights
— set of weights of the same size as the input sample.null
(by default) means all weights are equal to1.0
and the weighted density is equal to the normal one
Parameters (optional):
n: Int
— number of sampled points;trim: Boolean
— iffalse
, each density is computed on the full range of the data, iftrue
, each density is computed over the range of that group (only for grouped inputs).adjust: Double
— adjusts the value of bandwidth by multiplying it; changes how smooth the frequency curve is.kernel: Kernel
— the kernel used to calculate the density function:Kernel.GAUSSIAN
Kernel.RECTANGULAR
Kernel.TRIANGULAR
Kernel.BIWEIGHT
Kernel.EPANECHNIKOV
Kernel.OPTCOSINE
fullScanMax: Int
— maximum size of data to use density computation with 'full scan'. For bigger data, less accurate but more efficient density computation is appliedbandWidth: BandWidth
— the method (or exact value) of bandwidth:BandWidth.Method.NRD
BandWidth.Method.NRD0
BandWidth.value(value: Double)
Generalized signature
The specific signature depends on the function, but all functions related to "density" statistic (which will be discussed further below — different variations of statDensity()
, densityPlot()
) have approximately the same signature with the arguments above:
The possible types of x
and weights
depend on where a certain function is used. They can be simply Iterable
(List
, Set
, etc.) or a reference to a column in a DataFrame
(String
, ColumnAccessor
) or the DataColumn
itself.
Output statistics
name | type | description |
---|---|---|
Stat.x | Double |
|
Stat.density | Double | Density estimate |
Stat.densityWeighted | Double | Weighted density |
Stat.scaled | Double | Density estimate, scaled to maximin of 1.0. |
Stat.scaledWeighted | Double | Weighted scaled |
StatDensity plots
depth | coeff |
---|---|
495.7 | 0.818 |
666.918 | 0.863 |
466.139 | 1 |
488.06 | 0.489 |
338.757 | 0.917 |
df
has a signature
depth | coeff |
---|
Let's take a look at StatDensity
output DataFrame:
Stat | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
It has the following signature:
Stat | ||||
---|---|---|---|---|
x | density | densityWeighted | scaled | scaledWeighted |
As you can see, we got a DataFrame
with one ColumnGroup
called Stat
which contains several columns with statics. For statDensity
, each row corresponds to one PDF point. Stat.x
is the column with this point x
coordinate. Stat.density
contains the estimated density. Stat.densityWeighted
— weighted version of density
. Stat.scaled
is a density scaled to a maximum of 1.0. Stat.scaledWeighted
— weighted version of scaled
. DataFrame
with "density" statistics is called StatDensityFrame
statDensity
transform
statDensity(statDensityArgs) { /*new plotting context*/ }
modifies a plotting context — instead of original data (no matter was it empty or not) new StatDensity
dataset (calculated on given arguments, inputs and weights can be provided as Iterable
or as dataset column reference — by name as a String
, as a ColumnReference
or as a DataColumn
) is used inside a new context (original dataset and primary context are not affected — you can add layers using initial dataset outside the statDensity
context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the Stat
group and can be called inside the new context:
DensityPlot layer
Density plot is a statistical plot used for visualizing the distribution of continuous variables. It's an area graph of kernel-estimated PDF. So basically, we can build a histogram with statDensity
as follows:
But we can do it even faster with densityPlot(statDensityArgs)
method:
Let's compare them:
These two plots are identical. Indeed, densityPlot
just uses statDensity
and area
and performs coordinate mappings under the hood. And we can customize densityPlot
layer: densityPlot()
optionally opens a new context, where we can configure bars (as in the usual context opened by area { ... }
) — even change coordinate mappings from default ones. StatDensity
dataset of densityPlot
is also can be accessed here.
If we specify weights, Stat.densityWeighted
is mapped to y
by default:
densityPlot
plot
densityPlot(statDensityArgs)
and DataFrame.densityPlot(statDensityArgs)
are a family of functions for fast plotting a density plot.
In case you want to provide input and weights using column selection DSL, it's a bit different from the usual one — you should assign x
input and (optionally) weight
throw invocation eponymous functions:
densityPlot
plot can be configured with .configure {}
extension — it opens context that combines area, StatDensity
and plot context. That means you can configure bars settings, mappings using StatDensity
dataset and any plot adjustments:
Grouped statDensity
statDensity
can be applied for grouped data — statistics will be calculated on each group independently but with equal categories. This application returns a new GroupBy
dataset with the same keys as the old one but with StatDensity
groups instead of old ones.
range | category |
---|---|
503.671 | A |
560.585 | A |
525.12 | A |
488.74 | A |
357.084 | A |
It has the following signature:
range | category |
---|
category | group | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A |
| ||||||||||||
B |
|
Now we have a GroupBy
with a signature
key: [category] | group: DataFrame[range|category] |
---|---|
A | A-Group |
B | B-Group |
category | group | ||
---|---|---|---|
A |
| ||
B |
|
After statDensity
applying it's still a GroupBy
but with different signature of group
- all groups have the same signature as usual DataFrame
after statDensity
applying (i.e. StatDensityFrame
):
key: [drv] | group: StaDensityFrame |
---|---|
"A" | "A"-Group |
"B" | "B"-Group |
As you can see, we did indeed do a statDensity
transformation within groups, the grouping keys did not change. The plotting process doesn't change much — we do everything the same.
As you can see, there are two lines because we have two groups of data. To distinguish them, we need to add mapping to the color from the key. This is convenient — the key is available in the context
The densityPlot()
layer also works. Moreover, if we have exactly one grouping key, a mapping from it to fillColor
and borderLine.color
will be created by default.
We can customize it like we used to. From the differences — access to key
columns:
Also, we can stack areas (for that we need x
coordinates to match — use trim = true
):
densityPlot
plot for GroupBy
(i.e. GroupBy.densityPlot(statDensityArgs)
extensions) works as well:
... and can be configured the same way:
Inside groupBy{}
plot context
We can apply groupBy
modification to the initial dataset and build a density plot with grouped data the same way: