Summary statistics
Basic summary statistics:
Aggregating summary statistics:
Every summary statistics can be used in aggregations of:
df.mean()
df.age.sum()
df.groupBy { city }.mean()
df.pivot { city }.median()
df.pivot { city }.groupBy { name.lastName }.std()
sum, mean, std are available for (primitive) number columns of types Int
, Double
, Float
, Long
, Byte
, Short
, and any mix of those.
min/max, median, and percentile are available for self-comparable columns (so columns of type T : Comparable<T>
, like DateTime
, String
, Int
, etc.) which includes all primitive number columns, but no mix of different number types.
In all cases, null
values are ignored.
NaN
values can optionally be ignored by setting the skipNaN
flag to true
. When it's set to false
, a NaN
in the input will be propagated to the result.
Big numbers (BigInteger
, BigDecimal
) are generally not supported for statistics. Please convert them to primitive types before using statistics.
When statistics x
is applied to several columns, it can be computed in several modes:
x(): DataRow
computes separate value per every suitable columnx { columns }: Value
computes single value across all given columnsxFor { columns }: DataRow
computes separate value per every given columnxOf { rowExpression }: Value
computes single value across results of row expression evaluated for every row
min/max, median, and percentile have additional mode by
:
minBy { rowExpression }: DataRow
finds a row with the minimal result of the rowExpressionmedianBy { rowExpression }: DataRow
finds a row where the median lies based on the results of the rowExpression
To perform statistics for a single row, see row statistics.
df.sum() // sum of values per every numeric column
df.sum { age and weight } // sum of all values in `age` and `weight`
df.sumFor(skipNaN = true) { age and weight } // sum of values per `age` and `weight` separately
df.sumOf { (weight ?: 0) / age } // sum of expression evaluated for every row
When statistics is applied to GroupBy DataFrame
, it is computed for every data group.
If a statistic is applied in a mode that returns a single value for every data group, it will be stored in a single column named according to the statistic name.
df.groupBy { city }.mean { age } // [`city`, `mean`]
df.groupBy { city }.meanOf { age / 2 } // [`city`, `mean`]
You can also pass a custom name for the aggregated column:
df.groupBy { city }.mean("mean age") { age } // [`city`, `mean age`]
df.groupBy { city }.meanOf("custom") { age / 2 } // [`city`, `custom`]
If a statistic is applied in a mode that returns a separate value for every column in a data group, aggregated values will be stored in columns with original column names.
df.groupBy { city }.meanFor { age and weight } // [`city`, `age`, `weight`]
df.groupBy { city }.mean() // [`city`, `age`, `weight`, ...]
When statistics are applied to Pivot
or PivotGroupBy
, it is computed for every data group.
If a statistic is applied in a mode that returns a single value for every data group, it will be stored in a DataFrame
cell without any name.
df.groupBy { city }.pivot { name.lastName }.mean { age }
df.groupBy { city }.pivot { name.lastName }.meanOf { age / 2.0 }
df.groupBy("city").pivot { "name"["lastName"] }.mean("age")
df.groupBy("city").pivot { "name"["lastName"] }.meanOf { "age"<Int>() / 2.0 }
If a statistic is applied in such a way that it returns separate value per every column in a data group, every cell in the nested dataframe will contain DataRow
with values for every aggregated column.
df.groupBy { city }.pivot { name.lastName }.meanFor { age and weight }
df.groupBy { city }.pivot { name.lastName }.mean()
To group columns in aggregation results not by pivoted values, but by aggregated columns, apply the separate
flag:
df.groupBy { city }.pivot { name.lastName }.meanFor(separate = true) { age and weight }
df.groupBy { city }.pivot { name.lastName }.mean(separate = true)