Summary statistics
Basic summary statistics:
Aggregating summary statistics:
Every summary statistics can be used in aggregations of:
sum, mean, std are available for (primitive) number columns of types Int, Double, Float, Long, Byte, Short, and any mix of those.
min/max, median, and percentile are available for self-comparable columns (so columns of type T : Comparable<T>, like DateTime, String, Int, etc.) which includes all primitive number columns, but no mix of different number types.
In all cases, null values are ignored.
NaN values can optionally be ignored by setting the skipNaN flag to true. When it's set to false, a NaN in the input will be propagated to the result.
Big numbers (BigInteger, BigDecimal) are generally not supported for statistics. Please convert them to primitive types before using statistics.
When statistics x is applied to several columns, it can be computed in several modes:
x(): DataRowcomputes separate value per every suitable columnx { columns }: Valuecomputes single value across all given columnsxFor { columns }: DataRowcomputes separate value per every given columnxOf { rowExpression }: Valuecomputes single value across results of row expression evaluated for every row
(See column selectors for how to select the columns for these operations)
min/max, median, and percentile have additional mode by:
minBy { rowExpression }: DataRowfinds a row with the minimal result of the rowExpressionmedianBy { rowExpression }: DataRowfinds a row where the median lies based on the results of the rowExpression
To perform statistics for a single row, see row statistics.
groupBy statistics
When statistics are applied to GroupBy DataFrame, it is computed for every data group.
If a statistic is applied in a mode that returns a single value for every data group, it will be stored in a single column named according to the statistic name.
You can also pass a custom name for the aggregated column:
If a statistic is applied in a mode that returns a separate value for every column in a data group, aggregated values will be stored in columns with original column names.
pivot statistics
When statistics are applied to Pivot or PivotGroupBy, it is computed for every data group.
If a statistic is applied in a mode that returns a single value for every data group, it will be stored in a DataFrame cell without any name.
If a statistic is applied in such a way that it returns separate value per every column in a data group, every cell in the nested dataframe will contain DataRow with values for every aggregated column.
To group columns in aggregation results not by pivoted values, but by aggregated columns, apply the separate flag: