groupBy
Splits the rows of DataFrame into groups using one or several columns as grouping keys.
The groupBy function returns a GroupBy object. A GroupBy is a dataframe-like structure that contains one or more key columns and a group FrameColumn. Key columns contain all unique combinations of key values, and the group FrameColumn contains the corresponding groups of rows (each represented as a DataFrame). Each row in a GroupBy corresponds to a keys/group combination.
See column selectors for how to select the columns for this operation, groupBy transformations, groupBy reducing, groupBy aggregations, and pivot+groupBy.
Grouping columns can be created inplace:
With the optional moveToTop parameter, you can choose whether to make a selected nested column a top-level column:
or to keep it inside a ColumnGroup:
Returns GroupBy object.
Transformation
A GroupBy can be transformed into a new GroupBy using one of the following methods:
sortByGroup/sortByGroupDesc— sorts the order of groups (and their corresponding keys) by values computed with aDataFrameExpressionapplied to each group;sortByCount/sortByCountAsc— sorts the order of groups (and their corresponding keys) by the number of rows they contain;sortByKey/sortByKeyDesc— sorts the order of groups (and their corresponding keys) by the grouping key values;sortBy/sortByDesc— sorts the order of rows within each group by one or more column values;updateGroups— transforms each group into a new one using the provided transforming function;filter— filters group rows by the given predicate;add— adds a new column to each group.
Any DataFrame with FrameColumn can be reinterpreted as a GroupBy:
Examples of transformation
sortByGroup
sortByCount
sortByKey
sortBy
updateGroups
filter
add
Reducing
A GroupBy can be reduced into a DataFrame. It means that each group in this GroupBy is collapsed into a single representative row, and these rows are concatenated into a new DataFrame.
Reducing is a specific case of aggregation.
This mechanism includes two steps.
Step 1: use a reducing function to make a single row from each group
To perform a reducing operation, use the following functions:
first/last– to get the first / last row (optionally, the first or last one that satisfies a predicate) of each group.minBy/maxBy– to get from each group the row with the smallest / largest result of therow expressionsupplied to the function.medianBy/percentileBy– to get the row with the value closest to the estimated median/percentile index of therow expression's results calculated on rows within each group.
These functions return an instance of ReducedGroupBy, which is a class serving as a transitional step between performing a reduction on groups and specifying how the resulting reduced rows (either original or transformed) should be represented in a new DataFrame.
Examples of reducing
df.groupBy
first
last
minBy
maxBy
medianBy
percentileBy
Step 2: transform the result to a DataFrame
A ReducedGroupBy can be transformed into a DataFrame using the following functions:
concat– to concatenate all reduced rows into a singleDataFrame.values– to create aDataFramethat contains the values from the reduced rows in the selected columns.into– to add a newcolumnto the resultingDataFramewith values computed with arow expressionon each row, or a newcolumn groupcontaining each group reduced to a single row.
Each method returns a new DataFrame that includes the grouping key columns, containing all unique grouping key values (or value combinations for multiple keys) along with their corresponding reduced rows.
Examples of transforming
concat
values
into
Aggregation
A GroupBy can be directly transformed into a new DataFrame by applying one or more aggregation operations to its groups.
Aggregation is a generalization of reducing.
The following aggregation methods are available:
concat— concatenates allrowsfrom all groups into a singleDataFrame, without preserving grouping keys.toDataFrame— returns thisGroupByas aDataFramewith the grouping keys and corresponding groups in FrameColumn.concatWithKeys— a variant ofconcatthat also includes grouping keys that were not present in the originalDataFrame.into— creates a newcolumncontaining a list of values computed with aRowExpressionfor each group, or a new FrameColumn containing the groups themselves.values— collects all column values for every group without aggregation. For a ValueColumn of typeTit will gather group values into lists of typeList<T>. For a ColumnGroup it will gather group values into aDataFrameand convert that ColumnGroup into a FrameColumn.count— creates aDataFramecontaining the grouping key columns and an additionalcolumnwith the number of rows in each corresponding group.aggregate— performs a set of custom aggregations usingAggregateDsl, allowing you to compute one or more statistics per every group ofGroupBy. The body if this function will be executed for every data group and has a receiver of typeDataFramethat represents the current data group being aggregated. To add a new column to the resultingDataFrame, pass the name of the new column to infix functioninto.
Each of these methods returns a new DataFrame that includes the grouping key columns (except for concat) along with the columns of values aggregated from the corresponding groups.
Examples of aggregation
toDataFrame on GroupBy
Any GroupBy can be reinterpreted as DataFrame with FrameColumn:
concatWithKeys on GroupBy
into on GroupBy
values on GroupBy
all columns
selected columns
rename columns
count on GroupBy
aggregate on GroupBy
If only one aggregation function is used, the column name can be omitted:
Aggregation statistics
Aggregation statistics are predefined shortcuts for common statistical aggregations such as sum, mean, median, and others.
Each function computes a statistic across the rows of a group and returns the result as a new column (or several columns) in the resulting DataFrame.
The following aggregation statistics are available:
To compute one or several statistics per every group of GroupBy, use the aggregate function.
The functions max, maxOf, and maxFor differ as follows. They all calculate the maximum of values, but:
maxcomputes it on the selected columns. If more than one column is selected, for each group it computes one maximum value among all selected columns.maxOfcomputes it by arow expression: the expression is calculated for each row of the group and the maximum value is returned.maxForcomputes it for each of the selected columns within each group. If more than one column is selected, for each group it computes the maximum value for each selected column separately.
Similar logic applies to other statistics.
Direct aggregations
Most common aggregation functions can be computed directly on a GroupBy.
Examples of direct aggregations
max
min
sum
mean
std
median
percentile
Pivot + GroupBy
A GroupBy can be pivoted with the pivot method. It produces a PivotGroupBy that combines vertical and horizontal grouping, enabling computation of cross-group, matrix-like statistics.
For more information, see pivot + groupBy