Column selectors
DataFrame
provides a DSL for selecting an arbitrary set of columns: the Columns Selection DSL.
Column selectors are used in many operations:
Full DSL Grammar
Definitions
Functions Overview
First (Col), Last (Col), Single (Col)
first {}
, firstCol()
, last {}
, lastCol()
, single {}
, singleCol()
Returns the first, last, or single column from the top-level, specified column group, or ColumnSet
that adheres to the optional given condition. If no column adheres to the given condition, NoSuchElementException
is thrown.
Col
col(name)
, col(5)
, this[5]
Creates a ColumnAccessor (or SingleColumn
) for a column with the given argument from the top-level or specified column group. The argument can be either an index (Int
) or a reference to a column (String
, ColumnPath
, KProperty
, or ColumnAccessor
; any AccessApi).
Value Col, Frame Col, Col Group
valueCol(name)
, valueCol(5)
, frameCol(name)
, frameCol(5)
, colGroup(name)
, colGroup(5)
Creates a ColumnAccessor (or SingleColumn
) for a value column/frame column/column group with the given argument from the top-level or specified column group. The argument can be either an index (Int
) or a reference to a column (String
, ColumnPath
, KProperty
, or ColumnAccessor
; any AccessApi). The functions can be both typed and untyped (in case you're supplying a column name, -path, or index). These functions throw an IllegalArgumentException
if the column found is not the right kind.
Cols
cols {}
, cols()
, cols(colA, colB)
, cols(1, 5)
, cols(1..5)
, [{}]
, colSet[1, 3]
Creates a subset of columns (ColumnSet
) from the top-level, specified column group, or ColumnSet
. You can use either a ColumnFilter
, or any of the vararg
overloads for any AccessApi. The function can be both typed and untyped (in case you're supplying a column name, -path, or index (range)).
Note that you can also use the []
operator for most overloads of cols
to achieve the same result.
Range of Columns
colA.."colB"
Creates a ColumnSet
containing all columns from colA
to colB
(inclusive) from the top-level. Columns inside column groups are also supported (as long as they share the same direct parent), as well as any combination of AccessApi.
Value Columns, Frame Columns, Column Groups
valueCols {}
, valueCols()
, frameCols {}
, frameCols()
, colGroups {}
, colGroups()
Creates a subset of columns (ColumnSet
) from the top-level, specified column group, or ColumnSet
containing only value columns/frame columns/column groups that adhere to the optional condition.
Cols of Kind
colsOfKind(Value, Frame) {}
, colsOfKind(Group, Frame)
Creates a subset of columns (ColumnSet
) from the top-level, specified column group, or ColumnSet
containing only columns of the specified kind(s) that adhere to the optional condition.
All (Cols)
all()
, allCols()
Creates a ColumnSet
containing all columns from the top-level, specified column group, or ColumnSet
. This is the opposite of none()
and equivalent to cols()
without filter. Note, on column groups, all
is named allCols
instead to avoid confusion.
All (Cols) After, -Before, -From, -Up To
allAfter(colA)
, allBefore(colA)
, allColsFrom(colA)
, allColsUpTo(colA)
Creates a ColumnSet
containing a subset of columns from the top-level, specified column group, or ColumnSet
. The subset includes:
all(Cols)Before(colA)
: All columns before the specified column, excluding that column.all(Cols)After(colA)
: All columns after the specified column, excluding that column.all(Cols)From(colA)
: All columns from the specified column, including that column.all(Cols)UpTo(colA)
: All columns up to the specified column, including that column.
NOTE: The {}
overloads of these functions in the Plain DSL and on column groups are a ColumnSelector
(relative to the receiver). On ColumnSets
they are a ColumnFilter
instead.
Cols at any Depth
colsAtAnyDepth {}
, colsAtAnyDepth()
Creates a ColumnSet
containing all columns from the top-level, specified column group, or ColumnSet
at any depth if they satisfy the optional given predicate. This means that columns (of all three kinds!) nested inside column groups are also included. This function can also be followed by another ColumnSet
filter-function like colsOf<>()
, single()
, or valueCols()
.
For example:
Depth-first search to a column containing the value "Alice":
df.select { colsAtAnyDepth().first { "Alice" in it.values() } }
The columns at any depth excluding the top-level:
df.select { colGroups().colsAtAnyDepth() }
All value- and frame columns at any depth:
df.select { colsAtAnyDepth { !it.isColumnGroup } }
All value columns at any depth nested under a column group named "myColGroup":
df.select { myColGroup.colsAtAnyDepth().valueCols() }
Converting from deprecated syntax:
dfs { condition }
-> colsAtAnyDepth { condition }
allDfs(includeGroups = false)
-> colsAtAnyDepth { includeGroups || !it.isColumnGroup() }
dfsOf<Type> { condition }
-> colsAtAnyDepth().colsOf<Type> { condition }
cols { condition }.recursively()
-> colsAtAnyDepth { condition }
first { condition }.rec()
-> colsAtAnyDepth { condition }.first()
all().recursively()
-> colsAtAnyDepth()
Cols in Groups
colsInGroups {}
, colsInGroups()
Creates a ColumnSet
containing all columns that are nested in the column groups at the top-level, specified column group, or ColumnSet
adhering to an optional predicate. This is useful if you want to select all columns that are "one level down".
This function used to be called children()
in the past.
For example:
To get the columns inside all column groups in a dataframe, instead of having to write:
df.select { colGroupA.cols() and colGroupB.cols() ... }
you can use:
df.select { colsInGroups() }
or with filter:
df.select { colsInGroups { "user" in it.name } }
Similarly, you can take the columns inside all column groups in a ColumnSet
:
df.select { colGroups { "my" in it.name }.colsInGroups() }
Take (Last) (Cols) (While)
take(5)
, takeLastCols(2)
, takeLastWhile {}
, takeColsWhile {}
,
Creates a ColumnSet
containing the first / last n
columns from the top-level, specified column group, or ColumnSet
or those that adhere to the given condition. Note, to avoid ambiguity, take
is called takeCols
when called on a column group.
Drop (Last) (Cols) (While)
drop(5)
, dropLastCols(2)
, dropLastWhile {}
, dropColsWhile {}
Creates a ColumnSet
without the first / last n
columns from the top-level, specified column group, or ColumnSet
or those that adhere to the given condition. Note, to avoid ambiguity, drop
is called dropCols
when called on a column group.
Select from Column Group
colGroupA.select {}
, "colGroupA" {}
Creates a ColumnSet
containing the columns selected by a ColumnsSelector
relative to the specified column group. In practice, this means you're opening a new selection DSL scope inside a column group and selecting columns from there. The selected columns are referenced individually and "unpacked" from their parent column group.
For example:
Select myColGroup.someCol
and all String
columns from myColGroup
:
df.select { myColGroup.select { someCol and colsOf<String>() } }
df.select { "myGroupCol" { "colA" and expr("newCol") { colB + 1 } } }
df.select { "pathTo"["myGroupCol"].select { "colA" and "colB" } }
df.select { it["myGroupCol"].asColumnGroup()() { "colA" and "colB" } }
(All) (Cols) Except
colSet.except()
, allExcept {}
, colGroupA.allColsExcept {}
Perform a selection of columns using a relative ColumnsSelector
to exclude from the current selection.
This function is best explained in parts:
On Column Sets: except {}
This function can be explained the easiest with a ColumnSet
. Let's say we want all Int
columns apart from age
and height
.
We can do:
df.select { colsOf<Int>() except (age and height) }
which will 'subtract' the ColumnSet
created by age and height
from the ColumnSet
created by colsOf<Int>()
.
This operation can also be used to exclude columns that are originally in column groups.
For instance, excluding userData.age
:
df.select { colsAtAnyDepth { "a" in it.name() } except userData.age }
Note that the selection of columns to exclude from column sets is always done relative to the outer scope. Use the Extension Properties API to prevent scoping issues if possible.
Directly in the DSL: allExcept {}
Instead of having to write all() except { ... }
in the DSL, you can use allExcept { ... }
to achieve the same result.
This does the same but is a handy shorthand.
For example:
df.select { allExcept { userData.age and height } }
On Column Groups: allColsExcept {}
The variant of this function on Column Groups is a bit different, as it changes the scope to being relative to the Column Groups. This is similar to the select
function.
In other words:
df.select { myColGroup.allColsExcept { colA and colB } }
is shorthand for
df.select { myColGroup.select { allExcept { colA and colB } } }
or
df.select { myColGroup.allCols() except { myColGroup.colA and myColGroup.colB } }
Note the name change, similar to allCols
, this makes it clearer that you're selecting columns inside the group, 'lifting' them out.
Experimental: Except on Column Group
Selects the current column group itself, except for the specified columns. This is different from allColsExcept
in that it does not 'lift' the columns out of the group, but instead selects the group itself.
These all produce the same result:
df.select { colGroup exceptNew { col } }
df.select { colGroup }.remove { colGroup.col }
df.select { cols(colGroup) except colGroup.col }
Column Name Filters
nameContains()
, colsNameContains()
, nameStartsWith()
, colsNameEndsWith()
Creates a ColumnSet
containing columns from the top-level, specified column group, or ColumnSet
that have names that satisfy the given function. These functions accept a String
as argument, as well as an optional ignoreCase
parameter. For the nameContains
variant, you can also pass a Regex
as an argument. Note, on column groups, the functions have names starting with cols
to avoid ambiguity.
(Cols) Without Nulls
withoutNulls()
, colsWithoutNulls()
Creates a ColumnSet
containing columns from the top-level, specified column group, or ColumnSet
that have no null
values. This is a shorthand for cols { !it.hasNulls() }
. Note, to avoid ambiguity, withoutNulls
is called colsWithoutNulls
when called on a column group.
Distinct
colSet.distinct()
Returns a new ColumnSet
from the specified ColumnSet
containing only distinct columns (by path). This is useful when you've selected the same column multiple times but only want it once.
This does not cover the case where a column is selected individually and through its enclosing column group. See simplify
for that.
NOTE: This doesn't solve the DuplicateColumnNamesException
if you've selected two columns with the same name. For this, you'll need to rename one of the columns.
None
none()
Creates an empty ColumnSet
, essentially selecting no columns at all. This is the opposite of all()
.
This function mostly exists for completeness, but can be useful in some very specific cases.
Cols Of
colsOf<T>()
, colsOf<T> {}
Creates a ColumnSet
containing columns from the top-level, specified column group, or ColumnSet
that are a subtype of the specified type T
and adhere to the optional condition.
Simplify
colSet.simplify()
Returns a new ColumnSet
from the specified ColumnSet
in 'simplified' form. This function simplifies the structure of the ColumnSet
by removing columns that are already present in column groups, returning only these groups, plus columns not belonging in any of the groups.
In other words, this means that if a column in the ColumnSet
is inside a column group in the ColumnSet
, it will not be included in the result.
It's useful in combination with colsAtAnyDepth {}
, as that function can create a ColumnSet
containing both a column and the column group it's in.
In the past, was named top()
and roots()
, but these names have been deprecated.
For example:
cols(a, a.b, d.c).simplify() == cols(a, d.c)
Filter
colSet.filter {}
Returns a new ColumnSet
from the specified ColumnSet
containing only columns that satisfy the given condition. This function behaves the same as cols {}
and [{}]
, but only exists on column sets.
And
colSet and colB
Creates a ColumnSet
containing the columns from both the left and right side of the function. This allows you to combine selections or simply select multiple columns at once.
Any combination of AccessApi can be used on either side of the and
operator.
Note, while you can write col1 and col2 and col3...
, it may be more concise to use cols(col1, col2, col3...)
instead. The only downside is that you can't mix Access APIs with that notation.
Rename
colA named "colB"
, colA into namedColAccessor
Renaming a column in the Columns Selection DSL is done by calling the infix functions named
or into
. They behave exactly the same, so it's up to contextual preference which one to use. Any combination of Access API can be used to specify the column to rename and which name should be used instead.
Expr (Column Expression)
expr {}
, expr("newCol") {}
Creates a temporary new column by defining an expression to fill up each row. You may have come across this name before in the Add DSL or toDataFrame {}
DSL.
It's extremely useful when you want to create a new column based on existing columns for operations like sortBy
, groupBy
, etc.
Examples
Select columns by name:
Select columns by column index:
Other column selectors:
Modify the set of selected columns: