Dataframe 0.13 Help

Column selectors

DataFrame provides a DSL for selecting an arbitrary set of columns: the Columns Selection DSL.

Column selectors are used in many operations:

df.select { age and name } df.fillNaNs { colsAtAnyDepth().colsOf<Double>() }.withZero() df.remove { cols { it.hasNulls() } } df.group { cols { it.data != name } }.into { "nameless" } df.update { city }.notNull { it.lowercase() } df.gather { colsOf<Number>() }.into("key", "value") df.move { name.firstName and name.lastName }.after { city }

Full DSL Grammar

Definitions

Functions Overview

First (Col), Last (Col), Single (Col)

first {}, firstCol(), last {}, lastCol(), single {}, singleCol()

Returns the first, last, or single column from the top-level, specified column group, or ColumnSet that adheres to the optional given condition. If no column adheres to the given condition, NoSuchElementException is thrown.

Col

col(name), col(5), this[5]

Creates a ColumnAccessor (or SingleColumn) for a column with the given argument from the top-level or specified column group. The argument can be either an index (Int) or a reference to a column (String, ColumnPath, KProperty, or ColumnAccessor; any AccessApi).

Value Col, Frame Col, Col Group

valueCol(name), valueCol(5), frameCol(name), frameCol(5), colGroup(name), colGroup(5)

Creates a ColumnAccessor (or SingleColumn) for a value column/frame column/column group with the given argument from the top-level or specified column group. The argument can be either an index (Int) or a reference to a column (String, ColumnPath, KProperty, or ColumnAccessor; any AccessApi). The functions can be both typed and untyped (in case you're supplying a column name, -path, or index). These functions throw an IllegalArgumentException if the column found is not the right kind.

Cols

cols {}, cols(), cols(colA, colB), cols(1, 5), cols(1..5), [{}], colSet[1, 3]

Creates a subset of columns (ColumnSet) from the top-level, specified column group, or ColumnSet. You can use either a ColumnFilter, or any of the vararg overloads for any AccessApi. The function can be both typed and untyped (in case you're supplying a column name, -path, or index (range)).

Note that you can also use the [] operator for most overloads of cols to achieve the same result.

Range of Columns

colA.."colB"

Creates a ColumnSet containing all columns from colA to colB (inclusive) from the top-level. Columns inside column groups are also supported (as long as they share the same direct parent), as well as any combination of AccessApi.

Value Columns, Frame Columns, Column Groups

valueCols {}, valueCols(), frameCols {}, frameCols(), colGroups {}, colGroups()

Creates a subset of columns (ColumnSet) from the top-level, specified column group, or ColumnSet containing only value columns/frame columns/column groups that adhere to the optional condition.

Cols of Kind

colsOfKind(Value, Frame) {}, colsOfKind(Group, Frame)

Creates a subset of columns (ColumnSet) from the top-level, specified column group, or ColumnSet containing only columns of the specified kind(s) that adhere to the optional condition.

All (Cols)

all(), allCols()

Creates a ColumnSet containing all columns from the top-level, specified column group, or ColumnSet. This is the opposite of none() and equivalent to cols() without filter. Note, on column groups, all is named allCols instead to avoid confusion.

All (Cols) After, -Before, -From, -Up To

allAfter(colA), allBefore(colA), allColsFrom(colA), allColsUpTo(colA)

Creates a ColumnSet containing a subset of columns from the top-level, specified column group, or ColumnSet. The subset includes:

  • all(Cols)Before(colA): All columns before the specified column, excluding that column.

  • all(Cols)After(colA): All columns after the specified column, excluding that column.

  • all(Cols)From(colA): All columns from the specified column, including that column.

  • all(Cols)UpTo(colA): All columns up to the specified column, including that column.

NOTE: The {} overloads of these functions in the Plain DSL and on column groups are a ColumnSelector (relative to the receiver). On ColumnSets they are a ColumnFilter instead.

Cols at any Depth

colsAtAnyDepth {}, colsAtAnyDepth()

Creates a ColumnSet containing all columns from the top-level, specified column group, or ColumnSet at any depth if they satisfy the optional given predicate. This means that columns (of all three kinds!) nested inside column groups are also included. This function can also be followed by another ColumnSet filter-function like colsOf<>(), single(), or valueCols().

For example:

Depth-first search to a column containing the value "Alice":

df.select { colsAtAnyDepth().first { "Alice" in it.values() } }

The columns at any depth excluding the top-level:

df.select { colGroups().colsAtAnyDepth() }

All value- and frame columns at any depth:

df.select { colsAtAnyDepth { !it.isColumnGroup } }

All value columns at any depth nested under a column group named "myColGroup":

df.select { myColGroup.colsAtAnyDepth().valueCols() }

Converting from deprecated syntax:

dfs { condition }-> colsAtAnyDepth { condition }

allDfs(includeGroups = false)-> colsAtAnyDepth { includeGroups || !it.isColumnGroup() }

dfsOf<Type> { condition }-> colsAtAnyDepth().colsOf<Type> { condition }

cols { condition }.recursively()-> colsAtAnyDepth { condition }

first { condition }.rec()-> colsAtAnyDepth { condition }.first()

all().recursively()-> colsAtAnyDepth()

Cols in Groups

colsInGroups {}, colsInGroups()

Creates a ColumnSet containing all columns that are nested in the column groups at the top-level, specified column group, or ColumnSet adhering to an optional predicate. This is useful if you want to select all columns that are "one level down".

This function used to be called children() in the past.

For example:

To get the columns inside all column groups in a dataframe, instead of having to write:

df.select { colGroupA.cols() and colGroupB.cols() ... }

you can use:

df.select { colsInGroups() }

or with filter:

df.select { colsInGroups { "user" in it.name } }

Similarly, you can take the columns inside all column groups in a ColumnSet:

df.select { colGroups { "my" in it.name }.colsInGroups() }

Take (Last) (Cols) (While)

take(5), takeLastCols(2), takeLastWhile {}, takeColsWhile {},

Creates a ColumnSet containing the first / last n columns from the top-level, specified column group, or ColumnSet or those that adhere to the given condition. Note, to avoid ambiguity, take is called takeCols when called on a column group.

Drop (Last) (Cols) (While)

drop(5), dropLastCols(2), dropLastWhile {}, dropColsWhile {}

Creates a ColumnSet without the first / last n columns from the top-level, specified column group, or ColumnSet or those that adhere to the given condition. Note, to avoid ambiguity, drop is called dropCols when called on a column group.

Select from Column Group

colGroupA.select {}, "colGroupA" {}

Creates a ColumnSet containing the columns selected by a ColumnsSelector relative to the specified column group. In practice, this means you're opening a new selection DSL scope inside a column group and selecting columns from there. The selected columns are referenced individually and "unpacked" from their parent column group.

For example:

Select myColGroup.someCol and all String columns from myColGroup:

df.select { myColGroup.select { someCol and colsOf<String>() } }

df.select { "myGroupCol" { "colA" and expr("newCol") { colB + 1 } } }

df.select { "pathTo"["myGroupCol"].select { "colA" and "colB" } }

df.select { it["myGroupCol"].asColumnGroup()() { "colA" and "colB" } }

(All) (Cols) Except

colSet.except(), allExcept {}, colGroupA.allColsExcept {}

Perform a selection of columns using a relative ColumnsSelector to exclude from the current selection.

This function is best explained in parts:

On Column Sets: except {}

This function can be explained the easiest with a ColumnSet. Let's say we want all Int columns apart from age and height.

We can do:

df.select { colsOf<Int>() except (age and height) }

which will 'subtract' the ColumnSet created by age and height from the ColumnSet created by colsOf<Int>().

This operation can also be used to exclude columns that are originally in column groups.

For instance, excluding userData.age:

df.select { colsAtAnyDepth { "a" in it.name() } except userData.age }

Note that the selection of columns to exclude from column sets is always done relative to the outer scope. Use the Extension Properties API to prevent scoping issues if possible.

Directly in the DSL: allExcept {}

Instead of having to write all() except { ... } in the DSL, you can use allExcept { ... } to achieve the same result.

This does the same but is a handy shorthand.

For example:

df.select { allExcept { userData.age and height } }

On Column Groups: allColsExcept {}

The variant of this function on Column Groups is a bit different, as it changes the scope to being relative to the Column Groups. This is similar to the select function.

In other words:

df.select { myColGroup.allColsExcept { colA and colB } }

is shorthand for

df.select { myColGroup.select { allExcept { colA and colB } } }

or

df.select { myColGroup.allCols() except { myColGroup.colA and myColGroup.colB } }

Note the name change, similar to allCols, this makes it clearer that you're selecting columns inside the group, 'lifting' them out.

Experimental: Except on Column Group

Selects the current column group itself, except for the specified columns. This is different from allColsExcept in that it does not 'lift' the columns out of the group, but instead selects the group itself.

These all produce the same result:

df.select { colGroup exceptNew { col } }

df.select { colGroup }.remove { colGroup.col }

df.select { cols(colGroup) except colGroup.col }

Column Name Filters

nameContains(), colsNameContains(), nameStartsWith(), colsNameEndsWith()

Creates a ColumnSet containing columns from the top-level, specified column group, or ColumnSet that have names that satisfy the given function. These functions accept a String as argument, as well as an optional ignoreCase parameter. For the nameContains variant, you can also pass a Regex as an argument. Note, on column groups, the functions have names starting with cols to avoid ambiguity.

(Cols) Without Nulls

withoutNulls(), colsWithoutNulls()

Creates a ColumnSet containing columns from the top-level, specified column group, or ColumnSet that have no null values. This is a shorthand for cols { !it.hasNulls() }. Note, to avoid ambiguity, withoutNulls is called colsWithoutNulls when called on a column group.

Distinct

colSet.distinct()

Returns a new ColumnSet from the specified ColumnSet containing only distinct columns (by path). This is useful when you've selected the same column multiple times but only want it once.

This does not cover the case where a column is selected individually and through its enclosing column group. See simplify for that.

NOTE: This doesn't solve the DuplicateColumnNamesException if you've selected two columns with the same name. For this, you'll need to rename one of the columns.

None

none()

Creates an empty ColumnSet, essentially selecting no columns at all. This is the opposite of all().

This function mostly exists for completeness, but can be useful in some very specific cases.

Cols Of

colsOf<T>(), colsOf<T> {}

Creates a ColumnSet containing columns from the top-level, specified column group, or ColumnSet that are a subtype of the specified type T and adhere to the optional condition.

Simplify

colSet.simplify()

Returns a new ColumnSet from the specified ColumnSet in 'simplified' form. This function simplifies the structure of the ColumnSet by removing columns that are already present in column groups, returning only these groups, plus columns not belonging in any of the groups.

In other words, this means that if a column in the ColumnSet is inside a column group in the ColumnSet, it will not be included in the result.

It's useful in combination with colsAtAnyDepth {}, as that function can create a ColumnSet containing both a column and the column group it's in.

In the past, was named top() and roots(), but these names have been deprecated.

For example:

cols(a, a.b, d.c).simplify() == cols(a, d.c)

Filter

colSet.filter {}

Returns a new ColumnSet from the specified ColumnSet containing only columns that satisfy the given condition. This function behaves the same as cols {} and [{}], but only exists on column sets.

And

colSet and colB

Creates a ColumnSet containing the columns from both the left and right side of the function. This allows you to combine selections or simply select multiple columns at once.

Any combination of AccessApi can be used on either side of the and operator.

Note, while you can write col1 and col2 and col3..., it may be more concise to use cols(col1, col2, col3...) instead. The only downside is that you can't mix Access APIs with that notation.

Rename

colA named "colB", colA into namedColAccessor

Renaming a column in the Columns Selection DSL is done by calling the infix functions named or into. They behave exactly the same, so it's up to contextual preference which one to use. Any combination of Access API can be used to specify the column to rename and which name should be used instead.

Expr (Column Expression)

expr {}, expr("newCol") {}

Creates a temporary new column by defining an expression to fill up each row. You may have come across this name before in the Add DSL or toDataFrame {} DSL.

It's extremely useful when you want to create a new column based on existing columns for operations like sortBy, groupBy, etc.

Examples

Select columns by name:

// by column name df.select { it.name } df.select { name } // by column path df.select { name.firstName } // with a new name df.select { name named "Full Name" } // converted df.select { name.firstName.map { it.lowercase() } } // column arithmetics df.select { 2021 - age } // two columns df.select { name and age } // range of columns df.select { name..age } // all columns of ColumnGroup df.select { name.allCols() } // traversal of columns at any depth from here excluding ColumnGroups df.select { name.colsAtAnyDepth { !it.isColumnGroup() } }
// by column name val name by columnGroup() df.select { it[name] } df.select { name } // by column path val firstName by name.column<String>() df.select { firstName } // with a new name df.select { name named "Full Name" } // converted df.select { firstName.map { it.lowercase() } } // column arithmetics val age by column<Int>() df.select { 2021 - age } // two columns df.select { name and age } // range of columns df.select { name..age } // all columns of ColumnGroup df.select { name.allCols() } // traversal of columns at any depth from here excluding ColumnGroups df.select { name.colsAtAnyDepth { !it.isColumnGroup() } }
// by column name df.select { it["name"] } // by column path df.select { it["name"]["firstName"] } df.select { "name"["firstName"] } // with a new name df.select { "name" named "Full Name" } // converted df.select { "name"["firstName"]<String>().map { it.uppercase() } } // column arithmetics df.select { 2021 - "age"<Int>() } // two columns df.select { "name" and "age" } // by range of names df.select { "name".."age" } // all columns of ColumnGroup df.select { "name".allCols() } // traversal of columns at any depth from here excluding ColumnGroups df.select { "name".colsAtAnyDepth { !it.isColumnGroup() } }

Select columns by column index:

// by index df.select { col(2) } // by several indices df.select { cols(0, 1, 3) } // by range of indices df.select { cols(1..4) }

Other column selectors:

// by condition df.select { cols { it.name().startsWith("year") } } df.select { nameStartsWith("year") } // by type df.select { colsOf<String>() } // by type with condition df.select { colsOf<String?> { it.countDistinct() > 5 } } // all top-level columns df.select { all() } // first/last n columns df.select { take(2) } df.select { takeLast(2) } // all except first/last n columns df.select { drop(2) } df.select { dropLast(2) } // find the first column satisfying the condition df.select { first { it.name.startsWith("year") } } // find the last column inside a column group satisfying the condition df.select { colGroup("name").lastCol { it.name().endsWith("Name") } } // find the single column inside a column group satisfying the condition df.select { Person::name.singleCol { it.name().startsWith("first") } } // traversal of columns at any depth from here excluding ColumnGroups df.select { colsAtAnyDepth { !it.isColumnGroup() } } // traversal of columns at any depth from here including ColumnGroups df.select { colsAtAnyDepth() } // traversal of columns at any depth with condition df.select { colsAtAnyDepth { it.name().contains(":") } } // traversal of columns at any depth to find columns of given type df.select { colsAtAnyDepth().colsOf<String>() } // all columns except given column set df.select { allExcept { colsOf<String>() } } // union of column sets df.select { take(2) and col(3) }

Modify the set of selected columns:

// first/last n value- and frame columns in column set df.select { colsAtAnyDepth { !it.isColumnGroup() }.take(3) } df.select { colsAtAnyDepth { !it.isColumnGroup() }.takeLast(3) } // all except first/last n value- and frame columns in column set df.select { colsAtAnyDepth { !it.isColumnGroup() }.drop(3) } df.select { colsAtAnyDepth { !it.isColumnGroup() }.dropLast(3) } // filter column set by condition df.select { colsAtAnyDepth { !it.isColumnGroup() }.filter { it.name().startsWith("year") } } // exclude columns from column set df.select { colsAtAnyDepth { !it.isColumnGroup() }.except { age } } // keep only unique columns df.select { (colsOf<Int>() and age).distinct() }
Last modified: 18 July 2024