Operations

Edit pageLast modified: 24 July 2025

Data transformation pipeline usually consists of several modification operations, such as filtering, sorting, grouping, pivoting, adding/removing columns etc. The Kotlin DataFrame API is designed in functional style so that the whole processing pipeline can be represented as a single statement with a sequential chain of operations. DataFrame object is immutable and all operations return a new DataFrame instance reusing underlying data structures as much as possible.

df.update { age }.where { city == "Paris" }.with { it - 5 }
    .filter { isHappy && age > 100 }
    .move { name.firstName and name.lastName }.after { isHappy }
    .merge { age and weight }.by { "Age: ${it[0]}, weight: ${it[1]}" }.into("info")
    .rename { isHappy }.into("isOK")

tip
You can play with "people" dataset that is used in present guide here

Multiplex operations

Simple operations (such as filter or select) return new DataFrame immediately, while more complex operations return an intermediate object that is used for further configuration of the operation. Let's call such operations multiplex.

Every multiplex operation configuration consists of:

column selector that is used to select target columns for the operation
additional configuration functions
terminal function that returns modified DataFrame

Most multiplex operations end with into or with function. The following naming convention is used:

into defines column names for storing operation results. Used in move, group, split, merge, gather, groupBy, rename.
with defines row-wise data transformation with row expression. Used in update, convert, replace, pivot.

List of DataRow operations

index(): Int — sequential row number in DataFrame, starts from 0
prev(): DataRow? — previous row (null for the first row)
next(): DataRow? — next row (null for the last row)
diff(T) { rowExpression }: T / diffOrNull { rowExpression }: T? — difference between the results of a row expression calculated for current and previous rows
explode(columns): DataFrame<T> — spread lists and DataFrame objects vertically into new rows
values(): List<Any?> — list of all cell values from the current row
valuesOf<T>(): List<T> — list of values of the given type
columnsCount(): Int — number of columns
columnNames(): List<String> — list of all column names
columnTypes(): List<KType> — list of all column types
namedValues(): List<NameValuePair<Any?>> — list of name-value pairs where name is a column name and value is cell value
namedValuesOf<T>(): List<NameValuePair<T>> — list of name-value pairs where value has given type
transpose(): DataFrame<NameValuePair<*>> — DataFrame of two columns: name: String is column names and value: Any? is cell values
transposeTo<T>(): DataFrame<NameValuePair<T>> — DataFrame of two columns: name: String is column names and value: T is cell values
getRow(Int): DataRow — row from DataFrame by row index
getRows(Iterable<Int>): DataFrame — DataFrame with subset of rows selected by absolute row index.
relative(Iterable<Int>): DataFrame — DataFrame with subset of rows selected by relative row index: relative(-1..1) will return previous, current and next row. Requested indices will be coerced to the valid range and invalid indices will be skipped
getValue<T>(columnName) — cell value of type T by this row and given columnName
getValueOrNull<T>(columnName) — cell value of type T? by this row and given columnName or null if there's no such column
get(column): T — cell value by this row and given column
String.invoke<T>(): T — cell value of type T by this row and given this column name
ColumnPath.invoke<T>(): T — cell value of type T by this row and given this column path
ColumnReference.invoke(): T — cell value of type T by this row and given this column
df() — DataFrame that current row belongs to

List of DataRow statistics

The following statistics are available for DataRow:

rowSum
rowMean
rowStd

These statistics will be applied only to values of appropriate types, and incompatible values will be ignored. For example, if a dataframe has columns of types String and Int, rowSum() will compute the sum of the Int values in the row and ignore String values.

To apply statistics only to values of a particular type use -Of versions:

rowSumOf<T>
rowMeanOf<T>
rowStdOf<T>
rowMinOf<T>
rowMaxOf<T>
rowMedianOf<T>
rowPercentileOf<T>

List of DataFrame operations

add — add columns
addId — add id column
append — add rows
columns/columnNames/columnTypes — get list of top-level columns, column names or column types
columnsCount — number of top-level columns
concat — union rows from several DataFrame objects
convert — change column values and/or column types
corr — pairwise correlation of columns
count — number of rows that match condition
countDistinct — number of unique rows
cumSum — cumulative sum of column values
describe — basic column statistics
distinct/distinctBy — remove duplicated rows
drop/dropLast/dropWhile/dropNulls/dropNA/dropNaNs — remove rows by condition
duplicate — duplicate rows
explode — spread lists and DataFrame objects vertically into new rows
fillNulls/fillNaNs/fillNA — replace missing values
filter/filterBy — filter rows by condition
first/firstOrNull — find first row by condition
flatten — remove column groupings recursively
forEachRow/forEachColumn — iterate over rows or columns
format — conditional formatting for cell rendering
gather — convert pairs of column names and values into new columns
getColumn/getColumnOrNull/getColumnGroup/getColumns — get one or several columns
group — group columns into ColumnGroup
groupBy — group rows by key columns
head — get first 5 rows of DataFrame
implode — collapse column values into lists grouping by other columns
inferType — infer column type from column values
insert — insert column
join — join two DataFrame objects by key columns
joinWith — join two DataFrame object by an expression that evaluates joined DataRows to Boolean
last/lastOrNull — find last row by condition
map — map columns into new DataFrame or DataColumn
max/maxBy/maxOf/maxFor — max of values
mean/meanOf/meanFor — average of values
median/medianOf/medianFor — median of values
merge — merge several columns into one
min/minBy/minOf/minFor — min of values
move — move columns or change column groupings
parse — try to convert strings into other types
pivot/pivotCounts/pivotMatches — convert values into new columns
remove — remove columns
rename — rename columns
reorder/reorderColumnsBy/reorderColumnsByName — reorder columns
replace — replace columns
reverse — reverse rows
rows/rowsReversed — get rows in direct or reversed order
rowsCount — number of rows
schema — schema of columns: names, types and hierarchy
select — select subset of columns
shuffle — reorder rows randomly
single/singleOrNull — get single row by condition
sortBy/sortByDesc/sortWith — sort rows
split — split column values into new rows/columns or inplace into lists
std/stdOf/stdFor — standard deviation of values
sum/sumOf/sumFor — sum of values
take/takeLast/takeWhile — get first/last rows
toList/toListOf — export DataFrame into a list of data classes
toMap — export DataFrame into a map from column names to column values
unfold - unfold objects (normal class instances) in columns according to their properties
ungroup — remove column groupings
update — update column values preserving column types
values — Sequence of values traversed by row or by column
valueCounts — counts for unique values
xs — slice DataFrame by given key values

Shortcut operations

Some operations are shortcuts for more general operations:

rename, group, flatten are special cases of move
valueCounts is a special case of groupBy
pivotCounts, pivotMatches are special cases of pivot
fillNulls, fillNaNs, fillNA are special cases of update
convert is a special case of replace

You can use these shortcuts to apply the most common DataFrame transformations easier, but you can always fall back to general operations if you need more customization.

Operations﻿

tip

Multiplex operations﻿

List of DataRow operations﻿

List of DataRow statistics﻿

List of DataFrame operations﻿

Shortcut operations﻿