Dataframe 0.13 Help

Operations Overview

Data transformation pipeline usually consists of several modification operations, such as filtering, sorting, grouping, pivoting, adding/removing columns etc. The Kotlin DataFrame API is designed in functional style so that the whole processing pipeline can be represented as a single statement with a sequential chain of operations. DataFrame object is immutable and all operations return a new DataFrame instance reusing underlying data structures as much as possible.

df.update { age }.where { city == "Paris" }.with { it - 5 } .filter { isHappy && age > 100 } .move { name.firstName and name.lastName }.after { isHappy } .merge { age and weight }.by { "Age: ${it[0]}, weight: ${it[1]}" }.into("info") .rename { isHappy }.into("isOK")

Multiplex operations

Simple operations (such as filter or select) return new DataFrame immediately, while more complex operations return an intermediate object that is used for further configuration of the operation. Let's call such operations multiplex.

Every multiplex operation configuration consists of:

  • column selector that is used to select target columns for the operation

  • additional configuration functions

  • terminal function that returns modified DataFrame

Most multiplex operations end with into or with function. The following naming convention is used:

List of DataRow operations

  • index(): Int — sequential row number in DataFrame, starts from 0

  • prev(): DataRow? — previous row (null for the first row)

  • next(): DataRow? — next row (null for the last row)

  • diff(T) { rowExpression }: T / diffOrNull { rowExpression }: T? — difference between the results of a row expression calculated for current and previous rows

  • explode(columns): DataFrame<T> — spread lists and DataFrames vertically into new rows

  • values(): List<Any?> — list of all cell values from the current row

  • valuesOf<T>(): List<T> — list of values of the given type

  • columnsCount(): Int — number of columns

  • columnNames(): List<String> — list of all column names

  • columnTypes(): List<KType> — list of all column types

  • namedValues(): List<NameValuePair<Any?>> — list of name-value pairs where name is a column name and value is cell value

  • namedValuesOf<T>(): List<NameValuePair<T>> — list of name-value pairs where value has given type

  • transpose(): DataFrame<NameValuePair<*>> — dataframe of two columns: name: String is column names and value: Any? is cell values

  • transposeTo<T>(): DataFrame<NameValuePair<T>> — dataframe of two columns: name: String is column names and value: T is cell values

  • getRow(Int): DataRow — row from DataFrame by row index

  • getRows(Iterable<Int>): DataFrame — dataframe with subset of rows selected by absolute row index.

  • relative(Iterable<Int>): DataFrame — dataframe with subset of rows selected by relative row index: relative(-1..1) will return previous, current and next row. Requested indices will be coerced to the valid range and invalid indices will be skipped

  • getValue<T>(columnName) — cell value of type T by this row and given columnName

  • getValueOrNull<T>(columnName) — cell value of type T? by this row and given columnName or null if there's no such column

  • get(column): T — cell value by this row and given column

  • String.invoke<T>(): T — cell value of type T by this row and given this column name

  • ColumnPath.invoke<T>(): T — cell value of type T by this row and given this column path

  • ColumnReference.invoke(): T — cell value of type T by this row and given this column

  • df()DataFrame that current row belongs to

List of DataRow statistics

The following statistics are available for DataRow:

  • rowMax

  • rowMin

  • rowSum

  • rowMean

  • rowStd

  • rowMedian

These statistics will be applied only to values of appropriate types and incompatible values will be ignored. For example, if DataFrame has columns of type String and Int, rowSum() will successfully compute sum of Int values in a row and ignore String values.

To apply statistics only to values of particular type use -Of versions:

  • rowMaxOf<T>

  • rowMinOf<T>

  • rowSumOf<T>

  • rowMeanOf<T>

  • rowMedianOf<T>

List of DataFrame operations

Shortcut operations

Some operations are shortcuts for more general operations:

You can use these shortcuts to apply the most common DataFrame transformations easier, but you can always fall back to general operations if you need more customization.

Last modified: 29 March 2024