Operations Overview
Data transformation pipeline usually consists of several modification operations, such as filtering, sorting, grouping, pivoting, adding/removing columns etc. The Kotlin DataFrame API is designed in functional style so that the whole processing pipeline can be represented as a single statement with a sequential chain of operations. DataFrame
object is immutable and all operations return a new DataFrame
instance reusing underlying data structures as much as possible.
Multiplex operations
Simple operations (such as filter
or select
) return new DataFrame
immediately, while more complex operations return an intermediate object that is used for further configuration of the operation. Let's call such operations multiplex.
Every multiplex operation configuration consists of:
column selector that is used to select target columns for the operation
additional configuration functions
terminal function that returns modified
DataFrame
Most multiplex operations end with into
or with
function. The following naming convention is used:
List of DataRow operations
index(): Int
— sequential row number inDataFrame
, starts from 0prev(): DataRow?
— previous row (null
for the first row)next(): DataRow?
— next row (null
for the last row)diff(T) { rowExpression }: T / diffOrNull { rowExpression }: T?
— difference between the results of a row expression calculated for current and previous rowsexplode(columns): DataFrame<T>
— spread lists andDataFrame
objects vertically into new rowsvalues(): List<Any?>
— list of all cell values from the current rowvaluesOf<T>(): List<T>
— list of values of the given typecolumnsCount(): Int
— number of columnscolumnNames(): List<String>
— list of all column namescolumnTypes(): List<KType>
— list of all column typesnamedValues(): List<NameValuePair<Any?>>
— list of name-value pairs wherename
is a column name andvalue
is cell valuenamedValuesOf<T>(): List<NameValuePair<T>>
— list of name-value pairs where value has given typetranspose(): DataFrame<NameValuePair<*>>
—DataFrame
of two columns:name: String
is column names andvalue: Any?
is cell valuestransposeTo<T>(): DataFrame<NameValuePair<T>>
—DataFrame
of two columns:name: String
is column names andvalue: T
is cell valuesgetRow(Int): DataRow
— row fromDataFrame
by row indexgetRows(Iterable<Int>): DataFrame
—DataFrame
with subset of rows selected by absolute row index.relative(Iterable<Int>): DataFrame
—DataFrame
with subset of rows selected by relative row index:relative(-1..1)
will return previous, current and next row. Requested indices will be coerced to the valid range and invalid indices will be skippedgetValue<T>(columnName)
— cell value of typeT
by this row and givencolumnName
getValueOrNull<T>(columnName)
— cell value of typeT?
by this row and givencolumnName
ornull
if there's no such columnget(column): T
— cell value by this row and givencolumn
String.invoke<T>(): T
— cell value of typeT
by this row and giventhis
column nameColumnPath.invoke<T>(): T
— cell value of typeT
by this row and giventhis
column pathColumnReference.invoke(): T
— cell value of typeT
by this row and giventhis
columndf()
—DataFrame
that current row belongs to
List of DataRow statistics
The following statistics are available for DataRow
:
rowMax
rowMin
rowSum
rowMean
rowStd
rowMedian
These statistics will be applied only to values of appropriate types and incompatible values will be ignored. For example, if DataFrame
has columns of type String
and Int
, rowSum()
will successfully compute sum of Int
values in a row and ignore String
values.
To apply statistics only to values of particular type use -Of
versions:
rowMaxOf<T>
rowMinOf<T>
rowSumOf<T>
rowMeanOf<T>
rowMedianOf<T>
List of DataFrame operations
add — add columns
addId — add
id
columnappend — add rows
columns/columnNames/columnTypes — get list of top-level columns, column names or column types
columnsCount — number of top-level columns
convert — change column values and/or column types
corr — pairwise correlation of columns
count — number of rows that match condition
countDistinct — number of unique rows
cumSum — cumulative sum of column values
describe — basic column statistics
distinct/distinctBy — remove duplicated rows
drop/dropLast/dropWhile/dropNulls/dropNA/dropNaNs — remove rows by condition
duplicate — duplicate rows
explode — spread lists and
DataFrame
objects vertically into new rowsfirst/firstOrNull — find first row by condition
flatten — remove column groupings recursively
forEachRow/forEachColumn — iterate over rows or columns
format — conditional formatting for cell rendering
gather — convert pairs of column names and values into new columns
getColumn/getColumnOrNull/getColumnGroup/getColumns — get one or several columns
group — group columns into
ColumnGroup
groupBy — group rows by key columns
implode — collapse column values into lists grouping by other columns
inferType — infer column type from column values
insert — insert column
joinWith — join two
DataFrame
object by an expression that evaluates joined DataRows to Booleanlast/lastOrNull — find last row by condition
map — map columns into new
DataFrame
orDataColumn
merge — merge several columns into one
move — move columns or change column groupings
parse — try to convert strings into other types
pivot/pivotCounts/pivotMatches — convert values into new columns
remove — remove columns
rename — rename columns
reorder/reorderColumnsBy/reorderColumnsByName — reorder columns
replace — replace columns
reverse — reverse rows
rows/rowsReversed — get rows in direct or reversed order
rowsCount — number of rows
schema — schema of columns: names, types and hierarchy
select — select subset of columns
shuffle — reorder rows randomly
single/singleOrNull — get single row by condition
sortBy/sortByDesc/sortWith — sort rows
split — split column values into new rows/columns or inplace into lists
toList/toListOf — export
DataFrame
into a list of data classestoMap — export
DataFrame
into a map from column names to column valuesunfold - unfold objects (normal class instances) in columns according to their properties
ungroup — remove column groupings
update — update column values preserving column types
valueCounts — counts for unique values
Shortcut operations
Some operations are shortcuts for more general operations:
valueCounts is a special case of groupBy
pivotCounts, pivotMatches are special cases of pivot
You can use these shortcuts to apply the most common DataFrame
transformations easier, but you can always fall back to general operations if you need more customization.