Dataframe 0.14 Help

Access APIs

By nature, data frames are dynamic objects; column labels depend on the input source and new columns can be added or deleted while wrangling. Kotlin, in contrast, is a statically typed language where all types are defined and verified ahead of execution.

That's why creating a flexible, handy, and, at the same time, safe API to a data frame is tricky.

In the Kotlin DataFrame library, we provide four different ways to access columns, and, while they are essentially different, they look pretty similar in the data wrangling DSL.

List of Access APIs

Here's a list of all APIs in order of increasing safety.

  • String API
    Columns are accessed by string representing their name. Type-checking is done at runtime, name-checking too.

  • Column Accessors API
    Every column has a descriptor; a variable that represents its name and type.

  • KProperties API
    Columns accessed by the KProperty of some class. The name and type of column should match the name and type of property, respectively.

  • Extension Properties API
    Extension access properties are generated based on the dataframe schema. The name and type of properties are inferred from the name and type of the corresponding columns.

Example

Here's an example of how the same operations can be performed via different Access APIs:

DataFrame.read("titanic.csv") .add("lastName") { "name"<String>().split(",").last() } .dropNulls("age") .filter { "survived"<Boolean>() && "home"<String>().endsWith("NY") && "age"<Int>() in 10..20 }
val survived by column<Boolean>() val home by column<String>() val age by column<Int?>() val name by column<String>() val lastName by column<String>() DataFrame.read("titanic.csv") .add(lastName) { name().split(",").last() } .dropNulls { age } .filter { survived() && home().endsWith("NY") && age()!! in 10..20 }
data class Passenger( val survived: Boolean, val home: String, val age: Int, val lastName: String ) val passengers = DataFrame.read("titanic.csv") .add(Passenger::lastName) { "name"<String>().split(",").last() } .dropNulls(Passenger::age) .filter { it[Passenger::survived] && it[Passenger::home].endsWith("NY") && it[Passenger::age] in 10..20 } .toListOf<Passenger>()
val df /* : AnyFrame */ = DataFrame.read("titanic.csv")
df.add("lastName") { name.split(",").last() } .dropNulls { age } .filter { survived && home.endsWith("NY") && age in 10..20 }

The titanic.csv file can be found here.

The String API is the simplest and unsafest of them all. The main advantage of it is that it can be used at any time, including when accessing new columns in chain calls. So we can write something like:

df.add("weight") { ... } // add a new column `weight`, calculated by some expression .sortBy("weight") // sorting dataframe rows by its value

We don't need to interrupt a function call chain and declare a column accessor or generate new properties.

In contrast, generated extension properties form the most convenient and the safest API. Using them, you can always be sure that you work with correct data and types. However, there's a bottleneck at the moment of generation. To get new extension properties, you have to run a cell in a notebook, which could lead to unnecessary variable declarations. Currently, we are working on a compiler plugin that generates these properties on the fly while typing!

The Column Accessors API is a kind of trade-off between safety and needs to be written ahead of the execution type declaration. It was designed to better be able to write code in an IDE without a notebook experience. It provides type-safe access to columns but doesn't ensure that the columns really exist in a particular data frame.

The KProperties API is useful when you already have declared classed in your business logic with fields that correspond to columns of a data frame.

API

Type-checking

Column names checking

Column existence checking

String API

Runtime

Runtime

Runtime

Column Accessors API

Compile-time

Compile-time

Runtime

KProperties API

Compile-time

Compile-time

Runtime

Extension Properties API

Generation-time

Generation-time

Generation-time

Last modified: 27 September 2024