Custom Data Schemas
You can define your own DataSchema
interfaces and use them in functions and classes to represent DataFrame
with specific set of columns:
@DataSchema
interface Person {
val name: String
val age: Int
}
After execution of this cell in Jupyter or annotation processing in IDEA, extension properties for data access will be generated. Now we can use these properties to create functions for typed DataFrame
:
fun DataFrame<Person>.splitName() = split { name }.by(",").into("firstName", "lastName")
fun DataFrame<Person>.adults() = filter { age > 18 }
In Jupyter these functions will work automatically for any DataFrame
that matches Person
schema:
val df = dataFrameOf("name", "age", "weight")(
"Merton, Alice", 15, 60.0,
"Marley, Bob", 20, 73.5,
)
Schema of df
is compatible with Person
, so auto-generated schema interface will inherit from it:
@DataSchema(isOpen = false)
interface DataFrameType : Person
val ColumnsContainer<DataFrameType>.weight: DataColumn<Double> get() = this["weight"] as DataColumn<Double>
val DataRow<DataFrameType>.weight: Double get() = this["weight"] as Double
Despite df
has additional column weight
, previously defined functions for DataFrame<Person>
will work for it:
df.splitName()
firstName lastName age weight
Merton Alice 15 60.000
Marley Bob 20 73.125
df.adults()
name age weight
Marley, Bob 20 73.5
In JVM project you will have to cast DataFrame
explicitly to the target interface:
df.cast<Person>().splitName()
Last modified: 27 September 2024