Data Schemas
The Kotlin DataFrame library provides typed data access via generation of extension properties for the type DataFrame<T> (as well as for DataRow<T> and ColumnGroup<T>), where T is a marker class representing the DataSchema of the DataFrame.
A schema of a DataFrame is a mapping from column names to column types.
This data schema can be expressed as a Kotlin interface or data class by annotating it with @DataSchema.
If the dataframe is hierarchical — contains a column group or a column of dataframes — the data schema reflects this structure, with a separate class representing the schema of each column group or nested DataFrame.
For example, consider a simple hierarchical dataframe from example.csv.
This dataframe consists of two columns:
name, which is aStringcolumninfo, which is a column group containing two nested value columns:ageof typeIntheightof typeDouble
name | info | |
|---|---|---|
age | height | |
Alice | 23 | 175.5 |
Bob | 27 | 160.2 |
The data schema corresponding to this DataFrame can be represented as:
Extension properties for DataFrame<Person>
are generated based on this schema and allow accessing columns or using them in operations:
See Extension Properties API for more information.
@DataSchema annotation
@DataSchema is a Kotlin annotation that marks a data class or interface as a data schema. The compiler plugin generates extension properties for the DataFrame (or DataRow, ColumnGroup, etc.) with a type parameter annotated with @DataSchema.
Each property of an annotated class or interface corresponds to a column in the DataFrame (or DataRow, ColumnGroup, etc.). The property name is the column name, and the property type is the column type.
Data Schema Retrieving
Defining a data schema manually can be difficult, especially for dataframes with many columns or deeply nested structures, and may lead to mistakes in column names or types. Kotlin DataFrame provides several methods for generating data schemas.
generate..()methods are extensions forDataFrame(or for itsschema) that generate a code string representing itsDataSchema.Kotlin DataFrame Compiler Plugin cannot automatically infer a data schema from external sources such as files or URLs. However, it can infer the schema if you construct the
DataFramemanually — that is, by explicitly declaring the columns using the API. It will also automatically update the schema during operations that modify the structure of the DataFrame.
Plugins
The Gradle plugin allows generating a data schema automatically by specifying a source file path in the Gradle build script.
The KSP plugin allows generating a data schema automatically using Kotlin Symbol Processing by specifying a source file path in your code file.
Specifying Data Schema
To bring the DataFrame into the desired schema, you can use one of two operations:
Extension Properties Generation
Once you have a data schema, you can generate extension properties.
The easiest and most convenient way is to use the Kotlin DataFrame Compiler Plugin, which generates extension properties on the fly for declared data schemas and automatically keeps them up to date after operations that modify the structure of the DataFrame.
When using Kotlin DataFrame inside Kotlin Notebook, the schema and extension properties are generated automatically after each cell execution for all
DataFramevariables declared in that cell. See extension properties example in Kotlin Notebook.
If you're not using the Compiler Plugin, you can still generate extension properties for a
DataFramemanually by calling one of thegenerate..()methods with theextensionProperties = trueargument.
Custom extension properties
Sometimes it is also useful to define your own extension properties based on a data schema.
See Custom extension properties for more information.