The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, and Apache Arrow input formats.
.read() function automatically detects the input format based on file extension and content:
The input string can be a file path or URL.
Read from CSV
To read a CSV file, use the
To read a CSV file from a file:
To read a CSV file from a URL:
By default, CSV files are parsed using
, as the delimiter. To specify a custom delimiter, use the
Column type inference from CSV
Column types are inferred from the CSV data. Suppose that the CSV from the previous example had the following content:
DataFrame schema we get is:
DataFrame tries to parse columns as JSON, so when reading the following table with JSON object in column D:
We get this data schema where D is
ColumnGroup with 2 children columns:
For a column where values are lists of JSON values:
Work with locale-specific numbers
Sometimes columns in your CSV can be interpreted differently depending on your system locale.
Here a comma can be decimal or thousands separator, thus different values. You can deal with it in two ways:
Provide locale as a parser option
Disable type inference for specific column and convert it yourself
Read from JSON
To read a JSON file, use the
.readJSON() function. JSON files can be read from a file or a URL.
To read a JSON file from a file:
To read a JSON file from a URL:
Column type inference from JSON
Type inference for JSON is much simpler than for CSV. JSON string literals are always supposed to have String type. Number literals take different
Number kinds. Boolean literals are converted to
Let's take a look at the following JSON:
We can read it from file:
DataFrame schema is:
Column A has
String type because all values are string literals, no implicit conversion is performed. Column C has
Number type because it's the least common type for
JSON parsing options
Manage type clashes
By default, if a type clash occurs when reading JSON, a new column group is created consisting of: "value", "array", and any number of object properties:
"value" will be set to the value of the JSON element if it's a primitive, else it will be
"array" will be set to the array of values if the json element is an array, else it will be
If the json element is an object, then each property will spread out to its own column in the group, else these columns will be
In this case
typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS.
will be read like (including
This makes it more convenient to work with the data, but it can be confusing if you're not expecting it or if you just need the type to be an
For this case, you can set
typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS to get the following:
This option is also possible to set in the Gradle- and KSP plugin by providing
Specify Key/Value Paths
If you have a JSON like:
You will get a column for each dog, which becomes an issue when you have a lot of dogs. This issue is especially noticeable when generating data schemas from JSON, as you might run out of memory when doing that due to the sheer number of generated interfaces. Instead, you can use
keyValuePaths to specify paths to the objects that should be read as key value frame columns.
This can be the difference between:
with dogs looking like
(The results are wrapped in a
FrameColumn instead of a
ColumnGroup since lengths between "cats" and "dogs" can vary, among other reasons.)
To specify the paths, you can use the
Note: For the KSP plugin, the
JsonPath class is not available, so you will have to use the
String version of the paths instead. For example:
jsonOptions = JsonOptions(keyValuePaths = ["""$""", """$[*]["versions"]"""]). Only the bracket notation of json path is supported, as well as just double quotes, arrays, and wildcards.
For more examples, see the "examples/json" module.
Read from Excel
Before you can read data from Excel, add the following dependency:
To read an Excel spreadsheet, use the
.readExcel() function. Excel spreadsheets can be read from a file or a URL. Supported Excel spreadsheet formats are: xls, xlsx.
To read an Excel spreadsheet from a file:
To read an Excel spreadsheet from a URL:
Cell type inference from Excel
Cells representing dates will be read as
kotlinx.datetime.LocalDateTime. Cells with number values, including whole numbers such as "100", or calculated formulas will be read as
Sometimes cells can have the wrong format in an Excel file. For example, you expect to read a column of
You will get column of
Serializable instead (common parent for
You can fix it using the
Read Apache Arrow formats
Before you can read data from Apache Arrow format, add the following dependency:
To read Apache Arrow formats, use the
DataFrame supports reading Arrow interprocess streaming format and Arrow random access format from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.