Read

The .read() function automatically detects the input format based on a file extension and content:

DataFrame.read("input.csv")

Read from CSV

implementation("org.jetbrains.kotlinx:dataframe-csv:$dataframe_version")

It's included by default if you have org.jetbrains.kotlinx:dataframe:$dataframe_version already.

To read a CSV file, use the .readCsv() function.

import java.io.File

DataFrame.readCsv("input.csv")
// Alternatively
DataFrame.readCsv(File("input.csv"))

import java.net.URI

DataFrame.readCsv(URI("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv").toURL())

To read CSV from String:

val csv = """
    A,B,C,D
    12,tuv,0.12,true
    41,xyz,3.6,not assigned
    89,abc,7.1,false
""".trimIndent()

DataFrame.readCsvStr(csv)

Specify delimiter

By default, CSV files are parsed using , as the delimiter. To specify a custom delimiter, use the delimiter argument:

val df = DataFrame.readCsv(
    file,
    delimiter = '|',
    header = listOf("A", "B", "C", "D"),
    parserOptions = ParserOptions(nullStrings = setOf("not assigned")),
)

Column type inference from CSV

We rely on the fast implementation of Deephaven CSV for inferring and parsing to (nullable) Int, Long, Double, and Boolean types. For other types we fall back to the parse operation.

A	B	C	D
12	tuv	0.12	true
41	xyz	3.6	not assigned
89	abc	7.1	false

A: Int
B: String
C: Double
D: Boolean?

A	D
12	{"B":2,"C":3}
41	{"B":3,"C":2}

A: Int
D:
    B: Int
    C: Int

A	G
12	[{"B":1,"C":2,"D":3},{"B":1,"C":3,"D":2}]
41	[{"B":2,"C":1,"D":3}]

A: Int
G: *
    B: Int
    C: Int
    D: Int

Work with locale-specific numbers

numbers
12,123
41,111

val df = DataFrame.readCsv(
    file,
    parserOptions = ParserOptions(locale = Locale.UK),
)

val df = DataFrame.readCsv(
    file,
    colTypes = mapOf("colName" to ColType.String),
)

Work with specific date-time formats

When parsing date or date-time columns, you might encounter formats different from the default ISO_LOCAL_DATE_TIME. Like:

date
13/Jan/23 11:49 AM
14/Mar/23 5:35 PM

Because the format here "dd/MMM/yy h:mm a" differs from the default (ISO_LOCAL_DATE_TIME), columns like this may be recognized as simple String values rather than actual date-time columns.

val df = DataFrame.readCsv(
    file,
    parserOptions = ParserOptions(dateTimePattern = "dd/MMM/yy h:mm a")
)

val df = DataFrame.readCsv(
    file,
    parserOptions = ParserOptions(dateTimeFormatter = DateTimeFormatter.ofPattern("dd/MMM/yy h:mm a"))
)

Provide a default type for all columns

While you can provide a ColType per column, you might not always know how many columns there are or what their names are. In such cases, you can disable type inference for all columns by providing a default type for all columns:

val df = DataFrame.readCsv(
    file,
    colTypes = mapOf(ColType.DEFAULT to ColType.String),
)

Unlocking Deephaven CSV features

For each group of functions (readCsv, readDelim, readTsv, etc.) we provide one overload which has the adjustCsvSpecs parameter. This is an advanced option because it exposes the CsvSpecs.Builder of the underlying Deephaven implementation. Generally, we don't recommend using this feature unless there's no other way to achieve your goal.

val df = DataFrame.readCsv(
    inputStream = file.openStream(),
    adjustCsvSpecs = { // it: CsvSpecs.Builder
        it.putParserForName("date", Parsers.DATETIME)
    },
)

Read from JSON

implementation("org.jetbrains.kotlinx:dataframe-json:$dataframe_version")

It's included by default if you have org.jetbrains.kotlinx:dataframe:$dataframe_version already.

To read a JSON file, use the .readJson() function. JSON files can be read from a file or a URL.

Note that after reading a JSON with a complex structure, you can get hierarchical DataFrame: DataFrame with ColumnGroups and FrameColumns.

val df = DataFrame.readJson(file)

DataFrame.readJson("https://covid.ourworldindata.org/data/owid-covid-data.json")

Column type inference from JSON

Type inference for JSON is much simpler than for CSV. JSON string literals always become a String. Number literals are converted to a unified Number type which will fit all encountered numbers. Boolean literals are converted to Boolean.

[
    {
        "A": "1",
        "B": 1,
        "C": 1.0,
        "D": true
    },
    {
        "A": "2",
        "B": 2,
        "C": 1.1,
        "D": null
    },
    {
        "A": "3",
        "B": 3,
        "C": 1,
        "D": false
    },
    {
        "A": "4",
        "B": 4,
        "C": 1.3,
        "D": true
    }
]

val df = DataFrame.readJson(file)

A: String
B: Int
C: Double
D: Boolean?

Column A has String type because all values are string literals, no implicit conversion is performed. Column C has the Double type because it's the smallest unified number type for Int and Float.

JSON parsing options

Manage type clashes

"value" will be set to the value of the JSON element if it's a primitive, else it will be null.
"array" will be set to the array of values if the JSON element is an array, else it will be [].
If the JSON element is an object, then each property will spread out to its own column in the group, else these columns will be null.

In this case typeClashTactic = JSON.TypeClashTactic.ARRAY_AND_VALUE_COLUMNS.

[
    { "a": "text" },
    { "a": { "b": 2 } },
    { "a": [ 6, 7, 8 ] }
]

will be read like (including null and [] values):

⌌----------------------------------------------⌍
|  | a:{b:Int?, value:String?, array:List<Int>}|
|--|-------------------------------------------|
| 0|   {b:null, value:"text",  array:[]       }|
| 1|   {b:2,    value:null,    array:[]       }|
| 2|   {b:null, value:null,    array:[6, 7, 8]}|
⌎----------------------------------------------⌏

This makes it more convenient to work with the data, but it can be confusing if you're not expecting it or if you just need the type to be an Any.

For this case, you can set typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS to get the following:

⌌-------------⌍
|  |     a:Any|
|--|----------|
| 0|    "text"|
| 1|   { b:2 }|
| 2| [6, 7, 8]|
⌎-------------⌏

This option is also possible to set in the Gradle- and KSP plugin by providing jsonOptions.

Specify Key/Value Paths

{
    "dogs": {
        "fido": {
            "age": 3,
            "breed": "poodle"
        },
        "spot": {
            "age": 5,
            "breed": "labrador"
        },
        "rex": {
            "age": 2,
            "breed": "golden retriever"
        },
        "lucky": { ... },
        "rover": { ... },
        "max": { ... },
        "buster": { ... },
        ...
    },
    "cats": { ... }
}

You will get a column for each dog, which becomes an issue when you have a lot of dogs. This issue is especially noticeable when generating data schemas from JSON, as you might run out of memory when doing that due to the sheer number of generated interfaces. Instead, you can use keyValuePaths to specify paths to the objects that should be read as key value frame columns.

⌌---------------------------------------------------------------------------------------------------------------------------------------------...
|  |                      dogs:{fido:{age:Int, breed:String}, spot:{age:Int, breed:String}, rex:{age:Int, breed:String}, lucky:{age:Int, breed...
|--|------------------------------------------------------------------------------------------------------------------------------------------...
| 0| { fido:{ age:3, breed:poodle }, spot:{ age:5, breed:labrador }, rex:{ age:2, breed:golden retriever }, lucky:{ age:1, breed:poodle }, rov...
⌎---------------------------------------------------------------------------------------------------------------------------------------------...

⌌------------------------------------------------------------------------------------------------------⌍
|  | dogs:[key:String, value:{age:Int, breed:String}]| cats:[key:String, value:{age:Int, breed:String}]|
|--|-------------------------------------------------|-------------------------------------------------|
| 0|                                          [7 x 2]|                                          [6 x 2]|
⌎------------------------------------------------------------------------------------------------------⌏

⌌-------------------------------------------------⌍
|  | key:String|     value:{age:Int, breed:String}|
|--|-----------|----------------------------------|
| 0|       fido|           { age:3, breed:poodle }|
| 1|       spot|         { age:5, breed:labrador }|
| 2|        rex| { age:2, breed:golden retriever }|
| 3|      lucky|           { age:1, breed:poodle }|
| 4|      rover|         { age:3, breed:labrador }|
| 5|        max| { age:2, breed:golden retriever }|
| 6|     buster|           { age:1, breed:poodle }|
⌎-------------------------------------------------⌏

(The results are wrapped in a FrameColumn instead of a ColumnGroup since lengths between "cats" and "dogs" can vary, among other reasons.)

To specify the paths, you can use the JsonPath class:

DataFrame.readJsonStr(
    text = myJson,
    keyValuePaths = listOf(
        JsonPath().append("dogs"), // which will result in '$["dogs"]'
        JsonPath().append("cats"), // which will result in '$["cats"]'
    ),
)

Note: For the KSP plugin, the JsonPath class is not available, so you will have to use the String version of the paths instead. For example: jsonOptions = JsonOptions(keyValuePaths = ["""$""", """$[*]["versions"]"""]). Only the bracket notation of json path is supported, as well as just double quotes, arrays, and wildcards.

Read from Excel

implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")

It's included by default if you have org.jetbrains.kotlinx:dataframe:$dataframe_version already.

To read an Excel spreadsheet, use the .readExcel() function. Excel spreadsheets can be read from a file or a URL. Supported Excel spreadsheet formats are: xls, xlsx.

val df = DataFrame.readExcel(file)

DataFrame.readExcel("https://example.com/data.xlsx")

Cell type inference from Excel

Cells representing dates will be read as kotlinx.datetime.LocalDateTime. Cells with number values, including whole numbers such as "100", or calculated formulas will be read as Double.

Sometimes cells can have the wrong format in an Excel file. For example, you expect to read a column of String:

IDS
100 <-- Intended to be String, but has numeric cell format in original .xlsx file
A100
B100
C100

You will get column of Serializable instead (common parent for Double and String).

val df = DataFrame.readExcel("mixed_column.xlsx", stringColumns = StringColumns("A"))

Read Apache Arrow formats

implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")

It's included by default if you have org.jetbrains.kotlinx:dataframe:$dataframe_version already.

To read Apache Arrow formats, use the .readArrowFeather() function:

val df = DataFrame.readArrowFeather(file)

Read﻿

Read from CSV﻿

Specify delimiter﻿

Column type inference from CSV﻿

Work with locale-specific numbers﻿

Work with specific date-time formats﻿

tip

Provide a default type for all columns﻿

Unlocking Deephaven CSV features﻿

Read from JSON﻿

Column type inference from JSON﻿

JSON parsing options﻿

Manage type clashes﻿

Specify Key/Value Paths﻿

Read from Excel﻿

Cell type inference from Excel﻿

Read Apache Arrow formats﻿

note

Read

Read from CSV

Specify delimiter

Column type inference from CSV

Work with locale-specific numbers

Work with specific date-time formats

Provide a default type for all columns

Unlocking Deephaven CSV features

Read from JSON

Column type inference from JSON

JSON parsing options

Manage type clashes

Specify Key/Value Paths

Read from Excel

Cell type inference from Excel

Read Apache Arrow formats