Parquet

Kotlin DataFrame supports reading Apache Parquet files through the Apache Arrow integration.

Requires the dataframe-arrow module, which is included by default in the general dataframe artifact and in and when using %use dataframe for Kotlin Notebook.

Reading Parquet Files

Kotlin DataFrame provides four readParquet() methods that can read from different source types. All overloads accept optional nullability inference settings and batchSize for Arrow scanning.

// 1) URLs
public fun DataFrame.Companion.readParquet(
    vararg urls: URL,
    nullability: NullabilityOptions = NullabilityOptions.Infer,
    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame

// 2) Strings (interpreted as file paths or URLs, e.g., "data/file.parquet", "file://", or "http(s)://")
public fun DataFrame.Companion.readParquet(
    vararg strUrls: String,
    nullability: NullabilityOptions = NullabilityOptions.Infer,
    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame

// 3) Paths
public fun DataFrame.Companion.readParquet(
    vararg paths: Path,
    nullability: NullabilityOptions = NullabilityOptions.Infer,
    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame

// 4) Files
public fun DataFrame.Companion.readParquet(
    vararg files: File,
    nullability: NullabilityOptions = NullabilityOptions.Infer,
    batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
): AnyFrame

These overloads are defined in the dataframe-arrow module and internally use FileFormat.PARQUET from Apache Arrow’s Dataset API to scan the data and materialize it as a Kotlin DataFrame.

Examples

// Read from file paths (as strings)
val df = DataFrame.readParquet("data/sales.parquet")

// Read from Path objects
val path = Paths.get("data/sales.parquet")
val df = DataFrame.readParquet(path)

// Read from URLs
val df = DataFrame.readParquet(url)

// Read from File objects
val file = File("data/sales.parquet")
val df = DataFrame.readParquet(file)

// Read from File objects
val file = File("data/sales.parquet")

val df = DataFrame.readParquet(
    file,
    nullability = NullabilityOptions.Infer,
    batchSize = 64L * 1024
)

If you want to see a complete, realistic data‑engineering example using Spark and Parquet with Kotlin DataFrame, check out the example project.

Multiple Files

It's possible to read multiple Parquet files:

val file = File("data/sales.parquet")
val file1 = File("data/sales1.parquet")
val file2 = File("data/sales2.parquet")

val df = DataFrame.readParquet(file, file1, file2)

Requirements:

All files must have compatible schemas
Files are vertically concatenated (union of rows)
Column types must match exactly
Missing columns in some files will result in null values

Performance tips

Column selection: Because the readParquet method reads all columns, use DataFrame operations like select() immediately after reading to reduce memory usage in later operations
Predicate pushdown: Currently not supported—filtering happens after data is loaded into memory
Use Arrow‑compatible JVMs as documented in Apache Arrow Java compatibility.
Adjust batchSize if you read huge files and need to tune throughput vs. memory.

DataFrame 1.0 Help