Parquet
Kotlin DataFrame supports reading Apache Parquet files through the Apache Arrow integration.
Requires the dataframe-arrow
module, which is included by default in the general dataframe
artifact and in and when using %use dataframe
for Kotlin Notebook.
Reading Parquet Files
Kotlin DataFrame provides four readParquet()
methods that can read from different source types. All overloads accept optional nullability
inference settings and batchSize
for Arrow scanning.
These overloads are defined in the dataframe-arrow
module and internally use FileFormat.PARQUET
from Apache Arrow’s Dataset API to scan the data and materialize it as a Kotlin DataFrame
.
Examples
If you want to see a complete, realistic data‑engineering example using Spark and Parquet with Kotlin DataFrame, check out the example project.
Multiple Files
It's possible to read multiple Parquet files:
Requirements:
All files must have compatible schemas
Files are vertically concatenated (union of rows)
Column types must match exactly
Missing columns in some files will result in null values
Performance tips
Column selection: Because the
readParquet
method reads all columns, use DataFrame operations likeselect()
immediately after reading to reduce memory usage in later operationsPredicate pushdown: Currently not supported—filtering happens after data is loaded into memory
Use Arrow‑compatible JVMs as documented in Apache Arrow Java compatibility.
Adjust
batchSize
if you read huge files and need to tune throughput vs. memory.
See also
Apache Arrow — reading/writing Arrow IPC formats.
Example: Spark + Parquet + Kotlin DataFrame
Data Sources — Overview of all supported formats