Dataframe 0.14 Help

Write

DataFrame instances can be saved in the following formats: CSV, TSV, JSON, XLS(X) and Apache Arrow.

Writing to CSV

You can write DataFrame in CSV format to file, to String or to Appendable (i.e. to Writer).

Values of ColumnGroup, FrameColumn, i.e. AnyRow, AnyFrame will be serialized as JSON objects.

df.writeCSV(file)
val format = CSVFormat.DEFAULT.builder().setDelimiter(';').setRecordSeparator(System.lineSeparator()).build() val csvStr = df.toCsv(format)

ColumnGroup and FrameColumn values will be serialized as JSON strings.

Writing to JSON

You can write your DataFrame in JSON format to file, to string or to Appendable (i.e. to Writer).

df.writeJson(file)
val jsonStr = df.toJson(prettyPrint = true)

Write to Excel spreadsheet

Add dependency:

implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")

You can write your DataFrame in XLS, XLSX format to a file, OutputStream or Workbook object.

df.writeExcel(file)

Values of ColumnGroup, FrameColumn, i.e. AnyRow, AnyFrame will be serialized as JSON objects.

If you work directly with Apache POI, you can use created Workbook and Sheets in your code:

/** * Do something with generated sheets. Here we set bold style for headers and italic style for first data column */ fun setStyles(sheet: Sheet) { val headerFont = sheet.workbook.createFont() headerFont.bold = true val headerStyle = sheet.workbook.createCellStyle() headerStyle.setFont(headerFont) val indexFont = sheet.workbook.createFont() indexFont.italic = true val indexStyle = sheet.workbook.createCellStyle() indexStyle.setFont(indexFont) sheet.forEachIndexed { index, row -> if (index == 0) { for (cell in row) { cell.cellStyle = headerStyle } } else { row.first().cellStyle = indexStyle } } } // Create a workbook (or use existing) val wb = WorkbookFactory.create(true) // Create different sheets from different data frames in the workbook val allPersonsSheet = df.writeExcel(wb, sheetName = "allPersons") val happyPersonsSheet = df.filter { person -> person.isHappy }.remove("isHappy").writeExcel(wb, sheetName = "happyPersons") val unhappyPersonsSheet = df.filter { person -> !person.isHappy }.remove("isHappy").writeExcel(wb, sheetName = "unhappyPersons") // Do anything you want by POI listOf(happyPersonsSheet, unhappyPersonsSheet).forEach { setStyles(it) } // Save the result file.outputStream().use { wb.write(it) } wb.close()

Add new sheets without using Apache POI directly by using a parameter to keep using the same file if it already exists

// Create a new Excel workbook with a single sheet called "allPersons", replacing the file if it already exists -> Current sheets: allPersons df.writeExcel(file, sheetName = "allPersons") // Add a new sheet to the previous file without replacing it, by setting keepFile = true -> Current sheets: allPersons, happyPersons df.filter { person -> person.isHappy }.remove("isHappy") .writeExcel(file, sheetName = "happyPersons", keepFile = true) // Add a new sheet to the previous file without replacing it, by setting keepFile = true -> Current sheets: allPersons, happyPersons, unhappyPersons df.filter { person -> !person.isHappy }.remove("isHappy") .writeExcel(file, sheetName = "unhappyPersons", keepFile = true)

Writing to Apache Arrow formats

Add dependency:

implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")

DataFrame supports writing Arrow interprocess streaming format and Arrow random access format to raw WritableByteChannel, OutputStream, File or ByteArray.

Data may be saved "as is" (like exporting to new Excel file) or converted to match some target Schema if you have it (like inserting into existing SQL table).

The first approach is quite easy:

df.writeArrowIPC(file) // or df.writeArrowFeather(file)

(writing to file, opened stream or channel),

val ipcByteArray: ByteArray = df.saveArrowIPCToByteArray() // or val featherByteArray: ByteArray = df.saveArrowFeatherToByteArray()

(creating byte array). Nested frames and columns with mixed or unsupported types will be saved as String.

The second approach is a bit more tricky. You have to specify schema itself and casting behavior mode as ArrowWriter parameters. Behavior Mode has four independent switchers: restrictWidening, restrictNarrowing, strictType, strictNullable. You can use Mode.STRICT (this is default), Mode.LOYAL or any combination you want. The ArrowWriter object should be closed after using because Arrow uses random access buffers not managed by Java GC. Finally, you can specify a callback to be invoked if some data is lost or can not be saved according to your schema.

Here is full example:

// Get schema from anywhere you want. It can be deserialized from JSON, generated from another dataset // (including the DataFrame.columns().toArrowSchema() method), created manually, and so on. val schema = Schema.fromJSON(schemaJson) df.arrowWriter( // Specify your schema targetSchema = schema, // Specify desired behavior mode mode = ArrowWriter.Mode( restrictWidening = true, restrictNarrowing = true, strictType = true, strictNullable = false, ), // Specify mismatch subscriber mismatchSubscriber = writeMismatchMessage, ).use { writer: ArrowWriter -> // Save to any format and sink, like in the previous example writer.writeArrowFeather(file) }

On executing you should get two warnings:

and

Last modified: 27 September 2024