Dataframe 0.13 Help

joinWith

Joins two DataFrames by a join expression.

joinWith(otherDf, type = JoinType.Inner) { joinExpression } joinExpression: JoinedDataRow.(JoinedDataRow) -> Boolean interface JoinedDataRow: LeftDataRow { val right: RightDataRow }

This function is a join variant that lets you match data using any expression that returns a Boolean, which also gives opportunity to perform operations that require values from both matching rows. Can be helpful if the data you want to join wasn't designed relational and requires heuristics to tell if rows are matching, or has relations other than equals.

For example, you can match rows based on:

  • Order relations such as >, <, in for numerical or DateTime values

  • Spatial relations, like distance within a certain range if your data includes spatial or geographical values

  • String equivalence using more complex comparison techniques, such as contains, regular expressions, Levenshtein Distance or language models.

Join types with examples

Supported join types:

  • Inner (default) — only matched rows from left and right DataFrames

  • Filter — only matched rows from left DataFrame

  • Left — all rows from left DataFrame, mismatches from right DataFrame filled with null

  • Right — all rows from right DataFrame, mismatches from left DataFrame filled with null

  • Full — all rows from left and right DataFrames, any mismatches filled with null

  • Exclude — only mismatched rows from left

For every join type there is a shortcut operation:

Inner join

campaigns.innerJoinWith(visits) { right.date in startDate..endDate }
val date by column<LocalDate>() val startDate by column<LocalDate>() val endDate by column<LocalDate>() campaigns.innerJoinWith(visits) { right[date] in startDate()..endDate() }
campaigns.innerJoinWith(visits) { right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>() }

Filter join

Special case of inner join when you only need the data from the left table.

campaigns.filterJoinWith(visits) { right.date in startDate..endDate }
val date by column<LocalDate>() val startDate by column<LocalDate>() val endDate by column<LocalDate>() campaigns.filterJoinWith(visits) { right[date] in startDate()..endDate() }
campaigns.filterJoinWith(visits) { right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>() }

Left join

campaigns.leftJoinWith(visits) { right.date in startDate..endDate }
val date by column<LocalDate>() val startDate by column<LocalDate>() val endDate by column<LocalDate>() campaigns.leftJoinWith(visits) { right[date] in startDate()..endDate() }
campaigns.leftJoinWith(visits) { right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>() }

Right join

campaigns.rightJoinWith(visits) { right.date in startDate..endDate }
val date by column<LocalDate>() val startDate by column<LocalDate>() val endDate by column<LocalDate>() campaigns.rightJoinWith(visits) { right[date] in startDate()..endDate() }
campaigns.rightJoinWith(visits) { right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>() }

Full join

campaigns.fullJoinWith(visits) { right.date in startDate..endDate }
val date by column<LocalDate>() val startDate by column<LocalDate>() val endDate by column<LocalDate>() campaigns.fullJoinWith(visits) { right[date] in startDate()..endDate() }
campaigns.fullJoinWith(visits) { right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>() }

Exclude join

Can be viewed as filterJoin with logically opposite predicate

campaigns.excludeJoinWith(visits) { right.date in startDate..endDate }
val date by column<LocalDate>() val startDate by column<LocalDate>() val endDate by column<LocalDate>() campaigns.excludeJoinWith(visits) { right[date] in startDate()..endDate() }
campaigns.excludeJoinWith(visits) { right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>() }

Cross join

Can also be called cross product of two dataframes

campaigns.joinWith(visits) { true }

Difference from join

join tries to take advantage of knowledge that data in matching columns is the same (because equals is used) to minimize number of columns in the resulting dataframe.

df1.innerJoin(df2, "index", "age")

Columns that were used in the condition: index, age - are present only once. Numerical suffix is used to disambiguate columns that are not used in the condition. Compare it to an equivalent joinWith:

df1.innerJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }

Here columns from both dataframes are presented as is. So join is better suited for equals relation, and joinWith is for everything else. Below are two more examples with join types that allow mismatches. Note the difference in null values

df1.leftJoin(df2, "index", "age") df1.leftJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
df1.rightJoin(df2, "index", "age") df1.rightJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
Last modified: 29 March 2024