joinWith
Joins two DataFrame
objects by a join expression.
joinWith(otherDf, type = JoinType.Inner) { joinExpression }
joinExpression: JoinedDataRow.(JoinedDataRow) -> Boolean
interface JoinedDataRow: LeftDataRow {
val right: RightDataRow
}
This function is a join variant that lets you match data using any expression that returns a Boolean, which also gives opportunity to perform operations that require values from both matching rows. Can be helpful if the data you want to join wasn't designed relational and requires heuristics to tell if rows are matching, or has relations other than equals
.
For example, you can match rows based on:
Order relations such as >
, <
, in
for numerical or DateTime values
Spatial relations, like distance within a certain range if your data includes spatial or geographical values
String equivalence using more complex comparison techniques, such as contains
, regular expressions, Levenshtein Distance or language models.
Join types with examples
Supported join types:
Inner
(default) — only matched rows from left and right DataFrame
objects
Filter
— only matched rows from left DataFrame
Left
— all rows from left DataFrame
, mismatches from right DataFrame
filled with null
Right
— all rows from right DataFrame
, mismatches from left DataFrame
filled with null
Full
— all rows from left and right DataFrame
objects, any mismatches filled with null
Exclude
— only mismatched rows from left
For every join type there is a shortcut operation:
Inner join
campaigns.innerJoinWith(visits) {
right.date in startDate..endDate
}
val date by column<LocalDate>()
val startDate by column<LocalDate>()
val endDate by column<LocalDate>()
campaigns.innerJoinWith(visits) {
right[date] in startDate()..endDate()
}
campaigns.innerJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
Filter join
Special case of inner join when you only need the data from the left table.
campaigns.filterJoinWith(visits) {
right.date in startDate..endDate
}
val date by column<LocalDate>()
val startDate by column<LocalDate>()
val endDate by column<LocalDate>()
campaigns.filterJoinWith(visits) {
right[date] in startDate()..endDate()
}
campaigns.filterJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
Left join
campaigns.leftJoinWith(visits) {
right.date in startDate..endDate
}
val date by column<LocalDate>()
val startDate by column<LocalDate>()
val endDate by column<LocalDate>()
campaigns.leftJoinWith(visits) {
right[date] in startDate()..endDate()
}
campaigns.leftJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
Right join
campaigns.rightJoinWith(visits) {
right.date in startDate..endDate
}
val date by column<LocalDate>()
val startDate by column<LocalDate>()
val endDate by column<LocalDate>()
campaigns.rightJoinWith(visits) {
right[date] in startDate()..endDate()
}
campaigns.rightJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
Full join
campaigns.fullJoinWith(visits) {
right.date in startDate..endDate
}
val date by column<LocalDate>()
val startDate by column<LocalDate>()
val endDate by column<LocalDate>()
campaigns.fullJoinWith(visits) {
right[date] in startDate()..endDate()
}
campaigns.fullJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
Exclude join
Can be viewed as filterJoin with logically opposite predicate
campaigns.excludeJoinWith(visits) {
right.date in startDate..endDate
}
val date by column<LocalDate>()
val startDate by column<LocalDate>()
val endDate by column<LocalDate>()
campaigns.excludeJoinWith(visits) {
right[date] in startDate()..endDate()
}
campaigns.excludeJoinWith(visits) {
right.getValue<LocalDate>("date") in "startDate"<LocalDate>().."endDate"<LocalDate>()
}
Cross join
It can also be called cross product of two DataFrame
objects.
campaigns.joinWith(visits) { true }
Difference from join
join tries to take advantage of knowledge that data in matching columns is the same (because equals
is used) to minimize number of columns in the resulting dataframe.
df1.innerJoin(df2, "index", "age")
Columns that were used in the condition: index
, age
- are present only once. Numerical suffix is used to disambiguate columns that are not used in the condition. Compare it to an equivalent joinWith
:
df1.innerJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
Here columns from both DataFrame
objects are presented as is. So join is better suited for equals
relation, and joinWith is for everything else. Below are two more examples with join types that allow mismatches. Note the difference in null
values
df1.leftJoin(df2, "index", "age")
df1.leftJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
df1.rightJoin(df2, "index", "age")
df1.rightJoinWith(df2) { it["index"] == right["index"] && it["age"] == right["age"] }
Last modified: 27 September 2024