joinWith
Joins two DataFrame objects by a join expression.
This function is a join variant that lets you match data using any expression that returns a Boolean, which also gives opportunity to perform operations that require values from both matching rows. Can be helpful if the data you want to join wasn't designed relational and requires heuristics to tell if rows are matching, or has relations other than equals.
For example, you can match rows based on:
Order relations such as
>,<,infor numerical or DateTime valuesSpatial relations, like distance within a certain range if your data includes spatial or geographical values
String equivalence using more complex comparison techniques, such as
contains, regular expressions, Levenshtein Distance or language models.
Join types with examples
Supported join types:
Inner(default) — only matched rows from left and rightDataFrameobjectsFilter— only matched rows from leftDataFrameLeft— all rows from leftDataFrame, mismatches from rightDataFramefilled withnullRight— all rows from rightDataFrame, mismatches from leftDataFramefilled withnullFull— all rows from left and rightDataFrameobjects, any mismatches filled withnullExclude— only mismatched rows from left
For every join type there is a shortcut operation:
Inner join
Filter join
Special case of inner join when you only need the data from the left table.
Left join
Right join
Full join
Exclude join
Can be viewed as filterJoin with logically opposite predicate
Cross join
It can also be called cross product of two DataFrame objects.
Difference from join
join tries to take advantage of knowledge that data in matching columns is the same (because equals is used) to minimize number of columns in the resulting dataframe.
Columns that were used in the condition: index, age - are present only once. Numerical suffix is used to disambiguate columns that are not used in the condition. Compare it to an equivalent joinWith:
Here columns from both DataFrame objects are presented as is. So join is better suited for equals relation, and joinWith is for everything else. Below are two more examples with join types that allow mismatches. Note the difference in null values