countDistinct
Counts distinct rows or distinct combinations of values in selected columns.
When countDistinct is used on a DataFrame, it returns the number of distinct rows in this DataFrame.
You can also specify which columns to use when counting distinct combinations of values.
When countDistinct is used on a GroupBy, it counts distinct rows within each group. That is, this function returns a DataFrame where each row corresponds to a group from the original GroupBy. The result contains the original group key columns and a new column with the number of distinct rows (or combinations of values in selected columns) in each group.
Let's take this GroupBy as an example:
Applying countDistinct to this GroupBy yields the following result:
You can also specify which columns in the groups should be used to determine distinctness.
The default name of the new column is countDistinct, but you can choose a different one.