countApproxDistinctByKey

fun <K, V> JavaRDD<Tuple2<K, V>>.countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): JavaRDD<Tuple2<K, Long>>

Return approximate number of distinct values for each key in this RDD.

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

Parameters

relativeSD

Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.

partitioner

partitioner of the resulting RDD.


fun <K, V> JavaRDD<Tuple2<K, V>>.countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): JavaRDD<Tuple2<K, Long>>

Return approximate number of distinct values for each key in this RDD.

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

Parameters

relativeSD

Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.

numPartitions

number of partitions of the resulting RDD.


fun <K, V> JavaRDD<Tuple2<K, V>>.countApproxDistinctByKey(relativeSD: Double): JavaRDD<Tuple2<K, Long>>

Return approximate number of distinct values for each key in this RDD.

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

Parameters

relativeSD

Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.