groupByKey

inline fun <T, R> Dataset<T>.groupByKey(noinline func: (T) -> R): KeyValueGroupedDataset<R, T>

(Kotlin-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.


fun <K, V> JavaRDD<Tuple2<K, V>>.groupByKey(partitioner: Partitioner): JavaRDD<Tuple2<K, Iterable<V>>>

Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.

Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using aggregateByKey or reduceByKey will provide much better performance.

Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.


fun <K, V> JavaRDD<Tuple2<K, V>>.groupByKey(numPartitions: Int): JavaRDD<Tuple2<K, Iterable<V>>>

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with into numPartitions partitions. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.

Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using aggregateByKey or reduceByKey will provide much better performance.

Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.


fun <K, V> JavaRDD<Tuple2<K, V>>.groupByKey(): JavaRDD<Tuple2<K, Iterable<V>>>

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with the existing partitioner/parallelism level. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.

Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using aggregateByKey or reduceByKey will provide much better performance.


fun <K, V> JavaDStream<Tuple2<K, V>>.groupByKey(    numPartitions: Int = dstream().ssc().sc().defaultParallelism()): JavaDStream<Tuple2<K, Iterable<V>>>

Return a new DStream by applying groupByKey to each RDD. Hash partitioning is used to generate the RDDs with numPartitions partitions.


fun <K, V> JavaDStream<Tuple2<K, V>>.groupByKey(partitioner: Partitioner): JavaDStream<Tuple2<K, Iterable<V>>>

Return a new DStream by applying groupByKey on each RDD. The supplied org.apache.spark.Partitioner is used to control the partitioning of each RDD.