groupByKey
(Kotlin-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using aggregateByKey or reduceByKey will provide much better performance.
Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with into numPartitions partitions. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using aggregateByKey or reduceByKey will provide much better performance.
Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an OutOfMemoryError.
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with the existing partitioner/parallelism level. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using aggregateByKey or reduceByKey will provide much better performance.
Return a new DStream by applying groupByKey
to each RDD. Hash partitioning is used to generate the RDDs with numPartitions
partitions.
Return a new DStream by applying groupByKey
on each RDD. The supplied org.apache.spark.Partitioner is used to control the partitioning of each RDD.