updateStateByKey
Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of each key. In every batch the updateFunc will be called for each state even if there are no new values. Hash partitioning is used to generate the RDDs with Spark's default number of partitions. Note: Needs checkpoint directory to be set.
Parameters
State update function. If this
function returns null
, then corresponding state key-value pair will be eliminated.
Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of each key. In every batch the updateFunc will be called for each state even if there are no new values. [org.apache.spark.Partitioner] is used to control the partitioning of each RDD. Note: Needs checkpoint directory to be set.
Parameters
State update function. Note, that this function may generate a different tuple with a different key than the input key. Therefore keys may be removed or added in this way. It is up to the developer to decide whether to remember the partitioner despite the key being changed.
Partitioner for controlling the partitioning of each RDD in the new DStream
Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key. org.apache.spark.Partitioner is used to control the partitioning of each RDD. Note: Needs checkpoint directory to be set.
Parameters
State update function. If this
function returns null
, then corresponding state key-value pair will be eliminated.
Partitioner for controlling the partitioning of each RDD in the new DStream.
initial state value of each key.