Sparsemax activation function is similar to softmax but able to output sparse probabilities.
for batch i and class j
i
j
sparsemax(x)[i,j] = max(0, logits[i,j] - τ(logits[i,:]))
[i,j]
τ
[i,:]
From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification