import numpy as np
def softmax(x):"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))return e_x / e_x.sum()
scores =[3.0,1.0,0.2]print(softmax(scores))
返回:
[0.83601880.113142840.05083836]
但是建议的解决方案是:
def softmax(x):"""Compute softmax values for each sets of scores in x."""return np.exp(x)/ np.sum(np.exp(x), axis=0)
From the Udacity’s deep learning class, the softmax of y_i is simply the exponential divided by the sum of exponential of the whole Y vector:
Where S(y_i) is the softmax function of y_i and e is the exponential and j is the no. of columns in the input vector Y.
I’ve tried the following:
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
scores = [3.0, 1.0, 0.2]
print(softmax(scores))
which returns:
[ 0.8360188 0.11314284 0.05083836]
But the suggested solution was:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
which produces the same output as the first implementation, even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum.
Can someone show mathematically why? Is one correct and the other one wrong?
Are the implementation similar in terms of code and time complexity? Which is more efficient?
回答 0
它们都是正确的,但是从数值稳定性的角度来看,您是首选。
你开始
e ^(x - max(x))/ sum(e^(x - max(x))
通过使用a ^(b-c)=(a ^ b)/(a ^ c)的事实,我们得到
= e ^ x /(e ^ max(x)* sum(e ^ x / e ^ max(x)))= e ^ x / sum(e ^ x)
import numpy as np
# your solution:def your_softmax(x):"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))return e_x / e_x.sum()# correct solution:def softmax(x):"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))return e_x / e_x.sum(axis=0)# only difference
(Well… much confusion here, both in the question and in the answers…)
To start with, the two solutions (i.e. yours and the suggested one) are not equivalent; they happen to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.
Results-wise, the only actual difference between the two solutions is the axis=0 argument. To see that this is the case, let’s try your solution (your_softmax) and one where the only difference is the axis argument:
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# correct solution:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference
As I said, for a 1-D score array, the results are indeed identical:
The results are different – the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.
So, all the fuss was actually for an implementation detail – the axis argument. According to the numpy.sum documentation:
The default, axis=None, will sum all of the elements of the input array
while here we want to sum row-wise, hence axis=0. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case…
The axis issue aside, your implementation (i.e. your choice to subtract the max first) is actually better than the suggested solution! In fact, it is the recommended way of implementing the softmax function – see here for the justification (numeric stability, also pointed out by some other answers here).
import numpy as np
# your solution:def your_softmax(x):"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))return e_x / e_x.sum()# desertnaut solution (copied from his answer): def desertnaut_softmax(x):"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))return e_x / e_x.sum(axis=0)# only difference# my (correct) solution:def softmax(z):assert len(z.shape)==2
s = np.max(z, axis=1)
s = s[:, np.newaxis]# necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis]# ditoreturn e_x / div
让我们以Desertnauts为例:
x1 = np.array([[1,2,3,6]])# notice that we put the data into 2 dimensions(!)
So, this is really a comment to desertnaut’s answer but I can’t comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. However, desertnaut’s solution is also wrong. The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# desertnaut solution (copied from his answer):
def desertnaut_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference
# my (correct) solution:
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
Lets take desertnauts example:
x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)
This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!
I would say that while both are correct mathematically, implementation-wise, first one is better. When computing softmax, the intermediate values may become very large. Dividing two large numbers can be numerically unstable. These notes (from Stanford) mention a normalization trick which is essentially what you are doing.
回答 4
sklearn还提供softmax的实现
from sklearn.utils.extmath import softmax
import numpy as np
x = np.array([[0.50839931,0.49767588,0.51260159]])
softmax(x)# output
array([[0.3340521,0.33048906,0.33545884]])
From mathematical point of view both sides are equal.
And you can easily prove this. Let’s m=max(x). Now your function softmax returns a vector, whose i-th coordinate is equal to
notice that this works for any m, because for all (even complex) numbers e^m != 0
from computational complexity point of view they are also equivalent and both run in O(n) time, where n is the size of a vector.
from numerical stability point of view, the first solution is preferred, because e^x grows very fast and even for pretty small values of x it will overflow. Subtracting the maximum value allows to get rid of this overflow. To practically experience the stuff I was talking about try to feed x = np.array([1000, 5]) into both of your functions. One will return correct probability, the second will overflow with nan
your solution works only for vectors (Udacity quiz wants you to calculate it for matrices as well). In order to fix it you need to use sum(axis=0)
def softmax(X, theta =1.0, axis =None):"""
Compute the softmax of each element along an axis of X.
Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.
Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""# make X at least 2d
y = np.atleast_2d(X)# find axisif axis isNone:
axis = next(j[0]for j in enumerate(y.shape)if j[1]>1)# multiply y against the theta parameter,
y = y * float(theta)# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)# exponentiate y
y = np.exp(y)# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)# finally: divide elementwise
p = y / ax_sum
# flatten if X was 1Dif len(X.shape)==1: p = p.flatten()return p
I wrote a function applying the softmax over any axis:
def softmax(X, theta = 1.0, axis = None):
"""
Compute the softmax of each element along an axis of X.
Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.
Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""
# make X at least 2d
y = np.atleast_2d(X)
# find axis
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
# multiply y against the theta parameter,
y = y * float(theta)
# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)
# exponentiate y
y = np.exp(y)
# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
# finally: divide elementwise
p = y / ax_sum
# flatten if X was 1D
if len(X.shape) == 1: p = p.flatten()
return p
Subtracting the max, as other users described, is good practice. I wrote a detailed post about it here.
“When you’re writing code for computing the Softmax function in practice, the intermediate terms may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick.”
To offer an alternative solution, consider the cases where your arguments are extremely large in magnitude such that exp(x) would underflow (in the negative case) or overflow (in the positive case). Here you want to remain in log space as long as possible, exponentiating only at the end where you can trust the result will be well-behaved.
import scipy.special as sc
import numpy as np
def softmax(x: np.ndarray) -> np.ndarray:
return np.exp(x - sc.logsumexp(x))
I needed something compatible with the output of a dense layer from Tensorflow.
The solution from @desertnaut does not work in this case because I have batches of data. Therefore, I came with another solution that should work in both cases:
if len(x.shape)>1:
tmp = np.max(x, axis =1)
x -= tmp.reshape((x.shape[0],1))
x = np.exp(x)
tmp = np.sum(x, axis =1)
x /= tmp.reshape((x.shape[0],1))else:
tmp = np.max(x)
x -= tmp
x = np.exp(x)
tmp = np.sum(x)
x /= tmp
return x
I would like to supplement a little bit more understanding of the problem. Here it is correct of subtracting max of the array. But if you run the code in the other post, you would find it is not giving you right answer when the array is 2D or higher dimensions.
Here I give you some suggestions:
To get max, try to do it along x-axis, you will get an 1D array.
Reshape your max array to original shape.
Do np.exp get exponential value.
Do np.sum along axis.
Get the final results.
Follow the result you will get the correct answer by doing vectorization. Since it is related to the college homework, I cannot post the exact code here, but I would like to give more suggestions if you don’t understand.
The purpose of the softmax function is to preserve the ratio of the vectors as opposed to squashing the end-points with a sigmoid as the values saturate (i.e. tend to +/- 1 (tanh) or from 0 to 1 (logistical)). This is because it preserves more information about the rate of change at the end-points and thus is more applicable to neural nets with 1-of-N Output Encoding (i.e. if we squashed the end-points it would be harder to differentiate the 1-of-N output class because we can’t tell which one is the “biggest” or “smallest” because they got squished.); also it makes the total output sum to 1, and the clear winner will be closer to 1 while other numbers that are close to each other will sum to 1/p, where p is the number of output neurons with similar values.
The purpose of subtracting the max value from the vector is that when you do e^y exponents you may get very high value that clips the float at the max value leading to a tie, which is not the case in this example. This becomes a BIG problem if you subtract the max value to make a negative number, then you have a negative exponent that rapidly shrinks the values altering the ratio, which is what occurred in poster’s question and yielded the incorrect answer.
The answer supplied by Udacity is HORRIBLY inefficient. The first thing we need to do is calculate e^y_j for all vector components, KEEP THOSE VALUES, then sum them up, and divide. Where Udacity messed up is they calculate e^y_j TWICE!!! Here is the correct answer:
def softmax(X, theta =1.0, axis =None):"""
Compute the softmax of each element along an axis of X.
Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.
Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""# make X at least 2d
y = np.atleast_2d(X)# find axisif axis isNone:
axis = next(j[0]for j in enumerate(y.shape)if j[1]>1)# multiply y against the theta parameter,
y = y * float(theta)# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)# exponentiate y
y = np.exp(y)# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)# finally: divide elementwise
p = y / ax_sum
# flatten if X was 1Dif len(X.shape)==1: p = p.flatten()return p
scores_np = softmax(logits_np, axis=-1)print('scores_np.shape', scores_np.shape)print('scores_np:')print(scores_np)print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)print('np.sum(scores_np, axis=-1):')print(np.sum(scores_np, axis=-1))
def softmax(X, theta = 1.0, axis = None):
"""
Compute the softmax of each element along an axis of X.
Parameters
----------
X: ND-Array. Probably should be floats.
theta (optional): float parameter, used as a multiplier
prior to exponentiation. Default = 1.0
axis (optional): axis to compute values along. Default is the
first non-singleton axis.
Returns an array the same size as X. The result will sum to 1
along the specified axis.
"""
# make X at least 2d
y = np.atleast_2d(X)
# find axis
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
# multiply y against the theta parameter,
y = y * float(theta)
# subtract the max for numerical stability
y = y - np.expand_dims(np.max(y, axis = axis), axis)
# exponentiate y
y = np.exp(y)
# take the sum along the specified axis
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
# finally: divide elementwise
p = y / ax_sum
# flatten if X was 1D
if len(X.shape) == 1: p = p.flatten()
return p
scores_np = softmax(logits_np, axis=-1)
print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)
print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))
import numpy as np
# your solution:def your_softmax(x):"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))return e_x / e_x.sum()# correct solution: def softmax(x):"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))return e_x / e_x.sum(axis=0)# only difference
The softmax function is an activation function that turns numbers into probabilities which sum to one. The softmax function outputs a vector that represents the probability distributions of a list of outcomes. It is also a core element used in deep learning classification tasks.
Softmax function is used when we have multiple classes.
It is useful for finding out the class which has the max. Probability.
The Softmax function is ideally used in the output layer, where we are actually trying to attain the probabilities to define the class of each input.
It ranges from 0 to 1.
Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1. Logits are the raw scores output by the last layer of a neural network. Before activation takes place. To understand the softmax function, we must look at the output of the (n-1)th layer.
The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# correct solution:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
# only difference