问题:我在哪里可以在Keras中调用BatchNormalization函数?
如果我想在Keras中使用BatchNormalization函数,那么是否仅需要在开始时调用一次?
我为此阅读了该文档:http : //keras.io/layers/normalization/
我看不到该怎么称呼它。下面是我尝试使用它的代码:
model = Sequential()
keras.layers.normalization.BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None)
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(Activation('softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)
我问,因为如果我用第二行(包括批处理规范化)运行代码,而如果我不使用第二行运行代码,则会得到类似的输出。因此,要么我没有在正确的位置调用该函数,要么我猜它并没有太大的区别。
If I want to use the BatchNormalization function in Keras, then do I need to call it once only at the beginning?
I read this documentation for it: http://keras.io/layers/normalization/
I don’t see where I’m supposed to call it. Below is my code attempting to use it:
model = Sequential()
keras.layers.normalization.BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None)
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(Activation('softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)
I ask because if I run the code with the second line including the batch normalization and if I run the code without the second line I get similar outputs. So either I’m not calling the function in the right place, or I guess it doesn’t make that much of a difference.
回答 0
只是为了更详细地回答这个问题,正如Pavel所说的,批处理规范化只是另一层,因此您可以使用它来创建所需的网络体系结构。
一般用例是在网络的线性层和非线性层之间使用BN,因为它可以将激活函数的输入标准化,从而使您位于激活函数(例如Sigmoid)的线性部分的中心。有一小的讨论在这里
在上述情况下,它可能类似于:
# import BatchNormalization
from keras.layers.normalization import BatchNormalization
# instantiate model
model = Sequential()
# we can think of this chunk as the input layer
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
# we can think of this chunk as the hidden layer
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
# we can think of this chunk as the output layer
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('softmax'))
# setting up the optimization of our weights
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
# running the fitting
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)
希望这能使事情更清楚。
Just to answer this question in a little more detail, and as Pavel said, Batch Normalization is just another layer, so you can use it as such to create your desired network architecture.
The general use case is to use BN between the linear and non-linear layers in your network, because it normalizes the input to your activation function, so that you’re centered in the linear section of the activation function (such as Sigmoid). There’s a small discussion of it here
In your case above, this might look like:
# import BatchNormalization
from keras.layers.normalization import BatchNormalization
# instantiate model
model = Sequential()
# we can think of this chunk as the input layer
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
# we can think of this chunk as the hidden layer
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))
# we can think of this chunk as the output layer
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('softmax'))
# setting up the optimization of our weights
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
# running the fitting
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)
Hope this clarifies things a bit more.
回答 1
该线程具有误导性。试图评论卢卡斯·斋月的答案,但我没有适当的特权,所以我只想把它放在这里。
在激活功能之后,在此处或此处,批标准化最有效就是的原因:它是为防止内部协变量移位而开发的。当激活分布时发生内部协变量移位在整个训练过程中,一层的变化很大。使用批处理规范化,以便由于每个批处理中的参数更新(或至少允许其更改),输入(具体来说,这些输入是激活函数的结果)到特定层的分布不会随时间变化以有利的方式)。它使用批处理统计信息进行归一化,然后使用批处理归一化参数(原始论文中的gamma和beta)“以确保插入到网络中的转换可以表示身份转换”(引自原始论文)。但是关键是我们试图将输入归一化,因此它应该总是紧接在网络的下一层之前。是否
This thread is misleading. Tried commenting on Lucas Ramadan’s answer, but I don’t have the right privileges yet, so I’ll just put this here.
Batch normalization works best after the activation function, and here or here is why: it was developed to prevent internal covariate shift. Internal covariate shift occurs when the distribution of the activations of a layer shifts significantly throughout training. Batch normalization is used so that the distribution of the inputs (and these inputs are literally the result of an activation function) to a specific layer doesn’t change over time due to parameter updates from each batch (or at least, allows it to change in an advantageous way). It uses batch statistics to do the normalizing, and then uses the batch normalization parameters (gamma and beta in the original paper) “to make sure that the transformation inserted in the network can represent the identity transform” (quote from original paper). But the point is that we’re trying to normalize the inputs to a layer, so it should always go immediately before the next layer in the network. Whether or not that’s after an activation function is dependent on the architecture in question.
回答 2
该线程对于是否应在当前层的非线性之前应用BN或对前一层的激活进行广泛的辩论。
尽管没有正确的答案,但批处理规范化的作者说,
应在当前层的非线性之前立即应用它。原因(引自原文)-
“我们通过归一化x = Wu + b来在非线性之前添加BN变换。我们也可以归一化层输入u,但是由于u可能是另一个非线性的输出,因此其分布的形状可能会在训练,并限制其第一和第二时刻并不能消除协变量偏移,相反,Wu + b更有可能具有对称的,非稀疏的分布,即“更呈高斯分布”(Hyvèarinen&Oja,2000) ;规范化它可能会产生具有稳定分布的激活。”
This thread has some considerable debate about whether BN should be applied before non-linearity of current layer or to the activations of the previous layer.
Although there is no correct answer, the authors of Batch Normalization say that
It should be applied immediately before the non-linearity of the current layer. The reason ( quoted from original paper) –
“We add the BN transform immediately before the
nonlinearity, by normalizing x = Wu+b. We could have
also normalized the layer inputs u, but since u is likely
the output of another nonlinearity, the shape of its distribution
is likely to change during training, and constraining
its first and second moments would not eliminate the covariate
shift. In contrast, Wu + b is more likely to have
a symmetric, non-sparse distribution, that is “more Gaussian”
(Hyv¨arinen & Oja, 2000); normalizing it is likely to
produce activations with a stable distribution.”
回答 3
Keras现在支持该use_bias=False
选项,因此我们可以通过编写如下代码来节省一些计算
model.add(Dense(64, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('tanh'))
要么
model.add(Convolution2D(64, 3, 3, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('relu'))
Keras now supports the use_bias=False
option, so we can save some computation by writing like
model.add(Dense(64, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('tanh'))
or
model.add(Convolution2D(64, 3, 3, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('relu'))
回答 4
Conv2D
紧随ReLu
其后的是BatchNormalization
一层,这几乎已成为一种趋势。因此,我组成了一个小函数来一次调用所有这些函数。使模型定义看起来更整洁,更易于阅读。
def Conv2DReluBatchNorm(n_filter, w_filter, h_filter, inputs):
return BatchNormalization()(Activation(activation='relu')(Convolution2D(n_filter, w_filter, h_filter, border_mode='same')(inputs)))
It’s almost become a trend now to have a Conv2D
followed by a ReLu
followed by a BatchNormalization
layer. So I made up a small function to call all of them at once. Makes the model definition look a whole lot cleaner and easier to read.
def Conv2DReluBatchNorm(n_filter, w_filter, h_filter, inputs):
return BatchNormalization()(Activation(activation='relu')(Convolution2D(n_filter, w_filter, h_filter, border_mode='same')(inputs)))
回答 5
回答 6
批归一化用于通过调整激活的均值和缩放来归一化输入层和隐藏层。由于深度神经网络中具有附加层的这种归一化效果,该网络可以使用更高的学习速率而不会消失或爆炸梯度。此外,批量归一化对网络进行规范化,使其更易于泛化,因此无需使用压差来减轻过度拟合的情况。
在使用Keras中的Dense()或Conv2D()计算线性函数之后,我们立即使用BatchNormalization()来计算图层中的线性函数,然后使用Activation()将非线性添加到图层中。
from keras.layers.normalization import BatchNormalization
model = Sequential()
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True,
validation_split=0.2, verbose = 2)
批量标准化如何应用?
假设我们已将a [l-1]输入到层l。同样,我们对层l具有权重W [l]和偏置单元b [l]。令a [l]是为层l计算的激活向量(即,在添加了非线性之后),而z [l]是在添加非线性之前的向量
- 使用a [l-1]和W [l]我们可以计算层l的z [l]
- 通常,在前馈传播中,我们会在此阶段像z [l] + b [l]一样向z [l]添加偏置单元,但是在批归一化中,不需要添加b [l]的步骤,并且不需要使用b [l]参数。
- 计算z [l]均值并从每个元素中减去
- 使用标准偏差除以(z [l]-平均值)。称为Z_temp [l]
现在定义新参数γ和β,它们将改变隐藏层的比例,如下所示:
z_norm [l] =γ.Z_temp[l] +β
在此代码摘录中,Dense()取a [l-1],使用W [l]并计算z [l]。然后立即的BatchNormalization()将执行上述步骤以给出z_norm [l]。然后立即Activation()将计算tanh(z_norm [l])得出a [l],即
a[l] = tanh(z_norm[l])
Batch Normalization is used to normalize the input layer as well as hidden layers by adjusting mean and scaling of the activations. Because of this normalizing effect with additional layer in deep neural networks, the network can use higher learning rate without vanishing or exploding gradients. Furthermore, batch normalization regularizes the network such that it is easier to generalize, and it is thus unnecessary to use dropout to mitigate overfitting.
Right after calculating the linear function using say, the Dense() or Conv2D() in Keras, we use BatchNormalization() which calculates the linear function in a layer and then we add the non-linearity to the layer using Activation().
from keras.layers.normalization import BatchNormalization
model = Sequential()
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('softmax'))
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True,
validation_split=0.2, verbose = 2)
How is Batch Normalization applied?
Suppose we have input a[l-1] to a layer l. Also we have weights W[l] and bias unit b[l] for the layer l. Let a[l] be the activation vector calculated(i.e. after adding the non-linearity) for the layer l and z[l] be the vector before adding non-linearity
- Using a[l-1] and W[l] we can calculate z[l] for the layer l
- Usually in feed-forward propagation we will add bias unit to the z[l] at this stage like this z[l]+b[l], but in Batch Normalization this step of addition of b[l] is not required and no b[l] parameter is used.
- Calculate z[l] means and subtract it from each element
- Divide (z[l] – mean) using standard deviation. Call it Z_temp[l]
Now define new parameters γ and β that will change the scale of the hidden layer as follows:
z_norm[l] = γ.Z_temp[l] + β
In this code excerpt, the Dense() takes the a[l-1], uses W[l] and calculates z[l]. Then the immediate BatchNormalization() will perform the above steps to give z_norm[l]. And then the immediate Activation() will calculate tanh(z_norm[l]) to give a[l] i.e.
a[l] = tanh(z_norm[l])