标签归档:deep-learning

哪些参数应用于提前停止?

问题:哪些参数应用于提前停止?

我正在使用Keras为我的项目训练神经网络。Keras提供了提前停止的功能。我是否知道应该观察哪些参数,以避免由于使用早期停止而使神经网络过度拟合?

I’m training a neural network for my project using Keras. Keras has provided a function for early stopping. May I know what parameters should be observed to avoid my neural network from overfitting by using early stopping?


回答 0

提前停止

一旦损失开始增加(或换句话说,验证准确性开始降低),早期停止基本上就是停止训练。根据文件,其用法如下;

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=0,
                              verbose=0, mode='auto')

值取决于您的实现(问题,批处理大小等),但通常是为了防止我使用过度拟合;

  1. 通过将monitor 参数设置为,监控验证损失(需要使用交叉验证或至少训练/测试集)'val_loss'
  2. min_delta是在某个时期是否将损失量化为改善的阈值。如果损失差异小于min_delta,则将其量化为无改善。最好将其保留为0,因为我们对损失越来越严重感兴趣。
  3. patience参数代表损失开始增加(停止改善)后停止之前的时期数。这取决于您的实现,如果您使用的批次非常小学习率较高,则损失呈锯齿状(准确性会更加嘈杂),因此最好设置一个较大的patience参数。如果您使用大批量学习率较低,则损失会更平稳,因此可以使用较小的patience参数。无论哪种方式,我都将其保留为2,以便为模型提供更多机会。
  4. verbose 确定要打印的内容,将其保留为默认值(0)。
  5. mode参数取决于您监视的数量的方向(应该是减少还是增加),因为我们监视损失,所以可以使用min。但是让我们留给喀拉拉邦为我们处理,并将其设置为auto

因此,我将使用类似的方法并通过绘制有无早期停止的错误损失来进行实验。

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=2,
                              verbose=0, mode='auto')

为了避免对回调的工作方式产生歧义,我将尝试解释更多信息。调用fit(... callbacks=[es])模型后,Keras会调用给定的回调对象预定的函数。这些功能可以称为on_train_beginon_train_endon_epoch_beginon_epoch_endon_batch_beginon_batch_end。在每个时期结束时调用提前停止回调,将最佳监视值与当前监视值进行比较,并在条件满足时停止(自观察最佳监视值以来已经过去了多少个时期,这不仅仅是耐心参数,两者之间的差最后一个值大于min_delta等。)。

正如@BrentFaust在评论中指出的那样,模型的训练将继续进行,直到满足Early Stopping条件或满足epochsin(默认值= 10)为止fit()。设置“提早停止”回调不会使模型超出其epochs参数进行训练。因此,调用fit()较大epochs值的函数将受益于Early Stopping回调。

early stopping

Early stopping is basically stopping the training once your loss starts to increase (or in other words validation accuracy starts to decrease). According to documents it is used as follows;

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=0,
                              verbose=0, mode='auto')

Values depends on your implementation (problem, batch size etc…) but generally to prevent overfitting I would use;

  1. Monitor the validation loss (need to use cross validation or at least train/test sets) by setting the monitor argument to 'val_loss'.
  2. min_delta is a threshold to whether quantify a loss at some epoch as improvement or not. If the difference of loss is below min_delta, it is quantified as no improvement. Better to leave it as 0 since we’re interested in when loss becomes worse.
  3. patience argument represents the number of epochs before stopping once your loss starts to increase (stops improving). This depends on your implementation, if you use very small batches or a large learning rate your loss zig-zag (accuracy will be more noisy) so better set a large patience argument. If you use large batches and a small learning rate your loss will be smoother so you can use a smaller patience argument. Either way I’ll leave it as 2 so I would give the model more chance.
  4. verbose decides what to print, leave it at default (0).
  5. mode argument depends on what direction your monitored quantity has (is it supposed to be decreasing or increasing), since we monitor the loss, we can use min. But let’s leave keras handle that for us and set that to auto

So I would use something like this and experiment by plotting the error loss with and without early stopping.

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=2,
                              verbose=0, mode='auto')

For possible ambiguity on how callbacks work, I’ll try to explain more. Once you call fit(... callbacks=[es]) on your model, Keras calls given callback objects predetermined functions. These functions can be called on_train_begin, on_train_end, on_epoch_begin, on_epoch_end and on_batch_begin, on_batch_end. Early stopping callback is called on every epoch end, compares the best monitored value with the current one and stops if conditions are met (how many epochs have past since the observation of the best monitored value and is it more than patience argument, the difference between last value is bigger than min_delta etc..).

As pointed by @BrentFaust in comments, model’s training will continue until either Early Stopping conditions are met or epochs parameter (default=10) in fit() is satisfied. Setting an Early Stopping callback will not make the model to train beyond its epochs parameter. So calling fit() function with a larger epochs value would benefit more from Early Stopping callback.


如何在TensorFlow中应用梯度裁剪?

问题:如何在TensorFlow中应用梯度裁剪?

考虑示例代码

我想知道如何在RNN上的该网络上应用梯度剪切,而梯度可能会爆炸。

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

这是一个可以使用的示例,但是在哪里介绍呢?在RNN中

    lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    _X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)

但这没有意义,因为张量_X是输入,而不是grad,要裁剪的内容是什么?

我是否需要为此定义自己的优化器,还是有一个更简单的选择?

Considering the example code.

I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

This is an example that could be used but where do I introduce this ? In the def of RNN

    lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    _X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)

But this doesn’t make sense as the tensor _X is the input and not the grad what is to be clipped?

Do I have to define my own Optimizer for this or is there a simpler option?


回答 0

在计算梯度之后,但在应用梯度更新模型参数之前,需要进行梯度修剪。在您的示例中,这两种AdamOptimizer.minimize()方法均由该方法处理。

为了裁剪您的渐变,您需要按照TensorFlow API文档本节中的说明显式计算,裁剪和应用它们。具体来说,您需要minimize()用以下类似的方法代替对方法的调用:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model’s parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.

In order to clip your gradients you’ll need to explicitly compute, clip, and apply them as described in this section in TensorFlow’s API documentation. Specifically you’ll need to substitute the call to the minimize() method with something like the following:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

回答 1

尽管看起来很流行,但您可能希望通过其全局范数来裁剪整个渐变:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))

分别裁剪每个渐变矩阵会更改其相对比例,但是也可以:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
    None if gradient is None else tf.clip_by_norm(gradient, 5.0)
    for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))

在TensorFlow 2中,磁带计算梯度,优化器来自Keras,我们不需要存储更新操作,因为它会自动运行而不将其传递给会话:

optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
  loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))

Clipping each gradient matrix individually changes their relative scale but is also possible:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
    None if gradient is None else tf.clip_by_norm(gradient, 5.0)
    for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))

In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don’t need to store the update op because it runs automatically without passing it to a session:

optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
  loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

回答 2

实际上在文档中对此做了正确解释。

调用minimum()既要计算梯度,又要将其应用于变量。如果要在应用渐变之前对其进行处理,则可以分三步使用优化器:

  • 使用compute_gradients()计算梯度。
  • 根据需要处理渐变。
  • 使用apply_gradients()应用处理后的渐变。

在他们提供的示例中,他们使用以下3个步骤:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

MyCapper是限制渐变的任何函数。有用的功能列表(除外tf.clip_by_value())在此处

This is actually properly explained in the documentation.:

Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps:

  • Compute the gradients with compute_gradients().
  • Process the gradients as you wish.
  • Apply the processed gradients with apply_gradients().

And in the example they provide they use these 3 steps:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.


回答 3

对于那些想了解梯度裁剪的想法(按规范)的人:

每当梯度范数大于特定阈值时,我们都会修剪梯度范数,以使其保持在阈值之内。此阈值有时设置为5

令梯度为g,max_norm_threshold为j

现在,如果|| g || > j,我们这样做:

g =( j * g)/ || G ||

这是在 tf.clip_by_norm

For those who would like to understand the idea of gradient clipping (by norm):

Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.

Let the gradient be g and the max_norm_threshold be j.

Now, if ||g|| > j , we do:

g = ( j * g ) / ||g||

This is the implementation done in tf.clip_by_norm


回答 4

IMO最好的解决方案是用TF的估算器装饰器包装优化器tf.contrib.estimator.clip_gradients_by_norm

original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)

这样,您只需要定义一次,而不必在每次梯度计算后运行它。

文档:https : //www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm

IMO the best solution is wrapping your optimizer with TF’s estimator decorator tf.contrib.estimator.clip_gradients_by_norm:

original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)

This way you only have to define this once, and not run it after every gradients calculation.

Documentation: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm


回答 5

梯度修剪基本上可以在梯度爆炸或消失的情况下起到帮助作用。说您的损失太高,将会导致指数梯度流经网络,可能导致Nan值。为了克服这个问题,我们将梯度裁剪在特定范围内(-1到1或根据条件的任何范围)。

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

其中grads _and_vars是渐变对(您可以通过tf.compute_gradients计算)及其变量。

裁剪后,我们只需使用优化器即可应用其值。 optimizer.apply_gradients(clipped_value)

Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.

After clipping we simply apply its value using an optimizer. optimizer.apply_gradients(clipped_value)


在TensorFlow中使用预训练的单词嵌入(word2vec或Glove)

问题:在TensorFlow中使用预训练的单词嵌入(word2vec或Glove)

我最近审查了卷积文本分类的一个有趣的实现。但是我检查过的所有TensorFlow代码都使用如下的随机(未经预训练)嵌入向量:

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

有谁知道如何使用Word2vec或GloVe预训练词嵌入的结果,而不是随机的结果?

I’ve recently reviewed an interesting implementation for convolutional text classification. However all TensorFlow code I’ve reviewed uses a random (not pre-trained) embedding vectors like the following:

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?


回答 0

您可以通过多种方式在TensorFlow中使用预训练的嵌入。假设你有一个与NumPy阵列称为嵌入embedding,用vocab_size行和embedding_dim列,要创建一个张量W,可以在一个呼叫中使用tf.nn.embedding_lookup()

  1. 只需创建W一个tf.constant()是需要embedding为它的价值:

    W = tf.constant(embedding, name="W")

    这是最简单的方法,但是由于a的值tf.constant()多次存储在内存中,因此内存使用效率不高。由于embedding可能很大,因此您仅应将此方法用于玩具示例。

  2. 创建W为a tf.Variable并通过NumPy数组对其进行初始化tf.placeholder()

    W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                    trainable=False, name="W")
    
    embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
    embedding_init = W.assign(embedding_placeholder)
    
    # ...
    sess = tf.Session()
    
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
    

    这样可以避免embedding在图表中存储的副本,但确实需要足够的内存才能一次在内存中保留矩阵的两个副本(一个用于NumPy数组,一个用于tf.Variable)。请注意,我假设您想在训练期间保持嵌入矩阵不变,因此W是使用创建的trainable=False

  3. 如果将嵌入训练为另一个TensorFlow模型的一部分,则可以使用tf.train.Saver从其他模型的检查点文件加载值。这意味着嵌入矩阵可以完全绕过Python。W按照选项2 创建,然后执行以下操作:

    W = tf.Variable(...)
    
    embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})
    
    # ...
    sess = tf.Session()
    embedding_saver.restore(sess, "checkpoint_filename.ckpt")
    

There are a few ways that you can use a pre-trained embedding in TensorFlow. Let’s say that you have the embedding in a NumPy array called embedding, with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup().

  1. Simply create W as a tf.constant() that takes embedding as its value:

    W = tf.constant(embedding, name="W")
    

    This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory. Since embedding can be very large, you should only use this approach for toy examples.

  2. Create W as a tf.Variable and initialize it from the NumPy array via a tf.placeholder():

    W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                    trainable=False, name="W")
    
    embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
    embedding_init = W.assign(embedding_placeholder)
    
    # ...
    sess = tf.Session()
    
    sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
    

    This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable). Note that I’ve assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False.

  3. If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver to load the value from the other model’s checkpoint file. This means that the embedding matrix can bypass Python altogether. Create W as in option 2, then do the following:

    W = tf.Variable(...)
    
    embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})
    
    # ...
    sess = tf.Session()
    embedding_saver.restore(sess, "checkpoint_filename.ckpt")
    

回答 1

我使用这种方法来加载和共享嵌入。

W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

I use this method to load and share embedding.

W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

回答 2

@mrry的答案不正确,因为它会导致覆盖每个运行网络的嵌入权重,因此,如果您采用小批量方法来训练网络,则将覆盖嵌入权重。因此,以我的观点,预训练嵌入的正确方法是:

embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

The answer of @mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. So, on my point of view the right way to pre-trained embeddings is:

embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

回答 3

2.0兼容答案:有很多预训练的嵌入,这些嵌入是由Google开发的,并且是开源的。

其中一些是Universal Sentence Encoder (USE), ELMO, BERT,等等。在代码中重用它们非常容易。

重用代码Pre-Trained EmbeddingUniversal Sentence Encoder如下所示:

  !pip install "tensorflow_hub>=0.6.0"
  !pip install "tensorflow>=2.0.0"

  import tensorflow as tf
  import tensorflow_hub as hub

  module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
  embed = hub.KerasLayer(module_url)
  embeddings = embed(["A long sentence.", "single-word",
                      "http://example.com"])
  print(embeddings.shape)  #(3,128)

有关更多信息,请参见TF Hub Link,它是Google开发和开放源代码的预培训嵌入。

2.0 Compatible Answer: There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced.

Some of them are Universal Sentence Encoder (USE), ELMO, BERT, etc.. and it is very easy to reuse them in your code.

Code to reuse the Pre-Trained Embedding, Universal Sentence Encoder is shown below:

  !pip install "tensorflow_hub>=0.6.0"
  !pip install "tensorflow>=2.0.0"

  import tensorflow as tf
  import tensorflow_hub as hub

  module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
  embed = hub.KerasLayer(module_url)
  embeddings = embed(["A long sentence.", "single-word",
                      "http://example.com"])
  print(embeddings.shape)  #(3,128)

For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link.


回答 4

在Tensorflow版本2中,如果您使用Embedding层,则非常简单

X=tf.keras.layers.Embedding(input_dim=vocab_size,
                            output_dim=300,
                            input_length=Length_of_input_sequences,
                            embeddings_initializer=matrix_of_pretrained_weights
                            )(ur_inp)

With tensorflow version 2 its quite easy if you use the Embedding layer

X=tf.keras.layers.Embedding(input_dim=vocab_size,
                            output_dim=300,
                            input_length=Length_of_input_sequences,
                            embeddings_initializer=matrix_of_pretrained_weights
                            )(ur_inp)


回答 5

我也遇到嵌入问题,所以我写了有关数据集的详细教程。在这里我想补充一下我尝试过的方法也可以尝试这种方法,

import tensorflow as tf

tf.reset_default_graph()

input_x=tf.placeholder(tf.int32,shape=[None,None])

#you have to edit shape according to your embedding size


Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for ii in final_:
            print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))

如果您想从头开始理解,请看这里工作详细的Tutorial Ipython示例

I was also facing embedding issue, So i wrote detailed tutorial with dataset. Here I would like to add what I tried You can also try this method,

import tensorflow as tf

tf.reset_default_graph()

input_x=tf.placeholder(tf.int32,shape=[None,None])

#you have to edit shape according to your embedding size


Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)

with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for ii in final_:
            print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))

Here is working detailed Tutorial Ipython example if you want to understand from scratch , take a look .


如何在TensorFlow中添加正则化?

问题:如何在TensorFlow中添加正则化?

我在使用TensorFlow实现的许多可用神经网络代码中发现,正则化项通常是通过在损耗值上手动添加一个附加项来实现的。

我的问题是:

  1. 是否有比手动进行更优雅或推荐的正规化方法?

  2. 我也发现get_variable有一个论点regularizer。应该如何使用?根据我的观察,如果我们向其传递正则化器(例如tf.contrib.layers.l2_regularizer,将计算表示正则化项的张量并将其添加到名为的图集合中tf.GraphKeys.REGULARIZATOIN_LOSSES,该集合是否会被TensorFlow自动使用(例如,训练时由优化器使用)?期望我自己使用该收藏集吗?

I found in many available neural network code implemented using TensorFlow that regularization terms are often implemented by manually adding an additional term to loss value.

My questions are:

  1. Is there a more elegant or recommended way of regularization than doing it manually?

  2. I also find that get_variable has an argument regularizer. How should it be used? According to my observation, if we pass a regularizer to it (such as tf.contrib.layers.l2_regularizer, a tensor representing regularized term will be computed and added to a graph collection named tf.GraphKeys.REGULARIZATOIN_LOSSES. Will that collection be automatically used by TensorFlow (e.g. used by optimizers when training)? Or is it expected that I should use that collection by myself?


回答 0

如第二点regularizer所述,建议使用参数。您可以在中使用它get_variable,也可以在其中设置一次,variable_scope并对所有变量进行正则化。

损失收集在图中,您需要像这样将它们手动添加到成本函数中。

  reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
  reg_constant = 0.01  # Choose an appropriate one.
  loss = my_normal_loss + reg_constant * sum(reg_losses)

希望有帮助!

As you say in the second point, using the regularizer argument is the recommended way. You can use it in get_variable, or set it once in your variable_scope and have all your variables regularized.

The losses are collected in the graph, and you need to manually add them to your cost function like this.

  reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
  reg_constant = 0.01  # Choose an appropriate one.
  loss = my_normal_loss + reg_constant * sum(reg_losses)

Hope that helps!


回答 1

现有答案的几个方面对我来说还不是很清楚,所以这里是一个循序渐进的指南:

  1. 定义一个正则化器。在这里可以设置正则化常量,例如:

    regularizer = tf.contrib.layers.l2_regularizer(scale=0.1)
  2. 通过以下方式创建变量:

        weights = tf.get_variable(
            name="weights",
            regularizer=regularizer,
            ...
        )

    等效地,可以通过常规weights = tf.Variable(...)构造函数创建变量,然后通过tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, weights)

  3. 定义一些loss术语并添加正则化术语:

    reg_variables = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    reg_term = tf.contrib.layers.apply_regularization(regularizer, reg_variables)
    loss += reg_term

    注意:看起来tf.contrib.layers.apply_regularization是作为实现的AddN,所以大致等同于sum(reg_variables)

A few aspects of the existing answer were not immediately clear to me, so here is a step-by-step guide:

  1. Define a regularizer. This is where the regularization constant can be set, e.g.:

    regularizer = tf.contrib.layers.l2_regularizer(scale=0.1)
    
  2. Create variables via:

        weights = tf.get_variable(
            name="weights",
            regularizer=regularizer,
            ...
        )
    

    Equivalently, variables can be created via the regular weights = tf.Variable(...) constructor, followed by tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, weights).

  3. Define some loss term and add the regularization term:

    reg_variables = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    reg_term = tf.contrib.layers.apply_regularization(regularizer, reg_variables)
    loss += reg_term
    

    Note: It looks like tf.contrib.layers.apply_regularization is implemented as an AddN, so more or less equivalent to sum(reg_variables).


回答 2

由于找不到答案,我将提供一个简单正确的答案。您需要两个简单的步骤,其余步骤由tensorflow magic完成:

  1. 在创建变量或图层时添加正则化器:

    tf.layers.dense(x, kernel_regularizer=tf.contrib.layers.l2_regularizer(0.001))
    # or
    tf.get_variable('a', regularizer=tf.contrib.layers.l2_regularizer(0.001))
  2. 在定义损失时添加正则项:

    loss = ordinary_loss + tf.losses.get_regularization_loss()

I’ll provide a simple correct answer since I didn’t find one. You need two simple steps, the rest is done by tensorflow magic:

  1. Add regularizers when creating variables or layers:

    tf.layers.dense(x, kernel_regularizer=tf.contrib.layers.l2_regularizer(0.001))
    # or
    tf.get_variable('a', regularizer=tf.contrib.layers.l2_regularizer(0.001))
    
  2. Add the regularization term when defining loss:

    loss = ordinary_loss + tf.losses.get_regularization_loss()
    

回答 3

contrib.learn基于Tensorflow网站上的Deep MNIST教程,对该库执行此操作的另一种方法如下。首先,假设您已经导入了相关的库(例如import tensorflow.contrib.layers as layers),则可以使用单独的方法定义网络:

def easier_network(x, reg):
    """ A network based on tf.contrib.learn, with input `x`. """
    with tf.variable_scope('EasyNet'):
        out = layers.flatten(x)
        out = layers.fully_connected(out, 
                num_outputs=200,
                weights_initializer = layers.xavier_initializer(uniform=True),
                weights_regularizer = layers.l2_regularizer(scale=reg),
                activation_fn = tf.nn.tanh)
        out = layers.fully_connected(out, 
                num_outputs=200,
                weights_initializer = layers.xavier_initializer(uniform=True),
                weights_regularizer = layers.l2_regularizer(scale=reg),
                activation_fn = tf.nn.tanh)
        out = layers.fully_connected(out, 
                num_outputs=10, # Because there are ten digits!
                weights_initializer = layers.xavier_initializer(uniform=True),
                weights_regularizer = layers.l2_regularizer(scale=reg),
                activation_fn = None)
        return out 

然后,在主要方法中,您可以使用以下代码段:

def main(_):
    mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
    x = tf.placeholder(tf.float32, [None, 784])
    y_ = tf.placeholder(tf.float32, [None, 10])

    # Make a network with regularization
    y_conv = easier_network(x, FLAGS.regu)
    weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'EasyNet') 
    print("")
    for w in weights:
        shp = w.get_shape().as_list()
        print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
    print("")
    reg_ws = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES, 'EasyNet')
    for w in reg_ws:
        shp = w.get_shape().as_list()
        print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
    print("")

    # Make the loss function `loss_fn` with regularization.
    cross_entropy = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
    loss_fn = cross_entropy + tf.reduce_sum(reg_ws)
    train_step = tf.train.AdamOptimizer(1e-4).minimize(loss_fn)

为了使它起作用,您需要遵循我之前链接的MNIST教程并导入相关的库,但是学习TensorFlow是一个不错的练习,并且很容易看到正则化如何影响输出。如果将正则化用作参数,则可以看到以下内容:

- EasyNet/fully_connected/weights:0 shape:[784, 200] size:156800
- EasyNet/fully_connected/biases:0 shape:[200] size:200
- EasyNet/fully_connected_1/weights:0 shape:[200, 200] size:40000
- EasyNet/fully_connected_1/biases:0 shape:[200] size:200
- EasyNet/fully_connected_2/weights:0 shape:[200, 10] size:2000
- EasyNet/fully_connected_2/biases:0 shape:[10] size:10

- EasyNet/fully_connected/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_1/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_2/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0

请注意,基于可用项目,正则化部分为您提供了三项。

使用0、0.0001、0.01和1.0的正则化,我得到的测试精度值分别为0.9468、0.9476、0.9183和0.1135,显示了高正则项的危险。

Another option to do this with the contrib.learn library is as follows, based on the Deep MNIST tutorial on the Tensorflow website. First, assuming you’ve imported the relevant libraries (such as import tensorflow.contrib.layers as layers), you can define a network in a separate method:

def easier_network(x, reg):
    """ A network based on tf.contrib.learn, with input `x`. """
    with tf.variable_scope('EasyNet'):
        out = layers.flatten(x)
        out = layers.fully_connected(out, 
                num_outputs=200,
                weights_initializer = layers.xavier_initializer(uniform=True),
                weights_regularizer = layers.l2_regularizer(scale=reg),
                activation_fn = tf.nn.tanh)
        out = layers.fully_connected(out, 
                num_outputs=200,
                weights_initializer = layers.xavier_initializer(uniform=True),
                weights_regularizer = layers.l2_regularizer(scale=reg),
                activation_fn = tf.nn.tanh)
        out = layers.fully_connected(out, 
                num_outputs=10, # Because there are ten digits!
                weights_initializer = layers.xavier_initializer(uniform=True),
                weights_regularizer = layers.l2_regularizer(scale=reg),
                activation_fn = None)
        return out 

Then, in a main method, you can use the following code snippet:

def main(_):
    mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
    x = tf.placeholder(tf.float32, [None, 784])
    y_ = tf.placeholder(tf.float32, [None, 10])

    # Make a network with regularization
    y_conv = easier_network(x, FLAGS.regu)
    weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'EasyNet') 
    print("")
    for w in weights:
        shp = w.get_shape().as_list()
        print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
    print("")
    reg_ws = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES, 'EasyNet')
    for w in reg_ws:
        shp = w.get_shape().as_list()
        print("- {} shape:{} size:{}".format(w.name, shp, np.prod(shp)))
    print("")

    # Make the loss function `loss_fn` with regularization.
    cross_entropy = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
    loss_fn = cross_entropy + tf.reduce_sum(reg_ws)
    train_step = tf.train.AdamOptimizer(1e-4).minimize(loss_fn)

To get this to work you need to follow the MNIST tutorial I linked to earlier and import the relevant libraries, but it’s a nice exercise to learn TensorFlow and it’s easy to see how the regularization affects the output. If you apply a regularization as an argument, you can see the following:

- EasyNet/fully_connected/weights:0 shape:[784, 200] size:156800
- EasyNet/fully_connected/biases:0 shape:[200] size:200
- EasyNet/fully_connected_1/weights:0 shape:[200, 200] size:40000
- EasyNet/fully_connected_1/biases:0 shape:[200] size:200
- EasyNet/fully_connected_2/weights:0 shape:[200, 10] size:2000
- EasyNet/fully_connected_2/biases:0 shape:[10] size:10

- EasyNet/fully_connected/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_1/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0
- EasyNet/fully_connected_2/kernel/Regularizer/l2_regularizer:0 shape:[] size:1.0

Notice that the regularization portion gives you three items, based on the items available.

With regularizations of 0, 0.0001, 0.01, and 1.0, I get test accuracy values of 0.9468, 0.9476, 0.9183, and 0.1135, respectively, showing the dangers of high regularization terms.


回答 4

如果有人还在寻找,我想在tf.keras中添加它,您可以通过将其作为参数传递到图层中来添加权重正则化。从Tensorflow Keras Tutorials站点批发获得的添加L2正则化的示例:

model = keras.models.Sequential([
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                       activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                       activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

据我所知,无需使用此方法手动添加正则化损失。

参考: https //www.tensorflow.org/tutorials/keras/overfit_and_underfit#add_weight_regularization

If anyone’s still looking, I’d just like to add on that in tf.keras you may add weight regularization by passing them as arguments in your layers. An example of adding L2 regularization taken wholesale from the Tensorflow Keras Tutorials site:

model = keras.models.Sequential([
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                       activation=tf.nn.relu, input_shape=(NUM_WORDS,)),
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                       activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

There’s no need to manually add in the regularization losses with this method as far as I know.

Reference: https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#add_weight_regularization


回答 5

我进行了测试tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)tf.losses.get_regularization_loss()l2_regularizer在图中使用了一个,发现它们返回相同的值。通过观察值的数量,我猜想reg_constant通过设置的参数已经对值有意义tf.contrib.layers.l2_regularizer

I tested tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) and tf.losses.get_regularization_loss() with one l2_regularizer in the graph, and found that they return the same value. By observing the value’s quantity, I guess reg_constant has already make sense on the value by setting the parameter of tf.contrib.layers.l2_regularizer.


回答 6

如果您有CNN,则可以执行以下操作:

在您的模型函数中:

conv = tf.layers.conv2d(inputs=input_layer,
                        filters=32,
                        kernel_size=[3, 3],
                        kernel_initializer='xavier',
                        kernel_regularizer=tf.contrib.layers.l2_regularizer(1e-5),
                        padding="same",
                        activation=None) 
...

在损失函数中:

onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=num_classes)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
regularization_losses = tf.losses.get_regularization_losses()
loss = tf.add_n([loss] + regularization_losses)

If you have CNN you may do the following:

In your model function:

conv = tf.layers.conv2d(inputs=input_layer,
                        filters=32,
                        kernel_size=[3, 3],
                        kernel_initializer='xavier',
                        kernel_regularizer=tf.contrib.layers.l2_regularizer(1e-5),
                        padding="same",
                        activation=None) 
...

In your loss function:

onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=num_classes)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)
regularization_losses = tf.losses.get_regularization_losses()
loss = tf.add_n([loss] + regularization_losses)

回答 7

一些答案让我更加困惑。在这里,我给出两种方法来使它变得清晰。

#1.adding all regs by hand
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.Variable(name='v2',initial_value=1.0,dtype=tf.float32)
regularizer = tf.contrib.layers.l1_regularizer(0.1)
reg_term = tf.contrib.layers.apply_regularization(regularizer,[var1,var2])
#here reg_term is a scalar

#2.auto added and read,but using get_variable
with tf.variable_scope('x',
        regularizer=tf.contrib.layers.l2_regularizer(0.1)):
    var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
    var2 = tf.get_variable(name='v2',shape=[1],dtype=tf.float32)
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
#here reg_losses is a list,should be summed 

然后,可以将其添加到总损失中

Some answers make me more confused.Here I give two methods to make it clearly.

#1.adding all regs by hand
var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
var2 = tf.Variable(name='v2',initial_value=1.0,dtype=tf.float32)
regularizer = tf.contrib.layers.l1_regularizer(0.1)
reg_term = tf.contrib.layers.apply_regularization(regularizer,[var1,var2])
#here reg_term is a scalar

#2.auto added and read,but using get_variable
with tf.variable_scope('x',
        regularizer=tf.contrib.layers.l2_regularizer(0.1)):
    var1 = tf.get_variable(name='v1',shape=[1],dtype=tf.float32)
    var2 = tf.get_variable(name='v2',shape=[1],dtype=tf.float32)
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
#here reg_losses is a list,should be summed 

Then,it can be added into the total loss


回答 8

cross_entropy = tf.losses.softmax_cross_entropy(
  logits=logits, onehot_labels=labels)

l2_loss = weight_decay * tf.add_n(
     [tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()])

loss = cross_entropy + l2_loss
cross_entropy = tf.losses.softmax_cross_entropy(
  logits=logits, onehot_labels=labels)

l2_loss = weight_decay * tf.add_n(
     [tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables()])

loss = cross_entropy + l2_loss

回答 9

tf.GraphKeys.REGULARIZATION_LOSSES 不会自动添加,但是有一种简单的添加方法:

reg_loss = tf.losses.get_regularization_loss()
total_loss = loss + reg_loss

tf.losses.get_regularization_loss()用于tf.add_ntf.GraphKeys.REGULARIZATION_LOSSES逐个元素的项求和。tf.GraphKeys.REGULARIZATION_LOSSES通常是使用正则化函数计算的标量列表。它从tf.get_variable具有regularizer指定参数的调用中获取条目。您也可以手动添加到该集合。这在使用时tf.Variable以及在指定活动正则器或其他自定义正则器时将很有用。例如:

#This will add an activity regularizer on y to the regloss collection
regularizer = tf.contrib.layers.l2_regularizer(0.1)
y = tf.nn.sigmoid(x)
act_reg = regularizer(y)
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, act_reg)

(在此示例中,对x进行正则可能会更有效,因为y对于大x而言确实变平了。)

tf.GraphKeys.REGULARIZATION_LOSSES will not be added automatically, but there is a simple way to add them:

reg_loss = tf.losses.get_regularization_loss()
total_loss = loss + reg_loss

tf.losses.get_regularization_loss() uses tf.add_n to sum the entries of tf.GraphKeys.REGULARIZATION_LOSSES element-wise. tf.GraphKeys.REGULARIZATION_LOSSES will typically be a list of scalars, calculated using regularizer functions. It gets entries from calls to tf.get_variable that have the regularizer parameter specified. You can also add to that collection manually. That would be useful when using tf.Variable and also when specifying activity regularizers or other custom regularizers. For instance:

#This will add an activity regularizer on y to the regloss collection
regularizer = tf.contrib.layers.l2_regularizer(0.1)
y = tf.nn.sigmoid(x)
act_reg = regularizer(y)
tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, act_reg)

(In this example it would presumably be more effective to regularize x, as y really flattens out for large x.)


为什么我们需要在PyTorch中调用zero_grad()?

问题:为什么我们需要在PyTorch中调用zero_grad()?

zero_grad()训练期间需要调用该方法。但是文档不是很有帮助

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

为什么我们需要调用此方法?

The method zero_grad() needs to be called during training. But the documentation is not very helpful

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

Why do we need to call this method?


回答 0

在中PyTorch,我们需要在开始进行反向传播之前将梯度设置为零,因为PyTorch在随后的反向传递中累积梯度。在训练RNN时这很方便。因此,默认操作是在每次调用时累积(即求和)梯度loss.backward()

因此,理想情况下,当您开始训练循环时,应该zero out the gradients正确进行参数更新。否则,梯度将指向预期方向以外的其他方向,即朝向最小值(或最大化,如果达到最大化目标)。

这是一个简单的示例:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者,如果您要进行香草梯度下降,则:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注意:当张量上调用时,会发生梯度的累积(即求和)。.backward()loss

In PyTorch, we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradients on every loss.backward() call.

Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

Here is a simple example:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

Alternatively, if you’re doing a vanilla gradient descent, then:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

Note: The accumulation (i.e. sum) of gradients happen when .backward() is called on the loss tensor.


回答 1

如果您使用渐变方法来减少错误(或损失),zero_grad()将重新启动循环而不会损失上一步

如果您不使用zero_grad(),那么损失将减少而不是按需增加

例如,如果您使用zero_grad(),则会发现以下输出:

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

如果不使用zero_grad(),则会发现以下输出:

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

zero_grad() is restart looping without losses from last step if you use the gradient method for decreasing the error (or losses)

if you don’t use zero_grad() the loss will be decrease not increase as require

for example if you use zero_grad() you will find following output :

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

if you don’t use zero_grad() you will find following output :

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

如何在PyTorch中初始化权重?

问题:如何在PyTorch中初始化权重?

如何在PyTorch中的网络中初始化权重和偏差(例如,使用He或Xavier初始化)?

How to initialize the weights and biases (for example, with He or Xavier initialization) in a network in PyTorch?


回答 0

单层

要初始化单层的权重,请使用中的函数torch.nn.init。例如:

conv1 = torch.nn.Conv2d(...)
torch.nn.init.xavier_uniform(conv1.weight)

或者,您可以通过写入conv1.weight.data(是torch.Tensor)来修改参数。例:

conv1.weight.data.fill_(0.01)

偏见也是如此:

conv1.bias.data.fill_(0.01)

nn.Sequential 或自定义 nn.Module

将初始化函数传递给torch.nn.Module.apply。它将以nn.Module递归方式初始化整个权重。

申请(FN):适用fn递归到每个子模块(通过返回的.children()),以及自我。典型的用法包括初始化模型的参数(另请参见torch-nn-init)。

例:

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

Single layer

To initialize the weights of a single layer, use a function from torch.nn.init. For instance:

conv1 = torch.nn.Conv2d(...)
torch.nn.init.xavier_uniform(conv1.weight)

Alternatively, you can modify the parameters by writing to conv1.weight.data (which is a torch.Tensor). Example:

conv1.weight.data.fill_(0.01)

The same applies for biases:

conv1.bias.data.fill_(0.01)

nn.Sequential or custom nn.Module

Pass an initialization function to torch.nn.Module.apply. It will initialize the weights in the entire nn.Module recursively.

apply(fn): Applies fn recursively to every submodule (as returned by .children()) as well as self. Typical use includes initializing the parameters of a model (see also torch-nn-init).

Example:

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

回答 1

我们使用相同的神经网络(NN)结构比较权重初始化的不同模式。

全零或全零

如果您遵循以下原则 Occam剃刀,您可能会认为将所有权重设置为0或1将是最佳解决方案。不是这种情况。

每个权重相同时,每一层的所有神经元都产生相同的输出。这使得很难决定要调整的权重。

    # initialize two NN's with 0 and 1 constant weights
    model_0 = Net(constant_weight=0)
    model_1 = Net(constant_weight=1)
  • 2个纪元后:

重量初始化为常数时的训练损失图

Validation Accuracy
9.625% -- All Zeros
10.050% -- All Ones
Training Loss
2.304  -- All Zeros
1552.281  -- All Ones

统一初始化

一个均匀分布具有从一组数字拾取任何数量的相等概率。

让我们看看神经网络使用统一权重初始化的训练效果如何,low=0.0以及high=1.0

下面,我们将看到另一种方法(在Net类代码中)以初始化网络的权重。要在模型定义之外定义权重,我们可以:

  1. 定义一个根据网络层类型分配权重的函数,然后
  2. 使用model.apply(fn),将这些权重应用于初始化的模型,后者将函数应用于每个模型层。
    # takes in a module and applies the specified weight initialization
    def weights_init_uniform(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # apply a uniform distribution to the weights and a bias=0
            m.weight.data.uniform_(0.0, 1.0)
            m.bias.data.fill_(0)

    model_uniform = Net()
    model_uniform.apply(weights_init_uniform)
  • 2个纪元后:

在此处输入图片说明

Validation Accuracy
36.667% -- Uniform Weights
Training Loss
3.208  -- Uniform Weights

设置权重的一般规则

在神经网络中设置权重的一般规则是将权重设置为接近零而又不会太小。

优良作法是在[-y,y]的范围内开始权重,其中y=1/sqrt(n)
(n是给定神经元的输入数量)。

    # takes in a module and applies the specified weight initialization
    def weights_init_uniform_rule(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # get the number of the inputs
            n = m.in_features
            y = 1.0/np.sqrt(n)
            m.weight.data.uniform_(-y, y)
            m.bias.data.fill_(0)

    # create a new model with these weights
    model_rule = Net()
    model_rule.apply(weights_init_uniform_rule)

下面我们比较一下NN的性能,权重使用均匀分布[-0.5,0.5)初始化,权重使用通用规则初始化

  • 2个纪元后:

该图显示了权重的均匀初始化与初始化的一般规则的性能

Validation Accuracy
75.817% -- Centered Weights [-0.5, 0.5)
85.208% -- General Rule [-y, y)
Training Loss
0.705  -- Centered Weights [-0.5, 0.5)
0.469  -- General Rule [-y, y)

正态分布以初始化权重

正态分布的平均值应为0,标准差应为y=1/sqrt(n),其中n是NN的输入数量

    ## takes in a module and applies the specified weight initialization
    def weights_init_normal(m):
        '''Takes in a module and initializes all linear layers with weight
           values taken from a normal distribution.'''

        classname = m.__class__.__name__
        # for every Linear layer in a model
        if classname.find('Linear') != -1:
            y = m.in_features
        # m.weight.data shoud be taken from a normal distribution
            m.weight.data.normal_(0.0,1/np.sqrt(y))
        # m.bias.data should be 0
            m.bias.data.fill_(0)

下面我们展示了两种神经网络的性能,一种使用均匀分布初始化,另一种使用正态分布

  • 2个纪元后:

使用均态分布与正态分布的权重初始化的性能

Validation Accuracy
85.775% -- Uniform Rule [-y, y)
84.717% -- Normal Distribution
Training Loss
0.329  -- Uniform Rule [-y, y)
0.443  -- Normal Distribution

We compare different mode of weight-initialization using the same neural-network(NN) architecture.

All Zeros or Ones

If you follow the principle of Occam’s razor, you might think setting all the weights to 0 or 1 would be the best solution. This is not the case.

With every weight the same, all the neurons at each layer are producing the same output. This makes it hard to decide which weights to adjust.

    # initialize two NN's with 0 and 1 constant weights
    model_0 = Net(constant_weight=0)
    model_1 = Net(constant_weight=1)
  • After 2 epochs:

plot of training loss with weight initialization to constant

Validation Accuracy
9.625% -- All Zeros
10.050% -- All Ones
Training Loss
2.304  -- All Zeros
1552.281  -- All Ones

Uniform Initialization

A uniform distribution has the equal probability of picking any number from a set of numbers.

Let’s see how well the neural network trains using a uniform weight initialization, where low=0.0 and high=1.0.

Below, we’ll see another way (besides in the Net class code) to initialize the weights of a network. To define weights outside of the model definition, we can:

  1. Define a function that assigns weights by the type of network layer, then
  2. Apply those weights to an initialized model using model.apply(fn), which applies a function to each model layer.
    # takes in a module and applies the specified weight initialization
    def weights_init_uniform(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # apply a uniform distribution to the weights and a bias=0
            m.weight.data.uniform_(0.0, 1.0)
            m.bias.data.fill_(0)

    model_uniform = Net()
    model_uniform.apply(weights_init_uniform)
  • After 2 epochs:

enter image description here

Validation Accuracy
36.667% -- Uniform Weights
Training Loss
3.208  -- Uniform Weights

General rule for setting weights

The general rule for setting the weights in a neural network is to set them to be close to zero without being too small.

Good practice is to start your weights in the range of [-y, y] where y=1/sqrt(n)
(n is the number of inputs to a given neuron).

    # takes in a module and applies the specified weight initialization
    def weights_init_uniform_rule(m):
        classname = m.__class__.__name__
        # for every Linear layer in a model..
        if classname.find('Linear') != -1:
            # get the number of the inputs
            n = m.in_features
            y = 1.0/np.sqrt(n)
            m.weight.data.uniform_(-y, y)
            m.bias.data.fill_(0)

    # create a new model with these weights
    model_rule = Net()
    model_rule.apply(weights_init_uniform_rule)

below we compare performance of NN, weights initialized with uniform distribution [-0.5,0.5) versus the one whose weight is initialized using general rule

  • After 2 epochs:

plot showing performance of uniform initialization of weight versus general rule of initialization

Validation Accuracy
75.817% -- Centered Weights [-0.5, 0.5)
85.208% -- General Rule [-y, y)
Training Loss
0.705  -- Centered Weights [-0.5, 0.5)
0.469  -- General Rule [-y, y)

normal distribution to initialize the weights

The normal distribution should have a mean of 0 and a standard deviation of y=1/sqrt(n), where n is the number of inputs to NN

    ## takes in a module and applies the specified weight initialization
    def weights_init_normal(m):
        '''Takes in a module and initializes all linear layers with weight
           values taken from a normal distribution.'''

        classname = m.__class__.__name__
        # for every Linear layer in a model
        if classname.find('Linear') != -1:
            y = m.in_features
        # m.weight.data shoud be taken from a normal distribution
            m.weight.data.normal_(0.0,1/np.sqrt(y))
        # m.bias.data should be 0
            m.bias.data.fill_(0)

below we show the performance of two NN one initialized using uniform-distribution and the other using normal-distribution

  • After 2 epochs:

performance of weight initialization using uniform-distribution versus the normal distribution

Validation Accuracy
85.775% -- Uniform Rule [-y, y)
84.717% -- Normal Distribution
Training Loss
0.329  -- Uniform Rule [-y, y)
0.443  -- Normal Distribution

回答 2

要初始化图层,通常不需要执行任何操作。

PyTorch将为您做到。如果您考虑一下,这很有道理。当PyTorch可以按照最新趋势进行操作时,为什么还要初始化图层。

例如检查线性层

在该__init__方法中,它将调用Kaiming He的 init函数。

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

其他图层类型也是如此。例如在这里conv2d检查

注意:正确初始化的好处是更快的训练速度。如果您的问题需要特殊的初始化,则可以进行后续处理。

To initialize layers you typically don’t need to do anything.

PyTorch will do it for you. If you think about, this has lot of sense. Why should we initialize layers, when PyTorch can do that following the latest trends.

Check for instance the Linear layer.

In the __init__ method it will call Kaiming He init function.

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

The similar is for other layers types. For conv2d for instance check here.

To note : The gain of proper initialization is the faster training speed. If your problem deserves special initialization you can do it afterwords.


回答 3

    import torch.nn as nn        

    # a simple network
    rand_net = nn.Sequential(nn.Linear(in_features, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, 1),
                             nn.ReLU())

    # initialization function, first checks the module type,
    # then applies the desired changes to the weights
    def init_normal(m):
        if type(m) == nn.Linear:
            nn.init.uniform_(m.weight)

    # use the modules apply function to recursively apply the initialization
    rand_net.apply(init_normal)
    import torch.nn as nn        

    # a simple network
    rand_net = nn.Sequential(nn.Linear(in_features, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, h_size),
                             nn.BatchNorm1d(h_size),
                             nn.ReLU(),
                             nn.Linear(h_size, 1),
                             nn.ReLU())

    # initialization function, first checks the module type,
    # then applies the desired changes to the weights
    def init_normal(m):
        if type(m) == nn.Linear:
            nn.init.uniform_(m.weight)

    # use the modules apply function to recursively apply the initialization
    rand_net.apply(init_normal)

回答 4

抱歉这么晚,希望我的回答会有所帮助。

用a初始化权重 normal distribution使用:

torch.nn.init.normal_(tensor, mean=0, std=1)

或使用 constant distribution写:

torch.nn.init.constant_(tensor, value)

或使用 uniform distribution

torch.nn.init.uniform_(tensor, a=0, b=1) # a: lower_bound, b: upper_bound

您可以在此处检查其他方法来初始化张量

Sorry for being so late, I hope my answer will help.

To initialise weights with a normal distribution use:

torch.nn.init.normal_(tensor, mean=0, std=1)

Or to use a constant distribution write:

torch.nn.init.constant_(tensor, value)

Or to use an uniform distribution:

torch.nn.init.uniform_(tensor, a=0, b=1) # a: lower_bound, b: upper_bound

You can check other methods to initialise tensors here


回答 5

如果需要更多灵活性,也可以手动设置权重

假设您输入了所有内容:

import torch
import torch.nn as nn

input = torch.ones((8, 8))
print(input)
tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

而且您想制作一个没有偏差的致密层(因此我们可以可视化):

d = nn.Linear(8, 8, bias=False)

将所有权重设置为0.5(或其他任何值):

d.weight.data = torch.full((8, 8), 0.5)
print(d.weight.data)

重量:

Out[14]: 
tensor([[0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

现在,您所有的权重均为0.5。通过以下方式传递数据:

d(input)
Out[13]: 
tensor([[4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.]], grad_fn=<MmBackward>)

请记住,每个神经元接收8个输入,所有这些输入的权重为0.5,值为1(且无偏差),因此每个总和为4。

If you want some extra flexibility, you can also set the weights manually.

Say you have input of all ones:

import torch
import torch.nn as nn

input = torch.ones((8, 8))
print(input)
tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

And you want to make a dense layer with no bias (so we can visualize):

d = nn.Linear(8, 8, bias=False)

Set all the weights to 0.5 (or anything else):

d.weight.data = torch.full((8, 8), 0.5)
print(d.weight.data)

The weights:

Out[14]: 
tensor([[0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

All your weights are now 0.5. Pass the data through:

d(input)
Out[13]: 
tensor([[4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.],
        [4., 4., 4., 4., 4., 4., 4., 4.]], grad_fn=<MmBackward>)

Remember that each neuron receives 8 inputs, all of which have weight 0.5 and value of 1 (and no bias), so it sums up to 4 for each.


回答 6

遍历参数

如果您不能使用apply例如模型没有Sequential直接实现:

所有人都一样

# see UNet at https://github.com/milesial/Pytorch-UNet/tree/master/unet


def init_all(model, init_func, *params, **kwargs):
    for p in model.parameters():
        init_func(p, *params, **kwargs)

model = UNet(3, 10)
init_all(model, torch.nn.init.normal_, mean=0., std=1) 
# or
init_all(model, torch.nn.init.constant_, 1.) 

根据形状

def init_all(model, init_funcs):
    for p in model.parameters():
        init_func = init_funcs.get(len(p.shape), init_funcs["default"])
        init_func(p)

model = UNet(3, 10)
init_funcs = {
    1: lambda x: torch.nn.init.normal_(x, mean=0., std=1.), # can be bias
    2: lambda x: torch.nn.init.xavier_normal_(x, gain=1.), # can be weight
    3: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv1D filter
    4: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv2D filter
    "default": lambda x: torch.nn.init.constant(x, 1.), # everything else
}

init_all(model, init_funcs)

您可以尝试torch.nn.init.constant_(x, len(x.shape))检查它们是否已正确初始化:

init_funcs = {
    "default": lambda x: torch.nn.init.constant_(x, len(x.shape))
}

Iterate over parameters

If you cannot use apply for instance if the model does not implement Sequential directly:

Same for all

# see UNet at https://github.com/milesial/Pytorch-UNet/tree/master/unet


def init_all(model, init_func, *params, **kwargs):
    for p in model.parameters():
        init_func(p, *params, **kwargs)

model = UNet(3, 10)
init_all(model, torch.nn.init.normal_, mean=0., std=1) 
# or
init_all(model, torch.nn.init.constant_, 1.) 

Depending on shape

def init_all(model, init_funcs):
    for p in model.parameters():
        init_func = init_funcs.get(len(p.shape), init_funcs["default"])
        init_func(p)

model = UNet(3, 10)
init_funcs = {
    1: lambda x: torch.nn.init.normal_(x, mean=0., std=1.), # can be bias
    2: lambda x: torch.nn.init.xavier_normal_(x, gain=1.), # can be weight
    3: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv1D filter
    4: lambda x: torch.nn.init.xavier_uniform_(x, gain=1.), # can be conv2D filter
    "default": lambda x: torch.nn.init.constant(x, 1.), # everything else
}

init_all(model, init_funcs)

You can try with torch.nn.init.constant_(x, len(x.shape)) to check that they are appropriately initialized:

init_funcs = {
    "default": lambda x: torch.nn.init.constant_(x, len(x.shape))
}

回答 7

如果看到弃用警告(@FábioPerez)…

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

If you see a deprecation warning (@Fábio Perez)…

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

回答 8

因为到目前为止我还没有足够的名声,我不能在下面添加评论

prosti196月26日13:16发表的答案。

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

但我想指出的是,其实我们知道的纸一些假设开明他深入研究了整流器:上ImageNet分类超越人类水平的性能,是不恰当的,虽然它看起来像故意设计初始化方法,使一击实践。

例如,在“ 向后传播案例 ”小节中,他们假设$ w_l $和$ \ delta y_l $是彼此独立的。但是众所周知,以得分图$ \ delta y ^ L_i $为例,如果我们使用典型的交叉熵损失函数目标。

因此,我认为他的初始化工作正常的根本原因尚待阐明。因为每个人都见证了其加强深度学习培训的力量。

Cuz I haven’t had the enough reputation so far, I can’t add a comment under

the answer posted by prosti in Jun 26 ’19 at 13:16.

    def reset_parameters(self):
        init.kaiming_uniform_(self.weight, a=math.sqrt(3))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

But I wanna point out that actually we know some assumptions in the paper of Kaiming He, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, are not appropriate, though it looks like the deliberately designed initialization method makes a hit in practice.

E.g., within the subsection of Backward Propagation Case, they assume that $w_l$ and $\delta y_l$ are independent of each other. But as we all known, take the score map $\delta y^L_i$ as an instance, it often is $y_i-softmax(y^L_i)=y_i-softmax(w^L_ix^L_i)$ if we use a typical cross entropy loss function objective.

So I think the true underlying reason why He’s Initialization works well remains to unravel. Cuz everyone has witnessed its power on boosting deep learning training.


Recommenders-推荐系统的最佳实战教程

Recommenders现在在PyPI上,可以使用pip安装!此外,还有大量的错误修复和实用程序改进

您可以在此处找到PyPI页面:https://pypi.org/project/ms-recommenders/

您可以在此处找到软件包文档:https://microsoft-recommenders.readthedocs.io/en/latest/

引言

此存储库包含构建推荐系统的示例和最佳实践,以Jupyter笔记本形式提供。这些示例详细介绍了我们在五项关键任务上的学习:

中提供了几个实用程序reco_utils以支持常见任务,例如以不同算法预期的格式加载数据集、评估模型输出以及拆分训练/测试数据。其中包括几种最先进算法的实现,以便在您自己的应用程序中进行自学和自定义。请参阅reco_utils documentation

有关存储库的更详细概述,请参阅wiki page

快速入门

请参阅setup guide有关在本地设置计算机的更多详细信息,请访问data science virtual machine (DSVM)或打开Azure Databricks

推荐器软件包的安装已经过Python版本3.6和3.7的测试。建议在干净的环境中安装软件包及其依赖项(例如condavenv)

要在本地计算机上进行设置,请执行以下操作:

要安装核心实用程序、基于CPU的算法和相关性,请执行以下操作:

  1. 确保安装了编译所需的软件。在Linux上,可以通过添加构建基本依赖项来支持这一点:

sudo apt-get install -y build-essential

在Windows上,您需要Microsoft C++ Build Tools

  1. 从以下位置安装软件包PyPI

pip install --upgrade pip
pip install ms-recommenders[examples]
  1. 向Jupyter注册您的(CONDA或虚拟)环境:

python -m ipykernel install --user --name my_environment_name --display-name "Python (reco)"
  1. 启动Jupyter笔记本服务器

jupyter notebook
  1. 运行SAR Python CPU MovieLens笔记本在下面00_quick_start文件夹。确保将内核更改为“Python(Reco)”

有关安装软件包的其他选项(支持图形处理器、电光等)看见this guide

注意事项-TheAlternating Least Squares (ALS)笔记本电脑需要PySpark环境才能运行。请按照中的步骤操作setup guide在PySpark环境中运行这些笔记本。对于深度学习算法,建议使用GPU机器,并遵循setup guide要设置NVIDIA库,请执行以下操作

算法

下表列出了存储库中当前可用的推荐算法。当有不同的实施可用时,笔记本将链接在环境列下

算法 环境 类型 描述
交替最小二乘(ALS) PySpark 协同过滤 大数据集中显式或隐式反馈的矩阵分解算法,电光MLLib针对可扩展性和分布式计算能力进行了优化
关注异步奇异值分解(A2SVD)* Python CPU / Python GPU 协同过滤 一种基于顺序的算法,旨在利用注意力机制捕获长期和短期用户偏好
CONAC/贝叶斯个性化排名(BPR) Python CPU 协同过滤 隐式反馈预测项目排序的矩阵分解算法
角/双边变分自动编码器(BiVAE) Python CPU / Python GPU 协同过滤 二元数据的生成模型(例如,用户-项目交互)
卷积序列嵌入推荐(CASER) Python CPU / Python GPU 协同过滤 基于卷积的算法,旨在捕获用户的一般偏好和序列模式
深度知识感知网络(DKN)* Python CPU / Python GPU 基于内容的过滤 包含知识图和文章嵌入的深度学习算法,可提供强大的新闻或文章推荐
极深因式分解机(XDeepFM)* Python CPU / Python GPU 混血儿 基于深度学习的具有用户/项目特征的隐式和显式反馈算法
FastAI嵌入点偏置(FAST) Python CPU / Python GPU 协同过滤 面向用户和项目的带嵌入和偏向的通用算法
LightFM/混合矩阵分解 Python CPU 混血儿 隐式和显式反馈的混合矩阵分解算法
LightGBM/渐变增强树* Python CPU/PySpark 基于内容的过滤 基于内容问题的快速训练和低内存占用的梯度Boosting树算法
LightGCN Python CPU / Python GPU 协同过滤 一种深度学习算法,简化了隐式反馈预测GCN的设计
GeoIMC* Python CPU 混血儿 矩阵补全算法,考虑了用户和项目的特点,使用黎曼共轭梯度优化,遵循几何方法
GRU4Rec Python CPU / Python GPU 协同过滤 基于序列的算法,旨在使用递归神经网络捕获长期和短期用户偏好
多项式VAE Python CPU / Python GPU 协同过滤 预测用户/项目交互的产生式模型
具有长期和短期用户表示的神经推荐(LSTUR)* Python CPU / Python GPU 基于内容的过滤 具有长期和短期用户兴趣建模的神经推荐算法
基于注意力多视图学习(NAML)的神经推荐* Python CPU / Python GPU 基于内容的过滤 具有注意力多视角学习的神经推荐算法
神经协同过滤(NCF) Python CPU / Python GPU 协同过滤 一种性能增强的隐式反馈深度学习算法
具有个性化注意的神经推荐(NPA)* Python CPU / Python GPU 基于内容的过滤 基于个性化注意力网络的神经推荐算法
基于多头自我注意(NRMS)的神经推荐* Python CPU / Python GPU 基于内容的过滤 一种多头自关注的神经推荐算法
下一项推荐(NextItNet) Python CPU / Python GPU 协同过滤 基于扩张卷积和残差网络的序列模式捕获算法
受限Boltzmann机器(RBM) Python CPU / Python GPU 协同过滤 基于神经网络的显式或隐式反馈潜在概率分布学习算法
黎曼低秩矩阵完成(RLRMC)* Python CPU 协同过滤 基于黎曼共轭梯度优化的小内存矩阵分解算法
简单推荐算法(SAR)* Python CPU 协同过滤 基于相似度的隐式反馈数据集算法
短期和长期偏好综合推荐(SLI-REC)* Python CPU / Python GPU 协同过滤 基于顺序的算法,旨在使用注意机制、时间感知控制器和内容感知控制器捕获长期和短期用户偏好
多兴趣感知的序贯用户建模(SUM)* Python CPU / Python GPU 协同过滤 一种旨在捕捉用户多兴趣的增强型记忆网络顺序用户模型
标准VAE Python CPU / Python GPU 协同过滤 预测用户/项目交互的产生式模型
奇异值分解(SVD) Python CPU 协同过滤 预测不是很大数据集中显式评级反馈的矩阵分解算法
术语频率-文档频率反转(TF-IDF) Python CPU 基于内容的过滤 一种简单的基于相似度的文本数据内容推荐算法
Vowpal Wabbit(大众)* Python CPU (online training) 基于内容的过滤 快速在线学习算法,非常适合用户功能/上下文不断变化的场景
宽而深 Python CPU / Python GPU 混血儿 一种可记忆特征交互和泛化用户特征的深度学习算法
xLearch/factorization Machine(FM)&场感知FM(FFM) Python CPU 混血儿 利用用户/项目特征预测标签的快速高效内存算法

注意事项*指示由Microsoft发明/贡献的算法

独立的或正在孵化的算法和实用程序是contrib文件夹。这将包含可能不容易放入核心存储库的贡献,或者需要时间来重构或成熟代码并添加必要的测试

算法 环境 类型 描述
SARplus* PySpark 协同过滤 电光合成孔径雷达的优化实现

初步比较

我们提供一个benchmark notebook以说明如何评估和比较不同的算法。在本笔记本中,使用分层拆分将MovieLens数据集按75/25的比率拆分成训练/测试集。使用下面的每个协同过滤算法来训练推荐模型。我们利用文献中报道的经验参数值。here在对指标进行排名时,我们使用k=10(前十大推荐项目)。我们在标准NC6S_v2上运行比较Azure DSVM(6个vCPU、112 GB内存和1个P100 GPU)。电光肌萎缩侧索硬化症在本地独立模式下运行。在此表中,我们显示了在Movielens 100k上运行15个时期的算法的结果

算法 地图 nDCG@k 精度@k Recall@k RMSE Mae R2个 解释的差异
ALS 0.004732 0.044239 0.048462 0.017796 0.965038 0.753001 0.255647 0.251648
BiVAE 0.146126 0.475077 0.411771 0.219145 不适用 不适用 不适用 不适用
BPR 0.132478 0.441997 0.388229 0.212522 不适用 不适用 不适用 不适用
FastAI 0.025503 0.147866 0.130329 0.053824 0.943084 0.744337 0.285308 0.287671
LightGCN 0.088526 0.419846 0.379626 0.144336 不适用 不适用 不适用 不适用
NCF 0.107720 0.396118 0.347296 0.180775 不适用 不适用 不适用 不适用
SAR 0.110591 0.382461 0.330753 0.176385 1.253805 1.048484 -0.569363 0.030474
SVD 0.012873 0.095930 0.091198 0.032783 0.938681 0.742690 0.291967 0.291971

行为规范

本项目坚持Microsoft’s Open Source Code of Conduct为了给所有人营造一个欢迎和鼓舞人心的社区

贡献

这个项目欢迎大家提供意见和建议。在投稿之前,请查看我们的contribution guidelines

生成状态

这些测试是夜间构建,用于计算冒烟和集成测试。main是我们的主要分支机构staging是我们的发展分部。我们使用pytest中测试python实用程序reco_utilspapermill对于notebooks有关测试管道的更多信息,请参阅test documentation

DSVM构建状态

以下测试每天在Linux DSVM上运行。这些机器全天候运转。

构建类型 分支机构 状态 分支机构 状态
Linux CPU 主干道 Build Status 试运行 Build Status
Linux GPU 主干道 Build Status 试运行 Build Status
LINUX电光 主干道 Build Status 试运行 Build Status

相关项目

参考文献

  • A.Argyriou、M.González-Fierro和L.Zhang,“Microsoft推荐器:可投入生产的推荐系统的最佳实践(Best Practices for Production-Ready Recommendation Systems)”,万维网2020:台北国际万维网大会,2020年。在线提供:https://dl.acm.org/doi/abs/10.1145/3366424.3382692
  • 张立,吴涛,谢,A.Argyriou,M.González-Fierro,J.Lian,“建立规模化的可生产推荐系统”,ACM SIGKDD 2019年知识发现和数据挖掘大会(KDD 2019),2019年
  • S.Graham,J.K.Min,T.Wu,“微软推荐器:加速开发推荐器系统的工具”,RecSys‘19:第13届ACM推荐系统会议论文集,2019年。在线提供:https://dl.acm.org/doi/10.1145/3298689.3346967

Stanford-tensorflow-tutorials-此存储库包含斯坦福课程的代码示例

License
Join the https://gitter.im/stanford-tensorflow-tutorials

斯坦福-TensorFlow-教程

此存储库包含课程CS 20:深度学习研究的TensorFlow的代码示例。
它将随着课程的进展而更新。
详细的教学大纲和课堂讲稿可以在这里找到。here
对于本课程,我使用python3.6和TensorFlow 1.4.1

前一年课程的代码和备注请参见文件夹2017和网站https://web.stanford.edu/class/cs20si/2017

有关安装说明和依赖项列表,请参阅此存储库的安装文件夹

tf.nn.embedding_lookup函数有什么作用?

问题:tf.nn.embedding_lookup函数有什么作用?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None)

我不了解此功能的职责。像查找表吗?用哪种方法返回每个ID对应的参数(以ID为单位)?

例如,在skip-gram模型中,如果使用tf.nn.embedding_lookup(embeddings, train_inputs),则为每个train_input找到对应的嵌入?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None)

I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding to each id (in ids)?

For instance, in the skip-gram model if we use tf.nn.embedding_lookup(embeddings, train_inputs), then for each train_input it finds the correspond embedding?


回答 0

embedding_lookup函数检索params张量的行。该行为类似于对numpy中的数组使用索引。例如

matrix = np.random.random([1024, 64])  # 64-dimensional embeddings
ids = np.array([0, 5, 17, 33])
print matrix[ids]  # prints a matrix of shape [4, 64] 

params参数也可以是张量的列表,在这种情况下,ids将在张量之间分配。例如,给定的3张量列表[2, 64],默认行为是,他们将代表ids[0, 3][1, 4][2, 5]

partition_strategy控制ids列表中的分布方式。当矩阵可能太大而无法合为一体时,分区对于较大规模的问题很有用。

embedding_lookup function retrieves rows of the params tensor. The behavior is similar to using indexing with arrays in numpy. E.g.

matrix = np.random.random([1024, 64])  # 64-dimensional embeddings
ids = np.array([0, 5, 17, 33])
print matrix[ids]  # prints a matrix of shape [4, 64] 

params argument can be also a list of tensors in which case the ids will be distributed among the tensors. For example, given a list of 3 tensors [2, 64], the default behavior is that they will represent ids: [0, 3], [1, 4], [2, 5].

partition_strategy controls the way how the ids are distributed among the list. The partitioning is useful for larger scale problems when the matrix might be too large to keep in one piece.


回答 1

是的,在您明白这一点之前,很难理解此功能。

最简单的形式类似于tf.gather。它params根据所指定的索引返回的元素ids

例如(假设您在里面tf.InteractiveSession()

params = tf.constant([10,20,30,40])
ids = tf.constant([0,1,2,3])
print tf.nn.embedding_lookup(params,ids).eval()

将返回[10 20 30 40],因为params的第一个元素(索引0)为,params 10的第二个元素(索引1)为20,依此类推。

同样,

params = tf.constant([10,20,30,40])
ids = tf.constant([1,1,3])
print tf.nn.embedding_lookup(params,ids).eval()

会回来的[20 20 40]

embedding_lookup比这更。该params参数可以是张量列表,而不是单个张量。

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

在这种情况下,ids根据分区策略,在中指定的索引对应于张量的元素,其中默认分区策略为’mod’。

在’mod’策略中,索引0对应于列表中第一个张量的第一个元素。索引1对应于第二张量的第一元素。索引2对应于第三张量的第一个元素,依此类推。假设params是张量的列表,对于所有索引,简单地index 对应第(i + 1)张量的第一个元素。i0..(n-1)n

现在,索引n不能对应于张量n + 1,因为列表params仅包含n张量。因此index n对应于第一个张量的第二个元素。类似地,index n+1对应于第二张量的第二个元素,依此类推。

因此,在代码中

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

下标0对应于第一个张量的第一个元素:1

索引1对应于第二张量的第一个元素:10

索引2对应于第一个张量的第二个元素:2

索引3对应于第二张量的第二个元素:20

因此,结果将是:

[ 2  1  2 10  2 20]

Yes, this function is hard to understand, until you get the point.

In its simplest form, it is similar to tf.gather. It returns the elements of params according to the indexes specified by ids.

For example (assuming you are inside tf.InteractiveSession())

params = tf.constant([10,20,30,40])
ids = tf.constant([0,1,2,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [10 20 30 40], because the first element (index 0) of params is 10, the second element of params (index 1) is 20, etc.

Similarly,

params = tf.constant([10,20,30,40])
ids = tf.constant([1,1,3])
print tf.nn.embedding_lookup(params,ids).eval()

would return [20 20 40].

But embedding_lookup is more than that. The params argument can be a list of tensors, rather than a single tensor.

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

In such a case, the indexes, specified in ids, correspond to elements of tensors according to a partition strategy, where the default partition strategy is ‘mod’.

In the ‘mod’ strategy, index 0 corresponds to the first element of the first tensor in the list. Index 1 corresponds to the first element of the second tensor. Index 2 corresponds to the first element of the third tensor, and so on. Simply index i corresponds to the first element of the (i+1)th tensor , for all the indexes 0..(n-1), assuming params is a list of n tensors.

Now, index n cannot correspond to tensor n+1, because the list params contains only n tensors. So index n corresponds to the second element of the first tensor. Similarly, index n+1 corresponds to the second element of the second tensor, etc.

So, in the code

params1 = tf.constant([1,2])
params2 = tf.constant([10,20])
ids = tf.constant([2,0,2,1,2,3])
result = tf.nn.embedding_lookup([params1, params2], ids)

index 0 corresponds to the first element of the first tensor: 1

index 1 corresponds to the first element of the second tensor: 10

index 2 corresponds to the second element of the first tensor: 2

index 3 corresponds to the second element of the second tensor: 20

Thus, the result would be:

[ 2  1  2 10  2 20]

回答 2

是的,该tf.nn.embedding_lookup()函数的目的是在嵌入矩阵中执行查找并返回单词的嵌入(或简单地说是矢量表示)。

一个简单的嵌入矩阵(形状vocabulary_size x embedding_dimension:)如下所示。(即每个单词将由一个数字向量表示;因此,名称为word2vec


嵌入矩阵

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
like 0.36808 0.20834 -0.22319 0.046283 0.20098 0.27515 -0.77127 -0.76804
between 0.7503 0.71623 -0.27033 0.20059 -0.17008 0.68568 -0.061672 -0.054638
did 0.042523 -0.21172 0.044739 -0.19248 0.26224 0.0043991 -0.88195 0.55184
just 0.17698 0.065221 0.28548 -0.4243 0.7499 -0.14892 -0.66786 0.11788
national -1.1105 0.94945 -0.17078 0.93037 -0.2477 -0.70633 -0.8649 -0.56118
day 0.11626 0.53897 -0.39514 -0.26027 0.57706 -0.79198 -0.88374 0.30119
country -0.13531 0.15485 -0.07309 0.034013 -0.054457 -0.20541 -0.60086 -0.22407
under 0.13721 -0.295 -0.05916 -0.59235 0.02301 0.21884 -0.34254 -0.70213
such 0.61012 0.33512 -0.53499 0.36139 -0.39866 0.70627 -0.18699 -0.77246
second -0.29809 0.28069 0.087102 0.54455 0.70003 0.44778 -0.72565 0.62309 

我分裂上述嵌入基质并装载仅vocab,这将是我们的词汇并在相应的向量emb阵列。

vocab = ['the','like','between','did','just','national','day','country','under','such','second']

emb = np.array([[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862],
   [0.36808, 0.20834, -0.22319, 0.046283, 0.20098, 0.27515, -0.77127, -0.76804],
   [0.7503, 0.71623, -0.27033, 0.20059, -0.17008, 0.68568, -0.061672, -0.054638],
   [0.042523, -0.21172, 0.044739, -0.19248, 0.26224, 0.0043991, -0.88195, 0.55184],
   [0.17698, 0.065221, 0.28548, -0.4243, 0.7499, -0.14892, -0.66786, 0.11788],
   [-1.1105, 0.94945, -0.17078, 0.93037, -0.2477, -0.70633, -0.8649, -0.56118],
   [0.11626, 0.53897, -0.39514, -0.26027, 0.57706, -0.79198, -0.88374, 0.30119],
   [-0.13531, 0.15485, -0.07309, 0.034013, -0.054457, -0.20541, -0.60086, -0.22407],
   [ 0.13721, -0.295, -0.05916, -0.59235, 0.02301, 0.21884, -0.34254, -0.70213],
   [ 0.61012, 0.33512, -0.53499, 0.36139, -0.39866, 0.70627, -0.18699, -0.77246 ],
   [ -0.29809, 0.28069, 0.087102, 0.54455, 0.70003, 0.44778, -0.72565, 0.62309 ]])


emb.shape
# (11, 8)

在TensorFlow中嵌入查找

现在,我们将看到如何对某些任意输入语句执行嵌入查找

In [54]: from collections import OrderedDict

# embedding as TF tensor (for now constant; could be tf.Variable() during training)
In [55]: tf_embedding = tf.constant(emb, dtype=tf.float32)

# input for which we need the embedding
In [56]: input_str = "like the country"

# build index based on our `vocabulary`
In [57]: word_to_idx = OrderedDict({w:vocab.index(w) for w in input_str.split() if w in vocab})

# lookup in embedding matrix & return the vectors for the input words
In [58]: tf.nn.embedding_lookup(tf_embedding, list(word_to_idx.values())).eval()
Out[58]: 
array([[ 0.36807999,  0.20834   , -0.22318999,  0.046283  ,  0.20097999,
         0.27515   , -0.77126998, -0.76804   ],
       [ 0.41800001,  0.24968   , -0.41242   ,  0.1217    ,  0.34527001,
        -0.044457  , -0.49687999, -0.17862   ],
       [-0.13530999,  0.15485001, -0.07309   ,  0.034013  , -0.054457  ,
        -0.20541   , -0.60086   , -0.22407   ]], dtype=float32)

注意我们是怎么得到的嵌入使用从我们原来的嵌入矩阵(文字)的话指数在我们的词汇。

通常,此类嵌入查找是由第一层(称为“ 嵌入层”)执行的,然后将这些嵌入传递到RNN / LSTM / GRU层以进行进一步处理。


旁注:通常,词汇表还将具有特殊unk标记。因此,如果词汇表中不存在来自我们输入句子的标记,则将unk在嵌入矩阵中查找与之相对应的索引。


PS注意,embedding_dimension是一个超参数是一个具有调整他们的应用程序,但受欢迎的车型,如Word2Vec手套使用300维向量表示每个字。

奖励阅读 word2vec跳过语法模型

Yes, the purpose of tf.nn.embedding_lookup() function is to perform a lookup in the embedding matrix and return the embeddings (or in simple terms the vector representation) of words.

A simple embedding matrix (of shape: vocabulary_size x embedding_dimension) would look like below. (i.e. each word will be represented by a vector of numbers; hence the name word2vec)


Embedding Matrix

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862
like 0.36808 0.20834 -0.22319 0.046283 0.20098 0.27515 -0.77127 -0.76804
between 0.7503 0.71623 -0.27033 0.20059 -0.17008 0.68568 -0.061672 -0.054638
did 0.042523 -0.21172 0.044739 -0.19248 0.26224 0.0043991 -0.88195 0.55184
just 0.17698 0.065221 0.28548 -0.4243 0.7499 -0.14892 -0.66786 0.11788
national -1.1105 0.94945 -0.17078 0.93037 -0.2477 -0.70633 -0.8649 -0.56118
day 0.11626 0.53897 -0.39514 -0.26027 0.57706 -0.79198 -0.88374 0.30119
country -0.13531 0.15485 -0.07309 0.034013 -0.054457 -0.20541 -0.60086 -0.22407
under 0.13721 -0.295 -0.05916 -0.59235 0.02301 0.21884 -0.34254 -0.70213
such 0.61012 0.33512 -0.53499 0.36139 -0.39866 0.70627 -0.18699 -0.77246
second -0.29809 0.28069 0.087102 0.54455 0.70003 0.44778 -0.72565 0.62309 

I split the above embedding matrix and loaded only the words in vocab which will be our vocabulary and the corresponding vectors in emb array.

vocab = ['the','like','between','did','just','national','day','country','under','such','second']

emb = np.array([[0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862],
   [0.36808, 0.20834, -0.22319, 0.046283, 0.20098, 0.27515, -0.77127, -0.76804],
   [0.7503, 0.71623, -0.27033, 0.20059, -0.17008, 0.68568, -0.061672, -0.054638],
   [0.042523, -0.21172, 0.044739, -0.19248, 0.26224, 0.0043991, -0.88195, 0.55184],
   [0.17698, 0.065221, 0.28548, -0.4243, 0.7499, -0.14892, -0.66786, 0.11788],
   [-1.1105, 0.94945, -0.17078, 0.93037, -0.2477, -0.70633, -0.8649, -0.56118],
   [0.11626, 0.53897, -0.39514, -0.26027, 0.57706, -0.79198, -0.88374, 0.30119],
   [-0.13531, 0.15485, -0.07309, 0.034013, -0.054457, -0.20541, -0.60086, -0.22407],
   [ 0.13721, -0.295, -0.05916, -0.59235, 0.02301, 0.21884, -0.34254, -0.70213],
   [ 0.61012, 0.33512, -0.53499, 0.36139, -0.39866, 0.70627, -0.18699, -0.77246 ],
   [ -0.29809, 0.28069, 0.087102, 0.54455, 0.70003, 0.44778, -0.72565, 0.62309 ]])


emb.shape
# (11, 8)

Embedding Lookup in TensorFlow

Now we will see how can we perform embedding lookup for some arbitrary input sentence.

In [54]: from collections import OrderedDict

# embedding as TF tensor (for now constant; could be tf.Variable() during training)
In [55]: tf_embedding = tf.constant(emb, dtype=tf.float32)

# input for which we need the embedding
In [56]: input_str = "like the country"

# build index based on our `vocabulary`
In [57]: word_to_idx = OrderedDict({w:vocab.index(w) for w in input_str.split() if w in vocab})

# lookup in embedding matrix & return the vectors for the input words
In [58]: tf.nn.embedding_lookup(tf_embedding, list(word_to_idx.values())).eval()
Out[58]: 
array([[ 0.36807999,  0.20834   , -0.22318999,  0.046283  ,  0.20097999,
         0.27515   , -0.77126998, -0.76804   ],
       [ 0.41800001,  0.24968   , -0.41242   ,  0.1217    ,  0.34527001,
        -0.044457  , -0.49687999, -0.17862   ],
       [-0.13530999,  0.15485001, -0.07309   ,  0.034013  , -0.054457  ,
        -0.20541   , -0.60086   , -0.22407   ]], dtype=float32)

Observe how we got the embeddings from our original embedding matrix (with words) using the indices of words in our vocabulary.

Usually, such an embedding lookup is performed by the first layer (called Embedding layer) which then passes these embeddings to RNN/LSTM/GRU layers for further processing.


Side Note: Usually the vocabulary will also have a special unk token. So, if a token from our input sentence is not present in our vocabulary, then the index corresponding to unk will be looked up in the embedding matrix.


P.S. Note that embedding_dimension is a hyperparameter that one has to tune for their application but popular models like Word2Vec and GloVe uses 300 dimension vector for representing each word.

Bonus Reading word2vec skip-gram model


回答 3

这是描述嵌入查找过程的图像。

图片:嵌入查找过程

简而言之,它获取由ID列表指定的嵌入层的相应行,并将其作为张量提供。它是通过以下过程实现的。

  1. 定义一个占位符 lookup_ids = tf.placeholder([10])
  2. 定义嵌入层 embeddings = tf.Variable([100,10],...)
  3. 定义张量流操作 embed_lookup = tf.embedding_lookup(embeddings, lookup_ids)
  4. 通过运行获取结果 lookup = session.run(embed_lookup, feed_dict={lookup_ids:[95,4,14]})

Here’s an image depicting the process of embedding lookup.

Image: Embedding lookup process

Concisely, it gets the corresponding rows of a embedding layer, specified by a list of IDs and provide that as a tensor. It is achieved through the following process.

  1. Define a placeholder lookup_ids = tf.placeholder([10])
  2. Define a embedding layer embeddings = tf.Variable([100,10],...)
  3. Define the tensorflow operation embed_lookup = tf.embedding_lookup(embeddings, lookup_ids)
  4. Get the results by running lookup = session.run(embed_lookup, feed_dict={lookup_ids:[95,4,14]})

回答 4

当参数张量为高维时,id仅指最大维。也许对大多数人来说这很明显,但是我必须运行以下代码来理解这一点:

embeddings = tf.constant([[[1,1],[2,2],[3,3],[4,4]],[[11,11],[12,12],[13,13],[14,14]],
                          [[21,21],[22,22],[23,23],[24,24]]])
ids=tf.constant([0,2,1])
embed = tf.nn.embedding_lookup(embeddings, ids, partition_strategy='div')

with tf.Session() as session:
    result = session.run(embed)
    print (result)

只是尝试“ div”策略,对于一个张量,这没有什么区别。

这是输出:

[[[ 1  1]
  [ 2  2]
  [ 3  3]
  [ 4  4]]

 [[21 21]
  [22 22]
  [23 23]
  [24 24]]

 [[11 11]
  [12 12]
  [13 13]
  [14 14]]]

When the params tensor is in high dimensions, the ids only refers to top dimension. Maybe it’s obvious to most of people but I have to run the following code to understand that:

embeddings = tf.constant([[[1,1],[2,2],[3,3],[4,4]],[[11,11],[12,12],[13,13],[14,14]],
                          [[21,21],[22,22],[23,23],[24,24]]])
ids=tf.constant([0,2,1])
embed = tf.nn.embedding_lookup(embeddings, ids, partition_strategy='div')

with tf.Session() as session:
    result = session.run(embed)
    print (result)

Just trying the ‘div’ strategy and for one tensor, it makes no difference.

Here is the output:

[[[ 1  1]
  [ 2  2]
  [ 3  3]
  [ 4  4]]

 [[21 21]
  [22 22]
  [23 23]
  [24 24]]

 [[11 11]
  [12 12]
  [13 13]
  [14 14]]]

回答 5

另一种查看方式是,假设您将张量展平为一维数组,然后执行查找。

(例如)Tensor0 = [1,2,3],Tensor1 = [4,5,6],Tensor2 = [7,8,9]

展平的张量将如下[1,4,7,2,5,8,3,6,9]

现在,当您执行[0,3,4,1,7]的查找时,将会产生[1,2,5,4,6]

(i,e)例如,如果lookup值为7,而我们有3个张量(或具有3行的张量),

7/3 :(提醒为1,商为2)因此将显示Tensor1的第二个元素,即6

Another way to look at it is , assume that you flatten out the tensors to one dimensional array, and then you are performing a lookup

(eg) Tensor0=[1,2,3], Tensor1=[4,5,6], Tensor2=[7,8,9]

The flattened out tensor will be as follows [1,4,7,2,5,8,3,6,9]

Now when you do a lookup of [0,3,4,1,7] it will yeild [1,2,5,4,6]

(i,e) if lookup value is 7 for example , and we have 3 tensors (or a tensor with 3 rows) then,

7 / 3 : (Reminder is 1, Quotient is 2) So 2nd element of Tensor1 will be shown, which is 6


回答 6

由于我也对此功能感兴趣,因此我将给我两分钱。

我在2D情况下看到它的方式就像矩阵乘法(很容易推广到其他维度)。

考虑一个带有N个符号的词汇表。然后,您可以将符号x表示为尺寸为Nx1的矢量,并进行一次热编码。

但是,您不希望将此符号表示为Nx1的矢量,而是表示为尺寸为Mx1的y

因此,要将x转换为y,可以使用和嵌入尺寸为MxN的矩阵E

y = E x

本质上,这就是tf.nn.embedding_lookup(params,ids,…)所做的事情,细微的差别是ids只是一个数字,代表1在热编码矢量x中的位置1 。

Since I was also intrigued by this function, I’ll give my two cents.

The way I see it in the 2D case is just as a matrix multiplication (it’s easy to generalize to other dimensions).

Consider a vocabulary with N symbols. Then, you can represent a symbol x as a vector of dimensions Nx1, one-hot-encoded.

But you want a representation of this symbol not as a vector of Nx1, but as one with dimensions Mx1, called y.

So, to transform x into y, you can use and embedding matrix E, with dimensions MxN:

y = E x.

This is essentially what tf.nn.embedding_lookup(params, ids, …) is doing, with the nuance that ids are just one number that represents the position of the 1 in the one-hot-encoded vector x.


回答 7

添加到Asher Stern的答案中, params被解释为大嵌入张量的划分。它可以是表示完整嵌入张量的单个张量,也可以是X形张量的列表,除了第一维以外,它们均具有相同的形状,表示分片嵌入张量。

tf.nn.embedding_lookup考虑到嵌入(参数)会很大这一事实来编写函数。因此我们需要partition_strategy

Adding to Asher Stern’s answer, params is interpreted as a partitioning of a large embedding tensor. It can be a single tensor representing the complete embedding tensor, or a list of X tensors all of same shape except for the first dimension, representing sharded embedding tensors.

The function tf.nn.embedding_lookup is written considering the fact that embedding (params) will be large. Therefore we need partition_strategy.


Keras,如何获得每一层的输出?

问题:Keras,如何获得每一层的输出?

我已经使用CNN训练了二进制分类模型,这是我的代码

model = Sequential()
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
                        border_mode='valid',
                        input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
# (16, 16, 32)
model.add(Convolution2D(nb_filters*2, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters*2, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
# (8, 8, 64) = (2048)
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(2))  # define a binary classification problem
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])
model.fit(x_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          verbose=1,
          validation_data=(x_test, y_test))

在这里,我想像TensorFlow一样获得每一层的输出,我该怎么做?

I have trained a binary classification model with CNN, and here is my code

model = Sequential()
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],
                        border_mode='valid',
                        input_shape=input_shape))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
# (16, 16, 32)
model.add(Convolution2D(nb_filters*2, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters*2, kernel_size[0], kernel_size[1]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=pool_size))
# (8, 8, 64) = (2048)
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(2))  # define a binary classification problem
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])
model.fit(x_train, y_train,
          batch_size=batch_size,
          nb_epoch=nb_epoch,
          verbose=1,
          validation_data=(x_test, y_test))

And here, I wanna get the output of each layer just like TensorFlow, how can I do that?


回答 0

您可以使用以下命令轻松获取任何图层的输出: model.layers[index].output

对于所有图层,请使用以下命令:

from keras import backend as K

inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functors = [K.function([inp, K.learning_phase()], [out]) for out in outputs]    # evaluation functions

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = [func([test, 1.]) for func in functors]
print layer_outs

注:为了模拟差使用learning_phase1.layer_outs以其它方式使用0.

编辑:(基于评论)

K.function 创建theano / tensorflow张量函数,该函数随后用于从给定输入的符号图中获取输出。

现在K.learning_phase()需要作为输入,因为很多Keras层(如Dropout / Batchnomalization)都依赖于它,以在训练和测试期间更改行为。

因此,如果您删除代码中的辍学层,则可以简单地使用:

from keras import backend as K

inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functors = [K.function([inp], [out]) for out in outputs]    # evaluation functions

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = [func([test]) for func in functors]
print layer_outs

编辑2:更优化

我只是意识到,先前的答案并不是针对每个函数评估进行了优化,因为数据将被传输到CPU-> GPU内存中,并且还需要对低层进行n-n-over的张量计算。

相反,这是一种更好的方法,因为您不需要多个函数,而只需一个函数即可为您提供所有输出的列表:

from keras import backend as K

inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functor = K.function([inp, K.learning_phase()], outputs )   # evaluation function

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = functor([test, 1.])
print layer_outs

You can easily get the outputs of any layer by using: model.layers[index].output

For all layers use this:

from keras import backend as K

inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functors = [K.function([inp, K.learning_phase()], [out]) for out in outputs]    # evaluation functions

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = [func([test, 1.]) for func in functors]
print layer_outs

Note: To simulate Dropout use learning_phase as 1. in layer_outs otherwise use 0.

Edit: (based on comments)

K.function creates theano/tensorflow tensor functions which is later used to get the output from the symbolic graph given the input.

Now K.learning_phase() is required as an input as many Keras layers like Dropout/Batchnomalization depend on it to change behavior during training and test time.

So if you remove the dropout layer in your code you can simply use:

from keras import backend as K

inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functors = [K.function([inp], [out]) for out in outputs]    # evaluation functions

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = [func([test]) for func in functors]
print layer_outs

Edit 2: More optimized

I just realized that the previous answer is not that optimized as for each function evaluation the data will be transferred CPU->GPU memory and also the tensor calculations needs to be done for the lower layers over-n-over.

Instead this is a much better way as you don’t need multiple functions but a single function giving you the list of all outputs:

from keras import backend as K

inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functor = K.function([inp, K.learning_phase()], outputs )   # evaluation function

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = functor([test, 1.])
print layer_outs

回答 1

https://keras.io/getting-started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer

一种简单的方法是创建一个新模型,该模型将输出您感兴趣的图层:

from keras.models import Model

model = ...  # include here your original model

layer_name = 'my_layer'
intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(data)

或者,您可以构建Keras函数,该函数将在给定特定输入的情况下返回特定图层的输出,例如:

from keras import backend as K

# with a Sequential model
get_3rd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[3].output])
layer_output = get_3rd_layer_output([x])[0]

From https://keras.io/getting-started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer

One simple way is to create a new Model that will output the layers that you are interested in:

from keras.models import Model

model = ...  # include here your original model

layer_name = 'my_layer'
intermediate_layer_model = Model(inputs=model.input,
                                 outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(data)

Alternatively, you can build a Keras function that will return the output of a certain layer given a certain input, for example:

from keras import backend as K

# with a Sequential model
get_3rd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[3].output])
layer_output = get_3rd_layer_output([x])[0]

回答 2

基于此线程的所有良好答案,我编写了一个库来获取每一层的输出。它抽象了所有复杂性,并被设计为尽可能易于使用:

https://github.com/philipperemy/keract

它处理几乎所有边缘情况

希望能帮助到你!

Based on all the good answers of this thread, I wrote a library to fetch the output of each layer. It abstracts all the complexity and has been designed to be as user-friendly as possible:

https://github.com/philipperemy/keract

It handles almost all the edge cases

Hope it helps!


回答 3

以下对我来说看起来很简单:

model.layers[idx].output

上面是张量对象,因此您可以使用可应用于张量对象的操作对其进行修改。

例如,获得形状 model.layers[idx].output.get_shape()

idx 是图层的索引,您可以从中找到它 model.summary()

Following looks very simple to me:

model.layers[idx].output

Above is a tensor object, so you can modify it using operations that can be applied to a tensor object.

For example, to get the shape model.layers[idx].output.get_shape()

idx is the index of the layer and you can find it from model.summary()


回答 4

我为自己编写了此函数(在Jupyter中),它的灵感来自indraforyou的回答。它将自动绘制所有图层输出。您的图像必须具有(x,y,1)形状,其中1代表1个通道。您只需调用plot_layer_outputs(…)即可进行绘制。

%matplotlib inline
import matplotlib.pyplot as plt
from keras import backend as K

def get_layer_outputs():
    test_image = YOUR IMAGE GOES HERE!!!
    outputs    = [layer.output for layer in model.layers]          # all layer outputs
    comp_graph = [K.function([model.input]+ [K.learning_phase()], [output]) for output in outputs]  # evaluation functions

    # Testing
    layer_outputs_list = [op([test_image, 1.]) for op in comp_graph]
    layer_outputs = []

    for layer_output in layer_outputs_list:
        print(layer_output[0][0].shape, end='\n-------------------\n')
        layer_outputs.append(layer_output[0][0])

    return layer_outputs

def plot_layer_outputs(layer_number):    
    layer_outputs = get_layer_outputs()

    x_max = layer_outputs[layer_number].shape[0]
    y_max = layer_outputs[layer_number].shape[1]
    n     = layer_outputs[layer_number].shape[2]

    L = []
    for i in range(n):
        L.append(np.zeros((x_max, y_max)))

    for i in range(n):
        for x in range(x_max):
            for y in range(y_max):
                L[i][x][y] = layer_outputs[layer_number][x][y][i]


    for img in L:
        plt.figure()
        plt.imshow(img, interpolation='nearest')

I wrote this function for myself (in Jupyter) and it was inspired by indraforyou‘s answer. It will plot all the layer outputs automatically. Your images must have a (x, y, 1) shape where 1 stands for 1 channel. You just call plot_layer_outputs(…) to plot.

%matplotlib inline
import matplotlib.pyplot as plt
from keras import backend as K

def get_layer_outputs():
    test_image = YOUR IMAGE GOES HERE!!!
    outputs    = [layer.output for layer in model.layers]          # all layer outputs
    comp_graph = [K.function([model.input]+ [K.learning_phase()], [output]) for output in outputs]  # evaluation functions

    # Testing
    layer_outputs_list = [op([test_image, 1.]) for op in comp_graph]
    layer_outputs = []

    for layer_output in layer_outputs_list:
        print(layer_output[0][0].shape, end='\n-------------------\n')
        layer_outputs.append(layer_output[0][0])

    return layer_outputs

def plot_layer_outputs(layer_number):    
    layer_outputs = get_layer_outputs()

    x_max = layer_outputs[layer_number].shape[0]
    y_max = layer_outputs[layer_number].shape[1]
    n     = layer_outputs[layer_number].shape[2]

    L = []
    for i in range(n):
        L.append(np.zeros((x_max, y_max)))

    for i in range(n):
        for x in range(x_max):
            for y in range(y_max):
                L[i][x][y] = layer_outputs[layer_number][x][y][i]


    for img in L:
        plt.figure()
        plt.imshow(img, interpolation='nearest')

回答 5

来自:https : //github.com/philipperemy/keras-visualize-activations/blob/master/read_activations.py

import keras.backend as K

def get_activations(model, model_inputs, print_shape_only=False, layer_name=None):
    print('----- activations -----')
    activations = []
    inp = model.input

    model_multi_inputs_cond = True
    if not isinstance(inp, list):
        # only one input! let's wrap it in a list.
        inp = [inp]
        model_multi_inputs_cond = False

    outputs = [layer.output for layer in model.layers if
               layer.name == layer_name or layer_name is None]  # all layer outputs

    funcs = [K.function(inp + [K.learning_phase()], [out]) for out in outputs]  # evaluation functions

    if model_multi_inputs_cond:
        list_inputs = []
        list_inputs.extend(model_inputs)
        list_inputs.append(0.)
    else:
        list_inputs = [model_inputs, 0.]

    # Learning phase. 0 = Test mode (no dropout or batch normalization)
    # layer_outputs = [func([model_inputs, 0.])[0] for func in funcs]
    layer_outputs = [func(list_inputs)[0] for func in funcs]
    for layer_activations in layer_outputs:
        activations.append(layer_activations)
        if print_shape_only:
            print(layer_activations.shape)
        else:
            print(layer_activations)
    return activations

From: https://github.com/philipperemy/keras-visualize-activations/blob/master/read_activations.py

import keras.backend as K

def get_activations(model, model_inputs, print_shape_only=False, layer_name=None):
    print('----- activations -----')
    activations = []
    inp = model.input

    model_multi_inputs_cond = True
    if not isinstance(inp, list):
        # only one input! let's wrap it in a list.
        inp = [inp]
        model_multi_inputs_cond = False

    outputs = [layer.output for layer in model.layers if
               layer.name == layer_name or layer_name is None]  # all layer outputs

    funcs = [K.function(inp + [K.learning_phase()], [out]) for out in outputs]  # evaluation functions

    if model_multi_inputs_cond:
        list_inputs = []
        list_inputs.extend(model_inputs)
        list_inputs.append(0.)
    else:
        list_inputs = [model_inputs, 0.]

    # Learning phase. 0 = Test mode (no dropout or batch normalization)
    # layer_outputs = [func([model_inputs, 0.])[0] for func in funcs]
    layer_outputs = [func(list_inputs)[0] for func in funcs]
    for layer_activations in layer_outputs:
        activations.append(layer_activations)
        if print_shape_only:
            print(layer_activations.shape)
        else:
            print(layer_activations)
    return activations

回答 6

想要将其添加为@indraforyou的答案作为注释(但没有足够高的声望)以纠正@mathtick的注释中提到的问题。为了避免InvalidArgumentError: input_X:Y is both fed and fetched.异常,只需更换行outputs = [layer.output for layer in model.layers]outputs = [layer.output for layer in model.layers][1:],即

调整indraforyou的最小工作示例:

from keras import backend as K 
inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers][1:]        # all layer outputs except first (input) layer
functor = K.function([inp, K.learning_phase()], outputs )   # evaluation function

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = functor([test, 1.])
print layer_outs

PS我尝试的东西,如尝试outputs = [layer.output for layer in model.layers[1:]]不起作用。

Wanted to add this as a comment (but don’t have high enough rep.) to @indraforyou’s answer to correct for the issue mentioned in @mathtick’s comment. To avoid the InvalidArgumentError: input_X:Y is both fed and fetched. exception, simply replace the line outputs = [layer.output for layer in model.layers] with outputs = [layer.output for layer in model.layers][1:], i.e.

adapting indraforyou’s minimal working example:

from keras import backend as K 
inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers][1:]        # all layer outputs except first (input) layer
functor = K.function([inp, K.learning_phase()], outputs )   # evaluation function

# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = functor([test, 1.])
print layer_outs

p.s. my attempts trying things such as outputs = [layer.output for layer in model.layers[1:]] did not work.


回答 7

假设您有:

1- Keras训练有素model

2-输入x为图像或图像集。图像的分辨率应与输入层的尺寸兼容。例如对于3通道(RGB)图像为80 * 80 * 3

3- layer要激活的输出的名称。例如,“ flatten_2”层。这应该包含在layer_names变量中,代表给定层的名称model

4- batch_size是可选参数。

然后,您可以轻松地使用get_activation函数来获得layer给定输入x和预训练的输出激活model

import six
import numpy as np
import keras.backend as k
from numpy import float32
def get_activations(x, model, layer, batch_size=128):
"""
Return the output of the specified layer for input `x`. `layer` is specified by layer index (between 0 and
`nb_layers - 1`) or by name. The number of layers can be determined by counting the results returned by
calling `layer_names`.
:param x: Input for computing the activations.
:type x: `np.ndarray`. Example: x.shape = (80, 80, 3)
:param model: pre-trained Keras model. Including weights.
:type model: keras.engine.sequential.Sequential. Example: model.input_shape = (None, 80, 80, 3)
:param layer: Layer for computing the activations
:type layer: `int` or `str`. Example: layer = 'flatten_2'
:param batch_size: Size of batches.
:type batch_size: `int`
:return: The output of `layer`, where the first dimension is the batch size corresponding to `x`.
:rtype: `np.ndarray`. Example: activations.shape = (1, 2000)
"""

    layer_names = [layer.name for layer in model.layers]
    if isinstance(layer, six.string_types):
        if layer not in layer_names:
            raise ValueError('Layer name %s is not part of the graph.' % layer)
        layer_name = layer
    elif isinstance(layer, int):
        if layer < 0 or layer >= len(layer_names):
            raise ValueError('Layer index %d is outside of range (0 to %d included).'
                             % (layer, len(layer_names) - 1))
        layer_name = layer_names[layer]
    else:
        raise TypeError('Layer must be of type `str` or `int`.')

    layer_output = model.get_layer(layer_name).output
    layer_input = model.input
    output_func = k.function([layer_input], [layer_output])

    # Apply preprocessing
    if x.shape == k.int_shape(model.input)[1:]:
        x_preproc = np.expand_dims(x, 0)
    else:
        x_preproc = x
    assert len(x_preproc.shape) == 4

    # Determine shape of expected output and prepare array
    output_shape = output_func([x_preproc[0][None, ...]])[0].shape
    activations = np.zeros((x_preproc.shape[0],) + output_shape[1:], dtype=float32)

    # Get activations with batching
    for batch_index in range(int(np.ceil(x_preproc.shape[0] / float(batch_size)))):
        begin, end = batch_index * batch_size, min((batch_index + 1) * batch_size, x_preproc.shape[0])
        activations[begin:end] = output_func([x_preproc[begin:end]])[0]

    return activations

Assuming you have:

1- Keras pre-trained model.

2- Input x as image or set of images. The resolution of image should be compatible with dimension of the input layer. For example 80*80*3 for 3-channels (RGB) image.

3- The name of the output layer to get the activation. For example, “flatten_2” layer. This should be include in the layer_names variable, represents name of layers of the given model.

4- batch_size is an optional argument.

Then you can easily use get_activation function to get the activation of the output layer for a given input x and pre-trained model:

import six
import numpy as np
import keras.backend as k
from numpy import float32
def get_activations(x, model, layer, batch_size=128):
"""
Return the output of the specified layer for input `x`. `layer` is specified by layer index (between 0 and
`nb_layers - 1`) or by name. The number of layers can be determined by counting the results returned by
calling `layer_names`.
:param x: Input for computing the activations.
:type x: `np.ndarray`. Example: x.shape = (80, 80, 3)
:param model: pre-trained Keras model. Including weights.
:type model: keras.engine.sequential.Sequential. Example: model.input_shape = (None, 80, 80, 3)
:param layer: Layer for computing the activations
:type layer: `int` or `str`. Example: layer = 'flatten_2'
:param batch_size: Size of batches.
:type batch_size: `int`
:return: The output of `layer`, where the first dimension is the batch size corresponding to `x`.
:rtype: `np.ndarray`. Example: activations.shape = (1, 2000)
"""

    layer_names = [layer.name for layer in model.layers]
    if isinstance(layer, six.string_types):
        if layer not in layer_names:
            raise ValueError('Layer name %s is not part of the graph.' % layer)
        layer_name = layer
    elif isinstance(layer, int):
        if layer < 0 or layer >= len(layer_names):
            raise ValueError('Layer index %d is outside of range (0 to %d included).'
                             % (layer, len(layer_names) - 1))
        layer_name = layer_names[layer]
    else:
        raise TypeError('Layer must be of type `str` or `int`.')

    layer_output = model.get_layer(layer_name).output
    layer_input = model.input
    output_func = k.function([layer_input], [layer_output])

    # Apply preprocessing
    if x.shape == k.int_shape(model.input)[1:]:
        x_preproc = np.expand_dims(x, 0)
    else:
        x_preproc = x
    assert len(x_preproc.shape) == 4

    # Determine shape of expected output and prepare array
    output_shape = output_func([x_preproc[0][None, ...]])[0].shape
    activations = np.zeros((x_preproc.shape[0],) + output_shape[1:], dtype=float32)

    # Get activations with batching
    for batch_index in range(int(np.ceil(x_preproc.shape[0] / float(batch_size)))):
        begin, end = batch_index * batch_size, min((batch_index + 1) * batch_size, x_preproc.shape[0])
        activations[begin:end] = output_func([x_preproc[begin:end]])[0]

    return activations

回答 8

如果您具有以下情况之一:

  • 错误: InvalidArgumentError: input_X:Y is both fed and fetched
  • 多输入的情况

您需要进行以下更改:

  • outputs变量中的输入层添加过滤器
  • 最小functors循环变化

最小示例:

from keras.engine.input_layer import InputLayer
inp = model.input
outputs = [layer.output for layer in model.layers if not isinstance(layer, InputLayer)]
functors = [K.function(inp + [K.learning_phase()], [x]) for x in outputs]
layer_outputs = [fun([x1, x2, xn, 1]) for fun in functors]

In case you have one of the following cases:

  • error: InvalidArgumentError: input_X:Y is both fed and fetched
  • case of multiple inputs

You need to do the following changes:

  • add filter out for input layers in outputs variable
  • minnor change on functors loop

Minimum example:

from keras.engine.input_layer import InputLayer
inp = model.input
outputs = [layer.output for layer in model.layers if not isinstance(layer, InputLayer)]
functors = [K.function(inp + [K.learning_phase()], [x]) for x in outputs]
layer_outputs = [fun([x1, x2, xn, 1]) for fun in functors]

回答 9

好吧,其他答案也很完整,但是有一种非常基本的方法可以“看到”形状,而不是“获得”形状。

只是做一个model.summary()。它将打印所有图层及其输出形状。“无”值将指示可变尺寸,而第一维将是批量大小。

Well, other answers are very complete, but there is a very basic way to “see”, not to “get” the shapes.

Just do a model.summary(). It will print all layers and their output shapes. “None” values will indicate variable dimensions, and the first dimension will be the batch size.