I am using TensorFlow to train a neural network. This is how I am initializing the GradientDescentOptimizer:

init = tf.initialize_all_variables()
sess = tf.Session()

mse        = tf.reduce_mean(tf.square(out - out_))
train_step = tf.train.GradientDescentOptimizer(0.3).minimize(mse)

The thing here is that I don’t know how to set an update rule for the learning rate or a decay value for that.

How can I use an adaptive learning rate here?

回答 0



First of all, tf.train.GradientDescentOptimizer is designed to use a constant learning rate for all variables in all steps. TensorFlow also provides out-of-the-box adaptive optimizers including the tf.train.AdagradOptimizer and the tf.train.AdamOptimizer, and these can be used as drop-in replacements.

However, if you want to control the learning rate with otherwise-vanilla gradient descent, you can take advantage of the fact that the learning_rate argument to the tf.train.GradientDescentOptimizer constructor can be a Tensor object. This allows you to compute a different value for the learning rate in each step, for example:

learning_rate = tf.placeholder(tf.float32, shape=[])
# ...
train_step = tf.train.GradientDescentOptimizer(

sess = tf.Session()

# Feed different values for learning rate to each training step.
sess.run(train_step, feed_dict={learning_rate: 0.1})
sess.run(train_step, feed_dict={learning_rate: 0.1})
sess.run(train_step, feed_dict={learning_rate: 0.01})
sess.run(train_step, feed_dict={learning_rate: 0.01})

Alternatively, you could create a scalar tf.Variable that holds the learning rate, and assign it each time you want to change the learning rate.

回答 1



Tensorflow provides an op to automatically apply an exponential decay to a learning rate tensor: tf.train.exponential_decay. For an example of it in use, see this line in the MNIST convolutional model example. Then use @mrry’s suggestion above to supply this variable as the learning_rate parameter to your optimizer of choice.

The key excerpt to look at is:

# Optimizer: set up a variable that's incremented once per batch and
# controls the learning rate decay.
batch = tf.Variable(0)

learning_rate = tf.train.exponential_decay(
  0.01,                # Base learning rate.
  batch * BATCH_SIZE,  # Current index into the dataset.
  train_size,          # Decay step.
  0.95,                # Decay rate.
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,

Note the global_step=batch parameter to minimize. That tells the optimizer to helpfully increment the ‘batch’ parameter for you every time it trains.

回答 2




  • 动量 可帮助 SGD沿相关方向导航并软化无关的振荡。它只是将上一步的方向的一部分添加到当前步骤中。这样可以以正确的方向放大速度,并软化错误方向的振动。该分数通常在(0,1)范围内。使用自适应动量也很有意义。在开始学习时,很大的动力只会阻碍您的进步,因此使用0.01之类的东西就显得有些麻木了,一旦所有的高梯度消失了,您就可以使用更大的动量。动量存在一个问题:当我们非常接近目标时,在大多数情况下我们的动量很高,并且不知道它会放慢速度。这可能会导致它丢失或在最小值附近振荡
  • Nesterov加速梯度可以通过尽早降低速度来解决此问题。在动量中,我们首先计算坡度,然后在该方向上进行跳跃,并被我们之前拥有的任何动量放大。NAG的功能相同,但顺序相反:首先,我们根据存储的信息进行较大的跳跃,然后计算梯度并进行较小的校正。这种看似无关紧要的变化大大提高了实用速度。
  • AdaGrad或自适应梯度允许学习率根据参数进行自适应。它对不频繁的参数执行较大的更新,对频繁的参数执行较小的更新。因此,它非常适合稀疏数据(NLP或图像识别)。另一个优点是,它基本上不需要调整学习速度。每个参数都有其自己的学习速率,由于算法的特殊性,学习速率单调降低。这导致了最大的问题:在某些时候,学习率太小,系统停止学习
  • AdaDelta解决了AdaGrad中单调降低学习率的问题。在AdaGrad中,学习率大约是用除以平方根之和得出的。在每个阶段,您都要在总和上加上另一个平方根,这会使分母不断减小。在AdaDelta中,它使用滑动窗口而不是将所有过去的平方根求和,而是使总和减少。RMSprop与AdaDelta非常相似
  • 亚当或自适应动量是类似于AdaDelta的算法。但是,除了存储每个参数的学习率之外,它还分别存储每个参数的动量变化

    一个几可视化在此处输入图片说明 在此处输入图片说明

Gradient descent algorithm uses the constant learning rate which you can provide in during the initialization. You can pass various learning rates in a way showed by Mrry.

But instead of it you can also use more advanced optimizers which have faster convergence rate and adapts to the situation.

Here is a brief explanation based on my understanding:

  • momentum helps SGD to navigate along the relevant directions and softens the oscillations in the irrelevant. It simply adds a fraction of the direction of the previous step to a current step. This achieves amplification of speed in the correct dirrection and softens oscillation in wrong directions. This fraction is usually in the (0, 1) range. It also makes sense to use adaptive momentum. In the beginning of learning a big momentum will only hinder your progress, so it makse sense to use something like 0.01 and once all the high gradients disappeared you can use a bigger momentom. There is one problem with momentum: when we are very close to the goal, our momentum in most of the cases is very high and it does not know that it should slow down. This can cause it to miss or oscillate around the minima
  • nesterov accelerated gradient overcomes this problem by starting to slow down early. In momentum we first compute gradient and then make a jump in that direction amplified by whatever momentum we had previously. NAG does the same thing but in another order: at first we make a big jump based on our stored information, and then we calculate the gradient and make a small correction. This seemingly irrelevant change gives significant practical speedups.
  • AdaGrad or adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image recognition). Another advantage is that it basically illiminates the need to tune the learning rate. Each parameter has its own learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the biggest problem: at some point of time the learning rate is so small that the system stops learning
  • AdaDelta resolves the problem of monotonically decreasing learning rate in AdaGrad. In AdaGrad the learning rate was calculated approximately as one divided by the sum of square roots. At each stage you add another square root to the sum, which causes denominator to constantly decrease. In AdaDelta instead of summing all past square roots it uses sliding window which allows the sum to decrease. RMSprop is very similar to AdaDelta
  • Adam or adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately

    A few visualizations: enter image description here enter image description here

回答 3


From tensorflow official docs

global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                       100000, 0.96, staircase=True)

# Passing global_step to minimize() will increment it at each step.
learning_step = (
.minimize(...my loss..., global_step=global_step))

回答 4

如果您想为各个时间间隔设置特定的学习率,例如 0 < a < b < c < ...。然后,您可以将学习率定义为一个条件张量,以全局步长为条件,并将其正常地馈送到优化器。


def make_learning_rate_tensor(reduction_steps, learning_rates, global_step):
    assert len(reduction_steps) + 1 == len(learning_rates)
    if len(reduction_steps) == 1:
        return tf.cond(
            global_step < reduction_steps[0],
            lambda: learning_rates[0],
            lambda: learning_rates[1]
        return tf.cond(
            global_step < reduction_steps[0],
            lambda: learning_rates[0],
            lambda: make_learning_rate_tensor(

然后,要使用它,您需要知道一个时期中有多少个训练步骤,以便我们可以使用全局步骤在正确的时间切换,并最终定义您想要的时期和学习率。因此,如果我想分别[0.1, 0.01, 0.001, 0.0001]在每个纪元间隔内学习率[0, 19], [20, 59], [60, 99], [100, \infty],我会这样做:

global_step = tf.train.get_or_create_global_step()
learning_rates = [0.1, 0.01, 0.001, 0.0001]
steps_per_epoch = 225
epochs_to_switch_at = [20, 60, 100]
epochs_to_switch_at = [x*steps_per_epoch for x in epochs_to_switch_at ]
learning_rate = make_learning_rate_tensor(epochs_to_switch_at , learning_rates, global_step)

If you want to set specific learning rates for intervals of epochs like 0 < a < b < c < .... Then you can define your learning rate as a conditional tensor, conditional on the global step, and feed this as normal to the optimiser.

You could achieve this with a bunch of nested tf.cond statements, but its easier to build the tensor recursively:

def make_learning_rate_tensor(reduction_steps, learning_rates, global_step):
    assert len(reduction_steps) + 1 == len(learning_rates)
    if len(reduction_steps) == 1:
        return tf.cond(
            global_step < reduction_steps[0],
            lambda: learning_rates[0],
            lambda: learning_rates[1]
        return tf.cond(
            global_step < reduction_steps[0],
            lambda: learning_rates[0],
            lambda: make_learning_rate_tensor(

Then to use it you need to know how many training steps there are in a single epoch, so that we can use the global step to switch at the right time, and finally define the epochs and learning rates you want. So if I want the learning rates [0.1, 0.01, 0.001, 0.0001] during the epoch intervals of [0, 19], [20, 59], [60, 99], [100, \infty] respectively, I would do:

global_step = tf.train.get_or_create_global_step()
learning_rates = [0.1, 0.01, 0.001, 0.0001]
steps_per_epoch = 225
epochs_to_switch_at = [20, 60, 100]
epochs_to_switch_at = [x*steps_per_epoch for x in epochs_to_switch_at ]
learning_rate = make_learning_rate_tensor(epochs_to_switch_at , learning_rates, global_step)
