标签归档:conv-neural-network

批量归一化和辍学订购?

问题:批量归一化和辍学订购?

最初的问题是关于TensorFlow实现的。但是,答案是针对一般的实现。这个通用答案也是TensorFlow的正确答案。

在TensorFlow中使用批量归一化和辍学时(特别是使用contrib.layers),我需要担心订购吗?

如果我在退出后立即使用批处理规范化,则可能会出现问题。例如,如果批量归一化训练中的偏移量训练输出的比例尺数字较大,但是将相同的偏移量应用到较小的比例尺数字(由于补偿了更多的输出),而在测试过程中没有丢失,则轮班可能关闭。TensorFlow批处理规范化层会自动对此进行补偿吗?还是由于某种原因我不会想念这件事吗?

另外,将两者一起使用时还有其他陷阱吗?例如,假设我使用他们以正确的顺序在问候上述(假设有一个正确的顺序),可以存在与使用分批正常化和漏失在多个连续层烦恼?我没有立即看到问题,但是我可能会丢失一些东西。

非常感谢!

更新:

实验测试似乎表明排序确实很重要。我在相同的网络上运行了两次,但批次标准和退出均相反。当辍学在批处理规范之前时,验证损失似乎会随着培训损失的减少而增加。在另一种情况下,它们都下降了。但就我而言,运动缓慢,因此在接受更多培训后情况可能会发生变化,这只是一次测试。一个更加明确和明智的答案仍然会受到赞赏。

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.

When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?

It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I’m missing?

Also, are there other pitfalls to look out for in when using these two together? For example, assuming I’m using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don’t immediately see a problem with that, but I might be missing something.

Thank you much!

UPDATE:

An experimental test seems to suggest that ordering does matter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They’re both going down in the other case. But in my case the movements are slow, so things may change after more training and it’s just a single test. A more definitive and informed answer would still be appreciated.


回答 0

在《Ioffe and Szegedy 2015》中,作者指出“我们希望确保对于任何参数值,网络始终以期望的分布产生激活”。因此,批处理规范化层实际上是在转换层/完全连接层之后,但在馈入ReLu(或任何其他种类的)激活之前插入的。有关详情,请在时间约53分钟处观看此视频

就辍学而言,我认为辍学是在激活层之后应用的。在丢弃纸图3b中,将隐藏层l的丢弃因子/概率矩阵r(l)应用于y(l),其中y(l)是应用激活函数f之后的结果。

因此,总而言之,使用批处理规范化和退出的顺序为:

-> CONV / FC-> BatchNorm-> ReLu(或其他激活)->退出-> CONV / FC->

In the Ioffe and Szegedy 2015, the authors state that “we would like to ensure that for any parameter values, the network always produces activations with the desired distribution”. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

So in summary, the order of using batch normalization and dropout is:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->


回答 1

正如评论中所指出的,这里是阅读层顺序的绝佳资源。我已经浏览了评论,这是我在互联网上找到的主题的最佳资源

我的2美分:

辍学是指完全阻止某些神经元发出的信息,以确保神经元不共适应。因此,批处理规范化必须在退出后进行,否则您将通过规范化统计信息传递信息。

如果考虑一下,在典型的机器学习问题中,这就是我们不计算整个数据的均值和标准差,然后将其分为训练,测试和验证集的原因。我们拆分然后计算训练集上的统计信息,并使用它们对验证和测试数据集进行归一化和居中

所以我建议方案1(这考虑了伪马文对已接受答案评论)

-> CONV / FC-> ReLu(或其他激活)->退出-> BatchNorm-> CONV / FC

与方案2相反

-> CONV / FC-> BatchNorm-> ReLu(或其他激活)->退出-> CONV / FC->接受的答案

请注意,这意味着与方案1下的网络相比,方案2下的网络应显示出过拟合的状态,但是OP进行了上述测试,并且它们支持方案2

As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet

My 2 cents:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

If you think about it, in typical ML problems, this is the reason we don’t compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets

so i suggest Scheme 1 (This takes pseudomarvin’s comment on accepted answer into consideration)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2


回答 2

通常,只需删除Dropout(如果有BN):

  • “ BN消除了Dropout在某些情况下的需要,因为BN直观上提供了与Dropout类似的正则化好处”
  • “ ResNet,DenseNet等架构未使用 Dropout

有关更多详细信息,请参见本文[ 通过方差Shift理解辍学与批处理规范化之间的不和谐 ],如@Haramoz在评论中所提到的。

Usually, Just drop the Dropout(when you have BN):

  • “BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively”
  • “Architectures like ResNet, DenseNet, etc. not using Dropout

For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.


回答 3

我找到了一篇说明Dropout和Batch Norm(BN)之间不和谐的论文。关键思想是他们所谓的“方差转移”。这是因为,辍学在训练和测试阶段之间的行为有所不同,这改变了BN学习的输入统计数据。主要观点可以从本文摘录的该图中找到。

在此笔记本中可以找到有关此效果的小演示。

I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the “variance shift”. This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns. The main idea can be found in this figure which is taken from this paper.

A small demo for this effect can be found in this notebook.


回答 4

为了获得更好的性能,基于研究论文,我们应该在应用Dropouts之前使用BN

Based on the research paper for better performance we should use BN before applying Dropouts


回答 5

正确的顺序为:转换>规范化>激活>退出>池化

The correct order is: Conv > Normalization > Activation > Dropout > Pooling


回答 6

转化-激活-退出-BatchNorm-池->测试损失:0.04261355847120285

转化-激活-退出-池-BatchNorm->测试损失:0.050065308809280396

转换-激活-BatchNorm-池-退出-> Test_loss:0.04911309853196144

转换-激活-BatchNorm-退出-池-> Test_loss:0.06809622049331665

转换-BatchNorm-激活-退出-池-> Test_loss:0.038886815309524536

转换-BatchNorm-激活-池-退出-> Test_loss:0.04126095026731491

转换-BatchNorm-退出-激活-池-> Test_loss:0.05142546817660332

转换-退出-激活-BatchNorm-池->测试损失:0.04827788099646568

转化-退出-激活-池-BatchNorm->测试损失:0.04722036048769951

转化-退出-BatchNorm-激活-池->测试损失:0.03238215297460556


在MNIST数据集(20个纪元)上使用2个卷积模块(见下文)进行训练,然后每次

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

卷积层的内核大小为(3,3),默认填充为,激活值为elu。池化是池畔的MaxPooling (2,2)。损失为categorical_crossentropy,优化器为adam

相应的辍学概率分别为0.20.3。特征图的数量分别为3264

编辑: 当我按照某些答案中的建议删除Dropout时,它收敛得比我使用BatchNorm Dropout 时更快,但泛化能力却较差。

Conv – Activation – DropOut – BatchNorm – Pool –> Test_loss: 0.04261355847120285

Conv – Activation – DropOut – Pool – BatchNorm –> Test_loss: 0.050065308809280396

Conv – Activation – BatchNorm – Pool – DropOut –> Test_loss: 0.04911309853196144

Conv – Activation – BatchNorm – DropOut – Pool –> Test_loss: 0.06809622049331665

Conv – BatchNorm – Activation – DropOut – Pool –> Test_loss: 0.038886815309524536

Conv – BatchNorm – Activation – Pool – DropOut –> Test_loss: 0.04126095026731491

Conv – BatchNorm – DropOut – Activation – Pool –> Test_loss: 0.05142546817660332

Conv – DropOut – Activation – BatchNorm – Pool –> Test_loss: 0.04827788099646568

Conv – DropOut – Activation – Pool – BatchNorm –> Test_loss: 0.04722036048769951

Conv – DropOut – BatchNorm – Activation – Pool –> Test_loss: 0.03238215297460556


Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.

The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.


回答 7

ConV / FC-BN-Sigmoid / tanh-辍学。如果激活函数是Relu或其他,则规范化和退出的顺序取决于您的任务

ConV/FC – BN – Sigmoid/tanh – dropout. If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task


回答 8

我从https://stackoverflow.com/a/40295999/8625228的答案和评论中阅读了推荐的论文

从Ioffe和Szegedy(2015)的角度来看,仅在网络结构中使用BN。Li等。(2018)给出了统计和实验分析,当从业者在BN之前使用Dropout时存在方差变化。因此,李等人。(2018)建议在所有BN层之后应用Dropout。

从Ioffe和Szegedy(2015)的角度来看,BN位于 激活函数内部/之前。然而,Chen等。Chen等(2019)使用结合了Dropout和BN的IC层。(2019)建议在ReLU之后使用BN。

在安全背景上,我仅在网络中使用Dropout或BN。

陈光勇,陈鹏飞,石玉军,谢长裕,廖本本和张胜宇。2019年。“重新思考在深度神经网络训练中批量归一化和辍学的用法。” CoRR Abs / 1905.05928。http://arxiv.org/abs/1905.05928

艾菲,谢尔盖和克里斯蒂安·塞格迪。2015年。“批量标准化:通过减少内部协变量偏移来加速深度网络训练。” CoRR Abs / 1502.03167。http://arxiv.org/abs/1502.03167

李翔,陈硕,胡小林和杨健。2018年。“通过方差转移了解辍学和批处理规范化之间的不和谐。” CoRR Abs / 1801.05134。http://arxiv.org/abs/1801.05134

I read the recommended papers in the answer and comments from https://stackoverflow.com/a/40295999/8625228

From Ioffe and Szegedy (2015)’s point of view, only use BN in the network structure. Li et al. (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. Thus, Li et al. (2018) recommend applying Dropout after all BN layers.

From Ioffe and Szegedy (2015)’s point of view, BN is located inside/before the activation function. However, Chen et al. (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) recommends use BN after ReLU.

On the safety background, I use Dropout or BN only in the network.

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” CoRR abs/1905.05928. http://arxiv.org/abs/1905.05928.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.


Tensorflow大步向前

问题:Tensorflow大步向前

我想了解的进步在tf.nn.avg_pool,tf.nn.max_pool,tf.nn.conv2d说法。

文件反复说

步幅:长度大于等于4的整数的列表。输入张量每个维度的滑动窗口的步幅。

我的问题是:

  1. 4个以上的整数分别代表什么?
  2. 对于卷积网络,为什么必须要有stride [0] = strides [3] = 1?
  3. 此示例中,我们看到了tf.reshape(_X,shape=[-1, 28, 28, 1])。为什么是-1?

遗憾的是,文档中使用-1进行重塑的示例并不能很好地解释这种情况。

I am trying to understand the strides argument in tf.nn.avg_pool, tf.nn.max_pool, tf.nn.conv2d.

The documentation repeatedly says

strides: A list of ints that has length >= 4. The stride of the sliding window for each dimension of the input tensor.

My questions are:

  1. What do each of the 4+ integers represent?
  2. Why must they have strides[0] = strides[3] = 1 for convnets?
  3. In this example we see tf.reshape(_X,shape=[-1, 28, 28, 1]). Why -1?

Sadly the examples in the docs for reshape using -1 don’t translate too well to this scenario.


回答 0

池化和卷积运算在输入张量上滑动一个“窗口”。使用tf.nn.conv2d作为一个例子:如果输入张量有4个方面: [batch, height, width, channels],则卷积在二维窗口上操作height, width的尺寸。

strides确定窗口在每个维度上的移动量。典型用法是将第一个(批处理)和最后一个(深度)跨度设置为1。

让我们使用一个非常具体的示例:在32×32灰度输入图像上运行2-d卷积。我说灰度是因为输入图像的深度为1,这有助于使其简单。让该图像看起来像这样:

00 01 02 03 04 ...
10 11 12 13 14 ...
20 21 22 23 24 ...
30 31 32 33 34 ...
...

让我们在一个示例(批处理大小= 1)上运行2×2卷积窗口。我们给卷积的输出通道深度为8。

卷积的输入为shape=[1, 32, 32, 1]

如果指定strides=[1,1,1,1]padding=SAME,则滤波器的输出将是[1,32,32,8]。

过滤器将首先为以下内容创建输出:

F(00 01
  10 11)

然后针对:

F(01 02
  11 12)

等等。然后它将移至第二行,计算:

F(10, 11
  20, 21)

然后

F(11, 12
  21, 22)

如果将跨度指定为[1、2、2、1],则不会重叠窗口。它将计算:

F(00, 01
  10, 11)

然后

F(02, 03
  12, 13)

跨步操作对于池操作员类似。

问题2:为何大步走向[1,x,y,1]

第一个是批处理:您通常不想跳过批处理中的示例,否则您不应该首先将它们包括在内。:)

最后一个是卷积的深度:出于相同的原因,您通常不想跳过输入。

conv2d运算符比较笼统,因此您可以创建卷积以使窗口沿其他维度滑动,但这在卷积网络中并不常见。典型用途是在空间上使用它们。

为什么要重塑为-1 -1是一个占位符,它表示“根据需要进行调整以匹配整个张量所需的大小”。这是使代码独立于输入批处理大小的一种方法,因此您可以更改管道,而不必在代码中的任何地方调整批处理大小。

The pooling and convolutional ops slide a “window” across the input tensor. Using tf.nn.conv2d as an example: If the input tensor has 4 dimensions: [batch, height, width, channels], then the convolution operates on a 2D window on the height, width dimensions.

strides determines how much the window shifts by in each of the dimensions. The typical use sets the first (the batch) and last (the depth) stride to 1.

Let’s use a very concrete example: Running a 2-d convolution over a 32×32 greyscale input image. I say greyscale because then the input image has depth=1, which helps keep it simple. Let that image look like this:

00 01 02 03 04 ...
10 11 12 13 14 ...
20 21 22 23 24 ...
30 31 32 33 34 ...
...

Let’s run a 2×2 convolution window over a single example (batch size = 1). We’ll give the convolution an output channel depth of 8.

The input to the convolution has shape=[1, 32, 32, 1].

If you specify strides=[1,1,1,1] with padding=SAME, then the output of the filter will be [1, 32, 32, 8].

The filter will first create an output for:

F(00 01
  10 11)

And then for:

F(01 02
  11 12)

and so on. Then it will move to the second row, calculating:

F(10, 11
  20, 21)

then

F(11, 12
  21, 22)

If you specify a stride of [1, 2, 2, 1] it won’t do overlapping windows. It will compute:

F(00, 01
  10, 11)

and then

F(02, 03
  12, 13)

The stride operates similarly for the pooling operators.

Question 2: Why strides [1, x, y, 1] for convnets

The first 1 is the batch: You don’t usually want to skip over examples in your batch, or you shouldn’t have included them in the first place. :)

The last 1 is the depth of the convolution: You don’t usually want to skip inputs, for the same reason.

The conv2d operator is more general, so you could create convolutions that slide the window along other dimensions, but that’s not a typical use in convnets. The typical use is to use them spatially.

Why reshape to -1 -1 is a placeholder that says “adjust as necessary to match the size needed for the full tensor.” It’s a way of making the code be independent of the input batch size, so that you can change your pipeline and not have to adjust the batch size everywhere in the code.


回答 1

输入是4维的,格式为: [batch_size, image_rows, image_cols, number_of_colors]

通常,跨度定义了应用操作之间的重叠。对于conv2d,它指定卷积滤波器的连续应用之间的距离是多少。特定维度中的值1表示我们在每行/列应用运算符,值2表示每秒钟/以此类推。

关于1)对于卷积重要的值是2nd和3rd,它们表示卷积滤波器在沿行和列的应用中的重叠。值[1,2,2,1]表示我们要在每隔第二行和第二列上应用过滤器。

关于2)我不知道技术限制(可能是CuDNN要求),但通常人们会沿行或列尺寸使用步幅。在批处理大小上执行此操作不一定有意义。不确定最后一个尺寸。

关于3)为其中一个维设置-1表示“为第一维设置值,以使张量中的元素总数不变”。在我们的例子中,-1将等于batch_size。

The inputs are 4 dimensional and are of form: [batch_size, image_rows, image_cols, number_of_colors]

Strides, in general, define an overlap between applying operations. In the case of conv2d, it specifies what is the distance between consecutive applications of convolutional filters. The value of 1 in a specific dimension means that we apply the operator at every row/col, the value of 2 means every second, and so on.

Re 1) The values that matter for convolutions are 2nd and 3rd and they represent the overlap in the application of the convolutional filters along rows and columns. The value of [1, 2, 2, 1] says that we want to apply the filters on every second row and column.

Re 2) I don’t know the technical limitations (might be CuDNN requirement) but typically people use strides along the rows or columns dimensions. It doesn’t necessarily make sense to do it over batch size. Not sure of the last dimension.

Re 3) Setting -1 for one of the dimension means, “set the value for the first dimension so that the total number of elements in the tensor is unchanged”. In our case, the -1 will be equal to the batch_size.


回答 2

让我们从1-dim情况下的步幅开始。

让我们假设你input = [1, 0, 2, 3, 0, 1, 1]kernel = [2, 1, 3]卷积的结果是[8, 11, 7, 9, 4],它通过滑动你的内核在输入,进行逐元素乘法和求和计算出的一切。像这样

  • 8 = 1 * 2 + 0 * 1 + 2 * 3
  • 11 = 0 * 2 + 2 * 1 + 3 * 3
  • 7 = 2 * 2 + 3 * 1 + 0 * 3
  • 9 = 3 * 2 + 0 * 1 +1 * 3
  • 4 = 0 * 2 +1 * 1 +1 * 3

在这里,我们滑动了一个元素,但没有其他任何数字可以阻止您。这个数字是您的进步。您可以将其视为仅取第s个结果就对1步卷积的结果进行下采样。

知道输入大小i,内核大小k,步幅s和填充p,您可以轻松计算出卷积的输出大小为:

在这里|| 操作员表示天花板操作。对于池化层,s = 1。


N昏暗的情况。

一旦了解到每一个暗角都是独立的,就知道了1暗角情形的数学原理,n暗角情形很容易。因此,您只需分别滑动每个尺寸。这是2-d示例。请注意,您不必在所有尺寸上都具有相同的步幅。因此,对于N维输入/内核,您应提供N个跨度。


因此,现在很容易回答您的所有问题:

  1. 4个以上的整数分别代表什么?conv2dpool告诉您此列表表示每个维度之间的跨度。注意,步幅列表的长度与内核张量的秩相同。
  2. 为什么对于卷积网络,他们必须具有大步[0] =大步3 = 1?。第一个维度是批量大小,最后一个是渠道。既不跳过批处理也不跳过通道。因此,您将它们设置为1。对于宽度/高度,您可以跳过某些内容,因此它们可能不是1。
  3. tf.reshape(_X,shape = [-1,28,28,1])。为什么是-1? tf.reshape为您提供了帮助:

    如果形状的一个分量是特殊值-1,则将计算该尺寸的大小,以便总大小保持恒定。特别地,[-1]的形状变平为1-D。形状的最多一个分量可以是-1。

Let’s start with what stride does in 1-dim case.

Let’s assume your input = [1, 0, 2, 3, 0, 1, 1] and kernel = [2, 1, 3] the result of the convolution is [8, 11, 7, 9, 4], which is calculated by sliding your kernel over the input, performing element-wise multiplication and summing everything. Like this:

  • 8 = 1 * 2 + 0 * 1 + 2 * 3
  • 11 = 0 * 2 + 2 * 1 + 3 * 3
  • 7 = 2 * 2 + 3 * 1 + 0 * 3
  • 9 = 3 * 2 + 0 * 1 + 1 * 3
  • 4 = 0 * 2 + 1 * 1 + 1 * 3

Here we slide by one element, but nothing stops you by using any other number. This number is your stride. You can think about it as downsampling the result of the 1-strided convolution by just taking every s-th result.

Knowing the input size i, kernel size k, stride s and padding p you can easily calculate the output size of the convolution as:

Here || operator means ceiling operation. For a pooling layer s = 1.


N-dim case.

Knowing the math for a 1-dim case, n-dim case is easy once you see that each dim is independent. So you just slide each dimension separately. Here is an example for 2-d. Notice that you do not need to have the same stride at all the dimensions. So for an N-dim input/kernel you should provide N strides.


So now it is easy to answer all your questions:

  1. What do each of the 4+ integers represent?. conv2d, pool tells you that this list represents the strides among each dimension. Notice that the length of strides list is the same as the rank of kernel tensor.
  2. Why must they have strides[0] = strides3 = 1 for convnets?. The first dimension is batch size, the last is channels. There is no point of skipping neither batch nor channel. So you make them 1. For width/height you can skip something and that’s why they might be not 1.
  3. tf.reshape(_X,shape=[-1, 28, 28, 1]). Why -1? tf.reshape has it covered for you:

    If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant. In particular, a shape of [-1] flattens into 1-D. At most one component of shape can be -1.


回答 3

@dga在解释方面做得非常出色,我无法感激它的帮助。我将以类似的方式分享我stride在3D卷积中如何工作的发现。

根据conv3d 上的TensorFlow文档,输入的形状必须按以下顺序排列:

[batch, in_depth, in_height, in_width, in_channels]

让我们使用一个示例从最右到左解释变量。假设输入形状为 input_shape = [1000,16,112,112,3]

input_shape[4] is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3] is the width of the image
input_shape[2] is the height of the image
input_shape[1] is the number of frames that have been lumped into 1 complete data
input_shape[0] is the number of lumped frames of images we have.

以下是有关如何使用步幅的摘要文档。

步幅:长度> = 5的整数列表。长度为5的1-D张量。每个输入维度的滑动窗口的步幅。一定有strides[0] = strides[4] = 1

正如许多作品所指出的那样,步幅仅表示窗口或内核距离最近的元素(无论是数据帧还是像素)有几步的距离(顺便说一句)。

从以上文档中可以看出,3D中的跨度应为以下跨度=(1,XYZ,1)。

文档强调了这一点strides[0] = strides[4] = 1

strides[0]=1 means that we do not want to skip any data in the batch 
strides[4]=1 means that we do not want to skip in the channel 

strides [X]表示我们应在集总帧中进行多少次跳过。因此,例如,如果我们有16帧,则X = 1表示使用每帧。X = 2表示每隔一帧使用一次,然后继续

strides [y]和strides [z]按照@dga的说明进行操作,因此我不会重做该部分。

但是,在keras中,您只需要指定3个整数的元组/列表,即可指定沿每个空间维度的卷积步幅,其中空间维度为stride [x],strides [y]和strides [z]。strides [0]和strides [4]已默认为1。

我希望有人觉得这有帮助!

@dga has done a wonderful job explaining and I can’t be thankful enough how helpful it has been. In the like manner, I will like to share my findings on how stride works in 3D convolution.

According to the TensorFlow documentation on conv3d, the shape of the input must be in this order:

[batch, in_depth, in_height, in_width, in_channels]

Let’s explain the variables from the extreme right to the left using an example. Assuming the input shape is input_shape = [1000,16,112,112,3]

input_shape[4] is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3] is the width of the image
input_shape[2] is the height of the image
input_shape[1] is the number of frames that have been lumped into 1 complete data
input_shape[0] is the number of lumped frames of images we have.

Below is a summary documentation for how stride is used.

strides: A list of ints that has length >= 5. 1-D tensor of length 5. The stride of the sliding window for each dimension of input. Must have strides[0] = strides[4] = 1

As indicated in many works, strides simply mean how many steps away a window or kernel jumps away from the closest element, be it a data frame or pixel (this is paraphrased by the way).

From the above documentation, a stride in 3D will look like this strides = (1,X,Y,Z,1).

The documentation emphasizes that strides[0] = strides[4] = 1.

strides[0]=1 means that we do not want to skip any data in the batch 
strides[4]=1 means that we do not want to skip in the channel 

strides[X] means how many skips we should make in the lumped frames. So for example, if we have 16 frames, X=1 means use every frame. X=2 means use every second frame and it goes and on

strides[y] and strides[z] follow the explanation by @dga so I will not redo that part.

In keras however, you only need to specify a tuple/list of 3 integers, specifying the strides of the convolution along each spatial dimension, where spatial dimension is stride[x], strides[y] and strides[z]. strides[0] and strides[4] is already defaulted to 1.

I hope someone finds this helpful!


哪些参数应用于提前停止?

问题:哪些参数应用于提前停止?

我正在使用Keras为我的项目训练神经网络。Keras提供了提前停止的功能。我是否知道应该观察哪些参数,以避免由于使用早期停止而使神经网络过度拟合?

I’m training a neural network for my project using Keras. Keras has provided a function for early stopping. May I know what parameters should be observed to avoid my neural network from overfitting by using early stopping?


回答 0

一旦损失开始增加(或换句话说,验证准确性开始降低),早期停止基本上就是停止训练。根据文件,其用法如下;

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=0,
                              verbose=0, mode='auto')

值取决于您的实现(问题,批处理大小等),但通常是为了防止我使用过度拟合;

  1. 通过将monitor 参数设置为,监控验证损失(需要使用交叉验证或至少训练/测试集)'val_loss'
  2. min_delta是在某个时期是否将损失量化为改善的阈值。如果损失差异小于min_delta,则将其量化为无改善。最好将其保留为0,因为我们对损失越来越严重感兴趣。
  3. patience参数代表损失开始增加(停止改善)后停止之前的时期数。这取决于您的实现,如果您使用的批次非常小学习率较高,则损失呈锯齿状(准确性会更加嘈杂),因此最好设置一个较大的patience参数。如果您使用大批量学习率较低,则损失会更平稳,因此可以使用较小的patience参数。无论哪种方式,我都将其保留为2,以便为模型提供更多机会。
  4. verbose 确定要打印的内容,将其保留为默认值(0)。
  5. mode参数取决于您监视的数量的方向(应该是减少还是增加),因为我们监视损失,所以可以使用min。但是让我们留给喀拉拉邦为我们处理,并将其设置为auto

因此,我将使用类似的方法并通过绘制有无早期停止的错误损失来进行实验。

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=2,
                              verbose=0, mode='auto')

为了避免对回调的工作方式产生歧义,我将尝试解释更多信息。调用fit(... callbacks=[es])模型后,Keras会调用给定的回调对象预定的函数。这些功能可以称为on_train_beginon_train_endon_epoch_beginon_epoch_endon_batch_beginon_batch_end。在每个时期结束时调用提前停止回调,将最佳监视值与当前监视值进行比较,并在条件满足时停止(自观察最佳监视值以来已经过去了多少个时期,这不仅仅是耐心参数,两者之间的差最后一个值大于min_delta等。)。

正如@BrentFaust在评论中指出的那样,模型的训练将继续进行,直到满足Early Stopping条件或满足epochsin(默认值= 10)为止fit()。设置“提早停止”回调不会使模型超出其epochs参数进行训练。因此,调用fit()较大epochs值的函数将受益于Early Stopping回调。

Early stopping is basically stopping the training once your loss starts to increase (or in other words validation accuracy starts to decrease). According to documents it is used as follows;

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=0,
                              verbose=0, mode='auto')

Values depends on your implementation (problem, batch size etc…) but generally to prevent overfitting I would use;

  1. Monitor the validation loss (need to use cross validation or at least train/test sets) by setting the monitor argument to 'val_loss'.
  2. min_delta is a threshold to whether quantify a loss at some epoch as improvement or not. If the difference of loss is below min_delta, it is quantified as no improvement. Better to leave it as 0 since we’re interested in when loss becomes worse.
  3. patience argument represents the number of epochs before stopping once your loss starts to increase (stops improving). This depends on your implementation, if you use very small batches or a large learning rate your loss zig-zag (accuracy will be more noisy) so better set a large patience argument. If you use large batches and a small learning rate your loss will be smoother so you can use a smaller patience argument. Either way I’ll leave it as 2 so I would give the model more chance.
  4. verbose decides what to print, leave it at default (0).
  5. mode argument depends on what direction your monitored quantity has (is it supposed to be decreasing or increasing), since we monitor the loss, we can use min. But let’s leave keras handle that for us and set that to auto

So I would use something like this and experiment by plotting the error loss with and without early stopping.

keras.callbacks.EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=2,
                              verbose=0, mode='auto')

For possible ambiguity on how callbacks work, I’ll try to explain more. Once you call fit(... callbacks=[es]) on your model, Keras calls given callback objects predetermined functions. These functions can be called on_train_begin, on_train_end, on_epoch_begin, on_epoch_end and on_batch_begin, on_batch_end. Early stopping callback is called on every epoch end, compares the best monitored value with the current one and stops if conditions are met (how many epochs have past since the observation of the best monitored value and is it more than patience argument, the difference between last value is bigger than min_delta etc..).

As pointed by @BrentFaust in comments, model’s training will continue until either Early Stopping conditions are met or epochs parameter (default=10) in fit() is satisfied. Setting an Early Stopping callback will not make the model to train beyond its epochs parameter. So calling fit() function with a larger epochs value would benefit more from Early Stopping callback.