标签归档:tensorflow

为什么TensorFlow 2比TensorFlow 1慢得多?

问题:为什么TensorFlow 2比TensorFlow 1慢得多?

许多用户都将其作为切换到Pytorch的原因,但是我还没有找到牺牲/最渴望的实用质量,速度和执行力的理由/解释。

以下是代码基准测试性能,即TF1与TF2的对比-TF1的运行速度提高了47%至276%

我的问题是:在图形或硬件级别上,什么导致如此显着的下降?


寻找详细的答案-已经熟悉广泛的概念。相关的Git

规格:CUDA 10.0.130,cuDNN 7.4.2,Python 3.7.4,Windows 10,GTX 1070


基准测试结果


UPDATE:禁用每下面的代码不会急于执行没有帮助。但是,该行为是不一致的:有时以图形方式运行会有所帮助,而其他时候其运行速度要比 Eager

由于TF开发人员没有出现在任何地方,因此我将自己进行调查-可以跟踪相关Github问题的进展。

更新2:分享大量实验结果,并附有解释;应该在今天完成。


基准代码

# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time

batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)

model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)

K.clear_session()  # in my testing, kernel was restarted instead

model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)

使用的功能

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_small_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 400, strides=4, padding='same')(ipt)
    x     = Flatten()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_medium_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
    x     = LSTM(512, activation='relu', return_sequences=True)(x)
    x     = Conv1D(128, 400, strides=4, padding='same')(x)
    x     = Flatten()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_data(batch_shape):
    return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))

It’s been cited by many users as the reason for switching to Pytorch, but I’ve yet to find a justification / explanation for sacrificing the most important practical quality, speed, for eager execution.

Below is code benchmarking performance, TF1 vs. TF2 – with TF1 running anywhere from 47% to 276% faster.

My question is: what is it, at the graph or hardware level, that yields such a significant slowdown?


Looking for a detailed answer – am already familiar with broad concepts. Relevant Git

Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070


Benchmark results:


UPDATE: Disabling Eager Execution per below code does not help. The behavior, however, is inconsistent: sometimes running in graph mode helps considerably, other times it runs slower relative to Eager.

As TF devs don’t appear around anywhere, I’ll be investigating this matter myself – can follow progress in the linked Github issue.

UPDATE 2: tons of experimental results to share, along explanations; should be done today.


Benchmark code:

# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time

batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)

model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)

K.clear_session()  # in my testing, kernel was restarted instead

model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)

Functions used:

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_small_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 400, strides=4, padding='same')(ipt)
    x     = Flatten()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_medium_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
    x     = LSTM(512, activation='relu', return_sequences=True)(x)
    x     = Conv1D(128, 400, strides=4, padding='same')(x)
    x     = Flatten()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_data(batch_shape):
    return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))

回答 0

2020年2月18日更新:我每晚排练 2.1和2.1;结果好坏参半。除了一个配置(模型和数据大小)外,其他配置的运行速度都快于TF2和TF1的最佳配置。速度较慢且急剧下降的是大型-尤其是。在图形执行中(慢1.6倍至2.5倍)。

此外,对于我测试的大型模型,Graph和Eager之间存在极大的可重复性差异-无法通过随机性/计算并行性来解释这一差异。我目前无法按时间限制显示这些声明的可重现代码,因此我强烈建议您针对自己的模型进行测试。

尚未针对这些问题打开Git问题,但我确实对原始内容发表了评论-尚未回复。取得进展后,我将更新答案。


VERDICT:它是不是,如果你知道自己在做什么。但是,如果您不这样做,则可能会花费大量成本-平均而言,需要进行几次GPU升级,而在最坏的情况下,则需要多个GPU。


答案:旨在提供对该问题的高级描述,以及有关如何根据您的需求决定培训配置的指南。有关详细的低级描述(包括所有基准测试结果和所使用的代码),请参阅我的其他答案。

如果我学到更多信息,我将更新我的答案,并提供更多信息-可以为该问题添加书签/“加上星号”以供参考。


问题摘要:正如TensorFlow开发人员Q. Scott Zhu 确认的那样,TF2专注于Eager执行和带有Keras的紧密集成的开发,这涉及到TF源的全面更改-包括图形级。好处:大大扩展了处理,分发,调试和部署功能。但是,其中一些成本是速度。

但是,这个问题要复杂得多。不仅仅是TF1和TF2-导致火车速度显着差异的因素包括:

  1. TF2与TF1
  2. 渴望与图表模式
  3. kerastf.keras
  4. numpyvs. tf.data.Datasetvs ….
  5. train_on_batch()fit()
  6. GPU与CPU
  7. model(x)vs. model.predict(x)vs ….

不幸的是,以上几乎没有一个是彼此独立的,并且每个相对于另一个可以至少使执行时间加倍。幸运的是,您可以确定哪些是系统上最有效的方法,并提供一些捷径-正如我将要展示的。


我该怎么办?当前,唯一的方法是-针对您的特定模型,数据和硬件进行实验。没有任何单一的配置总是最好的工作-但也做的,并没有对简化搜索:

>>做:

  • train_on_batch()++ numpy+ tf.kerasTF1 +热切/图
  • train_on_batch()+ numpy+ tf.keras+ + TF2图
  • fit()++ numpy+ tf.kerasTF1 / TF2 +图表+大型模型和数据

>>不要:

  • fit()+ numpy+ keras用于中小型模型和数据
  • fit()++ numpy+ tf.kerasTF1 / TF2 +渴望
  • train_on_batch()+ numpy+ keras+ + TF1伊格

  • [主要] tf.python.keras;它的运行速度可以降低10到100倍,并且带有许多错误;更多信息

    • 这包括layersmodelsoptimizers,和相关的“乱用”的使用进口; ops,utils和相关的“私有”导入都可以-但可以肯定的是,请检查alt以及它们是否用于tf.keras

请参阅其他答案底部的代码,以获取基准测试设置示例。上面的列表主要基于其他答案中的“ BENCHMARKS”表。


上述注意事项的局限性

  • 这个问题的标题是“为什么TF2比TF1慢得多?”,尽管它的主体明确地涉及训练,但问题并不局限于此。即使在相同的TF版本,导入,数据格式等中,推理也将受到主要速度差异的影响-参见此答案
  • RNN在TF2中得到了改进,很可能会明显改变其他答案中的数据网格。
  • 模型主要用于Conv1DDense-不RNNs,稀疏数据/目标,4 / 5D输入,和其他CONFIGS
  • 输入数据限制为numpytf.data.Dataset,同时存在许多其他格式;查看其他答案
  • 使用了GPU;结果在CPU上有所不同。实际上,当我问这个问题时,我的CUDA配置不正确,并且某些结果是基于CPU的。

为什么TF2为急切执行而牺牲了最实用的质量,速度?显然,它还没有-图形仍然可用。但是如果问题是“为什么要渴望”:

  • 出色的调试:您可能会遇到许多问题,询问“如何获得中间层输出”或“如何检查权重”;渴望,它(几乎)很简单.__dict__。相比之下,Graph需要熟悉特殊的后端功能-极大地增加了调试和自省的整个过程。
  • 更快的原型制作:与上述类似的想法;更快的理解=剩下更多的时间用于实际DL。

如何启用/禁用EAGER?

tf.enable_eager_execution()  # TF1; must be done before any model/tensor creation
tf.compat.v1.disable_eager_execution() # TF2; above holds

附加信息

  • 仔细_on_batch()研究TF2中的方法;根据TF开发人员的说法,他们仍然使用较慢的实现方式,但不是故意的 -即必须解决。有关详细信息,请参见其他答案。

张力流需求

  1. 请修复train_on_batch(),以及fit()迭代调用的性能方面;定制火车循环对许多人尤其是我来说很重要。
  2. 添加有关这些性能差异的文档/文档字符串,以供用户了解。
  3. 提高一般执行速度,以防止窥视现象跳入Pytorch。

致谢:感谢


更新

  • 191114日 -找到了一个模型(在我的实际应用程序中),该模型在TF2上针对所有*配置(带有Numpy输入数据)的速度较慢。差异范围为13-19%,平均为17%。但是,keras和之间的tf.keras差异更为明显:平均18-40%。32%(TF1和2)。(*-渴望者(TF2 OOM’d为此)

  • 11/17/19 -devs on_batch()最近的一次提交中更新了方法,指出已提高了速度-将在TF 2.1中发布,或现在以形式提供tf-nightly。由于我无法让后者运行,因此将替补席推迟到2.1。

  • 2/20/20-预测性能也值得借鉴;例如,在TF2中,CPU预测时间可能涉及周期性的峰值

UPDATE 2/18/2020: I’ve benched 2.1 and 2.1-nightly; the results are mixed. All but one configs (model & data size) are as fast as or much faster than the best of TF2 & TF1. The one that’s slower, and slower dramatically, is Large-Large – esp. in Graph execution (1.6x to 2.5x slower).

Furthermore, there are extreme reproducibility differences between Graph and Eager for a large model I tested – one not explainable via randomness/compute-parallelism. I can’t currently present reproducible code for these claims per time constraints, so instead I strongly recommend testing this for your own models.

Haven’t opened a Git issue on these yet, but I did comment on the original – no response yet. I’ll update the answer(s) once progress is made.


VERDICT: it isn’t, IF you know what you’re doing. But if you don’t, it could cost you, lots – by a few GPU upgrades on average, and by multiple GPUs worst-case.


THIS ANSWER: aims to provide a high-level description of the issue, as well as guidelines for how to decide on the training configuration specific to your needs. For a detailed, low-level description, which includes all benchmarking results + code used, see my other answer.

I’ll be updating my answer(s) w/ more info if I learn any – can bookmark / “star” this question for reference.


ISSUE SUMMARY: as confirmed by a TensorFlow developer, Q. Scott Zhu, TF2 focused development on Eager execution & tight integration w/ Keras, which involved sweeping changes in TF source – including at graph-level. Benefits: greatly expanded processing, distribution, debug, and deployment capabilities. The cost of some of these, however, is speed.

The matter, however, is fairly more complex. It isn’t just TF1 vs. TF2 – factors yielding significant differences in train speed include:

  1. TF2 vs. TF1
  2. Eager vs. Graph mode
  3. keras vs. tf.keras
  4. numpy vs. tf.data.Dataset vs. …
  5. train_on_batch() vs. fit()
  6. GPU vs. CPU
  7. model(x) vs. model.predict(x) vs. …

Unfortunately, almost none of the above are independent of the other, and each can at least double execution time relative to another. Fortunately, you can determine what’ll work best systematically, and with a few shortcuts – as I’ll be showing.


WHAT SHOULD I DO? Currently, the only way is – experiment for your specific model, data, and hardware. No single configuration will always work best – but there are do’s and don’t’s to simplify your search:

>> DO:

  • train_on_batch() + numpy + tf.keras + TF1 + Eager/Graph
  • train_on_batch() + numpy + tf.keras + TF2 + Graph
  • fit() + numpy + tf.keras + TF1/TF2 + Graph + large model & data

>> DON’T:

  • fit() + numpy + keras for small & medium models and data
  • fit() + numpy + tf.keras + TF1/TF2 + Eager
  • train_on_batch() + numpy + keras + TF1 + Eager

  • [Major] tf.python.keras; it can run 10-100x slower, and w/ plenty of bugs; more info

    • This includes layers, models, optimizers, & related “out-of-box” usage imports; ops, utils, & related ‘private’ imports are fine – but to be sure, check for alts, & whether they’re used in tf.keras

Refer to code at bottom of my other answer for an example benchmarking setup. The list above is based mainly on the “BENCHMARKS” tables in the other answer.


LIMITATIONS of the above DO’s & DON’T’s:

  • This question’s titled “Why is TF2 much slower than TF1?”, and while its body concerns training explicitly, the matter isn’t limited to it; inference, too, is subject to major speed differences, even within the same TF version, import, data format, etc. – see this answer.
  • RNNs are likely to notably change the data grid in the other answer, as they’ve been improved in TF2
  • Models primarily used Conv1D and Dense – no RNNs, sparse data/targets, 4/5D inputs, & other configs
  • Input data limited to numpy and tf.data.Dataset, while many other formats exist; see other answer
  • GPU was used; results will differ on a CPU. In fact, when I asked the question, my CUDA wasn’t properly configured, and some of the results were CPU-based.

Why did TF2 sacrifice the most practical quality, speed, for eager execution? It hasn’t, clearly – graph is still available. But if the question is “why eager at all”:

  • Superior debugging: you’ve likely come across multitudes of questions asking “how do I get intermediate layer outputs” or “how do I inspect weights”; with eager, it’s (almost) as simple as .__dict__. Graph, in contrast, requires familiarity with special backend functions – greatly complicating the entire process of debugging & introspection.
  • Faster prototyping: per ideas similar to above; faster understanding = more time left for actual DL.

HOW TO ENABLE/DISABLE EAGER?

tf.enable_eager_execution()  # TF1; must be done before any model/tensor creation
tf.compat.v1.disable_eager_execution() # TF2; above holds

ADDITIONAL INFO:

  • Careful with _on_batch() methods in TF2; according to the TF dev, they still use a slower implementation, but not intentionally – i.e. it’s to be fixed. See other answer for details.

REQUESTS TO TENSORFLOW DEVS:

  1. Please fix train_on_batch(), and the performance aspect of calling fit() iteratively; custom train loops are important to many, especially to me.
  2. Add documentation / docstring mention of these performance differences for users’ knowledge.
  3. Improve general execution speed to keep peeps from hopping to Pytorch.

ACKNOWLEDGEMENTS: Thanks to


UPDATES:

  • 11/14/19 – found a model (in my real application) that that runs slower on TF2 for all* configurations w/ Numpy input data. Differences ranged 13-19%, averaging 17%. Differences between keras and tf.keras, however, were more dramatic: 18-40%, avg. 32% (both TF1 & 2). (* – except Eager, for which TF2 OOM’d)

  • 11/17/19 – devs updated on_batch() methods in a recent commit, stating to have improved speed – to be released in TF 2.1, or available now as tf-nightly. As I’m unable to get latter running, will delay benching until 2.1.

  • 2/20/20 – prediction performance is also worth benching; in TF2, for example, CPU prediction times can involve periodic spikes

回答 1

解答:旨在提供对该问题的详细图形/硬件级别描述-包括TF2与TF1训练循环,输入数据处理器以及Eager与Graph模式的执行。有关问题摘要和解决方案的指南,请参见我的其他答案。


性能验证:有时一个更快,有时另一个更快,具体取决于配置。就TF2与TF1而言,它们的平均水平差不多,但是确实存在基于配置的重大差异,并且TF1比TF2更为常见。请参阅下面的“标记”。


EAGER VS. GRAPH:这可以说是整个答案的关键:根据我的测试,TF2的渴望比TF1的渴望。细节进一步下降。

两者之间的根本区别是:Graph 主动设置计算网络,并在“提示”时执行-而Eager在创建时执行所有操作。但故事只从这里开始:

  • 渴望并不是没有Graph,实际上可能主要是 Graph,这与预期相反。它主要是执行图 -包括模型和优化器权重,占图的很大一部分。

  • 渴望在执行时重建自己图的一部分 ; Graph未完全构建的直接结果-请参阅分析器结果。这具有计算开销。

  • 渴望慢与脾气暴躁的输入 ; 根据此Git注释和代码,Eager中的Numpy输入包括将张量从CPU复制到GPU的开销成本。遍历源代码,数据处理差异很明显;渴望直接通过Numpy,而图则通过张量,然后求和为Numpy。不确定确切的过程,但后者应涉及GPU级别的优化

  • TF2 Eager 比TF1 Eager -这是…意外。请参阅下面的基准测试结果。差异从可以忽略不计到显着,但是是一致的。不确定为什么会这样-如果TF开发人员澄清了,将会更新答案。


TF2与TF1:引用TF开发人员Q. Scott Zhu的相关部分的回复 -附上我的强调和改写:

急切地,运行时需要执行ops并为python代码的每一行返回数值。单步执行的性质使其运行缓慢

在TF2中,Keras利用tf.function构建其图形进行训练,评估和预测。我们称它们为模型的“执行功能”。在TF1中,“执行功能”是FuncGraph,它与TF功能共享一些公共组件,但是实现方式不同。

在此过程中,我们以某种方式为train_on_batch(),test_on_batch()和预报_on_batch()留下了错误的实现。它们在数值上仍然是正确的,但是x_on_batch的执行函数是纯python函数,而不是tf.function包装的python函数。这会导致缓慢

在TF2中,我们将所有输入数据转换为tf.data.Dataset,通过它我们可以统一执行函数来处理单一类型的输入。数据集转换中可能会有一些开销,我认为这是一次性的开销,而不是每次批处理的开销

带有上一段的最后一句和下段的最后一句:

为了克服急切模式下的缓慢性,我们提供了@ tf.function,它将把python函数变成图形。当像np数组一样输入数值时,tf.function的主体将转换为静态图,进行优化,并返回最终值,该值很快,并且应具有与TF1图模式相似的性能。

我不同意-根据我的分析结果,该结果表明Eager的输入数据处理比Graph的处理要慢得多。另外,tf.data.Dataset尤其不确定,但是Eager确实反复调用了多个相同的数据转换方法-请参阅事件探查器。

最后,开发人员的链接提交:支持Keras v2循环的大量更改


训练循环:取决于(1)渴望与图表;(2)输入数据格式,训练将在一个独特的训练循环中进行-在TF2中_select_training_loop()training.py,其中之一:

training_v2.Loop()
training_distributed.DistributionMultiWorkerTrainingLoop(
              training_v2.Loop()) # multi-worker mode
# Case 1: distribution strategy
training_distributed.DistributionMultiWorkerTrainingLoop(
            training_distributed.DistributionSingleWorkerTrainingLoop())
# Case 2: generator-like. Input is Python generator, or Sequence object,
# or a non-distributed Dataset or iterator in eager execution.
training_generator.GeneratorOrSequenceTrainingLoop()
training_generator.EagerDatasetOrIteratorTrainingLoop()
# Case 3: Symbolic tensors or Numpy array-like. This includes Datasets and iterators 
# in graph mode (since they generate symbolic tensors).
training_generator.GeneratorLikeTrainingLoop() # Eager
training_arrays.ArrayLikeTrainingLoop() # Graph

每个人对资源分配的处理方式不同,并会对性能和功能造成影响。


火车循环:fitvs train_on_batchkerasvstf.keras:四个循环都使用不同的火车循环,尽管可能不是每种可能的组合。kerasfit,例如,使用的形式fit_loop,例如training_arrays.fit_loop(),其train_on_batch可以使用K.function()tf.keras具有更复杂的层次结构,在上一节中进行了部分描述。


训练循环:文档 -有关某些不同执行方法的相关源文档字符串

与其他TensorFlow操作不同,我们不会将python数值输入转换为张量。此外,将为每个不同的python数值生成一个新图

function 为每个唯一的输入形状和数据类型集实例化一个单独的图

一个tf.function对象可能需要映射到后台的多个计算图。这应该仅在性能上可见(跟踪图的计算和内存成本非零


输入数据处理器:与上述类似,根据运行时配置(执行模式,数据格式,分发策略)设置的内部标志,视情况选择处理器。最简单的情况是使用Eager,它可以直接与Numpy数组一起使用。有关某些特定示例,请参见此答案


模型大小,数据大小:

  • 是决定性的;没有任何一种配置能在所有型号和数据尺寸上脱颖而出。
  • 相对于模型大小的数据大小很重要;对于小型数据和模型,数据传输(例如,CPU至GPU)的开销可能占主导。同样,小型的开销处理器在每个数据转换时间对大型数据的运行速度上较慢(请参见 convert_to_tensor“配置文件”)
  • 速度因火车循环和输入数据处理器处理资源的方式而异。

基准:磨碎的肉。- Word文档Excel电子表格


术语

  • 减去%的数字都是
  • %计算为(1 - longer_time / shorter_time)*100; 理由:我们对哪个因素比另一个因素更快感兴趣shorter / longer实际上是非线性关系,对直接比较没有用
  • 百分号确定:
    • TF2 vs TF1:+如果TF2更快
    • GvE(图表vs.渴望):+如果图表更快
  • TF2 = TensorFlow 2.0.0 + Keras 2.3.1; TF1 = TensorFlow 1.14.0 + Keras 2.2.5

简介


PROFILER-说明:Spyder 3.3.6 IDE分析器。

  • 有些功能在其他嵌套中重复;因此,很难找到“数据处理”和“训练”功能之间的确切间隔,因此会有一些重叠-在最后一个结果中很明显。

  • 计算的wrt运行时数减去构建时间的百分比

  • 通过将所有(唯一)运行时间相加得出的构建时间来计算,这些运行时间称为1或2次
  • 通过累加所有(唯一的)运行时间(与迭代的次数和它们的嵌套的运行时间相同)计算出的训练时间
  • 不幸的是,函数是根据其原始名称进行_func = func概要分析的(即,将概要分析为func),这会混入构建时间-因此需要将其排除在外

测试环境

  • 底部执行的代码带有最少的后台任务运行
  • GPU是“热身” W /定时重复前几次反复,在提出这个帖子
  • 从源代码构建的CUDA 10.0.130,cuDNN 7.6.0,TensorFlow 1.14.0和TensorFlow 2.0.0,以及Anaconda
  • Python 3.7.4,Spyder 3.3.6 IDE
  • GTX 1070,Windows 10、24 GB DDR4 2.4 MHz RAM,i7-7700HQ 2.8 GHz CPU

方法

  • 基准“小”,“中”和“大”模型和数据大小
  • 固定每个模型大小的参数数,与输入数据大小无关
  • “较大”模型具有更多参数和层
  • “较大”的数据具有更长的序列,但相同batch_sizenum_channels
  • 模型只使用Conv1DDense“可学习”层; 每个TF版本的符号都避免了RNN。差异
  • 始终在基准循环之外运行一列火车,以省略模型和优化器图的构建
  • 不使用稀疏数据(例如layers.Embedding())或稀疏目标(例如SparseCategoricalCrossEntropy()

局限性:“完整”的答案将解释所有可能的火车循环和迭代器,但这肯定超出了我的时间能力,不存在薪水或一般必要性。结果仅与方法学一样好-用开放的心态进行解释。


代码

import numpy as np
import tensorflow as tf
import random
from termcolor import cprint
from time import time

from tensorflow.keras.layers import Input, Dense, Conv1D
from tensorflow.keras.layers import Dropout, GlobalAveragePooling1D
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
#from keras.layers import Input, Dense, Conv1D
#from keras.layers import Dropout, GlobalAveragePooling1D
#from keras.models import Model 
#from keras.optimizers import Adam
#import keras.backend as K

#tf.compat.v1.disable_eager_execution()
#tf.enable_eager_execution()

def reset_seeds(reset_graph_with_backend=None, verbose=1):
    if reset_graph_with_backend is not None:
        K = reset_graph_with_backend
        K.clear_session()
        tf.compat.v1.reset_default_graph()
        if verbose:
            print("KERAS AND TENSORFLOW GRAPHS RESET")

    np.random.seed(1)
    random.seed(2)
    if tf.__version__[0] == '2':
        tf.random.set_seed(3)
    else:
        tf.set_random_seed(3)
    if verbose:
        print("RANDOM SEEDS RESET")

print("TF version: {}".format(tf.__version__))
reset_seeds()

def timeit(func, iterations, *args, _verbose=0, **kwargs):
    t0 = time()
    for _ in range(iterations):
        func(*args, **kwargs)
        print(end='.'*int(_verbose))
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_model_small(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 40, strides=4, padding='same')(ipt)
    x     = GlobalAveragePooling1D()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_model_medium(batch_shape):
    ipt = Input(batch_shape=batch_shape)
    x = ipt
    for filters in [64, 128, 256, 256, 128, 64]:
        x  = Conv1D(filters, 20, strides=1, padding='valid')(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_model_large(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(64,  400, strides=4, padding='valid')(ipt)
    x     = Conv1D(128, 200, strides=1, padding='valid')(x)
    for _ in range(40):
        x = Conv1D(256,  12, strides=1, padding='same')(x)
    x     = Conv1D(512,  20, strides=2, padding='valid')(x)
    x     = Conv1D(1028, 10, strides=2, padding='valid')(x)
    x     = Conv1D(256,   1, strides=1, padding='valid')(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)    
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_data(batch_shape):
    return np.random.randn(*batch_shape), \
           np.random.randint(0, 2, (batch_shape[0], 1))

def make_data_tf(batch_shape, n_batches, iters):
    data = np.random.randn(n_batches, *batch_shape),
    trgt = np.random.randint(0, 2, (n_batches, batch_shape[0], 1))
    return tf.data.Dataset.from_tensor_slices((data, trgt))#.repeat(iters)

batch_shape_small  = (32, 140,   30)
batch_shape_medium = (32, 1400,  30)
batch_shape_large  = (32, 14000, 30)

batch_shapes = batch_shape_small, batch_shape_medium, batch_shape_large
make_model_fns = make_model_small, make_model_medium, make_model_large
iterations = [200, 100, 50]
shape_names = ["Small data",  "Medium data",  "Large data"]
model_names = ["Small model", "Medium model", "Large model"]

def test_all(fit=False, tf_dataset=False):
    for model_fn, model_name, iters in zip(make_model_fns, model_names, iterations):
        for batch_shape, shape_name in zip(batch_shapes, shape_names):
            if (model_fn is make_model_large) and (batch_shape is batch_shape_small):
                continue
            reset_seeds(reset_graph_with_backend=K)
            if tf_dataset:
                data = make_data_tf(batch_shape, iters, iters)
            else:
                data = make_data(batch_shape)
            model = model_fn(batch_shape)

            if fit:
                if tf_dataset:
                    model.train_on_batch(data.take(1))
                    t0 = time()
                    model.fit(data, steps_per_epoch=iters)
                    print("Time/iter: %.4f sec" % ((time() - t0) / iters))
                else:
                    model.train_on_batch(*data)
                    timeit(model.fit, iters, *data, _verbose=1, verbose=0)
            else:
                model.train_on_batch(*data)
                timeit(model.train_on_batch, iters, *data, _verbose=1)
            cprint(">> {}, {} done <<\n".format(model_name, shape_name), 'blue')
            del model

test_all(fit=True, tf_dataset=False)

THIS ANSWER: aims to provide a detailed, graph/hardware-level description of the issue – including TF2 vs. TF1 train loops, input data processors, and Eager vs. Graph mode executions. For an issue summary & resolution guidelines, see my other answer.


PERFORMANCE VERDICT: sometimes one is faster, sometimes the other, depending on configuration. As far as TF2 vs TF1 goes, they’re about on par on average, but significant config-based differences do exist, and TF1 trumps TF2 more often than vice versa. See “BENCHMARKING” below.


EAGER VS. GRAPH: the meat of this entire answer for some: TF2’s eager is slower than TF1’s, according to my testing. Details further down.

The fundamental difference between the two is: Graph sets up a computational network proactively, and executes when ‘told to’ – whereas Eager executes everything upon creation. But the story only begins here:

  • Eager is NOT devoid of Graph, and may in fact be mostly Graph, contrary to expectation. What it largely is, is executed Graph – this includes model & optimizer weights, comprising a great portion of the graph.

  • Eager rebuilds part of own graph at execution; direct consequence of Graph not being fully built — see profiler results. This has a computational overhead.

  • Eager is slower w/ Numpy inputs; per this Git comment & code, Numpy inputs in Eager include the overhead cost of copying tensors from CPU to GPU. Stepping through source code, data handling differences are clear; Eager directly passes Numpy, while Graph passes tensors which then evaluate to Numpy; uncertain of the exact process, but latter should involve GPU-level optimizations

  • TF2 Eager is slower than TF1 Eager – this is… unexpected. See benchmarking results below. Differences span from negligible to significant, but are consistent. Unsure why it’s the case – if a TF dev clarifies, will update answer.


TF2 vs. TF1: quoting relevant portions of a TF dev’s, Q. Scott Zhu’s, response – w/ bit of my emphasis & rewording:

In eager, the runtime needs to execute the ops and return the numerical value for every line of python code. The nature of single step execution causes it to be slow.

In TF2, Keras leverages tf.function to build its graph for training, eval and prediction. We call them “execution function” for the model. In TF1, the “execution function” was a FuncGraph, which shared some common component as TF function, but has a different implementation.

During the process, we somehow left an incorrect implementation for train_on_batch(), test_on_batch() and predict_on_batch(). They are still numerically correct, but the execution function for x_on_batch is a pure python function, rather than a tf.function wrapped python function. This will cause slowness

In TF2, we convert all input data into a tf.data.Dataset, by which we can unify our execution function to handle the single type of the inputs. There might be some overhead in the dataset conversion, and I think this is a one-time only overhead, rather than a per-batch cost

With the last sentence of last paragraph above, and last clause of below paragraph:

To overcome the slowness in eager mode, we have @tf.function, which will turn a python function into a graph. When feed numerical value like np array, the body of the tf.function is converted into static graph, being optimized, and return the final value, which is fast and should have similar performance as TF1 graph mode.

I disagree – per my profiling results, which show Eager’s input data processing to be substantially slower than Graph’s. Also, unsure about tf.data.Dataset in particular, but Eager does repeatedly call multiple of the same data conversion methods – see profiler.

Lastly, dev’s linked commit: Significant number of changes to support the Keras v2 loops.


Train Loops: depending on (1) Eager vs. Graph; (2) input data format, training in will proceed with a distinct train loop – in TF2, _select_training_loop(), training.py, one of:

training_v2.Loop()
training_distributed.DistributionMultiWorkerTrainingLoop(
              training_v2.Loop()) # multi-worker mode
# Case 1: distribution strategy
training_distributed.DistributionMultiWorkerTrainingLoop(
            training_distributed.DistributionSingleWorkerTrainingLoop())
# Case 2: generator-like. Input is Python generator, or Sequence object,
# or a non-distributed Dataset or iterator in eager execution.
training_generator.GeneratorOrSequenceTrainingLoop()
training_generator.EagerDatasetOrIteratorTrainingLoop()
# Case 3: Symbolic tensors or Numpy array-like. This includes Datasets and iterators 
# in graph mode (since they generate symbolic tensors).
training_generator.GeneratorLikeTrainingLoop() # Eager
training_arrays.ArrayLikeTrainingLoop() # Graph

Each handles resource allocation differently, and bears consequences on performance & capability.


Train Loops: fit vs train_on_batch, keras vs. tf.keras: each of the four uses different train loops, though perhaps not in every possible combination. kerasfit, for example, uses a form of fit_loop, e.g. training_arrays.fit_loop(), and its train_on_batch may use K.function(). tf.keras has a more sophisticated hierarchy described in part in previous section.


Train Loops: documentation — relevant source docstring on some of the different execution methods:

Unlike other TensorFlow operations, we don’t convert python numerical inputs to tensors. Moreover, a new graph is generated for each distinct python numerical value

function instantiates a separate graph for every unique set of input shapes and datatypes.

A single tf.function object might need to map to multiple computation graphs under the hood. This should be visible only as performance (tracing graphs has a nonzero computational and memory cost)


Input data processors: similar to above, the processor is selected case-by-case, depending on internal flags set according to runtime configurations (execution mode, data format, distribution strategy). The simplest case’s with Eager, which works directly w/ Numpy arrays. For some specific examples, see this answer.


MODEL SIZE, DATA SIZE:

  • Is decisive; no single configuration crowned itself atop all model & data sizes.
  • Data size relative to model size is important; for small data & model, data transfer (e.g. CPU to GPU) overhead can dominate. Likewise, small overhead processors can run slower on large data per data conversion time dominating (see convert_to_tensor in “PROFILER”)
  • Speed differs per train loops’ and input data processors’ differing means of handling resources.

BENCHMARKS: the grinded meat. — Word DocumentExcel Spreadsheet


Terminology:

  • %-less numbers are all seconds
  • % computed as (1 - longer_time / shorter_time)*100; rationale: we’re interested by what factor one is faster than the other; shorter / longer is actually a non-linear relation, not useful for direct comparison
  • % sign determination:
    • TF2 vs TF1: + if TF2 is faster
    • GvE (Graph vs. Eager): + if Graph is faster
  • TF2 = TensorFlow 2.0.0 + Keras 2.3.1; TF1 = TensorFlow 1.14.0 + Keras 2.2.5

PROFILER:


PROFILER – Explanation: Spyder 3.3.6 IDE profiler.

  • Some functions are repeated in nests of others; hence, it’s hard to track down the exact separation between “data processing” and “training” functions, so there will be some overlap – as pronounced in the very last result.

  • % figures computed w.r.t. runtime minus build time

  • Build time computed by summing all (unique) runtimes which were called 1 or 2 times
  • Train time computed by summing all (unique) runtimes which were called the same # of times as the # of iterations, and some of their nests’ runtimes
  • Functions are profiled according to their original names, unfortunately (i.e. _func = func will profile as func), which mixes in build time – hence the need to exclude it

TESTING ENVIRONMENT:

  • Executed code at bottom w/ minimal background tasks running
  • GPU was “warmed up” w/ a few iterations before timing iterations, as suggested in this post
  • CUDA 10.0.130, cuDNN 7.6.0, TensorFlow 1.14.0, & TensorFlow 2.0.0 built from source, plus Anaconda
  • Python 3.7.4, Spyder 3.3.6 IDE
  • GTX 1070, Windows 10, 24GB DDR4 2.4-MHz RAM, i7-7700HQ 2.8-GHz CPU

METHODOLOGY:

  • Benchmark ‘small’, ‘medium’, & ‘large’ model & data sizes
  • Fix # of parameters for each model size, independent of input data size
  • “Larger” model has more parameters and layers
  • “Larger” data has a longer sequence, but same batch_size and num_channels
  • Models only use Conv1D, Dense ‘learnable’ layers; RNNs avoided per TF-version implem. differences
  • Always ran one train fit outside of benchmarking loop, to omit model & optimizer graph building
  • Not using sparse data (e.g. layers.Embedding()) or sparse targets (e.g. SparseCategoricalCrossEntropy()

LIMITATIONS: a “complete” answer would explain every possible train loop & iterator, but that’s surely beyond my time ability, nonexistent paycheck, or general necessity. The results are only as good as the methodology – interpret with an open mind.


CODE:

import numpy as np
import tensorflow as tf
import random
from termcolor import cprint
from time import time

from tensorflow.keras.layers import Input, Dense, Conv1D
from tensorflow.keras.layers import Dropout, GlobalAveragePooling1D
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.backend as K
#from keras.layers import Input, Dense, Conv1D
#from keras.layers import Dropout, GlobalAveragePooling1D
#from keras.models import Model 
#from keras.optimizers import Adam
#import keras.backend as K

#tf.compat.v1.disable_eager_execution()
#tf.enable_eager_execution()

def reset_seeds(reset_graph_with_backend=None, verbose=1):
    if reset_graph_with_backend is not None:
        K = reset_graph_with_backend
        K.clear_session()
        tf.compat.v1.reset_default_graph()
        if verbose:
            print("KERAS AND TENSORFLOW GRAPHS RESET")

    np.random.seed(1)
    random.seed(2)
    if tf.__version__[0] == '2':
        tf.random.set_seed(3)
    else:
        tf.set_random_seed(3)
    if verbose:
        print("RANDOM SEEDS RESET")

print("TF version: {}".format(tf.__version__))
reset_seeds()

def timeit(func, iterations, *args, _verbose=0, **kwargs):
    t0 = time()
    for _ in range(iterations):
        func(*args, **kwargs)
        print(end='.'*int(_verbose))
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_model_small(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 40, strides=4, padding='same')(ipt)
    x     = GlobalAveragePooling1D()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_model_medium(batch_shape):
    ipt = Input(batch_shape=batch_shape)
    x = ipt
    for filters in [64, 128, 256, 256, 128, 64]:
        x  = Conv1D(filters, 20, strides=1, padding='valid')(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_model_large(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(64,  400, strides=4, padding='valid')(ipt)
    x     = Conv1D(128, 200, strides=1, padding='valid')(x)
    for _ in range(40):
        x = Conv1D(256,  12, strides=1, padding='same')(x)
    x     = Conv1D(512,  20, strides=2, padding='valid')(x)
    x     = Conv1D(1028, 10, strides=2, padding='valid')(x)
    x     = Conv1D(256,   1, strides=1, padding='valid')(x)
    x     = GlobalAveragePooling1D()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)    
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_data(batch_shape):
    return np.random.randn(*batch_shape), \
           np.random.randint(0, 2, (batch_shape[0], 1))

def make_data_tf(batch_shape, n_batches, iters):
    data = np.random.randn(n_batches, *batch_shape),
    trgt = np.random.randint(0, 2, (n_batches, batch_shape[0], 1))
    return tf.data.Dataset.from_tensor_slices((data, trgt))#.repeat(iters)

batch_shape_small  = (32, 140,   30)
batch_shape_medium = (32, 1400,  30)
batch_shape_large  = (32, 14000, 30)

batch_shapes = batch_shape_small, batch_shape_medium, batch_shape_large
make_model_fns = make_model_small, make_model_medium, make_model_large
iterations = [200, 100, 50]
shape_names = ["Small data",  "Medium data",  "Large data"]
model_names = ["Small model", "Medium model", "Large model"]

def test_all(fit=False, tf_dataset=False):
    for model_fn, model_name, iters in zip(make_model_fns, model_names, iterations):
        for batch_shape, shape_name in zip(batch_shapes, shape_names):
            if (model_fn is make_model_large) and (batch_shape is batch_shape_small):
                continue
            reset_seeds(reset_graph_with_backend=K)
            if tf_dataset:
                data = make_data_tf(batch_shape, iters, iters)
            else:
                data = make_data(batch_shape)
            model = model_fn(batch_shape)

            if fit:
                if tf_dataset:
                    model.train_on_batch(data.take(1))
                    t0 = time()
                    model.fit(data, steps_per_epoch=iters)
                    print("Time/iter: %.4f sec" % ((time() - t0) / iters))
                else:
                    model.train_on_batch(*data)
                    timeit(model.fit, iters, *data, _verbose=1, verbose=0)
            else:
                model.train_on_batch(*data)
                timeit(model.train_on_batch, iters, *data, _verbose=1)
            cprint(">> {}, {} done <<\n".format(model_name, shape_name), 'blue')
            del model

test_all(fit=True, tf_dataset=False)

Tensorflow 2.0-AttributeError:模块’tensorflow’没有属性’Session’

问题:Tensorflow 2.0-AttributeError:模块’tensorflow’没有属性’Session’

sess = tf.Session()在Tensorflow 2.0环境中执行命令时,出现如下错误消息:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'tensorflow' has no attribute 'Session'

系统信息:

  • 操作系统平台和发行版:Windows 10
  • python版本:3.7.1
  • Tensorflow版本:2.0.0-alpha0(随pip一起安装)

重现步骤:

安装:

  1. 点安装-升级点
  2. pip install tensorflow == 2.0.0-alpha0
  3. 点安装keras
  4. 点安装numpy == 1.16.2

执行:

  1. 执行命令:将tensorflow导入为tf
  2. 执行命令:sess = tf.Session()

When I am executing the command sess = tf.Session() in Tensorflow 2.0 environment, I am getting an error message as below:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'tensorflow' has no attribute 'Session'

System Information:

  • OS Platform and Distribution: Windows 10
  • Python Version: 3.7.1
  • Tensorflow Version: 2.0.0-alpha0 (installed with pip)

Steps to reproduce:

Installation:

  1. pip install –upgrade pip
  2. pip install tensorflow==2.0.0-alpha0
  3. pip install keras
  4. pip install numpy==1.16.2

Execution:

  1. Execute command: import tensorflow as tf
  2. Execute command: sess = tf.Session()

回答 0

根据TF 1:1 Symbols Map,在TF 2.0中,您应该使用tf.compat.v1.Session()而不是tf.Session()

https://docs.google.com/spreadsheets/d/1FLFJLzg7WNP6JHODX5q8BDgptKafq_slHpnHVbJIteQ/edit#gid=0

要获得TF 2.0中类似TF 1.x的行为,可以运行

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

但后来人们无法受益于TF 2.0所做的许多改进。有关更多详细信息,请参阅迁移指南 https://www.tensorflow.org/guide/migrate

According to TF 1:1 Symbols Map, in TF 2.0 you should use tf.compat.v1.Session() instead of tf.Session()

https://docs.google.com/spreadsheets/d/1FLFJLzg7WNP6JHODX5q8BDgptKafq_slHpnHVbJIteQ/edit#gid=0

To get TF 1.x like behaviour in TF 2.0 one can run

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

but then one cannot benefit of many improvements made in TF 2.0. For more details please refer to the migration guide https://www.tensorflow.org/guide/migrate


回答 1

TF2默认情况下运行急切执行,因此无需会话。如果要运行静态图,则更正确的方法是tf.function()在TF2中使用。虽然仍然可以通过tf.compat.v1.Session()TF2访问Session ,但我不建议使用它。通过比较问候世界中的差异来证明这种差异可能会有所帮助:

TF1.x你好世界:

import tensorflow as tf
msg = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(msg))

TF2.x你好世界:

import tensorflow as tf
msg = tf.constant('Hello, TensorFlow!')
tf.print(msg)

有关更多信息,请参见Effective TensorFlow 2

TF2 runs Eager Execution by default, thus removing the need for Sessions. If you want to run static graphs, the more proper way is to use tf.function() in TF2. While Session can still be accessed via tf.compat.v1.Session() in TF2, I would discourage using it. It may be helpful to demonstrate this difference by comparing the difference in hello worlds:

TF1.x hello world:

import tensorflow as tf
msg = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(msg))

TF2.x hello world:

import tensorflow as tf
msg = tf.constant('Hello, TensorFlow!')
tf.print(msg)

For more info, see Effective TensorFlow 2


回答 2

安装后第一次尝试python时遇到了这个问题 windows10 + python3.7(64bit) + anacconda3 + jupyter notebook.

我通过参考“ https://vispud.blogspot.com/2019/05/tensorflow200a0-attributeerror-module.html ”解决了此问题

我同意

我相信TF 2.0已删除了“ Session()”。

我插入了两行。一个是tf.compat.v1.disable_eager_execution(),另一个是sess = tf.compat.v1.Session()

我的Hello.py如下:

import tensorflow as tf

tf.compat.v1.disable_eager_execution()

hello = tf.constant('Hello, TensorFlow!')

sess = tf.compat.v1.Session()

print(sess.run(hello))

I faced this problem when I first tried python after installing windows10 + python3.7(64bit) + anacconda3 + jupyter notebook.

I solved this problem by refering to “https://vispud.blogspot.com/2019/05/tensorflow200a0-attributeerror-module.html

I agree with

I believe “Session()” has been removed with TF 2.0.

I inserted two lines. One is tf.compat.v1.disable_eager_execution() and the other is sess = tf.compat.v1.Session()

My Hello.py is as follows:

import tensorflow as tf

tf.compat.v1.disable_eager_execution()

hello = tf.constant('Hello, TensorFlow!')

sess = tf.compat.v1.Session()

print(sess.run(hello))

回答 3

对于TF2.x,您可以这样做。

import tensorflow as tf
with tf.compat.v1.Session() as sess:
    hello = tf.constant('hello world')
    print(sess.run(hello))

>>> b'hello world

For TF2.x, you can do like this.

import tensorflow as tf
with tf.compat.v1.Session() as sess:
    hello = tf.constant('hello world')
    print(sess.run(hello))

>>> b'hello world


回答 4

尝试这个

import tensorflow as tf

tf.compat.v1.disable_eager_execution()

hello = tf.constant('Hello, TensorFlow!')

sess = tf.compat.v1.Session()

print(sess.run(hello))

try this

import tensorflow as tf

tf.compat.v1.disable_eager_execution()

hello = tf.constant('Hello, TensorFlow!')

sess = tf.compat.v1.Session()

print(sess.run(hello))

回答 5

如果这是您的代码,则正确的解决方案是将其重写为不使用Session(),因为在TensorFlow 2中不再需要

如果这只是您正在运行的代码,则可以通过运行降级到TensorFlow 1

pip3 install --upgrade --force-reinstall tensorflow-gpu==1.15.0 

(或TensorFlow 1最新版本

If this is your code, the correct solution is to rewrite it to not use Session(), since that’s no longer necessary in TensorFlow 2

If this is just code you’re running, you can downgrade to TensorFlow 1 by running

pip3 install --upgrade --force-reinstall tensorflow-gpu==1.15.0 

(or whatever the latest version of TensorFlow 1 is)


回答 6

Tensorflow 2.x支持默认执行Eager Execution,因此不支持Session。

Tensorflow 2.x support’s Eager Execution by default hence Session is not supported.


回答 7

使用Anaconda + Spyder(Python 3.7)

[码]

import tensorflow as tf
valor1 = tf.constant(2)
valor2 = tf.constant(3)
type(valor1)
print(valor1)
soma=valor1+valor2
type(soma)
print(soma)
sess = tf.compat.v1.Session()
with sess:
    print(sess.run(soma))

[安慰]

import tensorflow as tf
valor1 = tf.constant(2)
valor2 = tf.constant(3)
type(valor1)
print(valor1)
soma=valor1+valor2
type(soma)
Tensor("Const_8:0", shape=(), dtype=int32)
Out[18]: tensorflow.python.framework.ops.Tensor

print(soma)
Tensor("add_4:0", shape=(), dtype=int32)

sess = tf.compat.v1.Session()

with sess:
    print(sess.run(soma))
5

Using Anaconda + Spyder (Python 3.7)

[code]

import tensorflow as tf
valor1 = tf.constant(2)
valor2 = tf.constant(3)
type(valor1)
print(valor1)
soma=valor1+valor2
type(soma)
print(soma)
sess = tf.compat.v1.Session()
with sess:
    print(sess.run(soma))

[console]

import tensorflow as tf
valor1 = tf.constant(2)
valor2 = tf.constant(3)
type(valor1)
print(valor1)
soma=valor1+valor2
type(soma)
Tensor("Const_8:0", shape=(), dtype=int32)
Out[18]: tensorflow.python.framework.ops.Tensor

print(soma)
Tensor("add_4:0", shape=(), dtype=int32)

sess = tf.compat.v1.Session()

with sess:
    print(sess.run(soma))
5

回答 8

TF v2.0支持Eager模式和v1.0的Graph模式。因此,v2.0不支持tf.session()。因此,建议您重写代码以在Eager模式下工作。

TF v2.0 supports Eager mode vis-a-vis Graph mode of v1.0. Hence, tf.session() is not supported on v2.0. Hence, would suggest you to rewrite your code to work in Eager mode.


回答 9

import tensorflow as tf
sess = tf.Session()

此代码将在版本2.x上显示属性错误

在版本2.x中使用版本1.x代码

尝试这个

import tensorflow.compat.v1 as tf
sess = tf.Session()
import tensorflow as tf
sess = tf.Session()

this code will show an Attribute error on version 2.x

to use version 1.x code in version 2.x

try this

import tensorflow.compat.v1 as tf
sess = tf.Session()

如何在CPU上运行Tensorflow

问题:如何在CPU上运行Tensorflow

我已经在Ubuntu 14.04上安装了GPU版本的tensorflow。

我在GPU服务器上,张量流可以访问可用的GPU。

我想在CPU上运行tensorflow。

通常我可以env CUDA_VISIBLE_DEVICES=0在GPU号上运行。0。

我该如何在CPU之间进行选择?

我不喜欢重新编写我的代码 with tf.device("/cpu:0"):

I have installed the GPU version of tensorflow on an Ubuntu 14.04.

I am on a GPU server where tensorflow can access the available GPUs.

I want to run tensorflow on the CPUs.

Normally I can use env CUDA_VISIBLE_DEVICES=0 to run on GPU no. 0.

How can I pick between the CPUs instead?

I am not intersted in rewritting my code with with tf.device("/cpu:0"):


回答 0

您可以device_counttf.Session以下方式应用参数:

config = tf.ConfigProto(
        device_count = {'GPU': 0}
    )
sess = tf.Session(config=config)

另请参阅protobuf配置文件:

tensorflow/core/framework/config.proto

You can apply device_count parameter per tf.Session:

config = tf.ConfigProto(
        device_count = {'GPU': 0}
    )
sess = tf.Session(config=config)

See also protobuf config file:

tensorflow/core/framework/config.proto


回答 1

您还可以将环境变量设置为

CUDA_VISIBLE_DEVICES=""

无需修改源代码。

You can also set the environment variable to

CUDA_VISIBLE_DEVICES=""

without having to modify the source code.


回答 2

如果以上答案不起作用,请尝试:

os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

If the above answers don’t work, try:

os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

回答 3

对我来说,只有CUDA_VISIBLE_DEVICES精确设置才行-1

作品:

import os
import tensorflow as tf

os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

if tf.test.gpu_device_name():
    print('GPU found')
else:
    print("No GPU found")

# No GPU found

难道工作:

import os
import tensorflow as tf

os.environ['CUDA_VISIBLE_DEVICES'] = ''    

if tf.test.gpu_device_name():
    print('GPU found')
else:
    print("No GPU found")

# GPU found

For me, only setting CUDA_VISIBLE_DEVICES to precisely -1 works:

Works:

import os
import tensorflow as tf

os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

if tf.test.gpu_device_name():
    print('GPU found')
else:
    print("No GPU found")

# No GPU found

Does not work:

import os
import tensorflow as tf

os.environ['CUDA_VISIBLE_DEVICES'] = ''    

if tf.test.gpu_device_name():
    print('GPU found')
else:
    print("No GPU found")

# GPU found

回答 4

只需使用下面的代码。

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

Just using the code below.

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

回答 5

在某些系统中,必须指定:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]=""  # or even "-1"

在导入张量流之前。

In some systems one have to specify:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]=""  # or even "-1"

BEFORE importing tensorflow.


回答 6

您可以使用tf.config.set_visible_devices。一种可能的功能,它允许您设置是否使用以及使用哪些GPU:

import tensorflow as tf

def set_gpu(gpu_ids_list):
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        try:
            gpus_used = [gpus[i] for i in gpu_ids_list]
            tf.config.set_visible_devices(gpus_used, 'GPU')
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            print(e)

假设你是用4个GPU的系统上,你希望只使用两个GPU,在一个具有id = 0和一个有id = 2,那么你的代码的第一个命令,导入库后,将是:

set_gpu([0, 2])

对于您的情况,仅使用CPU,可以使用一个空列表调用该函数

set_gpu([])

为了完整起见,如果要避免运行时初始化将分配设备上的所有内存,可以使用tf.config.experimental.set_memory_growth。最后,用于管理要使用的设备的功能(动态占用GPU的内存)变为:

import tensorflow as tf

def set_gpu(gpu_ids_list):
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        try:
            gpus_used = [gpus[i] for i in gpu_ids_list]
            tf.config.set_visible_devices(gpus_used, 'GPU')
            for gpu in gpus_used:
                tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            print(e)

You could use tf.config.set_visible_devices. One possible function that allows you to set if and which GPUs to use is:

import tensorflow as tf

def set_gpu(gpu_ids_list):
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        try:
            gpus_used = [gpus[i] for i in gpu_ids_list]
            tf.config.set_visible_devices(gpus_used, 'GPU')
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            print(e)

Suppose you are on a system with 4 GPUs and you want to use only two GPUs, the one with id = 0 and the one with id = 2, then the first command of your code, immediately after importing the libraries, would be:

set_gpu([0, 2])

In your case, to use only the CPU, you can invoke the function with an empty list:

set_gpu([])

For completeness, if you want to avoid that the runtime initialization will allocate all memory on the device, you can use tf.config.experimental.set_memory_growth. Finally, the function to manage which devices to use, occupying the GPUs memory dynamically, becomes:

import tensorflow as tf

def set_gpu(gpu_ids_list):
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        try:
            gpus_used = [gpus[i] for i in gpu_ids_list]
            tf.config.set_visible_devices(gpus_used, 'GPU')
            for gpu in gpus_used:
                tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
        except RuntimeError as e:
            # Visible devices must be set before GPUs have been initialized
            print(e)

回答 7

在安装级别上的另一种可能的解决方案是寻找仅CPU的变体:https : //www.tensorflow.org/install/pip#package-location

就我而言,这给出了:

pip3 install https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow_cpu-2.2.0-cp38-cp38-win_amd64.whl

只需选择正确的版本。如在此答案中说明的使用venv的奖励积分。

Another possible solution on installation level would be to look for the CPU only variant: https://www.tensorflow.org/install/pip#package-location

In my case, this gives right now:

pip3 install https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow_cpu-2.2.0-cp38-cp38-win_amd64.whl

Just select the correct version. Bonus points for using a venv like explained eg in this answer.


TensorFlow中Variable和get_variable之间的区别

问题:TensorFlow中Variable和get_variable之间的区别

据我所知,Variable是变量的默认操作,get_variable主要用于权重分配。

一方面,有人建议在需要变量时使用get_variable而不是原始Variable操作。另一方面,我只get_variable在TensorFlow的官方文档和演示中看到了任何使用。

因此,我想了解有关如何正确使用这两种机制的一些经验法则。是否有任何“标准”原则?

As far as I know, Variable is the default operation for making a variable, and get_variable is mainly used for weight sharing.

On the one hand, there are some people suggesting using get_variable instead of the primitive Variable operation whenever you need a variable. On the other hand, I merely see any use of get_variable in TensorFlow’s official documents and demos.

Thus I want to know some rules of thumb on how to correctly use these two mechanisms. Are there any “standard” principles?


回答 0

我建议始终使用tf.get_variable(...)-如果您需要随时共享变量,例如在multi-gpu设置中(请参见multi-gpu CIFAR示例),它将使您更轻松地重构代码。没有不利的一面。

tf.Variable是低级的。在某些时候tf.get_variable()不存在,因此某些代码仍使用低级方式。

I’d recommend to always use tf.get_variable(...) — it will make it way easier to refactor your code if you need to share variables at any time, e.g. in a multi-gpu setting (see the multi-gpu CIFAR example). There is no downside to it.

Pure tf.Variable is lower-level; at some point tf.get_variable() did not exist so some code still uses the low-level way.


回答 1

tf.Variable是一个类,有多种创建tf.Variable的方法,包括tf.Variable.__init__tf.get_variable

tf.Variable.__init__:创建一个带有initial_value的新变量。

W = tf.Variable(<initial-value>, name=<optional-name>)

tf.get_variable:获取具有这些参数的现有变量或创建一个新变量。您也可以使用初始化程序。

W = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None,
       regularizer=None, trainable=True, collections=None)

使用初始化器如xavier_initializer

W = tf.get_variable("W", shape=[784, 256],
       initializer=tf.contrib.layers.xavier_initializer())

更多信息在这里

tf.Variable is a class, and there are several ways to create tf.Variable including tf.Variable.__init__ and tf.get_variable.

tf.Variable.__init__: Creates a new variable with initial_value.

W = tf.Variable(<initial-value>, name=<optional-name>)

tf.get_variable: Gets an existing variable with these parameters or creates a new one. You can also use initializer.

W = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None,
       regularizer=None, trainable=True, collections=None)

It’s very useful to use initializers such as xavier_initializer:

W = tf.get_variable("W", shape=[784, 256],
       initializer=tf.contrib.layers.xavier_initializer())

More information here.


回答 2

我可以发现彼此之间的两个主要区别:

  1. 首先,tf.Variable它将始终创建一个新变量,而从图中tf.get_variable获取具有指定参数的现有变量,如果不存在,则创建一个新变量。

  2. tf.Variable 要求指定一个初始值。

重要的是要阐明该函数tf.get_variable在名称前加上当前变量作用域以执行重用检查。例如:

with tf.variable_scope("one"):
    a = tf.get_variable("v", [1]) #a.name == "one/v:0"
with tf.variable_scope("one"):
    b = tf.get_variable("v", [1]) #ValueError: Variable one/v already exists
with tf.variable_scope("one", reuse = True):
    c = tf.get_variable("v", [1]) #c.name == "one/v:0"

with tf.variable_scope("two"):
    d = tf.get_variable("v", [1]) #d.name == "two/v:0"
    e = tf.Variable(1, name = "v", expected_shape = [1]) #e.name == "two/v_1:0"

assert(a is c)  #Assertion is true, they refer to the same object.
assert(a is d)  #AssertionError: they are different objects
assert(d is e)  #AssertionError: they are different objects

最后一个断言错误很有趣:在相同范围内具有相同名称的两个变量应该是相同变量。但是,如果你测试变量的名字de你会发现,Tensorflow改变变量的名称e

d.name   #d.name == "two/v:0"
e.name   #e.name == "two/v_1:0"

I can find two main differences between one and the other:

  1. First is that tf.Variable will always create a new variable, whereas tf.get_variable gets an existing variable with specified parameters from the graph, and if it doesn’t exist, creates a new one.

  2. tf.Variable requires that an initial value be specified.

It is important to clarify that the function tf.get_variable prefixes the name with the current variable scope to perform reuse checks. For example:

with tf.variable_scope("one"):
    a = tf.get_variable("v", [1]) #a.name == "one/v:0"
with tf.variable_scope("one"):
    b = tf.get_variable("v", [1]) #ValueError: Variable one/v already exists
with tf.variable_scope("one", reuse = True):
    c = tf.get_variable("v", [1]) #c.name == "one/v:0"

with tf.variable_scope("two"):
    d = tf.get_variable("v", [1]) #d.name == "two/v:0"
    e = tf.Variable(1, name = "v", expected_shape = [1]) #e.name == "two/v_1:0"

assert(a is c)  #Assertion is true, they refer to the same object.
assert(a is d)  #AssertionError: they are different objects
assert(d is e)  #AssertionError: they are different objects

The last assertion error is interesting: Two variables with the same name under the same scope are supposed to be the same variable. But if you test the names of variables d and e you will realize that Tensorflow changed the name of variable e:

d.name   #d.name == "two/v:0"
e.name   #e.name == "two/v_1:0"

回答 3

另一个不同之处在于 ('variable_store',)集合中,而另一个不在。

请查看源代码

def _get_default_variable_store():
  store = ops.get_collection(_VARSTORE_KEY)
  if store:
    return store[0]
  store = _VariableStore()
  ops.add_to_collection(_VARSTORE_KEY, store)
  return store

让我说明一下:

import tensorflow as tf
from tensorflow.python.framework import ops

embedding_1 = tf.Variable(tf.constant(1.0, shape=[30522, 1024]), name="word_embeddings_1", dtype=tf.float32) 
embedding_2 = tf.get_variable("word_embeddings_2", shape=[30522, 1024])

graph = tf.get_default_graph()
collections = graph.collections

for c in collections:
    stores = ops.get_collection(c)
    print('collection %s: ' % str(c))
    for k, store in enumerate(stores):
        try:
            print('\t%d: %s' % (k, str(store._vars)))
        except:
            print('\t%d: %s' % (k, str(store)))
    print('')

输出:

collection ('__variable_store',): 0: {'word_embeddings_2': <tf.Variable 'word_embeddings_2:0' shape=(30522, 1024) dtype=float32_ref>}

Another difference lies in that one is in ('variable_store',) collection but the other is not.

Please see the source code:

def _get_default_variable_store():
  store = ops.get_collection(_VARSTORE_KEY)
  if store:
    return store[0]
  store = _VariableStore()
  ops.add_to_collection(_VARSTORE_KEY, store)
  return store

Let me illustrate that:

import tensorflow as tf
from tensorflow.python.framework import ops

embedding_1 = tf.Variable(tf.constant(1.0, shape=[30522, 1024]), name="word_embeddings_1", dtype=tf.float32) 
embedding_2 = tf.get_variable("word_embeddings_2", shape=[30522, 1024])

graph = tf.get_default_graph()
collections = graph.collections

for c in collections:
    stores = ops.get_collection(c)
    print('collection %s: ' % str(c))
    for k, store in enumerate(stores):
        try:
            print('\t%d: %s' % (k, str(store._vars)))
        except:
            print('\t%d: %s' % (k, str(store)))
    print('')

The output:

collection ('__variable_store',): 0: {'word_embeddings_2': <tf.Variable 'word_embeddings_2:0' shape=(30522, 1024) dtype=float32_ref>}


我可以在GPU上运行Keras模型吗?

问题:我可以在GPU上运行Keras模型吗?

我正在运行Keras模型,提交截止日期为36小时,如果我在cpu上训练我的模型大约需要50个小时,是否可以在gpu上运行Keras?

我正在使用Tensorflow后端,并在未安装anaconda的Jupyter笔记本上运行它。

I’m running a Keras model, with a submission deadline of 36 hours, if I train my model on the cpu it will take approx 50 hours, is there a way to run Keras on gpu?

I’m using Tensorflow backend and running it on my Jupyter notebook, without anaconda installed.


回答 0

是的,您可以在GPU上运行keras模型。几件事您将必须首先检查。

  1. 您的系统具有GPU(Nvidia。因为AMD尚未运行)
  2. 您已经安装了Tensorflow的GPU版本
  3. 您已安装CUDA 安装说明
  4. 验证Tensorflow是否与GPU一起运行,检查GPU是否正常工作

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

要么

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

输出将是这样的:

[
  name: "/cpu:0"device_type: "CPU",
  name: "/gpu:0"device_type: "GPU"
]

完成所有这些操作后,您的模型将在GPU上运行:

要检查keras(> = 2.1.1)是否使用GPU:

from keras import backend as K
K.tensorflow_backend._get_available_gpus()

祝一切顺利。

Yes you can run keras models on GPU. Few things you will have to check first.

  1. your system has GPU (Nvidia. As AMD doesn’t work yet)
  2. You have installed the GPU version of tensorflow
  3. You have installed CUDA installation instructions
  4. Verify that tensorflow is running with GPU check if GPU is working

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

OR

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

output will be something like this:

[
  name: "/cpu:0"device_type: "CPU",
  name: "/gpu:0"device_type: "GPU"
]

Once all this is done your model will run on GPU:

To Check if keras(>=2.1.1) is using GPU:

from keras import backend as K
K.tensorflow_backend._get_available_gpus()

All the best.


回答 1

当然。我想您已经安装了TensorFlow for GPU。

导入keras后,需要添加以下块。我正在使用具有56核心cpu和gpu的计算机。

import keras
import tensorflow as tf


config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 56} ) 
sess = tf.Session(config=config) 
keras.backend.set_session(sess)

当然,这种用法会强制执行我的计算机的最大限制。您可以减少cpu和gpu消耗值。

Sure. I suppose that you have already installed TensorFlow for GPU.

You need to add the following block after importing keras. I am working on a machine which have 56 core cpu, and a gpu.

import keras
import tensorflow as tf


config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 56} ) 
sess = tf.Session(config=config) 
keras.backend.set_session(sess)

Of course, this usage enforces my machines maximum limits. You can decrease cpu and gpu consumption values.


回答 2

2.0兼容答案:虽然上面提到的答案详细说明了如何在Keras Model上使用GPU,但我想说明如何实现Tensorflow Version 2.0

要知道有多少个GPU可用,我们可以使用以下代码:

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

要找出您的操作和张量分配给哪些设备,请将其tf.debugging.set_log_device_placement(True)作为程序的第一条语句。

启用设备放置日志记录将导致打印任何Tensor分配或操作。例如,运行以下代码:

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

给出如下所示的输出:

在设备/ job:localhost / replica:0 / task:0 / device:GPU:0 tf.Tensor([[22. 28.] [49. 64.]],shape =(2,2)中执行op MatMul dtype = float32)

有关更多信息,请参考此链接

2.0 Compatible Answer: While above mentioned answer explain in detail on how to use GPU on Keras Model, I want to explain how it can be done for Tensorflow Version 2.0.

To know how many GPUs are available, we can use the below code:

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

To find out which devices your operations and tensors are assigned to, put tf.debugging.set_log_device_placement(True) as the first statement of your program.

Enabling device placement logging causes any Tensor allocations or operations to be printed. For example, running the below code:

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

gives the Output shown below:

Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0 tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32)

For more information, refer this link


回答 3

当然。如果您在Tensorflow或CNTk后端上运行,则代码将默认在GPU设备上运行。但是,如果Theano后端,则可以使用以下代码

Theano标志:

“ THEANO_FLAGS = device = gpu,floatX = float32 python my_keras_script.py”

Of course. if you are running on Tensorflow or CNTk backends, your code will run on your GPU devices defaultly.But if Theano backends, you can use following

Theano flags:

“THEANO_FLAGS=device=gpu,floatX=float32 python my_keras_script.py”


回答 4

在任务管理器中查看脚本是否正在运行GPU。如果不是,请怀疑您的CUDA版本是您所使用的tensorflow版本的正确版本,其他答案已经建议了。

此外,需要使用适用于CUDA版本的适当CUDA DNN库,才能使用tensorflow运行GPU。从此处下载/提取它,并将DLL(例如cudnn64_7.dll)放入CUDA bin文件夹(例如C:\ Program Files \ NVIDIA GPU Computing Toolkit \ CUDA \ v10.1 \ bin)。

See if your script is running GPU in Task manager. If not, suspect your CUDA version is right one for the tensorflow version you are using, as the other answers suggested already.

Additionally, a proper CUDA DNN library for the CUDA version is required to run GPU with tensorflow. Download/extract it from here and put the DLL (e.g., cudnn64_7.dll) into CUDA bin folder (e.g., C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin).


在Tensorflow中,获取图中所有张量的名称

问题:在Tensorflow中,获取图中所有张量的名称

我正在使用Tensorflow和创建神经网络skflow。由于某种原因,我想获得某种内在的张量的值给定的输入,所以我使用的myClassifier.get_layer_value(input, "tensorName")myClassifier作为一个skflow.estimators.TensorFlowEstimator

但是,我发现很难找到张量名称的正确语法,即使知道它的名称也很困难(而且我对操作和张量感到困惑),因此我使用张量板来绘制图形并寻找名称。

有没有一种方法可以在不使用张量板的情况下枚举图中的所有张量?

I am creating neural nets with Tensorflow and skflow; for some reason I want to get the values of some inner tensors for a given input, so I am using myClassifier.get_layer_value(input, "tensorName"), myClassifier being a skflow.estimators.TensorFlowEstimator.

However, I find it difficult to find the correct syntax of the tensor name, even knowing its name (and I’m getting confused between operation and tensors), so I’m using tensorboard to plot the graph and look for the name.

Is there a way to enumerate all the tensors in a graph without using tensorboard?


回答 0

你可以做

[n.name for n in tf.get_default_graph().as_graph_def().node]

另外,如果要在IPython笔记本中进行原型制作,则可以直接在笔记本中显示图形,请参见show_graphAlexander’s Deep Dream 笔记本中的功能

You can do

[n.name for n in tf.get_default_graph().as_graph_def().node]

Also, if you are prototyping in an IPython notebook, you can show the graph directly in notebook, see show_graph function in Alexander’s Deep Dream notebook


回答 1

有一种方法可以通过使用get_operations来比Yaroslav的回答中稍快一些。这是一个简单的示例:

import tensorflow as tf

a = tf.constant(1.3, name='const_a')
b = tf.Variable(3.1, name='variable_b')
c = tf.add(a, b, name='addition')
d = tf.multiply(c, a, name='multiply')

for op in tf.get_default_graph().get_operations():
    print(str(op.name))

There is a way to do it slightly faster than in Yaroslav’s answer by using get_operations. Here is a quick example:

import tensorflow as tf

a = tf.constant(1.3, name='const_a')
b = tf.Variable(3.1, name='variable_b')
c = tf.add(a, b, name='addition')
d = tf.multiply(c, a, name='multiply')

for op in tf.get_default_graph().get_operations():
    print(str(op.name))

回答 2

我将尝试总结答案:

要获取所有节点(类型tensorflow.core.framework.node_def_pb2.NodeDef):

all_nodes = [n for n in tf.get_default_graph().as_graph_def().node]

要获取所有操作(类型tensorflow.python.framework.ops.Operation):

all_ops = tf.get_default_graph().get_operations()

要获取所有变量(类型tensorflow.python.ops.resource_variable_ops.ResourceVariable):

all_vars = tf.global_variables()

获取所有张量(类型tensorflow.python.framework.ops.Tensor

all_tensors = [tensor for op in tf.get_default_graph().get_operations() for tensor in op.values()]

I’ll try to summarize the answers:

To get all nodes: (type tensorflow.core.framework.node_def_pb2.NodeDef)

all_nodes = [n for n in tf.get_default_graph().as_graph_def().node]

To get all ops: (type tensorflow.python.framework.ops.Operation)

all_ops = tf.get_default_graph().get_operations()

To get all variables: (type tensorflow.python.ops.resource_variable_ops.ResourceVariable)

all_vars = tf.global_variables()

To get all tensors: (type tensorflow.python.framework.ops.Tensor)

all_tensors = [tensor for op in tf.get_default_graph().get_operations() for tensor in op.values()]

To get the graph in Tensorflow 2, instead of tf.get_default_graph() you need to instantiate a tf.function first and access the graph attribute, for example:

graph = func.get_concrete_function().graph

where func is a tf.function


回答 3

tf.all_variables() 可以为您获取所需的信息。

此外,今天在TensorFlow Learn中所做的提交get_variable_names在estimator中提供了一个函数,您可以使用该函数轻松检索所有变量名称。

tf.all_variables() can get you the information you want.

Also, this commit made today in TensorFlow Learn that provides a function get_variable_names in estimator that you can use to retrieve all variable names easily.


回答 4

我认为这样做也可以:

print(tf.contrib.graph_editor.get_tensors(tf.get_default_graph()))

但是,与萨尔瓦多和雅罗斯拉夫的答案相比,我不知道哪个更好。

I think this will do too:

print(tf.contrib.graph_editor.get_tensors(tf.get_default_graph()))

But compared with Salvado and Yaroslav’s answers, I don’t know which one is better.


回答 5

接受的答案仅会为您提供带有名称的字符串列表。我更喜欢另一种方法,它使您(几乎)直接访问张量:

graph = tf.get_default_graph()
list_of_tuples = [op.values() for op in graph.get_operations()]

list_of_tuples现在包含每个张量,每个张量都在一个元组中。您还可以对其进行调整以直接获得张量:

graph = tf.get_default_graph()
list_of_tuples = [op.values()[0] for op in graph.get_operations()]

The accepted answer only gives you a list of strings with the names. I prefer a different approach, which gives you (almost) direct access to the tensors:

graph = tf.get_default_graph()
list_of_tuples = [op.values() for op in graph.get_operations()]

list_of_tuples now contains every tensor, each within a tuple. You could also adapt it to get the tensors directly:

graph = tf.get_default_graph()
list_of_tuples = [op.values()[0] for op in graph.get_operations()]

回答 6

由于OP要求张量的列表而不是操作/节点的列表,因此代码应略有不同:

graph = tf.get_default_graph()    
tensors_per_node = [node.values() for node in graph.get_operations()]
tensor_names = [tensor.name for tensors in tensors_per_node for tensor in tensors]

Since the OP asked for the list of the tensors instead of the list of operations/nodes, the code should be slightly different:

graph = tf.get_default_graph()    
tensors_per_node = [node.values() for node in graph.get_operations()]
tensor_names = [tensor.name for tensors in tensors_per_node for tensor in tensors]

回答 7

先前的答案很好,我只想分享我编写的从图中选择张量的实用函数:

def get_graph_op(graph, and_conds=None, op='and', or_conds=None):
    """Selects nodes' names in the graph if:
    - The name contains all items in and_conds
    - OR/AND depending on op
    - The name contains any item in or_conds

    Condition starting with a "!" are negated.
    Returns all ops if no optional arguments is given.

    Args:
        graph (tf.Graph): The graph containing sought tensors
        and_conds (list(str)), optional): Defaults to None.
            "and" conditions
        op (str, optional): Defaults to 'and'. 
            How to link the and_conds and or_conds:
            with an 'and' or an 'or'
        or_conds (list(str), optional): Defaults to None.
            "or conditions"

    Returns:
        list(str): list of relevant tensor names
    """
    assert op in {'and', 'or'}

    if and_conds is None:
        and_conds = ['']
    if or_conds is None:
        or_conds = ['']

    node_names = [n.name for n in graph.as_graph_def().node]

    ands = {
        n for n in node_names
        if all(
            cond in n if '!' not in cond
            else cond[1:] not in n
            for cond in and_conds
        )}

    ors = {
        n for n in node_names
        if any(
            cond in n if '!' not in cond
            else cond[1:] not in n
            for cond in or_conds
        )}

    if op == 'and':
        return [
            n for n in node_names
            if n in ands.intersection(ors)
        ]
    elif op == 'or':
        return [
            n for n in node_names
            if n in ands.union(ors)
        ]

因此,如果您有带有操作图的图形:

['model/classifier/dense/kernel',
'model/classifier/dense/kernel/Assign',
'model/classifier/dense/kernel/read',
'model/classifier/dense/bias',
'model/classifier/dense/bias/Assign',
'model/classifier/dense/bias/read',
'model/classifier/dense/MatMul',
'model/classifier/dense/BiasAdd',
'model/classifier/ArgMax/dimension',
'model/classifier/ArgMax']

然后跑步

get_graph_op(tf.get_default_graph(), ['dense', '!kernel'], 'or', ['Assign'])

返回:

['model/classifier/dense/kernel/Assign',
'model/classifier/dense/bias',
'model/classifier/dense/bias/Assign',
'model/classifier/dense/bias/read',
'model/classifier/dense/MatMul',
'model/classifier/dense/BiasAdd']

Previous answers are good, I’d just like to share a utility function I wrote to select Tensors from a graph:

def get_graph_op(graph, and_conds=None, op='and', or_conds=None):
    """Selects nodes' names in the graph if:
    - The name contains all items in and_conds
    - OR/AND depending on op
    - The name contains any item in or_conds

    Condition starting with a "!" are negated.
    Returns all ops if no optional arguments is given.

    Args:
        graph (tf.Graph): The graph containing sought tensors
        and_conds (list(str)), optional): Defaults to None.
            "and" conditions
        op (str, optional): Defaults to 'and'. 
            How to link the and_conds and or_conds:
            with an 'and' or an 'or'
        or_conds (list(str), optional): Defaults to None.
            "or conditions"

    Returns:
        list(str): list of relevant tensor names
    """
    assert op in {'and', 'or'}

    if and_conds is None:
        and_conds = ['']
    if or_conds is None:
        or_conds = ['']

    node_names = [n.name for n in graph.as_graph_def().node]

    ands = {
        n for n in node_names
        if all(
            cond in n if '!' not in cond
            else cond[1:] not in n
            for cond in and_conds
        )}

    ors = {
        n for n in node_names
        if any(
            cond in n if '!' not in cond
            else cond[1:] not in n
            for cond in or_conds
        )}

    if op == 'and':
        return [
            n for n in node_names
            if n in ands.intersection(ors)
        ]
    elif op == 'or':
        return [
            n for n in node_names
            if n in ands.union(ors)
        ]

So if you have a graph with ops:

['model/classifier/dense/kernel',
'model/classifier/dense/kernel/Assign',
'model/classifier/dense/kernel/read',
'model/classifier/dense/bias',
'model/classifier/dense/bias/Assign',
'model/classifier/dense/bias/read',
'model/classifier/dense/MatMul',
'model/classifier/dense/BiasAdd',
'model/classifier/ArgMax/dimension',
'model/classifier/ArgMax']

Then running

get_graph_op(tf.get_default_graph(), ['dense', '!kernel'], 'or', ['Assign'])

returns:

['model/classifier/dense/kernel/Assign',
'model/classifier/dense/bias',
'model/classifier/dense/bias/Assign',
'model/classifier/dense/bias/read',
'model/classifier/dense/MatMul',
'model/classifier/dense/BiasAdd']

回答 8

这对我有用:

for n in tf.get_default_graph().as_graph_def().node:
    print('\n',n)

This worked for me:

for n in tf.get_default_graph().as_graph_def().node:
    print('\n',n)

批量归一化和辍学订购?

问题:批量归一化和辍学订购?

最初的问题是关于TensorFlow实现的。但是,答案是针对一般的实现。这个通用答案也是TensorFlow的正确答案。

在TensorFlow中使用批量归一化和辍学时(特别是使用contrib.layers),我需要担心订购吗?

如果我在退出后立即使用批处理规范化,则可能会出现问题。例如,如果批量归一化训练中的偏移量训练输出的比例尺数字较大,但是将相同的偏移量应用到较小的比例尺数字(由于补偿了更多的输出),而在测试过程中没有丢失,则轮班可能关闭。TensorFlow批处理规范化层会自动对此进行补偿吗?还是由于某种原因我不会想念这件事吗?

另外,将两者一起使用时还有其他陷阱吗?例如,假设我使用他们以正确的顺序在问候上述(假设有一个正确的顺序),可以存在与使用分批正常化和漏失在多个连续层烦恼?我没有立即看到问题,但是我可能会丢失一些东西。

非常感谢!

更新:

实验测试似乎表明排序确实很重要。我在相同的网络上运行了两次,但批次标准和退出均相反。当辍学在批处理规范之前时,验证损失似乎会随着培训损失的减少而增加。在另一种情况下,它们都下降了。但就我而言,运动缓慢,因此在接受更多培训后情况可能会发生变化,这只是一次测试。一个更加明确和明智的答案仍然会受到赞赏。

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.

When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?

It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I’m missing?

Also, are there other pitfalls to look out for in when using these two together? For example, assuming I’m using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don’t immediately see a problem with that, but I might be missing something.

Thank you much!

UPDATE:

An experimental test seems to suggest that ordering does matter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They’re both going down in the other case. But in my case the movements are slow, so things may change after more training and it’s just a single test. A more definitive and informed answer would still be appreciated.


回答 0

在《Ioffe and Szegedy 2015》中,作者指出“我们希望确保对于任何参数值,网络始终以期望的分布产生激活”。因此,批处理规范化层实际上是在转换层/完全连接层之后,但在馈入ReLu(或任何其他种类的)激活之前插入的。有关详情,请在时间约53分钟处观看此视频

就辍学而言,我认为辍学是在激活层之后应用的。在丢弃纸图3b中,将隐藏层l的丢弃因子/概率矩阵r(l)应用于y(l),其中y(l)是应用激活函数f之后的结果。

因此,总而言之,使用批处理规范化和退出的顺序为:

-> CONV / FC-> BatchNorm-> ReLu(或其他激活)->退出-> CONV / FC->

In the Ioffe and Szegedy 2015, the authors state that “we would like to ensure that for any parameter values, the network always produces activations with the desired distribution”. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

So in summary, the order of using batch normalization and dropout is:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->


回答 1

正如评论中所指出的,这里是阅读层顺序的绝佳资源。我已经浏览了评论,这是我在互联网上找到的主题的最佳资源

我的2美分:

辍学是指完全阻止某些神经元发出的信息,以确保神经元不共适应。因此,批处理规范化必须在退出后进行,否则您将通过规范化统计信息传递信息。

如果考虑一下,在典型的机器学习问题中,这就是我们不计算整个数据的均值和标准差,然后将其分为训练,测试和验证集的原因。我们拆分然后计算训练集上的统计信息,并使用它们对验证和测试数据集进行归一化和居中

所以我建议方案1(这考虑了伪马文对已接受答案评论)

-> CONV / FC-> ReLu(或其他激活)->退出-> BatchNorm-> CONV / FC

与方案2相反

-> CONV / FC-> BatchNorm-> ReLu(或其他激活)->退出-> CONV / FC->接受的答案

请注意,这意味着与方案1下的网络相比,方案2下的网络应显示出过拟合的状态,但是OP进行了上述测试,并且它们支持方案2

As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet

My 2 cents:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

If you think about it, in typical ML problems, this is the reason we don’t compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets

so i suggest Scheme 1 (This takes pseudomarvin’s comment on accepted answer into consideration)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2


回答 2

通常,只需删除Dropout(如果有BN):

  • “ BN消除了Dropout在某些情况下的需要,因为BN直观上提供了与Dropout类似的正则化好处”
  • “ ResNet,DenseNet等架构未使用 Dropout

有关更多详细信息,请参见本文[ 通过方差Shift理解辍学与批处理规范化之间的不和谐 ],如@Haramoz在评论中所提到的。

Usually, Just drop the Dropout(when you have BN):

  • “BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively”
  • “Architectures like ResNet, DenseNet, etc. not using Dropout

For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.


回答 3

我找到了一篇说明Dropout和Batch Norm(BN)之间不和谐的论文。关键思想是他们所谓的“方差转移”。这是因为,辍学在训练和测试阶段之间的行为有所不同,这改变了BN学习的输入统计数据。主要观点可以从本文摘录的该图中找到。

在此笔记本中可以找到有关此效果的小演示。

I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the “variance shift”. This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns. The main idea can be found in this figure which is taken from this paper.

A small demo for this effect can be found in this notebook.


回答 4

为了获得更好的性能,基于研究论文,我们应该在应用Dropouts之前使用BN

Based on the research paper for better performance we should use BN before applying Dropouts


回答 5

正确的顺序为:转换>规范化>激活>退出>池化

The correct order is: Conv > Normalization > Activation > Dropout > Pooling


回答 6

转化-激活-退出-BatchNorm-池->测试损失:0.04261355847120285

转化-激活-退出-池-BatchNorm->测试损失:0.050065308809280396

转换-激活-BatchNorm-池-退出-> Test_loss:0.04911309853196144

转换-激活-BatchNorm-退出-池-> Test_loss:0.06809622049331665

转换-BatchNorm-激活-退出-池-> Test_loss:0.038886815309524536

转换-BatchNorm-激活-池-退出-> Test_loss:0.04126095026731491

转换-BatchNorm-退出-激活-池-> Test_loss:0.05142546817660332

转换-退出-激活-BatchNorm-池->测试损失:0.04827788099646568

转化-退出-激活-池-BatchNorm->测试损失:0.04722036048769951

转化-退出-BatchNorm-激活-池->测试损失:0.03238215297460556


在MNIST数据集(20个纪元)上使用2个卷积模块(见下文)进行训练,然后每次

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

卷积层的内核大小为(3,3),默认填充为,激活值为elu。池化是池畔的MaxPooling (2,2)。损失为categorical_crossentropy,优化器为adam

相应的辍学概率分别为0.20.3。特征图的数量分别为3264

编辑: 当我按照某些答案中的建议删除Dropout时,它收敛得比我使用BatchNorm Dropout 时更快,但泛化能力却较差。

Conv – Activation – DropOut – BatchNorm – Pool –> Test_loss: 0.04261355847120285

Conv – Activation – DropOut – Pool – BatchNorm –> Test_loss: 0.050065308809280396

Conv – Activation – BatchNorm – Pool – DropOut –> Test_loss: 0.04911309853196144

Conv – Activation – BatchNorm – DropOut – Pool –> Test_loss: 0.06809622049331665

Conv – BatchNorm – Activation – DropOut – Pool –> Test_loss: 0.038886815309524536

Conv – BatchNorm – Activation – Pool – DropOut –> Test_loss: 0.04126095026731491

Conv – BatchNorm – DropOut – Activation – Pool –> Test_loss: 0.05142546817660332

Conv – DropOut – Activation – BatchNorm – Pool –> Test_loss: 0.04827788099646568

Conv – DropOut – Activation – Pool – BatchNorm –> Test_loss: 0.04722036048769951

Conv – DropOut – BatchNorm – Activation – Pool –> Test_loss: 0.03238215297460556


Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.

The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.


回答 7

ConV / FC-BN-Sigmoid / tanh-辍学。如果激活函数是Relu或其他,则规范化和退出的顺序取决于您的任务

ConV/FC – BN – Sigmoid/tanh – dropout. If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task


回答 8

我从https://stackoverflow.com/a/40295999/8625228的答案和评论中阅读了推荐的论文

从Ioffe和Szegedy(2015)的角度来看,仅在网络结构中使用BN。Li等。(2018)给出了统计和实验分析,当从业者在BN之前使用Dropout时存在方差变化。因此,李等人。(2018)建议在所有BN层之后应用Dropout。

从Ioffe和Szegedy(2015)的角度来看,BN位于 激活函数内部/之前。然而,Chen等。Chen等(2019)使用结合了Dropout和BN的IC层。(2019)建议在ReLU之后使用BN。

在安全背景上,我仅在网络中使用Dropout或BN。

陈光勇,陈鹏飞,石玉军,谢长裕,廖本本和张胜宇。2019年。“重新思考在深度神经网络训练中批量归一化和辍学的用法。” CoRR Abs / 1905.05928。http://arxiv.org/abs/1905.05928

艾菲,谢尔盖和克里斯蒂安·塞格迪。2015年。“批量标准化:通过减少内部协变量偏移来加速深度网络训练。” CoRR Abs / 1502.03167。http://arxiv.org/abs/1502.03167

李翔,陈硕,胡小林和杨健。2018年。“通过方差转移了解辍学和批处理规范化之间的不和谐。” CoRR Abs / 1801.05134。http://arxiv.org/abs/1801.05134

I read the recommended papers in the answer and comments from https://stackoverflow.com/a/40295999/8625228

From Ioffe and Szegedy (2015)’s point of view, only use BN in the network structure. Li et al. (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. Thus, Li et al. (2018) recommend applying Dropout after all BN layers.

From Ioffe and Szegedy (2015)’s point of view, BN is located inside/before the activation function. However, Chen et al. (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) recommends use BN after ReLU.

On the safety background, I use Dropout or BN only in the network.

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” CoRR abs/1905.05928. http://arxiv.org/abs/1905.05928.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.


Tensorflow大步向前

问题:Tensorflow大步向前

我想了解的进步在tf.nn.avg_pool,tf.nn.max_pool,tf.nn.conv2d说法。

文件反复说

步幅:长度大于等于4的整数的列表。输入张量每个维度的滑动窗口的步幅。

我的问题是:

  1. 4个以上的整数分别代表什么?
  2. 对于卷积网络,为什么必须要有stride [0] = strides [3] = 1?
  3. 此示例中,我们看到了tf.reshape(_X,shape=[-1, 28, 28, 1])。为什么是-1?

遗憾的是,文档中使用-1进行重塑的示例并不能很好地解释这种情况。

I am trying to understand the strides argument in tf.nn.avg_pool, tf.nn.max_pool, tf.nn.conv2d.

The documentation repeatedly says

strides: A list of ints that has length >= 4. The stride of the sliding window for each dimension of the input tensor.

My questions are:

  1. What do each of the 4+ integers represent?
  2. Why must they have strides[0] = strides[3] = 1 for convnets?
  3. In this example we see tf.reshape(_X,shape=[-1, 28, 28, 1]). Why -1?

Sadly the examples in the docs for reshape using -1 don’t translate too well to this scenario.


回答 0

池化和卷积运算在输入张量上滑动一个“窗口”。使用tf.nn.conv2d作为一个例子:如果输入张量有4个方面: [batch, height, width, channels],则卷积在二维窗口上操作height, width的尺寸。

strides确定窗口在每个维度上的移动量。典型用法是将第一个(批处理)和最后一个(深度)跨度设置为1。

让我们使用一个非常具体的示例:在32×32灰度输入图像上运行2-d卷积。我说灰度是因为输入图像的深度为1,这有助于使其简单。让该图像看起来像这样:

00 01 02 03 04 ...
10 11 12 13 14 ...
20 21 22 23 24 ...
30 31 32 33 34 ...
...

让我们在一个示例(批处理大小= 1)上运行2×2卷积窗口。我们给卷积的输出通道深度为8。

卷积的输入为shape=[1, 32, 32, 1]

如果指定strides=[1,1,1,1]padding=SAME,则滤波器的输出将是[1,32,32,8]。

过滤器将首先为以下内容创建输出:

F(00 01
  10 11)

然后针对:

F(01 02
  11 12)

等等。然后它将移至第二行,计算:

F(10, 11
  20, 21)

然后

F(11, 12
  21, 22)

如果将跨度指定为[1、2、2、1],则不会重叠窗口。它将计算:

F(00, 01
  10, 11)

然后

F(02, 03
  12, 13)

跨步操作对于池操作员类似。

问题2:为何大步走向[1,x,y,1]

第一个是批处理:您通常不想跳过批处理中的示例,否则您不应该首先将它们包括在内。:)

最后一个是卷积的深度:出于相同的原因,您通常不想跳过输入。

conv2d运算符比较笼统,因此您可以创建卷积以使窗口沿其他维度滑动,但这在卷积网络中并不常见。典型用途是在空间上使用它们。

为什么要重塑为-1 -1是一个占位符,它表示“根据需要进行调整以匹配整个张量所需的大小”。这是使代码独立于输入批处理大小的一种方法,因此您可以更改管道,而不必在代码中的任何地方调整批处理大小。

The pooling and convolutional ops slide a “window” across the input tensor. Using tf.nn.conv2d as an example: If the input tensor has 4 dimensions: [batch, height, width, channels], then the convolution operates on a 2D window on the height, width dimensions.

strides determines how much the window shifts by in each of the dimensions. The typical use sets the first (the batch) and last (the depth) stride to 1.

Let’s use a very concrete example: Running a 2-d convolution over a 32×32 greyscale input image. I say greyscale because then the input image has depth=1, which helps keep it simple. Let that image look like this:

00 01 02 03 04 ...
10 11 12 13 14 ...
20 21 22 23 24 ...
30 31 32 33 34 ...
...

Let’s run a 2×2 convolution window over a single example (batch size = 1). We’ll give the convolution an output channel depth of 8.

The input to the convolution has shape=[1, 32, 32, 1].

If you specify strides=[1,1,1,1] with padding=SAME, then the output of the filter will be [1, 32, 32, 8].

The filter will first create an output for:

F(00 01
  10 11)

And then for:

F(01 02
  11 12)

and so on. Then it will move to the second row, calculating:

F(10, 11
  20, 21)

then

F(11, 12
  21, 22)

If you specify a stride of [1, 2, 2, 1] it won’t do overlapping windows. It will compute:

F(00, 01
  10, 11)

and then

F(02, 03
  12, 13)

The stride operates similarly for the pooling operators.

Question 2: Why strides [1, x, y, 1] for convnets

The first 1 is the batch: You don’t usually want to skip over examples in your batch, or you shouldn’t have included them in the first place. :)

The last 1 is the depth of the convolution: You don’t usually want to skip inputs, for the same reason.

The conv2d operator is more general, so you could create convolutions that slide the window along other dimensions, but that’s not a typical use in convnets. The typical use is to use them spatially.

Why reshape to -1 -1 is a placeholder that says “adjust as necessary to match the size needed for the full tensor.” It’s a way of making the code be independent of the input batch size, so that you can change your pipeline and not have to adjust the batch size everywhere in the code.


回答 1

输入是4维的,格式为: [batch_size, image_rows, image_cols, number_of_colors]

通常,跨度定义了应用操作之间的重叠。对于conv2d,它指定卷积滤波器的连续应用之间的距离是多少。特定维度中的值1表示我们在每行/列应用运算符,值2表示每秒钟/以此类推。

关于1)对于卷积重要的值是2nd和3rd,它们表示卷积滤波器在沿行和列的应用中的重叠。值[1,2,2,1]表示我们要在每隔第二行和第二列上应用过滤器。

关于2)我不知道技术限制(可能是CuDNN要求),但通常人们会沿行或列尺寸使用步幅。在批处理大小上执行此操作不一定有意义。不确定最后一个尺寸。

关于3)为其中一个维设置-1表示“为第一维设置值,以使张量中的元素总数不变”。在我们的例子中,-1将等于batch_size。

The inputs are 4 dimensional and are of form: [batch_size, image_rows, image_cols, number_of_colors]

Strides, in general, define an overlap between applying operations. In the case of conv2d, it specifies what is the distance between consecutive applications of convolutional filters. The value of 1 in a specific dimension means that we apply the operator at every row/col, the value of 2 means every second, and so on.

Re 1) The values that matter for convolutions are 2nd and 3rd and they represent the overlap in the application of the convolutional filters along rows and columns. The value of [1, 2, 2, 1] says that we want to apply the filters on every second row and column.

Re 2) I don’t know the technical limitations (might be CuDNN requirement) but typically people use strides along the rows or columns dimensions. It doesn’t necessarily make sense to do it over batch size. Not sure of the last dimension.

Re 3) Setting -1 for one of the dimension means, “set the value for the first dimension so that the total number of elements in the tensor is unchanged”. In our case, the -1 will be equal to the batch_size.


回答 2

让我们从1-dim情况下的步幅开始。

让我们假设你input = [1, 0, 2, 3, 0, 1, 1]kernel = [2, 1, 3]卷积的结果是[8, 11, 7, 9, 4],它通过滑动你的内核在输入,进行逐元素乘法和求和计算出的一切。像这样

  • 8 = 1 * 2 + 0 * 1 + 2 * 3
  • 11 = 0 * 2 + 2 * 1 + 3 * 3
  • 7 = 2 * 2 + 3 * 1 + 0 * 3
  • 9 = 3 * 2 + 0 * 1 +1 * 3
  • 4 = 0 * 2 +1 * 1 +1 * 3

在这里,我们滑动了一个元素,但没有其他任何数字可以阻止您。这个数字是您的进步。您可以将其视为仅取第s个结果就对1步卷积的结果进行下采样。

知道输入大小i,内核大小k,步幅s和填充p,您可以轻松计算出卷积的输出大小为:

在这里|| 操作员表示天花板操作。对于池化层,s = 1。


N昏暗的情况。

一旦了解到每一个暗角都是独立的,就知道了1暗角情形的数学原理,n暗角情形很容易。因此,您只需分别滑动每个尺寸。这是2-d示例。请注意,您不必在所有尺寸上都具有相同的步幅。因此,对于N维输入/内核,您应提供N个跨度。


因此,现在很容易回答您的所有问题:

  1. 4个以上的整数分别代表什么?conv2dpool告诉您此列表表示每个维度之间的跨度。注意,步幅列表的长度与内核张量的秩相同。
  2. 为什么对于卷积网络,他们必须具有大步[0] =大步3 = 1?。第一个维度是批量大小,最后一个是渠道。既不跳过批处理也不跳过通道。因此,您将它们设置为1。对于宽度/高度,您可以跳过某些内容,因此它们可能不是1。
  3. tf.reshape(_X,shape = [-1,28,28,1])。为什么是-1? tf.reshape为您提供了帮助:

    如果形状的一个分量是特殊值-1,则将计算该尺寸的大小,以便总大小保持恒定。特别地,[-1]的形状变平为1-D。形状的最多一个分量可以是-1。

Let’s start with what stride does in 1-dim case.

Let’s assume your input = [1, 0, 2, 3, 0, 1, 1] and kernel = [2, 1, 3] the result of the convolution is [8, 11, 7, 9, 4], which is calculated by sliding your kernel over the input, performing element-wise multiplication and summing everything. Like this:

  • 8 = 1 * 2 + 0 * 1 + 2 * 3
  • 11 = 0 * 2 + 2 * 1 + 3 * 3
  • 7 = 2 * 2 + 3 * 1 + 0 * 3
  • 9 = 3 * 2 + 0 * 1 + 1 * 3
  • 4 = 0 * 2 + 1 * 1 + 1 * 3

Here we slide by one element, but nothing stops you by using any other number. This number is your stride. You can think about it as downsampling the result of the 1-strided convolution by just taking every s-th result.

Knowing the input size i, kernel size k, stride s and padding p you can easily calculate the output size of the convolution as:

Here || operator means ceiling operation. For a pooling layer s = 1.


N-dim case.

Knowing the math for a 1-dim case, n-dim case is easy once you see that each dim is independent. So you just slide each dimension separately. Here is an example for 2-d. Notice that you do not need to have the same stride at all the dimensions. So for an N-dim input/kernel you should provide N strides.


So now it is easy to answer all your questions:

  1. What do each of the 4+ integers represent?. conv2d, pool tells you that this list represents the strides among each dimension. Notice that the length of strides list is the same as the rank of kernel tensor.
  2. Why must they have strides[0] = strides3 = 1 for convnets?. The first dimension is batch size, the last is channels. There is no point of skipping neither batch nor channel. So you make them 1. For width/height you can skip something and that’s why they might be not 1.
  3. tf.reshape(_X,shape=[-1, 28, 28, 1]). Why -1? tf.reshape has it covered for you:

    If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant. In particular, a shape of [-1] flattens into 1-D. At most one component of shape can be -1.


回答 3

@dga在解释方面做得非常出色,我无法感激它的帮助。我将以类似的方式分享我stride在3D卷积中如何工作的发现。

根据conv3d 上的TensorFlow文档,输入的形状必须按以下顺序排列:

[batch, in_depth, in_height, in_width, in_channels]

让我们使用一个示例从最右到左解释变量。假设输入形状为 input_shape = [1000,16,112,112,3]

input_shape[4] is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3] is the width of the image
input_shape[2] is the height of the image
input_shape[1] is the number of frames that have been lumped into 1 complete data
input_shape[0] is the number of lumped frames of images we have.

以下是有关如何使用步幅的摘要文档。

步幅:长度> = 5的整数列表。长度为5的1-D张量。每个输入维度的滑动窗口的步幅。一定有strides[0] = strides[4] = 1

正如许多作品所指出的那样,步幅仅表示窗口或内核距离最近的元素(无论是数据帧还是像素)有几步的距离(顺便说一句)。

从以上文档中可以看出,3D中的跨度应为以下跨度=(1,XYZ,1)。

文档强调了这一点strides[0] = strides[4] = 1

strides[0]=1 means that we do not want to skip any data in the batch 
strides[4]=1 means that we do not want to skip in the channel 

strides [X]表示我们应在集总帧中进行多少次跳过。因此,例如,如果我们有16帧,则X = 1表示使用每帧。X = 2表示每隔一帧使用一次,然后继续

strides [y]和strides [z]按照@dga的说明进行操作,因此我不会重做该部分。

但是,在keras中,您只需要指定3个整数的元组/列表,即可指定沿每个空间维度的卷积步幅,其中空间维度为stride [x],strides [y]和strides [z]。strides [0]和strides [4]已默认为1。

我希望有人觉得这有帮助!

@dga has done a wonderful job explaining and I can’t be thankful enough how helpful it has been. In the like manner, I will like to share my findings on how stride works in 3D convolution.

According to the TensorFlow documentation on conv3d, the shape of the input must be in this order:

[batch, in_depth, in_height, in_width, in_channels]

Let’s explain the variables from the extreme right to the left using an example. Assuming the input shape is input_shape = [1000,16,112,112,3]

input_shape[4] is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3] is the width of the image
input_shape[2] is the height of the image
input_shape[1] is the number of frames that have been lumped into 1 complete data
input_shape[0] is the number of lumped frames of images we have.

Below is a summary documentation for how stride is used.

strides: A list of ints that has length >= 5. 1-D tensor of length 5. The stride of the sliding window for each dimension of input. Must have strides[0] = strides[4] = 1

As indicated in many works, strides simply mean how many steps away a window or kernel jumps away from the closest element, be it a data frame or pixel (this is paraphrased by the way).

From the above documentation, a stride in 3D will look like this strides = (1,X,Y,Z,1).

The documentation emphasizes that strides[0] = strides[4] = 1.

strides[0]=1 means that we do not want to skip any data in the batch 
strides[4]=1 means that we do not want to skip in the channel 

strides[X] means how many skips we should make in the lumped frames. So for example, if we have 16 frames, X=1 means use every frame. X=2 means use every second frame and it goes and on

strides[y] and strides[z] follow the explanation by @dga so I will not redo that part.

In keras however, you only need to specify a tuple/list of 3 integers, specifying the strides of the convolution along each spatial dimension, where spatial dimension is stride[x], strides[y] and strides[z]. strides[0] and strides[4] is already defaulted to 1.

I hope someone finds this helpful!


TensorFlow中的tf.app.flags的目的是什么?

问题:TensorFlow中的tf.app.flags的目的是什么?

我在Tensorflow中阅读一些示例代码,发现以下代码

flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
flags.DEFINE_integer('max_steps', 2000, 'Number of steps to run trainer.')
flags.DEFINE_integer('hidden1', 128, 'Number of units in hidden layer 1.')
flags.DEFINE_integer('hidden2', 32, 'Number of units in hidden layer 2.')
flags.DEFINE_integer('batch_size', 100, 'Batch size.  '
                 'Must divide evenly into the dataset sizes.')
flags.DEFINE_string('train_dir', 'data', 'Directory to put the training data.')
flags.DEFINE_boolean('fake_data', False, 'If true, uses fake data '
                 'for unit testing.')

tensorflow/tensorflow/g3doc/tutorials/mnist/fully_connected_feed.py

但我找不到有关的用法的任何文档tf.app.flags

我发现该标志的实现在 tensorflow/tensorflow/python/platform/default/_flags.py

显然,这tf.app.flags是以某种方式用于配置网络的,所以为什么在API文档中没有呢?谁能解释这是怎么回事?

I am reading some example codes in Tensorflow, I found following code

flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
flags.DEFINE_integer('max_steps', 2000, 'Number of steps to run trainer.')
flags.DEFINE_integer('hidden1', 128, 'Number of units in hidden layer 1.')
flags.DEFINE_integer('hidden2', 32, 'Number of units in hidden layer 2.')
flags.DEFINE_integer('batch_size', 100, 'Batch size.  '
                 'Must divide evenly into the dataset sizes.')
flags.DEFINE_string('train_dir', 'data', 'Directory to put the training data.')
flags.DEFINE_boolean('fake_data', False, 'If true, uses fake data '
                 'for unit testing.')

in tensorflow/tensorflow/g3doc/tutorials/mnist/fully_connected_feed.py

But I can’t find any docs about this usage of tf.app.flags.

And I found the implementation of this flags is in the tensorflow/tensorflow/python/platform/default/_flags.py

Obviously, this tf.app.flags is somehow used to configure a network, so why is it not in the API docs? Can anyone explain what is going on here?


回答 0

tf.app.flags模块目前是python-gflags的 一个瘦包装,因此该项目文档是如何使用它的最佳资源argparse,它实现了一部分功能python-gflags

请注意,该模块当前已打包为方便编写演示应用程序使用,从技术上讲,它不是公共API的一部分,因此将来可能会更改。

我们建议您使用argparse或任何您喜欢的库来实现自己的标志解析。

编辑:tf.app.flags模块实际上并未使用实现python-gflags,但它使用了类似的API。

The tf.app.flags module is presently a thin wrapper around python-gflags, so the documentation for that project is the best resource for how to use it argparse, which implements a subset of the functionality in python-gflags.

Note that this module is currently packaged as a convenience for writing demo apps, and is not technically part of the public API, so it may change in future.

We recommend that you implement your own flag parsing using argparse or whatever library you prefer.

EDIT: The tf.app.flags module is not in fact implemented using python-gflags, but it uses a similar API.


回答 1

tf.app.flags模块是Tensorflow提供的功能,用于为Tensorflow程序实现命令行标志。例如,您遇到的代码将执行以下操作:

flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')

第一个参数定义标志的名称,第二个参数定义默认值,以防执行文件时未指定标志。

因此,如果运行以下命令:

$ python fully_connected_feed.py --learning_rate 1.00

那么学习率将设置为1.00,如果未指定该标志,则将保持0.01。

本文所述,文档可能不存在,因为这可能是Google内部要求开发人员使用的文档。

此外,如文章中所述,使用Tensorflow标志比其他Python软件包提供的标志功能有多个优势,例如argparse在处理Tensorflow模型时尤其如此,最重要的是可以向代码提供Tensorflow特定信息,例如信息有关使用哪个GPU的信息。

The tf.app.flags module is a functionality provided by Tensorflow to implement command line flags for your Tensorflow program. As an example, the code you came across would do the following:

flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')

The first parameter defines the name of the flag while the second defines the default value in case the flag is not specified while executing the file.

So if you run the following:

$ python fully_connected_feed.py --learning_rate 1.00

then the learning rate is set to 1.00 and will remain 0.01 if the flag is not specified.

As mentioned in this article, the docs are probably not present because this might be something that Google requires internally for its developers to use.

Also, as mentioned in the post, there are several advantages of using Tensorflow flags over flag functionality provided by other Python packages such as argparse especially when dealing with Tensorflow models, the most important being that you can supply Tensorflow specific information to the code such as information about which GPU to use.


回答 2

在Google,他们使用标记系统来设置参数的默认值。它类似于argparse。他们使用自己的标记系统,而不是argparse或sys.argv。

资料来源:我以前在那里工作过。

At Google, they use flag systems to set default values for arguments. It’s similar to argparse. They use their own flag system instead of argparse or sys.argv.

Source: I worked there before.


回答 3

使用时tf.app.run(),可以使用方便地在线程之间传输变量tf.app.flags。请参阅此内容以进一步使用tf.app.flags

When you use tf.app.run(), you can transfer the variable very conveniently between threads using tf.app.flags. See this for further usage of tf.app.flags.


回答 4

经过多次尝试后,我发现它可以打印所有FLAGS键以及实际值-

for key in tf.app.flags.FLAGS.flag_values_dict():

  print(key, FLAGS[key].value)

After trying many times I found this to print all FLAGS key as well as actual value –

for key in tf.app.flags.FLAGS.flag_values_dict():

  print(key, FLAGS[key].value)

Tensorflow后端的Keras能否被迫随意使用CPU或GPU?

问题:Tensorflow后端的Keras能否被迫随意使用CPU或GPU?

我安装了Tensorflow后端和CUDA的Keras。我有时想按需强迫Keras使用CPU。不用说在虚拟环境中安装单独的仅CPU的Tensorflow就能做到吗?如果可以,怎么办?如果后端是Theano,则可以设置标志,但是我还没有听说过可以通过Keras访问Tensorflow标志。

I have Keras installed with the Tensorflow backend and CUDA. I’d like to sometimes on demand force Keras to use CPU. Can this be done without say installing a separate CPU-only Tensorflow in a virtual environment? If so how? If the backend were Theano, the flags could be set, but I have not heard of Tensorflow flags accessible via Keras.


回答 0

如果要强制Keras使用CPU

方式1

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""

在导入Keras / Tensorflow之前。

方式二

运行脚本为

$ CUDA_VISIBLE_DEVICES="" ./your_keras_code.py

也可以看看

  1. https://github.com/keras-team/keras/issues/152
  2. https://github.com/fchollet/keras/issues/4613

If you want to force Keras to use CPU

Way 1

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""

before Keras / Tensorflow is imported.

Way 2

Run your script as

$ CUDA_VISIBLE_DEVICES="" ./your_keras_code.py

See also

  1. https://github.com/keras-team/keras/issues/152
  2. https://github.com/fchollet/keras/issues/4613

回答 1

一个相当分离的方法是使用

import tensorflow as tf
from keras import backend as K

num_cores = 4

if GPU:
    num_GPU = 1
    num_CPU = 1
if CPU:
    num_CPU = 1
    num_GPU = 0

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}
                       )

session = tf.Session(config=config)
K.set_session(session)

在此处,通过booleans GPUCPU,我们通过严格定义允许Tensorflow会话访问的GPU和CPU的数量来指示我们是否要使用GPU或CPU运行代码。变量num_GPUnum_CPU定义该值。num_cores然后通过intra_op_parallelism_threads和设置可供使用的CPU内核数inter_op_parallelism_threads

intra_op_parallelism_threads变量指示在计算图中单个节点中并行操作被允许使用(内部)的线程数。虽然inter_ops_parallelism_threads变量定义了跨计算图节点(并行)进行并行操作可访问的线程数。

allow_soft_placement 如果满足以下任一条件,则允许在CPU上运行操作:

  1. 该操作没有GPU实现

  2. 没有已知或注册的GPU设备

  3. 需要与CPU的其他输入一起放置

所有这些都在任何其他操作之前在我的类的构造函数中执行,并且可以与我使用的任何模型或其他代码完全分开。

注意:这要求tensorflow-gpucuda/ cudnn要安装,因为提供了使用GPU的选项。

参考:

A rather separable way of doing this is to use

import tensorflow as tf
from keras import backend as K

num_cores = 4

if GPU:
    num_GPU = 1
    num_CPU = 1
if CPU:
    num_CPU = 1
    num_GPU = 0

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}
                       )

session = tf.Session(config=config)
K.set_session(session)

Here, with booleans GPU and CPU, we indicate whether we would like to run our code with the GPU or CPU by rigidly defining the number of GPUs and CPUs the Tensorflow session is allowed to access. The variables num_GPU and num_CPU define this value. num_cores then sets the number of CPU cores available for usage via intra_op_parallelism_threads and inter_op_parallelism_threads.

The intra_op_parallelism_threads variable dictates the number of threads a parallel operation in a single node in the computation graph is allowed to use (intra). While the inter_ops_parallelism_threads variable defines the number of threads accessible for parallel operations across the nodes of the computation graph (inter).

allow_soft_placement allows for operations to be run on the CPU if any of the following criterion are met:

  1. there is no GPU implementation for the operation

  2. there are no GPU devices known or registered

  3. there is a need to co-locate with other inputs from the CPU

All of this is executed in the constructor of my class before any other operations, and is completely separable from any model or other code I use.

Note: This requires tensorflow-gpu and cuda/cudnn to be installed because the option is given to use a GPU.

Refs:


回答 2

这对我有用(win10),在导入keras之前放置:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

This worked for me (win10), place before you import keras:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

回答 3

只需导入tensortflow并使用keras,就这么简单。

import tensorflow as tf
# your code here
with tf.device('/gpu:0'):
    model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Just import tensortflow and use keras, it’s that easy.

import tensorflow as tf
# your code here
with tf.device('/gpu:0'):
    model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

回答 4

按照keras 教程,您可以简单地使用与tf.device常规tensorflow中相同的作用域:

with tf.device('/gpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on GPU:0

with tf.device('/cpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on CPU:0

As per keras tutorial, you can simply use the same tf.device scope as in regular tensorflow:

with tf.device('/gpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on GPU:0

with tf.device('/cpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on CPU:0

回答 5

我只是花了一些时间弄清楚。Thoma的答案不完整。假设您的程序是test.py,您想使用gpu0来运行该程序,并使其他gpus保持空闲。

你应该写 CUDA_VISIBLE_DEVICES=0 python test.py

注意DEVICES不是DEVICE

I just spent some time figure it out. Thoma’s answer is not complete. Say your program is test.py, you want to use gpu0 to run this program, and keep other gpus free.

You should write CUDA_VISIBLE_DEVICES=0 python test.py

Notice it’s DEVICES not DEVICE


回答 6

对于从事PyCharm并强制使用CPU的人员,您可以在“运行/调试”配置的“环境变量”下添加以下行:

<OTHER_ENVIRONMENT_VARIABLES>;CUDA_VISIBLE_DEVICES=-1

For people working on PyCharm, and for forcing CPU, you can add the following line in the Run/Debug configuration, under Environment variables:

<OTHER_ENVIRONMENT_VARIABLES>;CUDA_VISIBLE_DEVICES=-1