问题:如何在张量流中获取当前可用的GPU?

我有一个使用分布式TensorFlow的计划,并且看到TensorFlow可以使用GPU进行培训和测试。在集群环境中,每台机器可能具有0个或1个或更多个GPU,我想将TensorFlow图运行到尽可能多的机器上的GPU中。

我发现运行tf.Session()TensorFlow时会在如下所示的日志消息中提供有关GPU的信息:

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)

我的问题是如何从TensorFlow获取有关当前可用GPU的信息?我可以从日志中获取已加载的GPU信息,但我想以更复杂的编程方式进行操作。我还可以使用CUDA_VISIBLE_DEVICES环境变量有意地限制GPU,所以我不想知道一种从OS内核获取GPU信息的方法。

简而言之,如果机器中有两个GPU ,我希望这样的函数tf.get_available_gpus()将返回['/gpu:0', '/gpu:1']。我该如何实施?

I have a plan to use distributed TensorFlow, and I saw TensorFlow can use GPUs for training and testing. In a cluster environment, each machine could have 0 or 1 or more GPUs, and I want to run my TensorFlow graph into GPUs on as many machines as possible.

I found that when running tf.Session() TensorFlow gives information about GPU in the log messages like below:

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)

My question is how do I get information about current available GPU from TensorFlow? I can get loaded GPU information from the log, but I want to do it in a more sophisticated, programmatic way. I also could restrict GPUs intentionally using the CUDA_VISIBLE_DEVICES environment variable, so I don’t want to know a way of getting GPU information from OS kernel.

In short, I want a function like tf.get_available_gpus() that will return ['/gpu:0', '/gpu:1'] if there are two GPUs available in the machine. How can I implement this?


回答 0

有一个未记录的方法device_lib.list_local_devices(),该方法使您可以列出本地进程中可用的设备。(注意,作为一种未公开的方法,此方法可能会向后不兼容更改。)该函数返回DeviceAttributes协议缓冲区对象的列表。您可以按以下方式提取GPU设备的字符串设备名称列表:

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

请注意(至少在TensorFlow 1.4之前),调用device_lib.list_local_devices()将运行一些初始化代码,默认情况下,这些初始化代码将在所有设备上分配所有GPU内存(GitHub issue)。为避免这种情况,请首先使用一个显着小的per_process_gpu_fraction或创建一个会话allow_growth=True,以防止分配所有内存。有关更多详细信息,请参见此问题

There is an undocumented method called device_lib.list_local_devices() that enables you to list the devices available in the local process. (N.B. As an undocumented method, this is subject to backwards incompatible changes.) The function returns a list of DeviceAttributes protocol buffer objects. You can extract a list of string device names for the GPU devices as follows:

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

Note that (at least up to TensorFlow 1.4), calling device_lib.list_local_devices() will run some initialization code that, by default, will allocate all of the GPU memory on all of the devices (GitHub issue). To avoid this, first create a session with an explicitly small per_process_gpu_fraction, or allow_growth=True, to prevent all of the memory being allocated. See this question for more details.


回答 1

您可以使用以下代码检查所有设备列表:

from tensorflow.python.client import device_lib

device_lib.list_local_devices()

You can check all device list using following code:

from tensorflow.python.client import device_lib

device_lib.list_local_devices()

回答 2

测试工具中还有一种方法。因此,所有要做的就是:

tf.test.is_gpu_available()

和/或

tf.test.gpu_device_name()

在Tensorflow文档中查找参数。

There is also a method in the test util. So all that has to be done is:

tf.test.is_gpu_available()

and/or

tf.test.gpu_device_name()

Look up the Tensorflow docs for arguments.


回答 3

在TensorFlow 2.0中,您可以使用 tf.config.experimental.list_physical_devices('GPU')

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

如果您安装了两个GPU,它将输出以下内容:

Name: /physical_device:GPU:0   Type: GPU
Name: /physical_device:GPU:1   Type: GPU

从2.1开始,您可以 experimental

gpus = tf.config.list_physical_devices('GPU')

看到:

In TensorFlow 2.0, you can use tf.config.experimental.list_physical_devices('GPU'):

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

If you have two GPUs installed, it outputs this:

Name: /physical_device:GPU:0   Type: GPU
Name: /physical_device:GPU:1   Type: GPU

From 2.1, you can drop experimental:

gpus = tf.config.list_physical_devices('GPU')

See:


回答 4

接受的答案给你GPU的数量,但它也分配所有这些GPU的内存。您可以通过在调用device_lib.list_local_devices()之前创建具有固定较低内存的会话来避免这种情况,这对于某些应用程序可能是不需要的。

我最终使用nvidia-smi来获取GPU的数量,而没有在其上分配任何内存。

import subprocess

n = str(subprocess.check_output(["nvidia-smi", "-L"])).count('UUID')

The accepted answer gives you the number of GPUs but it also allocates all the memory on those GPUs. You can avoid this by creating a session with fixed lower memory before calling device_lib.list_local_devices() which may be unwanted for some applications.

I ended up using nvidia-smi to get the number of GPUs without allocating any memory on them.

import subprocess

n = str(subprocess.check_output(["nvidia-smi", "-L"])).count('UUID')

回答 5

除了Mrry的出色解释之外,他建议在哪里使用,device_lib.list_local_devices()我可以向您展示如何从命令行检查GPU相关信息。

因为目前只有Nvidia的GPU适用于NN框架,所以答案只涵盖了它们。Nvidia上有一个页面,其中记录了如何使用/ proc文件系统接口来获取有关驱动程序,任何已安装的NVIDIA图形卡以及AGP状态的运行时信息。

/proc/driver/nvidia/gpus/0..N/information

提供有关每个已安装的NVIDIA图形适配器的信息(型号名称,IRQ,BIOS版本,总线类型)。请注意,BIOS版本仅在X运行时可用。

因此,您可以从命令行运行此命令,cat /proc/driver/nvidia/gpus/0/information并查看有关第一个GPU的信息。从python运行它很容易并且您可以检查第二,第三,第四GPU直到失败。

肯定Mrry的答案更可靠,而且我不确定我的答案是否可以在非Linux机器上使用,但是Nvidia的页面提供了其他有趣的信息,但鲜为人知。

Apart from the excellent explanation by Mrry, where he suggested to use device_lib.list_local_devices() I can show you how you can check for GPU related information from the command line.

Because currently only Nvidia’s gpus work for NN frameworks, the answer covers only them. Nvidia has a page where they document how you can use the /proc filesystem interface to obtain run-time information about the driver, any installed NVIDIA graphics cards, and the AGP status.

/proc/driver/nvidia/gpus/0..N/information

Provide information about each of the installed NVIDIA graphics adapters (model name, IRQ, BIOS version, Bus Type). Note that the BIOS version is only available while X is running.

So you can run this from command line cat /proc/driver/nvidia/gpus/0/information and see information about your first GPU. It is easy to run this from python and also you can check second, third, fourth GPU till it will fail.

Definitely Mrry’s answer is more robust and I am not sure whether my answer will work on non-linux machine, but that Nvidia’s page provide other interesting information, which not many people know about.


回答 6

以下工作在tensorflow 2中:

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

从2.1开始,您可以删除experimental

    gpus = tf.config.list_physical_devices('GPU')

https://www.tensorflow.org/api_docs/python/tf/config/list_physical_devices

The following works in tensorflow 2:

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

From 2.1, you can drop experimental:

    gpus = tf.config.list_physical_devices('GPU')

https://www.tensorflow.org/api_docs/python/tf/config/list_physical_devices


回答 7

NVIDIA GTX GeForce 1650 Ti我的机器中调用了一个GPUtensorflow-gpu==2.2.0

运行以下两行代码:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

输出:

Num GPUs Available:  1

I got a GPU called NVIDIA GTX GeForce 1650 Ti in my machine with tensorflow-gpu==2.2.0

Run the following two lines of code:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Output:

Num GPUs Available:  1

回答 8

使用这种方式并检查所有零件:

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds


version = tf.__version__
executing_eagerly = tf.executing_eagerly()
hub_version = hub.__version__
available = tf.config.experimental.list_physical_devices("GPU")

print("Version: ", version)
print("Eager mode: ", executing_eagerly)
print("Hub Version: ", h_version)
print("GPU is", "available" if avai else "NOT AVAILABLE")

Use this way and check all parts :

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds


version = tf.__version__
executing_eagerly = tf.executing_eagerly()
hub_version = hub.__version__
available = tf.config.experimental.list_physical_devices("GPU")

print("Version: ", version)
print("Eager mode: ", executing_eagerly)
print("Hub Version: ", h_version)
print("GPU is", "available" if avai else "NOT AVAILABLE")

回答 9

确保在您的GPU支持计算机中安装了最新的TensorFlow 2.x GPU,在python中执行以下代码,

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf 

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

会得到一个输出看起来像,

2020-02-07 10:45:37.587838:我tensorflow / stream_executor / cuda / cuda_gpu_executor.cc:1006]从SysFS读取成功的NUMA节点具有负值(-1),但必须至少有一个NUMA节点,因此返回NUMA节点为零2020-02-07 10:45:37.588896:I tensorflow / core / common_runtime / gpu / gpu_device.cc:1746]添加可见的gpu设备:0、1、2、3、4、5、6、7 Num可用GPU:8

Ensure you have the latest TensorFlow 2.x GPU installed in your GPU supporting machine, Execute the following code in python,

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf 

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Will get an output looks like,

2020-02-07 10:45:37.587838: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-02-07 10:45:37.588896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7 Num GPUs Available: 8


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。