问题:如何在Python中显式释放内存?

我编写了一个Python程序,该程序作用于大型输入文件,以创建代表三角形的数百万个对象。该算法是:

  1. 读取输入文件
  2. 处理文件并创建一个三角形列表,以其顶点表示
  3. 以OFF格式输出顶点:顶点列表,后跟三角形列表。三角形由顶点列表中的索引表示

在打印出三角形之前先打印出完整的顶点列表的OFF要求意味着在将输出写入文件之前,必须将三角形的列表保留在内存中。同时,由于列表的大小,我遇到了内存错误。

告诉Python我不再需要某些数据并且可以释放它们的最佳方法是什么?

I wrote a Python program that acts on a large input file to create a few million objects representing triangles. The algorithm is:

  1. read an input file
  2. process the file and create a list of triangles, represented by their vertices
  3. output the vertices in the OFF format: a list of vertices followed by a list of triangles. The triangles are represented by indices into the list of vertices

The requirement of OFF that I print out the complete list of vertices before I print out the triangles means that I have to hold the list of triangles in memory before I write the output to file. In the meanwhile I’m getting memory errors because of the sizes of the lists.

What is the best way to tell Python that I no longer need some of the data, and it can be freed?


回答 0

根据Python官方文档,您可以使用强制垃圾回收器释放未引用的内存gc.collect()。例:

import gc
gc.collect()

According to Python Official Documentation, you can force the Garbage Collector to release unreferenced memory with gc.collect(). Example:

import gc
gc.collect()

回答 1

不幸的是(取决于您的Python版本和版本),某些类型的对象使用“空闲列表”,这是一种整洁的局部优化,但可能会导致内存碎片,特别是通过为特定类型的对象设置越来越多的“专用”内存来实现。因此无法使用“普通基金”。

确保大量但临时使用内存的唯一真正可靠的方法是在完成后将所有资源都返还给系统,这是让使用发生在子进程中,该进程需要大量内存,然后终止。在这种情况下,操作系统将完成其工作,并乐意回收子进程可能吞没的所有资源。幸运的是,该multiprocessing模块使这种操作(过去很痛苦)在现代版本的Python中还不错。

在您的用例中,似乎子过程累积一些结果并确保这些结果可用于主过程的最佳方法是使用半临时文件(我所说的是半临时文件,而不是那种关闭后会自动消失,只会删除您用完后会明确删除的普通文件)。

Unfortunately (depending on your version and release of Python) some types of objects use “free lists” which are a neat local optimization but may cause memory fragmentation, specifically by making more and more memory “earmarked” for only objects of a certain type and thereby unavailable to the “general fund”.

The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it’s done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

In your use case, it seems that the best way for the subprocesses to accumulate some results and yet ensure those results are available to the main process is to use semi-temporary files (by semi-temporary I mean, NOT the kind of files that automatically go away when closed, just ordinary files that you explicitly delete when you’re all done with them).


回答 2

del语句可能有用,但是IIRC 不能保证释放内存。该文档是在这里 …和为什么它没有被释放是在这里

我听说Linux和Unix类型系统上的人们分叉python进程来做一些工作,获得结果然后杀死它。

本文对Python垃圾收集器进行了说明,但我认为缺乏内存控制是托管内存的缺点

The del statement might be of use, but IIRC it isn’t guaranteed to free the memory. The docs are here … and a why it isn’t released is here.

I have heard people on Linux and Unix-type systems forking a python process to do some work, getting results and then killing it.

This article has notes on the Python garbage collector, but I think lack of memory control is the downside to managed memory


回答 3

Python是垃圾回收的,因此,如果减小列表的大小,它将回收内存。您还可以使用“ del”语句完全摆脱变量:

biglist = [blah,blah,blah]
#...
del biglist

Python is garbage-collected, so if you reduce the size of your list, it will reclaim memory. You can also use the “del” statement to get rid of a variable completely:

biglist = [blah,blah,blah]
#...
del biglist

回答 4

您不能显式释放内存。您需要做的是确保您不保留对对象的引用。然后将对它们进行垃圾回收,从而释放内存。

对于您的情况,当您需要大型列表时,通常需要重新组织代码,通常使用生成器/迭代器。这样,您根本就不需要在内存中存储大型列表。

http://www.prasannatech.net/2009/07/introduction-python-generators.html

You can’t explicitly free memory. What you need to do is to make sure you don’t keep references to objects. They will then be garbage collected, freeing the memory.

In your case, when you need large lists, you typically need to reorganize the code, typically using generators/iterators instead. That way you don’t need to have the large lists in memory at all.

http://www.prasannatech.net/2009/07/introduction-python-generators.html


回答 5

del可以是您的朋友,因为当没有其他引用时,它将对象标记为可删除。现在,CPython解释器通常会保留此内存供以后使用,因此您的操作系统可能看不到“已释放”的内存。)

通过使用更紧凑的数据结构,也许您一开始就不会遇到任何内存问题。因此,数字列表的存储效率比标准array模块或第三方numpy模块使用的格式低得多。通过将顶点放在NumPy 3xN数组中并将三角形放在N元素数组中,可以节省内存。

(del can be your friend, as it marks objects as being deletable when there no other references to them. Now, often the CPython interpreter keeps this memory for later use, so your operating system might not see the “freed” memory.)

Maybe you would not run into any memory problem in the first place by using a more compact structure for your data. Thus, lists of numbers are much less memory-efficient than the format used by the standard array module or the third-party numpy module. You would save memory by putting your vertices in a NumPy 3xN array and your triangles in an N-element array.


回答 6

从文件读取图形时,我遇到了类似的问题。该处理包括计算不适合内存的200 000×200 000浮点矩阵(一次一行)。尝试使用gc.collect()固定的内存相关方面来释放两次计算之间的内存,但这导致了性能问题:我不知道为什么,但是即使使用的内存量保持不变,每次调用也要gc.collect()花费更多的时间。前一个。因此,垃圾收集很快就花费了大部分计算时间。

为了解决内存和性能问题,我改用了在某处阅读过的多线程技巧(很抱歉,我找不到相关的文章了)。在以大for循环读取文件的每一行之前,先对其进行处理,然后gc.collect()每隔一段时间运行一次以释放内存空间。现在,我调用一个在新线程中读取和处理文件块的函数。线程结束后,将自动释放内存,而不会出现奇怪的性能问题。

实际上它是这样的:

from dask import delayed  # this module wraps the multithreading
def f(storage, index, chunk_size):  # the processing function
    # read the chunk of size chunk_size starting at index in the file
    # process it using data in storage if needed
    # append data needed for further computations  to storage 
    return storage

partial_result = delayed([])  # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100  # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
    # we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
    partial_result = delayed(f)(partial_result, index, chunk_size)

    # no computations are done yet !
    # dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
    # passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
    # it also allows you to use the results of the processing of the previous chunks in the file if needed

# this launches all the computations
result = partial_result.compute()

# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided

I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000×200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don’t know why but even though the amount of used memory remained constant, each new call to gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.

To fix both the memory and performance issues I switched to the use of a multithreading trick I read once somewhere (I’m sorry, I cannot find the related post anymore). Before I was reading each line of the file in a big for loop, processing it, and running gc.collect() every once and a while to free memory space. Now I call a function that reads and processes a chunk of the file in a new thread. Once the thread ends, the memory is automatically freed without the strange performance issue.

Practically it works like this:

from dask import delayed  # this module wraps the multithreading
def f(storage, index, chunk_size):  # the processing function
    # read the chunk of size chunk_size starting at index in the file
    # process it using data in storage if needed
    # append data needed for further computations  to storage 
    return storage

partial_result = delayed([])  # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100  # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
    # we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
    partial_result = delayed(f)(partial_result, index, chunk_size)

    # no computations are done yet !
    # dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
    # passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
    # it also allows you to use the results of the processing of the previous chunks in the file if needed

# this launches all the computations
result = partial_result.compute()

# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided

回答 7

其他人已经发布了一些方法,使您可以“哄骗” Python解释器释放内存(或者避免出现内存问题)。您应该首先尝试他们的想法。但是,我觉得给您直接回答您的问题很重要。

实际上并没有直接告诉Python释放内存的方法。这件事的事实是,如果您想要较低的控制级别,则必须使用C或C ++编写扩展。

也就是说,有一些工具可以帮助您:

Others have posted some ways that you might be able to “coax” the Python interpreter into freeing the memory (or otherwise avoid having memory problems). Chances are you should try their ideas out first. However, I feel it important to give you a direct answer to your question.

There isn’t really any way to directly tell Python to free memory. The fact of that matter is that if you want that low a level of control, you’re going to have to write an extension in C or C++.

That said, there are some tools to help with this:


回答 8

如果您不关心顶点重用,则可以有两个输出文件-一个用于顶点,一个用于三角形。完成后,将三角形文件附加到顶点文件。

If you don’t care about vertex reuse, you could have two output files–one for vertices and one for triangles. Then append the triangle file to the vertex file when you are done.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。