Python内存泄漏

问题:Python内存泄漏

我有一个长时间运行的脚本,如果让脚本运行足够长的时间,它将消耗系统上的所有内存。

在不详细介绍脚本的情况下,我有两个问题:

  1. 是否有可遵循的“最佳实践”,以防止泄漏发生?
  2. 有什么技术可以调试Python中的内存泄漏?

I have a long-running script which, if let to run long enough, will consume all the memory on my system.

Without going into details about the script, I have two questions:

  1. Are there any “Best Practices” to follow, which will help prevent leaks from occurring?
  2. What techniques are there to debug memory leaks in Python?

回答 0

看看这篇文章:跟踪python内存泄漏

另外,请注意,垃圾收集模块实际上可以设置调试标志。看一下set_debug功能。此外,请查看Gnibbler的这段代码,以确定调用后已创建的对象的类型。

Have a look at this article: Tracing python memory leaks

Also, note that the garbage collection module actually can have debug flags set. Look at the set_debug function. Additionally, look at this code by Gnibbler for determining the types of objects that have been created after a call.


回答 1

我尝试了前面提到的大多数选项,但是发现这个小而直观的软件包是最好的:pympler

跟踪未进行垃圾收集的对象非常简单,请查看以下小示例:

通过安装软件包 pip install pympler

from pympler.tracker import SummaryTracker
tracker = SummaryTracker()

# ... some code you want to investigate ...

tracker.print_diff()

输出显示您已添加的所有对象,以及它们消耗的内存。

样本输出:

                                 types |   # objects |   total size
====================================== | =========== | ============
                                  list |        1095 |    160.78 KB
                                   str |        1093 |     66.33 KB
                                   int |         120 |      2.81 KB
                                  dict |           3 |       840 B
      frame (codename: create_summary) |           1 |       560 B
          frame (codename: print_diff) |           1 |       480 B

该软件包提供了更多的功能。查看pympler的文档,特别是确定内存泄漏一节。

I tried out most options mentioned previously but found this small and intuitive package to be the best: pympler

It’s quite straight forward to trace objects that were not garbage-collected, check this small example:

install package via pip install pympler

from pympler.tracker import SummaryTracker
tracker = SummaryTracker()

# ... some code you want to investigate ...

tracker.print_diff()

The output shows you all the objects that have been added, plus the memory they consumed.

Sample output:

                                 types |   # objects |   total size
====================================== | =========== | ============
                                  list |        1095 |    160.78 KB
                                   str |        1093 |     66.33 KB
                                   int |         120 |      2.81 KB
                                  dict |           3 |       840 B
      frame (codename: create_summary) |           1 |       560 B
          frame (codename: print_diff) |           1 |       480 B

This package provides a number of more features. Check pympler’s documentation, in particular the section Identifying memory leaks.


回答 2

让我推荐我创建的mem_top工具

它帮助我解决了类似的问题

它只是立即显示Python程序中内存泄漏的主要嫌疑人

Let me recommend mem_top tool I created

It helped me to solve a similar issue

It just instantly shows top suspects for memory leaks in a Python program


回答 3

从Python 3.4开始,Tracemalloc模块已集成为内置模块,并且显然,它也可以作为第三方库用于Python的早期版本(尽管尚未测试)。

该模块能够输出分配最多内存的精确文件和行。恕我直言,此信息比每种类型分配的实例数更有价值(99%的时间最终是很多元组,这是一个线索,但在大多数情况下几乎没有帮助)。

我建议您将tracemalloc与pyrasite结合使用。每10个中有9 pyrasite外壳中运行前10个代码段,将为您提供足够的信息和提示,以在10分钟内修复泄漏。但是,如果仍然无法找到泄漏的原因,pyrasite-shell与该线程中提到的其他工具的组合可能也会为您提供更多提示。您还应该查看吡铁矿提供的所有其他帮助程序(例如内存查看器)。

Tracemalloc module was integrated as a built-in module starting from Python 3.4, and appearently, it’s also available for prior versions of Python as a third-party library (haven’t tested it though).

This module is able to output the precise files and lines that allocated the most memory. IMHO, this information is infinitly more valuable than the number of allocated instances for each type (which ends up being a lot of tuples 99% of the time, which is a clue, but barely helps in most cases).

I recommend you use tracemalloc in combination with pyrasite. 9 times out of 10, running the top 10 snippet in a pyrasite-shell will give you enough information and hints to to fix the leak within 10 minutes. Yet, if you’re still unable to find the leak cause, pyrasite-shell in combination with the other tools mentioned in this thread will probably give you some more hints too. You should also take a look on all the extra helpers provided by pyrasite (such as the memory viewer).


回答 4

您应该特别查看全局数据或静态数据(寿命长的数据)。

当这些数据不受限制地增长时,您也会在Python中遇到麻烦。

垃圾收集器只能收集不再引用的数据。但是您的静态数据可以连接应释放的数据元素。

另一个问题可能是内存循环,但是至少从理论上讲,垃圾收集器应该找到并消除循环-至少只要不将其挂接到某些长期存在的数据上即可。

什么样的长期数据特别麻烦?在所有列表和字典上都有一个很好的外观-它们可以不受限制地增长。在字典中,您甚至可能看不到麻烦,因为当您访问字典时,字典中的键数可能对您来说不是很明显。

You should specially have a look on your global or static data (long living data).

When this data grows without restriction, you can also get troubles in Python.

The garbage collector can only collect data, that is not referenced any more. But your static data can hookup data elements that should be freed.

Another problem can be memory cycles, but at least in theory the Garbage collector should find and eliminate cycles — at least as long as they are not hooked on some long living data.

What kinds of long living data are specially troublesome? Have a good look on any lists and dictionaries — they can grow without any limit. In dictionaries you might even don’t see the trouble coming since when you access dicts, the number of keys in the dictionary might not be of big visibility to you …


回答 5

为了检测和定位长时间运行的进程(例如在生产环境中)的内存泄漏,现在可以使用stackimpact。它在下面使用tracemalloc这篇文章中的更多信息。

To detect and locate memory leaks for long running processes, e.g. in production environments, you can now use stackimpact. It uses tracemalloc underneath. More info in this post.


回答 6

就最佳实践而言,请留意递归函数。就我而言,我遇到了递归问题(不需要递归)。我正在做什么的简化示例:

def my_function():
    # lots of memory intensive operations
    # like operating on images or huge dictionaries and lists
    .....
    my_flag = True
    if my_flag:  # restart the function if a certain flag is true
        my_function()

def main():
    my_function()

以这种递归方式进行操作不会触发垃圾回收并清除功能的其余部分,因此每次通过内存使用的时间都在增长。

我的解决方案是将递归调用从my_function()中拉出,并在再次调用时让main()处理。这样,功能自然结束,并在运行后自动清理。

def my_function():
    # lots of memory intensive operations
    # like operating on images or huge dictionaries and lists
    .....
    my_flag = True
    .....
    return my_flag

def main():
    result = my_function()
    if result:
        my_function()

As far as best practices, keep an eye for recursive functions. In my case I ran into issues with recursion (where there didn’t need to be). A simplified example of what I was doing:

def my_function():
    # lots of memory intensive operations
    # like operating on images or huge dictionaries and lists
    .....
    my_flag = True
    if my_flag:  # restart the function if a certain flag is true
        my_function()

def main():
    my_function()

operating in this recursive manner won’t trigger the garbage collection and clear out the remains of the function, so every time through memory usage is growing and growing.

My solution was to pull the recursive call out of my_function() and have main() handle when to call it again. this way the function ends naturally and cleans up after itself.

def my_function():
    # lots of memory intensive operations
    # like operating on images or huge dictionaries and lists
    .....
    my_flag = True
    .....
    return my_flag

def main():
    result = my_function()
    if result:
        my_function()

回答 7

不确定python中内存泄漏的“最佳实践”,但是python应该通过垃圾回收器清除它自己的内存。因此,主要是我从检查一些简短的循环列表开始,因为它们不会被垃圾收集器接收。

Not sure about “Best Practices” for memory leaks in python, but python should clear it’s own memory by it’s garbage collector. So mainly I would start by checking for circular list of some short, since they won’t be picked up by the garbage collector.


回答 8

这绝不是详尽的建议。但是在编写时要避免将来的内存泄漏(循环),要记住的第一件事是要确保接受回调引用的任何内容都应将该回调存储为弱引用。

This is by no means exhaustive advice. But number one thing to keep in mind when writing with the thought of avoiding future memory leaks (loops) is to make sure that anything which accepts a reference to a call-back, should store that call-back as a weak reference.