Python子进程。Popen“ OSError:[Errno 12]无法分配内存”

问题:Python子进程。Popen“ OSError:[Errno 12]无法分配内存”

注意:此问题最初是在此处提出的但赏金时间已过,即使实际上未找到可接受的答案。我正在重新询问这个问题,包括原始问题中提供的所有详细信息。

python脚本使用sched模块每60秒运行一组类函数:

# sc is a sched.scheduler instance
sc.enter(60, 1, self.doChecks, (sc, False))

该脚本使用此处的代码作为守护进程运行。

在doChecks中调用的许多类方法使用子过程模块来调用系统函数,以获取系统统计信息:

ps = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE).communicate()[0]

在整个脚本崩溃并出现以下错误之前,它可以正常运行一段时间:

File "/home/admin/sd-agent/checks.py", line 436, in getProcesses
File "/usr/lib/python2.4/subprocess.py", line 533, in __init__
File "/usr/lib/python2.4/subprocess.py", line 835, in _get_handles
OSError: [Errno 12] Cannot allocate memory

脚本崩溃后,服务器上的free -m输出为:

$ free -m
                  total       used       free     shared     buffers    cached
Mem:                894        345        549          0          0          0
-/+ buffers/cache:  345        549
Swap:                 0          0          0

服务器正在运行CentOS 5.3。我无法在自己的CentOS盒子上或任何其他报告相同问题的用户上进行复制。

我已经尝试了许多方法来调试此问题,如原始问题中所建议:

  1. 在Popen调用之前和之后记录free -m的输出。内存使用没有显着变化,即,脚本运行时内存不会逐渐消耗完。

  2. 我在Popen调用中添加了close_fds = True,但这没有什么不同-脚本仍然因相同的错误而崩溃。建议在这里这里

  3. 我检查了这所建议双方RLIMIT_DATA和RLIMIT_AS显示(-1,-1)的rlimits 这里

  4. 一篇文章建议没有交换空间可能是原因,但是交换实际上是按需提供的(根据Web主机),这在这里也被认为是虚假的原因。

  5. 进程已关闭,因为这是使用.communicate()的行为,该行为由Python源代码和此处的注释支持。

可以在GitHub上的第442行定义的getProcesses函数中找到整个检查。此操作由从第520行开始的doChecks()调用。

崩溃前,该脚本使用strace运行,输出如下:

recv(4, "Total Accesses: 516662\nTotal kBy"..., 234, 0) = 234
gettimeofday({1250893252, 887805}, NULL) = 0
write(3, "2009-08-21 17:20:52,887 - checks"..., 91) = 91
gettimeofday({1250893252, 888362}, NULL) = 0
write(3, "2009-08-21 17:20:52,888 - checks"..., 74) = 74
gettimeofday({1250893252, 888897}, NULL) = 0
write(3, "2009-08-21 17:20:52,888 - checks"..., 67) = 67
gettimeofday({1250893252, 889184}, NULL) = 0
write(3, "2009-08-21 17:20:52,889 - checks"..., 81) = 81
close(4)                                = 0
gettimeofday({1250893252, 889591}, NULL) = 0
write(3, "2009-08-21 17:20:52,889 - checks"..., 63) = 63
pipe([4, 5])                            = 0
pipe([6, 7])                            = 0
fcntl64(7, F_GETFD)                     = 0
fcntl64(7, F_SETFD, FD_CLOEXEC)         = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7f12708) = -1 ENOMEM (Cannot allocate memory)
write(2, "Traceback (most recent call last"..., 35) = 35
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/agent."..., 52) = 52
open("/home/admin/sd-agent/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/home/admin/sd-agent/dae"..., 60) = 60
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/agent."..., 54) = 54
open("/usr/lib/python2.4/sched.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/sched"..., 55) = 55
fstat64(8, {st_mode=S_IFREG|0644, st_size=4054, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "\"\"\"A generally useful event sche"..., 4096) = 4054
write(2, "    ", 4)                     = 4
write(2, "void = action(*argument)\n", 25) = 25
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/checks"..., 60) = 60
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/checks"..., 64) = 64
open("/usr/lib/python2.4/subprocess.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/subpr"..., 65) = 65
fstat64(8, {st_mode=S_IFREG|0644, st_size=39931, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "# subprocess - Subprocesses with"..., 4096) = 4096
read(8, "lso, the newlines attribute of t"..., 4096) = 4096
read(8, "code < 0:\n        print >>sys.st"..., 4096) = 4096
read(8, "alse does not exist on 2.2.0\ntry"..., 4096) = 4096
read(8, " p2cread\n        # c2pread    <-"..., 4096) = 4096
write(2, "    ", 4)                     = 4
write(2, "errread, errwrite)\n", 19)    = 19
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
open("/usr/lib/python2.4/subprocess.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/subpr"..., 71) = 71
fstat64(8, {st_mode=S_IFREG|0644, st_size=39931, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "# subprocess - Subprocesses with"..., 4096) = 4096
read(8, "lso, the newlines attribute of t"..., 4096) = 4096
read(8, "code < 0:\n        print >>sys.st"..., 4096) = 4096
read(8, "alse does not exist on 2.2.0\ntry"..., 4096) = 4096
read(8, " p2cread\n        # c2pread    <-"..., 4096) = 4096
read(8, "table(self, handle):\n           "..., 4096) = 4096
read(8, "rrno using _sys_errlist (or siml"..., 4096) = 4096
read(8, " p2cwrite = None, None\n         "..., 4096) = 4096
write(2, "    ", 4)                     = 4
write(2, "self.pid = os.fork()\n", 21)  = 21
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
write(2, "OSError", 7)                  = 7
write(2, ": ", 2)                       = 2
write(2, "[Errno 12] Cannot allocate memor"..., 33) = 33
write(2, "\n", 1)                       = 1
unlink("/var/run/sd-agent.pid")         = 0
close(3)                                = 0
munmap(0xb7e0d000, 4096)                = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x589978}, {0xb89a60, [], SA_RESTORER, 0x589978}, 8) = 0
brk(0xa022000)                          = 0xa022000
exit_group(1)                           = ?

Note: This question was originally asked here but the bounty time expired even though an acceptable answer was not actually found. I am re-asking this question including all details provided in the original question.

A python script is running a set of class functions every 60 seconds using the sched module:

# sc is a sched.scheduler instance
sc.enter(60, 1, self.doChecks, (sc, False))

The script is running as a daemonised process using the code here.

A number of class methods that are called as part of doChecks use the subprocess module to call system functions in order to get system statistics:

ps = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE).communicate()[0]

This runs fine for a period of time before the entire script crashing with the following error:

File "/home/admin/sd-agent/checks.py", line 436, in getProcesses
File "/usr/lib/python2.4/subprocess.py", line 533, in __init__
File "/usr/lib/python2.4/subprocess.py", line 835, in _get_handles
OSError: [Errno 12] Cannot allocate memory

The output of free -m on the server once the script has crashed is:

$ free -m
                  total       used       free     shared     buffers    cached
Mem:                894        345        549          0          0          0
-/+ buffers/cache:  345        549
Swap:                 0          0          0

The server is running CentOS 5.3. I am unable to reproduce on my own CentOS boxes nor with any other user reporting the same problem.

I have tried a number of things to debug this as suggested in the original question:

  1. Logging the output of free -m before and after the Popen call. There is no significant change in memory usage i.e. memory is not gradually being used up as the script runs.

  2. I added close_fds=True to the Popen call but this made no difference – the script still crashed with the same error. Suggested here and here.

  3. I checked the rlimits which showed (-1, -1) on both RLIMIT_DATA and RLIMIT_AS as suggested here.

  4. An article suggested the having no swap space might be the cause but swap is actually available on demand (according to the web host) and this was also suggested as a bogus cause here.

  5. The processes are being closed because that is the behaviour of using .communicate() as backed up by the Python source code and comments here.

The entire checks can be found at on GitHub here with the getProcesses function defined from line 442. This is called by doChecks() starting at line 520.

The script was run with strace with the following output before the crash:

recv(4, "Total Accesses: 516662\nTotal kBy"..., 234, 0) = 234
gettimeofday({1250893252, 887805}, NULL) = 0
write(3, "2009-08-21 17:20:52,887 - checks"..., 91) = 91
gettimeofday({1250893252, 888362}, NULL) = 0
write(3, "2009-08-21 17:20:52,888 - checks"..., 74) = 74
gettimeofday({1250893252, 888897}, NULL) = 0
write(3, "2009-08-21 17:20:52,888 - checks"..., 67) = 67
gettimeofday({1250893252, 889184}, NULL) = 0
write(3, "2009-08-21 17:20:52,889 - checks"..., 81) = 81
close(4)                                = 0
gettimeofday({1250893252, 889591}, NULL) = 0
write(3, "2009-08-21 17:20:52,889 - checks"..., 63) = 63
pipe([4, 5])                            = 0
pipe([6, 7])                            = 0
fcntl64(7, F_GETFD)                     = 0
fcntl64(7, F_SETFD, FD_CLOEXEC)         = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7f12708) = -1 ENOMEM (Cannot allocate memory)
write(2, "Traceback (most recent call last"..., 35) = 35
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/agent."..., 52) = 52
open("/home/admin/sd-agent/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/daemon.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/home/admin/sd-agent/dae"..., 60) = 60
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/agent.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/agent."..., 54) = 54
open("/usr/lib/python2.4/sched.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/sched"..., 55) = 55
fstat64(8, {st_mode=S_IFREG|0644, st_size=4054, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "\"\"\"A generally useful event sche"..., 4096) = 4054
write(2, "    ", 4)                     = 4
write(2, "void = action(*argument)\n", 25) = 25
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/checks"..., 60) = 60
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/bin/sd-agent/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python24.zip/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/plat-linux2/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOMEM (Cannot allocate memory)
open("/usr/lib/python2.4/lib-tk/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/lib-dynload/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/usr/lib/python2.4/site-packages/checks.py", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "  File \"/usr/bin/sd-agent/checks"..., 64) = 64
open("/usr/lib/python2.4/subprocess.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/subpr"..., 65) = 65
fstat64(8, {st_mode=S_IFREG|0644, st_size=39931, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "# subprocess - Subprocesses with"..., 4096) = 4096
read(8, "lso, the newlines attribute of t"..., 4096) = 4096
read(8, "code < 0:\n        print >>sys.st"..., 4096) = 4096
read(8, "alse does not exist on 2.2.0\ntry"..., 4096) = 4096
read(8, " p2cread\n        # c2pread    <-"..., 4096) = 4096
write(2, "    ", 4)                     = 4
write(2, "errread, errwrite)\n", 19)    = 19
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
open("/usr/lib/python2.4/subprocess.py", O_RDONLY|O_LARGEFILE) = 8
write(2, "  File \"/usr/lib/python2.4/subpr"..., 71) = 71
fstat64(8, {st_mode=S_IFREG|0644, st_size=39931, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d28000
read(8, "# subprocess - Subprocesses with"..., 4096) = 4096
read(8, "lso, the newlines attribute of t"..., 4096) = 4096
read(8, "code < 0:\n        print >>sys.st"..., 4096) = 4096
read(8, "alse does not exist on 2.2.0\ntry"..., 4096) = 4096
read(8, " p2cread\n        # c2pread    <-"..., 4096) = 4096
read(8, "table(self, handle):\n           "..., 4096) = 4096
read(8, "rrno using _sys_errlist (or siml"..., 4096) = 4096
read(8, " p2cwrite = None, None\n         "..., 4096) = 4096
write(2, "    ", 4)                     = 4
write(2, "self.pid = os.fork()\n", 21)  = 21
close(8)                                = 0
munmap(0xb7d28000, 4096)                = 0
write(2, "OSError", 7)                  = 7
write(2, ": ", 2)                       = 2
write(2, "[Errno 12] Cannot allocate memor"..., 33) = 33
write(2, "\n", 1)                       = 1
unlink("/var/run/sd-agent.pid")         = 0
close(3)                                = 0
munmap(0xb7e0d000, 4096)                = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x589978}, {0xb89a60, [], SA_RESTORER, 0x589978}, 8) = 0
brk(0xa022000)                          = 0xa022000
exit_group(1)                           = ?

回答 0

作为一般规则(即香草内核),fork/ clone有故障ENOMEM 发生的具体原因的任何一个诚实的神了内存不足的条件dup_mmdup_task_structalloc_pidmpol_dupmm_init等呱呱叫),或者是因为security_vm_enough_memory_mm你失望实施过载策略

首先,在尝试进行分叉时,检查未能分叉的进程的vmsize,然后将其与过量使用策略相关的可用内存(物理和交换)量进行比较(插入数字)。

在您的特定情况下,请注意,Virtuozzo 在过量使用执法方面还有其他检查。而且,我不确定您容器内部交换和过量使用配置真正拥有多少控制权(以影响执行结果)。

现在,为了真正前进,我想告诉您,您还有两个选择

  • 切换到更大的实例,或者
  • 投入一些编码工作来更有效地控制脚本的内存占用量

注意,如果事实证明不是您自己,而是与您运行amock在同一台服务器上的其他实例并置在另一个实例中,那么编码工作可能就一事无成。

在内存方面,我们已经知道subprocess.Popen使用fork/ clone 在幕后,这意味着每次调用它时,您都在请求与Python已经耗尽的内存一样多的内存,即增加数百MB,以便exec很小的10kB可执行文件,例如freeps。如果出现不利的过量使用政策,您很快就会看到ENOMEM

替代方法fork没有此父页面表等。复制问题为vforkposix_spawn。但是,如果您不想subprocess.Popenvfork/ 重写大块的代码posix_spawn,请考虑suprocess.Popen在脚本开始时(Python的内存占用最小的情况下)仅使用一次,以生成一个shell脚本,然后再运行free/ ps/ sleep以及其他与脚本并行循环;轮询脚本的输出或同步读取它,如果您还有其他要异步处理的内容,则可能从一个单独的线程中读取它-使用Python处理数据,但将分叉交给下级进程处理。

无论其,在您的特定情况下,你可以跳过调用psfree干脆; 无论您选择自己亲自还是通过现有的库和/或程序包访问这些信息,都可以直接在Python中直接从中procfs使用。如果和是你正在运行的唯一的实用工具,那么你就可以弄死完全psfreesubprocess.Popen

最后,无论您做什么subprocess.Popen,如果脚本泄漏内存,您最终还是会碰壁。密切注意它,并检查是否有内存泄漏

As a general rule (i.e. in vanilla kernels), fork/clone failures with ENOMEM occur specifically because of either an honest to God out-of-memory condition (dup_mm, dup_task_struct, alloc_pid, mpol_dup, mm_init etc. croak), or because security_vm_enough_memory_mm failed you while enforcing the overcommit policy.

Start by checking the vmsize of the process that failed to fork, at the time of the fork attempt, and then compare to the amount of free memory (physical and swap) as it relates to the overcommit policy (plug the numbers in.)

In your particular case, note that Virtuozzo has additional checks in overcommit enforcement. Moreover, I’m not sure how much control you truly have, from within your container, over swap and overcommit configuration (in order to influence the outcome of the enforcement.)

Now, in order to actually move forward I’d say you’re left with two options:

  • switch to a larger instance, or
  • put some coding effort into more effectively controlling your script’s memory footprint

NOTE that the coding effort may be all for naught if it turns out that it’s not you, but some other guy collocated in a different instance on the same server as you running amock.

Memory-wise, we already know that subprocess.Popen uses fork/clone under the hood, meaning that every time you call it you’re requesting once more as much memory as Python is already eating up, i.e. in the hundreds of additional MB, all in order to then exec a puny 10kB executable such as free or ps. In the case of an unfavourable overcommit policy, you’ll soon see ENOMEM.

Alternatives to fork that do not have this parent page tables etc. copy problem are vfork and posix_spawn. But if you do not feel like rewriting chunks of subprocess.Popen in terms of vfork/posix_spawn, consider using suprocess.Popen only once, at the beginning of your script (when Python’s memory footprint is minimal), to spawn a shell script that then runs free/ps/sleep and whatever else in a loop parallel to your script; poll the script’s output or read it synchronously, possibly from a separate thread if you have other stuff to take care of asynchronously — do your data crunching in Python but leave the forking to the subordinate process.

HOWEVER, in your particular case you can skip invoking ps and free altogether; that information is readily available to you in Python directly from procfs, whether you choose to access it yourself or via existing libraries and/or packages. If ps and free were the only utilities you were running, then you can do away with subprocess.Popen completely.

Finally, whatever you do as far as subprocess.Popen is concerned, if your script leaks memory you will still hit the wall eventually. Keep an eye on it, and check for memory leaks.


回答 1

free -m我看来,从输出看,您实际上没有可用的交换内存。我不确定在Linux中交换是否总是可以按需自动进行,但是我遇到了同样的问题,这里的答案都没有真正帮助我。但是,添加一些交换内存可以解决我的问题,因为这可能会帮助其他面临相同问题的人,所以我发布了有关如何添加1GB交换的答案(在Ubuntu 12.04上,但对于其他发行版也应类似地工作)。

您可以首先检查是否启用了任何交换内存。

$sudo swapon -s

如果为空,则表示您没有启用任何交换。要添加1GB交换空间:

$sudo dd if=/dev/zero of=/swapfile bs=1024 count=1024k
$sudo mkswap /swapfile
$sudo swapon /swapfile

将以下行添加到中,fstab以使交换永久生效。

$sudo vim /etc/fstab

     /swapfile       none    swap    sw      0       0 

来源和更多信息可以在这里找到。

Looking at the output of free -m it seems to me that you actually do not have swap memory available. I am not sure if in Linux the swap always will be available automatically on demand, but I was having the same problem and none of the answers here really helped me. Adding some swap memory however, fixed the problem in my case so since this might help other people facing the same problem, I post my answer on how to add a 1GB swap (on Ubuntu 12.04 but it should work similarly for other distributions.)

You can first check if there is any swap memory enabled.

$sudo swapon -s

if it is empty, it means you don’t have any swap enabled. To add a 1GB swap:

$sudo dd if=/dev/zero of=/swapfile bs=1024 count=1024k
$sudo mkswap /swapfile
$sudo swapon /swapfile

Add the following line to the fstab to make the swap permanent.

$sudo vim /etc/fstab

     /swapfile       none    swap    sw      0       0 

Source and more information can be found here.


回答 2

swap可能不是以前建议的红色鲱鱼。之前的python进程有多大ENOMEM

在内核2.6下,/proc/sys/vm/swappiness控制内核将如何积极地进行交换,并overcommit*归档内核可以眨眨一下头来分配多少内存以及如何精确分配内存。就像您的Facebook关系状态一样,这很复杂

…但是交换实际上是按需提供的(根据Web主机)…

但不是根据free(1)命令的输出,该命令的输出不显示服务器实例识别的交换空间。现在,您的Web主机肯定比我对这个主题了解更多,但是我使用的虚拟RHEL / CentOS系统报告了可用于来宾OS的交换。

改编Red Hat KB第15252条

只要匿名内存和系统V共享内存的总和少于RAM的3/4,红帽企业Linux 5系统就可以很好地运行,根本没有交换空间。….内存小于或等于4GB的系统 [建议]至少具有2GB的交换空间。

将您的/proc/sys/vm设置与普通的CentOS 5.3安装进行比较。添加交换文件。棘轮下来swappiness,看看你是否再活下去。

swap may not be the red herring previously suggested. How big is the python process in question just before the ENOMEM?

Under kernel 2.6, /proc/sys/vm/swappiness controls how aggressively the kernel will turn to swap, and overcommit* files how much and how precisely the kernel may apportion memory with a wink and a nod. Like your facebook relationship status, it’s complicated.

…but swap is actually available on demand (according to the web host)…

but not according to the output of your free(1) command, which shows no swap space recognized by your server instance. Now, your web host may certainly know much more than I about this topic, but virtual RHEL/CentOS systems I’ve used have reported swap available to the guest OS.

Adapting Red Hat KB Article 15252:

A Red Hat Enterprise Linux 5 system will run just fine with no swap space at all as long as the sum of anonymous memory and system V shared memory is less than about 3/4 the amount of RAM. …. Systems with 4GB of ram or less [are recommended to have] a minimum of 2GB of swap space.

Compare your /proc/sys/vm settings to a plain CentOS 5.3 installation. Add a swap file. Ratchet down swappiness and see if you live any longer.


回答 3

为了轻松解决,您可以

echo 1 > /proc/sys/vm/overcommit_memory

如果您确定系统有足够的内存。参见Linux以上的提交启发式方法

For an easy fix, you could

echo 1 > /proc/sys/vm/overcommit_memory

if your’re sure that your system has enough memory. See Linux over commit heuristic.


回答 4

我仍然怀疑您的客户/用户已加载了某些内核模块或驱动程序,从而干扰了clone()系统调用(可能是一些晦涩的安全性增强功能,例如LIDS,但更为晦涩吗?),或者是以某种方式填充了某些内核数据结构对fork()/来说是必需的clone()(进程表,页面表,文件描述符表等)。

这是fork(2)手册页的相关部分:

错误
       EAGAIN fork()无法分配足够的内存来复制父级的页表并为该任务分配任务结构
              儿童。

       EAGAIN无法创建新进程,因为遇到了调用者的RLIMIT_NPROC资源限制。至
              超过此限制,该进程必须具有CAP_SYS_ADMIN或CAP_SYS_RESOURCE能力。

       由于内存紧张,ENOMEM fork()无法分配必要的内核结构。

我建议让用户在引导到普通的通用内核之后,并仅加载最少的一组模块和驱动程序(运行应用程序/脚本的最低要求)之后尝试一下。从那里开始,假设它在该配置下有效,则他们可以在显示该问题的配置和配置之间执行二进制搜索。这是标准的系统管理员故障排除101。

您中的相关行strace是:

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7f12708) = -1 ENOMEM (Cannot allocate memory)

…我知道其他人已经讨论过交换和内存的可用性(我建议您至少设置一个小的交换分区,即使该分区位于RAM磁盘上,也具有讽刺意味的是…与可用零交换的那些交换(exceptions处理路径)相比,甚至很少的可用交换也被广泛地执行。

但是,我怀疑这仍然是鲱鱼。

free报告高速缓存和缓冲区正在使用的0(零)内存的事实非常令人不安。我怀疑free输出…甚至可能是您的应用程序问题,是由某种专有内核模块引起的,该模块以某种方式干扰了内存分配。

根据fork()/ clone()的手册页,如果您的调用会导致资源限制冲突(RLIMIT_NPROC),则fork()系统调用应返回EAGAIN …但是,它没有说是否要返回EAGAIN违反其他RLIMIT *。无论如何,如果目标/主机具有某种奇怪的Vormetric或其他安全设置(或者即使您的进程在某种奇怪的SELinux策略下运行),也可能导致此-ENOMEM故障。

不太可能是正常的Linux / UNIX问题。您正在那里进行一些非标准的操作。

I continue to suspect that your customer/user has some kernel module or driver loaded which is interfering with the clone() system call (perhaps some obscure security enhancement, something like LIDS but more obscure?) or is somehow filling up some of the kernel data structures that are necessary for fork()/clone() to operate (process table, page tables, file descriptor tables, etc).

Here’s the relevant portion of the fork(2) man page:

ERRORS
       EAGAIN fork() cannot allocate sufficient memory to copy the parent's page tables and allocate a task  structure  for  the
              child.

       EAGAIN It  was not possible to create a new process because the caller's RLIMIT_NPROC resource limit was encountered.  To
              exceed this limit, the process must have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability.

       ENOMEM fork() failed to allocate the necessary kernel structures because memory is tight.

I suggest having the user try this after booting into a stock, generic kernel and with only a minimal set of modules and drivers loaded (minimum necessary to run your application/script). From there, assuming it works in that configuration, they can perform a binary search between that and the configuration which exhibits the issue. This is standard sysadmin troubleshooting 101.

The relevant line in your strace is:

clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7f12708) = -1 ENOMEM (Cannot allocate memory)

… I know others have talked about swap and memory availability (and I would recommend that you set up at least a small swap partition, ironically even if it’s on a RAM disk … the code paths through the Linux kernel when it has even a tiny bit of swap available have been exercised far more extensively than those (exception handling paths) in which there is zero swap available.

However I suspect that this is still a red herring.

The fact that free is reporting 0 (ZERO) memory in use by the cache and buffers is very disturbing. I suspect that the free output … and possibly your application issue here, are caused by some proprietary kernel module which is interfering with the memory allocation in some way.

According to the man pages for fork()/clone() the fork() system call should return EAGAIN if your call would cause a resource limit violation (RLIMIT_NPROC) … however, it doesn’t say if EAGAIN is to be returned by other RLIMIT* violations. In any event if your target/host has some sort of weird Vormetric or other security settings (or even if your process is running under some weird SELinux policy) then it might be causing this -ENOMEM failure.

It’s pretty unlikely to be a normal run-of-the-mill Linux/UNIX issue. You’ve got something non-standard going on there.


回答 5

您是否尝试过使用:

(status,output) = commands.getstatusoutput("ps aux")

我认为这为我解决了完全相同的问题。但是后来我的过程最终被杀死,而不是没有产生,这更糟。

经过一些测试,我发现这仅在旧版本的python上发生:它在2.6.5中发生,而在2.7.2中不发生

我的搜索将我引到了这里python-close_fds-issue,但是取消设置closed_fds并不能解决问题。仍然值得一读。

我发现python通过仅关注它就泄漏了文件描述符:

watch "ls /proc/$PYTHONPID/fd | wc -l"

像您一样,我确实想捕获命令的输出,并且确实要避免OOM错误……但是看来,唯一的方法是让人们使用没有太多错误的Python版本。不理想…

Have you tried using:

(status,output) = commands.getstatusoutput("ps aux")

I thought this had fixed the exact same problem for me. But then my process ended up getting killed instead of failing to spawn, which is even worse..

After some testing I found that this only occurred on older versions of python: it happens with 2.6.5 but not with 2.7.2

My search had led me here python-close_fds-issue, but unsetting closed_fds had not solved the issue. It is still well worth a read.

I found that python was leaking file descriptors by just keeping an eye on it:

watch "ls /proc/$PYTHONPID/fd | wc -l"

Like you, I do want to capture the command’s output, and I do want to avoid OOM errors… but it looks like the only way is for people to use a less buggy version of Python. Not ideal…


回答 6

munmap(0xb7d28000,4096)= 0
写(2,“ OSError”,7)= 7

我看过草率的代码,看起来像这样:

serrno = errno;
some_Syscall(...)
if (serrno != errno)
/* sound alarm: CATROSTOPHIC ERROR !!! */

您应该检查这是否是python代码中正在发生的事情。Errno仅在进行中的系统调用失败时才有效。

编辑添加:

您没有说这个过程可以持续多久。可能的内存使用者

  • 分叉过程
  • 未使用的数据结构
  • 共享库
  • 内存映射文件

munmap(0xb7d28000, 4096) = 0
write(2, “OSError”, 7) = 7

I’ve seen sloppy code that looks like this:

serrno = errno;
some_Syscall(...)
if (serrno != errno)
/* sound alarm: CATROSTOPHIC ERROR !!! */

You should check to see if this is what is happening in the python code. Errno is only valid if the proceeding system call failed.

Edited to add:

You don’t say how long this process lives. Possible consumers of memory

  • forked processes
  • unused data structures
  • shared libraries
  • memory mapped files

回答 7

也许你可以简单地

$ sudo bash -c "echo vm.overcommit_memory=1 >> /etc/sysctl.conf"
$ sudo sysctl -p

它适用于我的情况。

参考:https : //github.com/openai/gym/issues/110#issuecomment-220672405

Maybe you can simply

$ sudo bash -c "echo vm.overcommit_memory=1 >> /etc/sysctl.conf"
$ sudo sysctl -p

It works for my case.

Reference: https://github.com/openai/gym/issues/110#issuecomment-220672405