标签归档:Python

如何使用argparse将列表作为命令行参数传递?

问题:如何使用argparse将列表作为命令行参数传递?

我正在尝试将列表作为参数传递给命令行程序。是否有将argparse列表作为选项传递的选项?

parser.add_argument('-l', '--list',
                      type=list, action='store',
                      dest='list',
                      help='<Required> Set flag',
                      required=True)

脚本如下所示

python test.py -l "265340 268738 270774 270817"

I am trying to pass a list as an argument to a command line program. Is there an argparse option to pass a list as option?

parser.add_argument('-l', '--list',
                      type=list, action='store',
                      dest='list',
                      help='<Required> Set flag',
                      required=True)

Script is called like below

python test.py -l "265340 268738 270774 270817"

回答 0

TL; DR

使用nargs选项或选项的'append'设置action(取决于您希望用户界面的行为方式)。

纳尔

parser.add_argument('-l','--list', nargs='+', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 2345 3456 4567

nargs='+'接受1个或多个参数,nargs='*'接受零个或多个。

附加

parser.add_argument('-l','--list', action='append', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 -l 2345 -l 3456 -l 4567

append您提供多个选项来构建列表。

不要使用type=list-可能没有可能要与一起使用的type=list情况argparse。曾经


让我们更详细地了解人们可能尝试执行此操作的一些不同方式以及最终结果。

import argparse

parser = argparse.ArgumentParser()

# By default it will fail with multiple arguments.
parser.add_argument('--default')

# Telling the type to be a list will also fail for multiple arguments,
# but give incorrect results for a single argument.
parser.add_argument('--list-type', type=list)

# This will allow you to provide multiple arguments, but you will get
# a list of lists which is not desired.
parser.add_argument('--list-type-nargs', type=list, nargs='+')

# This is the correct way to handle accepting multiple arguments.
# '+' == 1 or more.
# '*' == 0 or more.
# '?' == 0 or 1.
# An int is an explicit number of arguments to accept.
parser.add_argument('--nargs', nargs='+')

# To make the input integers
parser.add_argument('--nargs-int-type', nargs='+', type=int)

# An alternate way to accept multiple inputs, but you must
# provide the flag once per input. Of course, you can use
# type=int here if you want.
parser.add_argument('--append-action', action='append')

# To show the results of the given option to screen.
for _, value in parser.parse_args()._get_kwargs():
    if value is not None:
        print(value)

这是您可以期望的输出:

$ python arg.py --default 1234 2345 3456 4567
...
arg.py: error: unrecognized arguments: 2345 3456 4567

$ python arg.py --list-type 1234 2345 3456 4567
...
arg.py: error: unrecognized arguments: 2345 3456 4567

$ # Quotes won't help here... 
$ python arg.py --list-type "1234 2345 3456 4567"
['1', '2', '3', '4', ' ', '2', '3', '4', '5', ' ', '3', '4', '5', '6', ' ', '4', '5', '6', '7']

$ python arg.py --list-type-nargs 1234 2345 3456 4567
[['1', '2', '3', '4'], ['2', '3', '4', '5'], ['3', '4', '5', '6'], ['4', '5', '6', '7']]

$ python arg.py --nargs 1234 2345 3456 4567
['1234', '2345', '3456', '4567']

$ python arg.py --nargs-int-type 1234 2345 3456 4567
[1234, 2345, 3456, 4567]

$ # Negative numbers are handled perfectly fine out of the box.
$ python arg.py --nargs-int-type -1234 2345 -3456 4567
[-1234, 2345, -3456, 4567]

$ python arg.py --append-action 1234 --append-action 2345 --append-action 3456 --append-action 4567
['1234', '2345', '3456', '4567']

小贴士

  • 使用nargsaction='append'
    • nargs从用户的角度来看,它可能更直接,但是如果存在位置参数,则可能是不直观的,因为argparse无法分辨什么应该是位置参数以及什么属于nargs;如果您有位置参数,那么action='append'最终可能是一个更好的选择。
    • 如果以上是唯一真正的nargs给予'*''+''?'。如果您提供一个整数(例如4),则将选项与nargs和位置参数混合使用将不会有问题,因为argparse它将确切知道期望该选项有多少个值。
  • 不要在命令行1上使用引号
  • 不要使用type=list,因为它会返回列表列表
    • 发生这种情况的原因是,在后台argparse使用的值type来强制您选择的每个给定给定参数type,而不是所有参数的总和。
    • 您可以使用type=int(或其他任何方式)获取一个整数列表(或其他任何方式)

1:我的意思不是一般。.我的意思不是用引号将列表传递给argparse您。

TL;DR

Use the nargs option or the 'append' setting of the action option (depending on how you want the user interface to behave).

nargs

parser.add_argument('-l','--list', nargs='+', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 2345 3456 4567

nargs='+' takes 1 or more arguments, nargs='*' takes zero or more.

append

parser.add_argument('-l','--list', action='append', help='<Required> Set flag', required=True)
# Use like:
# python arg.py -l 1234 -l 2345 -l 3456 -l 4567

With append you provide the option multiple times to build up the list.

Don’t use type=list!!! – There is probably no situation where you would want to use type=list with argparse. Ever.


Let’s take a look in more detail at some of the different ways one might try to do this, and the end result.

import argparse

parser = argparse.ArgumentParser()

# By default it will fail with multiple arguments.
parser.add_argument('--default')

# Telling the type to be a list will also fail for multiple arguments,
# but give incorrect results for a single argument.
parser.add_argument('--list-type', type=list)

# This will allow you to provide multiple arguments, but you will get
# a list of lists which is not desired.
parser.add_argument('--list-type-nargs', type=list, nargs='+')

# This is the correct way to handle accepting multiple arguments.
# '+' == 1 or more.
# '*' == 0 or more.
# '?' == 0 or 1.
# An int is an explicit number of arguments to accept.
parser.add_argument('--nargs', nargs='+')

# To make the input integers
parser.add_argument('--nargs-int-type', nargs='+', type=int)

# An alternate way to accept multiple inputs, but you must
# provide the flag once per input. Of course, you can use
# type=int here if you want.
parser.add_argument('--append-action', action='append')

# To show the results of the given option to screen.
for _, value in parser.parse_args()._get_kwargs():
    if value is not None:
        print(value)

Here is the output you can expect:

$ python arg.py --default 1234 2345 3456 4567
...
arg.py: error: unrecognized arguments: 2345 3456 4567

$ python arg.py --list-type 1234 2345 3456 4567
...
arg.py: error: unrecognized arguments: 2345 3456 4567

$ # Quotes won't help here... 
$ python arg.py --list-type "1234 2345 3456 4567"
['1', '2', '3', '4', ' ', '2', '3', '4', '5', ' ', '3', '4', '5', '6', ' ', '4', '5', '6', '7']

$ python arg.py --list-type-nargs 1234 2345 3456 4567
[['1', '2', '3', '4'], ['2', '3', '4', '5'], ['3', '4', '5', '6'], ['4', '5', '6', '7']]

$ python arg.py --nargs 1234 2345 3456 4567
['1234', '2345', '3456', '4567']

$ python arg.py --nargs-int-type 1234 2345 3456 4567
[1234, 2345, 3456, 4567]

$ # Negative numbers are handled perfectly fine out of the box.
$ python arg.py --nargs-int-type -1234 2345 -3456 4567
[-1234, 2345, -3456, 4567]

$ python arg.py --append-action 1234 --append-action 2345 --append-action 3456 --append-action 4567
['1234', '2345', '3456', '4567']

Takeaways:

  • Use nargs or action='append'
    • nargs can be more straightforward from a user perspective, but it can be unintuitive if there are positional arguments because argparse can’t tell what should be a positional argument and what belongs to the nargs; if you have positional arguments then action='append' may end up being a better choice.
    • The above is only true if nargs is given '*', '+', or '?'. If you provide an integer number (such as 4) then there will be no problem mixing options with nargs and positional arguments because argparse will know exactly how many values to expect for the option.
  • Don’t use quotes on the command line1
  • Don’t use type=list, as it will return a list of lists
    • This happens because under the hood argparse uses the value of type to coerce each individual given argument you your chosen type, not the aggregate of all arguments.
    • You can use type=int (or whatever) to get a list of ints (or whatever)

1: I don’t mean in general.. I mean using quotes to pass a list to argparse is not what you want.


回答 1

我更喜欢传递一个定界字符串,稍后在脚本中对其进行解析。原因是:该列表可以是任何类型intstr,有时nargs如果有多个可选参数和位置参数,有时会遇到问题。

parser = ArgumentParser()
parser.add_argument('-l', '--list', help='delimited list input', type=str)
args = parser.parse_args()
my_list = [int(item) for item in args.list.split(',')]

然后,

python test.py -l "265340,268738,270774,270817" [other arguments]

要么,

python test.py -l 265340,268738,270774,270817 [other arguments]

会很好的工作。分隔符也可以是空格,尽管会像问题中的示例一样在参数值周围加引号。

I prefer passing a delimited string which I parse later in the script. The reasons for this are; the list can be of any type int or str, and sometimes using nargs I run into problems if there are multiple optional arguments and positional arguments.

parser = ArgumentParser()
parser.add_argument('-l', '--list', help='delimited list input', type=str)
args = parser.parse_args()
my_list = [int(item) for item in args.list.split(',')]

Then,

python test.py -l "265340,268738,270774,270817" [other arguments]

or,

python test.py -l 265340,268738,270774,270817 [other arguments]

will work fine. The delimiter can be a space, too, which would though enforce quotes around the argument value like in the example in the question.


回答 2

除之外nargschoices如果您事先知道列表,则可能要使用:

>>> parser = argparse.ArgumentParser(prog='game.py')
>>> parser.add_argument('move', choices=['rock', 'paper', 'scissors'])
>>> parser.parse_args(['rock'])
Namespace(move='rock')
>>> parser.parse_args(['fire'])
usage: game.py [-h] {rock,paper,scissors}
game.py: error: argument move: invalid choice: 'fire' (choose from 'rock',
'paper', 'scissors')

Additionally to nargs, you might want to use choices if you know the list in advance:

>>> parser = argparse.ArgumentParser(prog='game.py')
>>> parser.add_argument('move', choices=['rock', 'paper', 'scissors'])
>>> parser.parse_args(['rock'])
Namespace(move='rock')
>>> parser.parse_args(['fire'])
usage: game.py [-h] {rock,paper,scissors}
game.py: error: argument move: invalid choice: 'fire' (choose from 'rock',
'paper', 'scissors')

回答 3

在argparse的add_argument方法中使用nargs参数

我使用nargs =’ ‘作为add_argument参数。如果我没有传递任何明确的参数,我专门在选项中使用nargs =’ ‘来选择默认值

包括一个代码片段作为示例:

示例:temp_args1.py

请注意:以下示例代码是用python3编写的。通过更改打印语句的格式,可以在python2中运行

#!/usr/local/bin/python3.6

from argparse import ArgumentParser

description = 'testing for passing multiple arguments and to get list of args'
parser = ArgumentParser(description=description)
parser.add_argument('-i', '--item', action='store', dest='alist',
                    type=str, nargs='*', default=['item1', 'item2', 'item3'],
                    help="Examples: -i item1 item2, -i item3")
opts = parser.parse_args()

print("List of items: {}".format(opts.alist))

注意:我正在收集存储在列表中的多个字符串参数-opts.alist如果要获取整数列表,请将parser.add_argument上的type参数更改为int

执行结果:

python3.6 temp_agrs1.py -i item5 item6 item7
List of items: ['item5', 'item6', 'item7']

python3.6 temp_agrs1.py -i item10
List of items: ['item10']

python3.6 temp_agrs1.py
List of items: ['item1', 'item2', 'item3']

Using nargs parameter in argparse’s add_argument method

I use nargs=’‘ as an add_argument parameter. I specifically used nargs=’‘ to the option to pick defaults if I am not passing any explicit arguments

Including a code snippet as example:

Example: temp_args1.py

Please Note: The below sample code is written in python3. By changing the print statement format, can run in python2

#!/usr/local/bin/python3.6

from argparse import ArgumentParser

description = 'testing for passing multiple arguments and to get list of args'
parser = ArgumentParser(description=description)
parser.add_argument('-i', '--item', action='store', dest='alist',
                    type=str, nargs='*', default=['item1', 'item2', 'item3'],
                    help="Examples: -i item1 item2, -i item3")
opts = parser.parse_args()

print("List of items: {}".format(opts.alist))

Note: I am collecting multiple string arguments that gets stored in the list – opts.alist If you want list of integers, change the type parameter on parser.add_argument to int

Execution Result:

python3.6 temp_agrs1.py -i item5 item6 item7
List of items: ['item5', 'item6', 'item7']

python3.6 temp_agrs1.py -i item10
List of items: ['item10']

python3.6 temp_agrs1.py
List of items: ['item1', 'item2', 'item3']

回答 4

如果打算使单个开关具有多个参数,请使用nargs='+'。如果您的示例“ -l”实际上是整数:

a = argparse.ArgumentParser()
a.add_argument(
    '-l', '--list',  # either of this switches
    nargs='+',       # one or more parameters to this switch
    type=int,        # /parameters/ are ints
    dest='list',     # store in 'list'.
    default=[],      # since we're not specifying required.
)

print a.parse_args("-l 123 234 345 456".split(' '))
print a.parse_args("-l 123 -l=234 -l345 --list 456".split(' '))

产生

Namespace(list=[123, 234, 345, 456])
Namespace(list=[456])  # Attention!

如果您多次指定相同的参数,则默认操作('store')将替换现有数据。

替代方法是使用append操作:

a = argparse.ArgumentParser()
a.add_argument(
    '-l', '--list',  # either of this switches
    type=int,        # /parameters/ are ints
    dest='list',     # store in 'list'.
    default=[],      # since we're not specifying required.
    action='append', # add to the list instead of replacing it
)

print a.parse_args("-l 123 -l=234 -l345 --list 456".split(' '))

哪个产生

Namespace(list=[123, 234, 345, 456])

或者,您可以编写一个自定义处理程序/操作来解析逗号分隔的值,以便您可以

-l 123,234,345 -l 456

If you are intending to make a single switch take multiple parameters, then you use nargs='+'. If your example ‘-l’ is actually taking integers:

a = argparse.ArgumentParser()
a.add_argument(
    '-l', '--list',  # either of this switches
    nargs='+',       # one or more parameters to this switch
    type=int,        # /parameters/ are ints
    dest='list',     # store in 'list'.
    default=[],      # since we're not specifying required.
)

print a.parse_args("-l 123 234 345 456".split(' '))
print a.parse_args("-l 123 -l=234 -l345 --list 456".split(' '))

Produces

Namespace(list=[123, 234, 345, 456])
Namespace(list=[456])  # Attention!

If you specify the same argument multiple times, the default action ('store') replaces the existing data.

The alternative is to use the append action:

a = argparse.ArgumentParser()
a.add_argument(
    '-l', '--list',  # either of this switches
    type=int,        # /parameters/ are ints
    dest='list',     # store in 'list'.
    default=[],      # since we're not specifying required.
    action='append', # add to the list instead of replacing it
)

print a.parse_args("-l 123 -l=234 -l345 --list 456".split(' '))

Which produces

Namespace(list=[123, 234, 345, 456])

Or you can write a custom handler/action to parse comma-separated values so that you could do

-l 123,234,345 -l 456

回答 5

在中add_argument()type只是一个可调用对象,它接收字符串并返回选项值。

import ast

def arg_as_list(s):                                                            
    v = ast.literal_eval(s)                                                    
    if type(v) is not list:                                                    
        raise argparse.ArgumentTypeError("Argument \"%s\" is not a list" % (s))
    return v                                                                   


def foo():
    parser.add_argument("--list", type=arg_as_list, default=[],
                        help="List of values")

这将允许:

$ ./tool --list "[1,2,3,4]"

In add_argument(), type is just a callable object that receives string and returns option value.

import ast

def arg_as_list(s):                                                            
    v = ast.literal_eval(s)                                                    
    if type(v) is not list:                                                    
        raise argparse.ArgumentTypeError("Argument \"%s\" is not a list" % (s))
    return v                                                                   


def foo():
    parser.add_argument("--list", type=arg_as_list, default=[],
                        help="List of values")

This will allow to:

$ ./tool --list "[1,2,3,4]"

回答 6

如果您有一个嵌套列表,其中内部列表具有不同的类型和长度,并且您想保留该类型,例如,

[[1, 2], ["foo", "bar"], [3.14, "baz", 20]]

那么您可以使用@ sam-mason这个问题提出的解决方案,如下所示:

from argparse import ArgumentParser
import json

parser = ArgumentParser()
parser.add_argument('-l', type=json.loads)
parser.parse_args(['-l', '[[1,2],["foo","bar"],[3.14,"baz",20]]'])

这使:

Namespace(l=[[1, 2], ['foo', 'bar'], [3.14, 'baz', 20]])

If you have a nested list where the inner lists have different types and lengths and you would like to preserve the type, e.g.,

[[1, 2], ["foo", "bar"], [3.14, "baz", 20]]

then you can use the solution proposed by @sam-mason to this question, shown below:

from argparse import ArgumentParser
import json

parser = ArgumentParser()
parser.add_argument('-l', type=json.loads)
parser.parse_args(['-l', '[[1,2],["foo","bar"],[3.14,"baz",20]]'])

which gives:

Namespace(l=[[1, 2], ['foo', 'bar'], [3.14, 'baz', 20]])

回答 7

我想处理传递多个列表,整数值和字符串。

有用的链接=> 如何将Bash变量传递给Python?

def main(args):
    my_args = []
    for arg in args:
        if arg.startswith("[") and arg.endswith("]"):
            arg = arg.replace("[", "").replace("]", "")
            my_args.append(arg.split(","))
        else:
            my_args.append(arg)

    print(my_args)


if __name__ == "__main__":
    import sys
    main(sys.argv[1:])

顺序并不重要。如果要传递列表,请在之间进行操作"[""]并使用逗号分隔它们。

然后,

python test.py my_string 3 "[1,2]" "[3,4,5]"

输出=> ['my_string', '3', ['1', '2'], ['3', '4', '5']]my_args变量按顺序包含参数。

I want to handle passing multiple lists, integer values and strings.

Helpful link => How to pass a Bash variable to Python?

def main(args):
    my_args = []
    for arg in args:
        if arg.startswith("[") and arg.endswith("]"):
            arg = arg.replace("[", "").replace("]", "")
            my_args.append(arg.split(","))
        else:
            my_args.append(arg)

    print(my_args)


if __name__ == "__main__":
    import sys
    main(sys.argv[1:])

Order is not important. If you want to pass a list just do as in between "[" and "] and seperate them using a comma.

Then,

python test.py my_string 3 "[1,2]" "[3,4,5]"

Output => ['my_string', '3', ['1', '2'], ['3', '4', '5']], my_args variable contains the arguments in order.


回答 8

我认为,最优雅的解决方案是将lambda函数传递给“类型”,如Chepner所述。除此之外,如果您事先不知道列表的分隔符是什么,还可以将多个分隔符传递给re.split:

# python3 test.py -l "abc xyz, 123"

import re
import argparse

parser = argparse.ArgumentParser(description='Process a list.')
parser.add_argument('-l', '--list',
                    type=lambda s: re.split(' |, ', s),
                    required=True,
                    help='comma or space delimited list of characters')

args = parser.parse_args()
print(args.list)


# Output: ['abc', 'xyz', '123']

I think the most elegant solution is to pass a lambda function to “type”, as mentioned by Chepner. In addition to this, if you do not know beforehand what the delimiter of your list will be, you can also pass multiple delimiters to re.split:

# python3 test.py -l "abc xyz, 123"

import re
import argparse

parser = argparse.ArgumentParser(description='Process a list.')
parser.add_argument('-l', '--list',
                    type=lambda s: re.split(' |, ', s),
                    required=True,
                    help='comma or space delimited list of characters')

args = parser.parse_args()
print(args.list)


# Output: ['abc', 'xyz', '123']

拼合不规则的列表

问题:拼合不规则的列表

是的,我知道以前已经讨论过这个主题(这里这里这里这里),但是据我所知,除一个解决方案外,所有解决方案在这样的列表上都失败了:

L = [[[1, 2, 3], [4, 5]], 6]

所需的输出是

[1, 2, 3, 4, 5, 6]

甚至更好的迭代器。这个问题是我看到的唯一适用于任意嵌套的解决方案:

def flatten(x):
    result = []
    for el in x:
        if hasattr(el, "__iter__") and not isinstance(el, basestring):
            result.extend(flatten(el))
        else:
            result.append(el)
    return result

flatten(L)

这是最好的模型吗?我有事吗 任何问题?

Yes, I know this subject has been covered before (here, here, here, here), but as far as I know, all solutions, except for one, fail on a list like this:

L = [[[1, 2, 3], [4, 5]], 6]

Where the desired output is

[1, 2, 3, 4, 5, 6]

Or perhaps even better, an iterator. The only solution I saw that works for an arbitrary nesting is found in this question:

def flatten(x):
    result = []
    for el in x:
        if hasattr(el, "__iter__") and not isinstance(el, basestring):
            result.extend(flatten(el))
        else:
            result.append(el)
    return result

flatten(L)

Is this the best model? Did I overlook something? Any problems?


回答 0

使用生成器函数可以使您的示例更易于阅读,并可能提高性能。

Python 2

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
            for sub in flatten(el):
                yield sub
        else:
            yield el

我使用了2.6中添加的Iterable ABC

Python 3

在Python 3中,basestring是没有更多的,但你可以使用一个元组str,并bytes得到同样的效果存在。

yield from运营商从一时间产生一个返回的项目。这句法委派到子发生器在3.3加入

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

Using generator functions can make your example a little easier to read and probably boost the performance.

Python 2

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, basestring):
            for sub in flatten(el):
                yield sub
        else:
            yield el

I used the Iterable ABC added in 2.6.

Python 3

In Python 3, the basestring is no more, but you can use a tuple of str and bytes to get the same effect there.

The yield from operator returns an item from a generator one at a time. This syntax for delegating to a subgenerator was added in 3.3

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

回答 1

我的解决方案:

import collections


def flatten(x):
    if isinstance(x, collections.Iterable):
        return [a for i in x for a in flatten(i)]
    else:
        return [x]

更加简洁,但几乎相同。

My solution:

import collections


def flatten(x):
    if isinstance(x, collections.Iterable):
        return [a for i in x for a in flatten(i)]
    else:
        return [x]

A little more concise, but pretty much the same.


回答 2

使用递归和鸭子类型生成器(针对Python 3更新):

def flatten(L):
    for item in L:
        try:
            yield from flatten(item)
        except TypeError:
            yield item

list(flatten([[[1, 2, 3], [4, 5]], 6]))
>>>[1, 2, 3, 4, 5, 6]

Generator using recursion and duck typing (updated for Python 3):

def flatten(L):
    for item in L:
        try:
            yield from flatten(item)
        except TypeError:
            yield item

list(flatten([[[1, 2, 3], [4, 5]], 6]))
>>>[1, 2, 3, 4, 5, 6]

回答 3

@unutbu的非递归解决方案的生成器版本,由@Andrew在注释中要求:

def genflat(l, ltypes=collections.Sequence):
    l = list(l)
    i = 0
    while i < len(l):
        while isinstance(l[i], ltypes):
            if not l[i]:
                l.pop(i)
                i -= 1
                break
            else:
                l[i:i + 1] = l[i]
        yield l[i]
        i += 1

此生成器的简化版本:

def genflat(l, ltypes=collections.Sequence):
    l = list(l)
    while l:
        while l and isinstance(l[0], ltypes):
            l[0:1] = l[0]
        if l: yield l.pop(0)

Generator version of @unutbu’s non-recursive solution, as requested by @Andrew in a comment:

def genflat(l, ltypes=collections.Sequence):
    l = list(l)
    i = 0
    while i < len(l):
        while isinstance(l[i], ltypes):
            if not l[i]:
                l.pop(i)
                i -= 1
                break
            else:
                l[i:i + 1] = l[i]
        yield l[i]
        i += 1

Slightly simplified version of this generator:

def genflat(l, ltypes=collections.Sequence):
    l = list(l)
    while l:
        while l and isinstance(l[0], ltypes):
            l[0:1] = l[0]
        if l: yield l.pop(0)

回答 4

这是我的功能性版本的递归展平,它既处理元组又处理列表,并允许您引入位置参数的任何组合。返回一个生成器,该生成器按arg由arg的顺序生成整个序列:

flatten = lambda *n: (e for a in n
    for e in (flatten(*a) if isinstance(a, (tuple, list)) else (a,)))

用法:

l1 = ['a', ['b', ('c', 'd')]]
l2 = [0, 1, (2, 3), [[4, 5, (6, 7, (8,), [9]), 10]], (11,)]
print list(flatten(l1, -2, -1, l2))
['a', 'b', 'c', 'd', -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

Here is my functional version of recursive flatten which handles both tuples and lists, and lets you throw in any mix of positional arguments. Returns a generator which produces the entire sequence in order, arg by arg:

flatten = lambda *n: (e for a in n
    for e in (flatten(*a) if isinstance(a, (tuple, list)) else (a,)))

Usage:

l1 = ['a', ['b', ('c', 'd')]]
l2 = [0, 1, (2, 3), [[4, 5, (6, 7, (8,), [9]), 10]], (11,)]
print list(flatten(l1, -2, -1, l2))
['a', 'b', 'c', 'd', -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

回答 5

此版本的版本flatten避免了python的递归限制(因此可用于任意深度的嵌套可迭代对象)。它是一个生成器,可以处理字符串和任意可迭代(甚至是无限的)。

import itertools as IT
import collections

def flatten(iterable, ltypes=collections.Iterable):
    remainder = iter(iterable)
    while True:
        first = next(remainder)
        if isinstance(first, ltypes) and not isinstance(first, (str, bytes)):
            remainder = IT.chain(first, remainder)
        else:
            yield first

以下是一些示例说明其用法:

print(list(IT.islice(flatten(IT.repeat(1)),10)))
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

print(list(IT.islice(flatten(IT.chain(IT.repeat(2,3),
                                       {10,20,30},
                                       'foo bar'.split(),
                                       IT.repeat(1),)),10)))
# [2, 2, 2, 10, 20, 30, 'foo', 'bar', 1, 1]

print(list(flatten([[1,2,[3,4]]])))
# [1, 2, 3, 4]

seq = ([[chr(i),chr(i-32)] for i in range(ord('a'), ord('z')+1)] + list(range(0,9)))
print(list(flatten(seq)))
# ['a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'f', 'F', 'g', 'G', 'h', 'H',
# 'i', 'I', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'N', 'o', 'O', 'p', 'P',
# 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'v', 'V', 'w', 'W', 'x', 'X',
# 'y', 'Y', 'z', 'Z', 0, 1, 2, 3, 4, 5, 6, 7, 8]

尽管flatten可以处理无限生成器,但不能处理无限嵌套:

def infinitely_nested():
    while True:
        yield IT.chain(infinitely_nested(), IT.repeat(1))

print(list(IT.islice(flatten(infinitely_nested()), 10)))
# hangs

This version of flatten avoids python’s recursion limit (and thus works with arbitrarily deep, nested iterables). It is a generator which can handle strings and arbitrary iterables (even infinite ones).

import itertools as IT
import collections

def flatten(iterable, ltypes=collections.Iterable):
    remainder = iter(iterable)
    while True:
        first = next(remainder)
        if isinstance(first, ltypes) and not isinstance(first, (str, bytes)):
            remainder = IT.chain(first, remainder)
        else:
            yield first

Here are some examples demonstrating its use:

print(list(IT.islice(flatten(IT.repeat(1)),10)))
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

print(list(IT.islice(flatten(IT.chain(IT.repeat(2,3),
                                       {10,20,30},
                                       'foo bar'.split(),
                                       IT.repeat(1),)),10)))
# [2, 2, 2, 10, 20, 30, 'foo', 'bar', 1, 1]

print(list(flatten([[1,2,[3,4]]])))
# [1, 2, 3, 4]

seq = ([[chr(i),chr(i-32)] for i in range(ord('a'), ord('z')+1)] + list(range(0,9)))
print(list(flatten(seq)))
# ['a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'f', 'F', 'g', 'G', 'h', 'H',
# 'i', 'I', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'N', 'o', 'O', 'p', 'P',
# 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'v', 'V', 'w', 'W', 'x', 'X',
# 'y', 'Y', 'z', 'Z', 0, 1, 2, 3, 4, 5, 6, 7, 8]

Although flatten can handle infinite generators, it can not handle infinite nesting:

def infinitely_nested():
    while True:
        yield IT.chain(infinitely_nested(), IT.repeat(1))

print(list(IT.islice(flatten(infinitely_nested()), 10)))
# hangs

回答 6

这是另一个更有趣的答案…

import re

def Flatten(TheList):
    a = str(TheList)
    b,crap = re.subn(r'[\[,\]]', ' ', a)
    c = b.split()
    d = [int(x) for x in c]

    return(d)

基本上,它将嵌套列表转换为字符串,使用正则表达式去除嵌套语法,然后将结果转换回(扁平化的)列表。

Here’s another answer that is even more interesting…

import re

def Flatten(TheList):
    a = str(TheList)
    b,crap = re.subn(r'[\[,\]]', ' ', a)
    c = b.split()
    d = [int(x) for x in c]

    return(d)

Basically, it converts the nested list to a string, uses a regex to strip out the nested syntax, and then converts the result back to a (flattened) list.


回答 7

def flatten(xs):
    res = []
    def loop(ys):
        for i in ys:
            if isinstance(i, list):
                loop(i)
            else:
                res.append(i)
    loop(xs)
    return res
def flatten(xs):
    res = []
    def loop(ys):
        for i in ys:
            if isinstance(i, list):
                loop(i)
            else:
                res.append(i)
    loop(xs)
    return res

回答 8

您可以deepflatten在第三方套餐中使用iteration_utilities

>>> from iteration_utilities import deepflatten
>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> list(deepflatten(L))
[1, 2, 3, 4, 5, 6]

>>> list(deepflatten(L, types=list))  # only flatten "inner" lists
[1, 2, 3, 4, 5, 6]

这是一个迭代器,因此您需要对其进行迭代(例如,通过将其包装list或在循环中使用)。在内部,它使用迭代方法而不是递归方法,并且将其编写为C扩展,因此它可以比纯python方法更快:

>>> %timeit list(deepflatten(L))
12.6 µs ± 298 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit list(deepflatten(L, types=list))
8.7 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit list(flatten(L))   # Cristian - Python 3.x approach from https://stackoverflow.com/a/2158532/5393381
86.4 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit list(flatten(L))   # Josh Lee - https://stackoverflow.com/a/2158522/5393381
107 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit list(genflat(L, list))  # Alex Martelli - https://stackoverflow.com/a/2159079/5393381
23.1 µs ± 710 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我是iteration_utilities图书馆的作者。

You could use deepflatten from the 3rd party package iteration_utilities:

>>> from iteration_utilities import deepflatten
>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> list(deepflatten(L))
[1, 2, 3, 4, 5, 6]

>>> list(deepflatten(L, types=list))  # only flatten "inner" lists
[1, 2, 3, 4, 5, 6]

It’s an iterator so you need to iterate it (for example by wrapping it with list or using it in a loop). Internally it uses an iterative approach instead of an recursive approach and it’s written as C extension so it can be faster than pure python approaches:

>>> %timeit list(deepflatten(L))
12.6 µs ± 298 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit list(deepflatten(L, types=list))
8.7 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit list(flatten(L))   # Cristian - Python 3.x approach from https://stackoverflow.com/a/2158532/5393381
86.4 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit list(flatten(L))   # Josh Lee - https://stackoverflow.com/a/2158522/5393381
107 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit list(genflat(L, list))  # Alex Martelli - https://stackoverflow.com/a/2159079/5393381
23.1 µs ± 710 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I’m the author of the iteration_utilities library.


回答 9

尝试创建一个可以平化Python中不规则列表的函数很有趣,但是当然这就是Python的目的(使编程变得有趣)。以下生成器在某些警告方面工作得很好:

def flatten(iterable):
    try:
        for item in iterable:
            yield from flatten(item)
    except TypeError:
        yield iterable

这将压扁的数据类型,你可能想独自离开(比如bytearraybytesstr对象)。此外,代码还依赖于以下事实:从不可迭代的对象请求迭代器会引发TypeError

>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> def flatten(iterable):
    try:
        for item in iterable:
            yield from flatten(item)
    except TypeError:
        yield iterable


>>> list(flatten(L))
[1, 2, 3, 4, 5, 6]
>>>

编辑:

我不同意以前的实现。问题在于您不应该将无法迭代的东西弄平。这令人困惑,并给人以错误的印象。

>>> list(flatten(123))
[123]
>>>

下面的生成器与第一个生成器几乎相同,但是不存在试图展平不可迭代对象的问题。当给它一个不适当的参数时,它会像人们期望的那样失败。

def flatten(iterable):
    for item in iterable:
        try:
            yield from flatten(item)
        except TypeError:
            yield item

使用提供的列表对生成器进行测试可以正常工作。但是,TypeError当给它一个不可迭代的对象时,新代码将引发一个。下面显示了新行为的示例。

>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> list(flatten(L))
[1, 2, 3, 4, 5, 6]
>>> list(flatten(123))
Traceback (most recent call last):
  File "<pyshell#32>", line 1, in <module>
    list(flatten(123))
  File "<pyshell#27>", line 2, in flatten
    for item in iterable:
TypeError: 'int' object is not iterable
>>>

It was fun trying to create a function that could flatten irregular list in Python, but of course that is what Python is for (to make programming fun). The following generator works fairly well with some caveats:

def flatten(iterable):
    try:
        for item in iterable:
            yield from flatten(item)
    except TypeError:
        yield iterable

It will flatten datatypes that you might want left alone (like bytearray, bytes, and str objects). Also, the code relies on the fact that requesting an iterator from a non-iterable raises a TypeError.

>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> def flatten(iterable):
    try:
        for item in iterable:
            yield from flatten(item)
    except TypeError:
        yield iterable


>>> list(flatten(L))
[1, 2, 3, 4, 5, 6]
>>>

Edit:

I disagree with the previous implementation. The problem is that you should not be able to flatten something that is not an iterable. It is confusing and gives the wrong impression of the argument.

>>> list(flatten(123))
[123]
>>>

The following generator is almost the same as the first but does not have the problem of trying to flatten a non-iterable object. It fails as one would expect when an inappropriate argument is given to it.

def flatten(iterable):
    for item in iterable:
        try:
            yield from flatten(item)
        except TypeError:
            yield item

Testing the generator works fine with the list that was provided. However, the new code will raise a TypeError when a non-iterable object is given to it. Example are shown below of the new behavior.

>>> L = [[[1, 2, 3], [4, 5]], 6]
>>> list(flatten(L))
[1, 2, 3, 4, 5, 6]
>>> list(flatten(123))
Traceback (most recent call last):
  File "<pyshell#32>", line 1, in <module>
    list(flatten(123))
  File "<pyshell#27>", line 2, in flatten
    for item in iterable:
TypeError: 'int' object is not iterable
>>>

回答 10

尽管选择了一个优雅且非常Python化的答案,但我仅出于审查目的而提出我的解决方案:

def flat(l):
    ret = []
    for i in l:
        if isinstance(i, list) or isinstance(i, tuple):
            ret.extend(flat(i))
        else:
            ret.append(i)
    return ret

请告诉我们这段代码的好坏?

Although an elegant and very pythonic answer has been selected I would present my solution just for the review:

def flat(l):
    ret = []
    for i in l:
        if isinstance(i, list) or isinstance(i, tuple):
            ret.extend(flat(i))
        else:
            ret.append(i)
    return ret

Please tell how good or bad this code is?


回答 11

我喜欢简单的答案。没有生成器。没有递归或递归限制。只是迭代:

def flatten(TheList):
    listIsNested = True

    while listIsNested:                 #outer loop
        keepChecking = False
        Temp = []

        for element in TheList:         #inner loop
            if isinstance(element,list):
                Temp.extend(element)
                keepChecking = True
            else:
                Temp.append(element)

        listIsNested = keepChecking     #determine if outer loop exits
        TheList = Temp[:]

    return TheList

这适用于两个列表:内部for循环和外部while循环。

内部的for循环遍历列表。如果找到列表元素,则(1)使用list.extend()展平该部分嵌套的层次,并且(2)将keepChecking切换为True。keepchecking用于控制外部while循环。如果将外部循环设置为true,则会触发内部循环进行另一遍处理。

这些通行证一直发生,直到找不到更多的嵌套列表。当最后一次通过但找不到任何地方的传递时,keepChecking永远不会变为true,这意味着listIsNested保持为false,而外部while循环退出。

然后返回扁平化列表。

测试运行

flatten([1,2,3,4,[100,200,300,[1000,2000,3000]]])

[1, 2, 3, 4, 100, 200, 300, 1000, 2000, 3000]

I prefer simple answers. No generators. No recursion or recursion limits. Just iteration:

def flatten(TheList):
    listIsNested = True

    while listIsNested:                 #outer loop
        keepChecking = False
        Temp = []

        for element in TheList:         #inner loop
            if isinstance(element,list):
                Temp.extend(element)
                keepChecking = True
            else:
                Temp.append(element)

        listIsNested = keepChecking     #determine if outer loop exits
        TheList = Temp[:]

    return TheList

This works with two lists: an inner for loop and an outer while loop.

The inner for loop iterates through the list. If it finds a list element, it (1) uses list.extend() to flatten that part one level of nesting and (2) switches keepChecking to True. keepchecking is used to control the outer while loop. If the outer loop gets set to true, it triggers the inner loop for another pass.

Those passes keep happening until no more nested lists are found. When a pass finally occurs where none are found, keepChecking never gets tripped to true, which means listIsNested stays false and the outer while loop exits.

The flattened list is then returned.

Test-run

flatten([1,2,3,4,[100,200,300,[1000,2000,3000]]])

[1, 2, 3, 4, 100, 200, 300, 1000, 2000, 3000]


回答 12

这是一个简单的函数,可以平铺任意深度的列表。没有递归,以避免堆栈溢出。

from copy import deepcopy

def flatten_list(nested_list):
    """Flatten an arbitrarily nested list, without recursion (to avoid
    stack overflows). Returns a new list, the original list is unchanged.

    >> list(flatten_list([1, 2, 3, [4], [], [[[[[[[[[5]]]]]]]]]]))
    [1, 2, 3, 4, 5]
    >> list(flatten_list([[1, 2], 3]))
    [1, 2, 3]

    """
    nested_list = deepcopy(nested_list)

    while nested_list:
        sublist = nested_list.pop(0)

        if isinstance(sublist, list):
            nested_list = sublist + nested_list
        else:
            yield sublist

Here’s a simple function that flattens lists of arbitrary depth. No recursion, to avoid stack overflow.

from copy import deepcopy

def flatten_list(nested_list):
    """Flatten an arbitrarily nested list, without recursion (to avoid
    stack overflows). Returns a new list, the original list is unchanged.

    >> list(flatten_list([1, 2, 3, [4], [], [[[[[[[[[5]]]]]]]]]]))
    [1, 2, 3, 4, 5]
    >> list(flatten_list([[1, 2], 3]))
    [1, 2, 3]

    """
    nested_list = deepcopy(nested_list)

    while nested_list:
        sublist = nested_list.pop(0)

        if isinstance(sublist, list):
            nested_list = sublist + nested_list
        else:
            yield sublist

回答 13

我很惊讶没有人想到这一点。该死的递归我没有这里的高级人员做出的递归答案。无论如何,这是我的尝试。请注意,这是非常特定于OP的用例的

import re

L = [[[1, 2, 3], [4, 5]], 6]
flattened_list = re.sub("[\[\]]", "", str(L)).replace(" ", "").split(",")
new_list = list(map(int, flattened_list))
print(new_list)

输出:

[1, 2, 3, 4, 5, 6]

I’m surprised no one has thought of this. Damn recursion I don’t get the recursive answers that the advanced people here made. anyway here is my attempt on this. caveat is it’s very specific to the OP’s use case

import re

L = [[[1, 2, 3], [4, 5]], 6]
flattened_list = re.sub("[\[\]]", "", str(L)).replace(" ", "").split(",")
new_list = list(map(int, flattened_list))
print(new_list)

output:

[1, 2, 3, 4, 5, 6]

回答 14

我没有在这里浏览所有已经可用的答案,但这是我想到的一个衬里,它借鉴了Lisp的第一张清单和其余清单的处理方式

def flatten(l): return flatten(l[0]) + (flatten(l[1:]) if len(l) > 1 else []) if type(l) is list else [l]

这是一种简单而又不太简单的情况-

>>> flatten([1,[2,3],4])
[1, 2, 3, 4]

>>> flatten([1, [2, 3], 4, [5, [6, {'name': 'some_name', 'age':30}, 7]], [8, 9, [10, [11, [12, [13, {'some', 'set'}, 14, [15, 'some_string'], 16], 17, 18], 19], 20], 21, 22, [23, 24], 25], 26, 27, 28, 29, 30])
[1, 2, 3, 4, 5, 6, {'age': 30, 'name': 'some_name'}, 7, 8, 9, 10, 11, 12, 13, set(['set', 'some']), 14, 15, 'some_string', 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
>>> 

I didn’t go through all the already available answers here, but here is a one liner I came up with, borrowing from lisp’s way of first and rest list processing

def flatten(l): return flatten(l[0]) + (flatten(l[1:]) if len(l) > 1 else []) if type(l) is list else [l]

here is one simple and one not-so-simple case –

>>> flatten([1,[2,3],4])
[1, 2, 3, 4]

>>> flatten([1, [2, 3], 4, [5, [6, {'name': 'some_name', 'age':30}, 7]], [8, 9, [10, [11, [12, [13, {'some', 'set'}, 14, [15, 'some_string'], 16], 17, 18], 19], 20], 21, 22, [23, 24], 25], 26, 27, 28, 29, 30])
[1, 2, 3, 4, 5, 6, {'age': 30, 'name': 'some_name'}, 7, 8, 9, 10, 11, 12, 13, set(['set', 'some']), 14, 15, 'some_string', 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
>>> 

回答 15

当试图回答这样的问题时,您确实需要给出您提议作为解决方案的代码的限制。如果只考虑性能,我不会太在意,但是提议作为解决方案的大多数代码(包括可接受的答案)都无法使深度大于1000的列表变平。

当我说大多数代码我指的是所有使用任何形式的递归的代码(或调用递归的标准库函数)。所有这些代码都会失败,因为对于每个递归调用,(调用)堆栈都增加一个单位,而(默认)python调用堆栈的大小为1000。

如果您不太熟悉调用堆栈,那么以下内容可能会有所帮助(否则,您可以滚动到Implementation)。

调用堆栈大小和递归编程(类似于地下城)

寻找宝藏并退出

想象一下,您进入一个带编号房间的巨大地牢,寻找宝藏。您不知道这个地方,但是对于如何找到宝藏有一些指示。每个指示都是一个谜(难度各不相同,但是您无法预测它们的难易程度)。您决定对节省时间的策略进行一点思考,然后进行两个观察:

  1. 很难(很长)找到宝藏,因为您必须解决(可能很难)谜团才能到达那里。
  2. 找到宝藏后,返回入口可能很容易,您只需要在另一个方向上使用相同的路径即可(尽管这需要一点记忆才能调用您的路径)。

进入地牢时,您会在这里注意到一个小笔记本。您决定使用它来写下谜题(当进入新房间时)之后退出的每个房间,这样您就可以返回到入口。那是个天才的主意,您甚至都不会花一分钱实施自己的策略。

您进入了地牢,成功地解决了前1001个难题,但是这是您未曾计划的事情,您借用的笔记本中没有剩余空间。您决定放弃自己的任务,因为您更喜欢没有宝物,而不是永远迷失在地牢中(确实看起来很聪明)。

执行递归程序

基本上,这与寻找宝藏完全相同。地牢是计算机的内存,您现在的目标不是找到宝藏,而是要计算某些函数(对于给定x找到f(x))。这些指示只是子例程,可以帮助您解决f(x)。您的策略与调用堆栈策略相同,笔记本是堆栈,房间是函数的返回地址:

x = ["over here", "am", "I"]
y = sorted(x) # You're about to enter a room named `sorted`, note down the current room address here so you can return back: 0x4004f4 (that room address looks weird)
# Seems like you went back from your quest using the return address 0x4004f4
# Let's see what you've collected 
print(' '.join(y))

您在地牢中遇到的问题在这里将是相同的,调用堆栈的大小是有限的(此处为1000),因此,如果您输入了太多函数而没有返回,则您将填充调用堆栈并出现错误就像一次调用自己-一遍又一遍-),您将一遍又一遍地输入,直到计算完成(直到找到宝藏为止),然后返回,直到返回到调用的位置为止 “亲爱的冒险家,很抱歉,您的笔记本已经满了”:最初的地方。直到最后一次将调用栈从所有返回地址中释放出来之前,调用栈将永远不会被释放。RecursionError: maximum recursion depth exceeded。请注意,您不需要递归即可填充调用堆栈,但是非递归程序调用1000函数而永远不会返回的可能性很小。同样重要的是要了解,从函数返回后,调用栈将从使用的地址中释放出来(因此,名称“栈”,返回地址在进入函数之前就被压入,并在返回时被拉出)。在简单递归的特殊情况下(一个函数ffff

如何避免这个问题?

这实际上很简单:“如果您不知道递归的深度,请不要使用递归”。并非总是如此,因为在某些情况下,可以优化尾调用递归(TCO)。但是在python中,情况并非如此,即使“写得很好”的递归函数也无法优化堆栈的使用。Guido有一个有趣的帖子,关于这个问题:尾递归消除

您可以使用一种技术来迭代任何递归函数,我们可以称之为自带笔记本。例如,在我们的特定情况下,我们只是在探索一个列表,进入一个房间等同于进入一个子列表,您应该问自己的问题是如何从列表返回其父列表?答案并不那么复杂,请重复以下操作,直到stack为空:

  1. 推送当前列表,addressindexstack进入新的子列表时将其推入(请注意,列表地址+索引也是地址,因此我们只使用调用堆栈使用的完全相同的技术);
  2. 每次找到一个项目yield(或将它们添加到列表中);
  3. 完全浏览列表后,请使用stack return address(和index)返回父列表。

还要注意,这等效于树中的DFS,其中某些节点是子列表,A = [1, 2]而有些则是简单项:(0, 1, 2, 3, 4用于L = [0, [1,2], 3, 4])。树看起来像这样:

                    L
                    |
           -------------------
           |     |     |     |
           0   --A--   3     4
               |   |
               1   2

DFS遍历的顺序为:L,0,A,1、2、3、4。请记住,要实现迭代DFS,您还需要“堆栈”。我之前提出的实现导致具有以下状态(针对stackflat_list):

init.:  stack=[(L, 0)]
**0**:  stack=[(L, 0)],         flat_list=[0]
**A**:  stack=[(L, 1), (A, 0)], flat_list=[0]
**1**:  stack=[(L, 1), (A, 0)], flat_list=[0, 1]
**2**:  stack=[(L, 1), (A, 1)], flat_list=[0, 1, 2]
**3**:  stack=[(L, 2)],         flat_list=[0, 1, 2, 3]
**3**:  stack=[(L, 3)],         flat_list=[0, 1, 2, 3, 4]
return: stack=[],               flat_list=[0, 1, 2, 3, 4]

在此示例中,堆栈最大大小为2,因为输入列表(因此树)的深度为2。

实作

对于实现,在python中,您可以使用迭代器而不是简单的列表来简化一点。对(子)迭代器的引用将用于存储子列表的返回地址(而不是同时具有列表地址和索引)。这不是什么大的区别,但是我觉得这更具可读性(并且速度更快):

def flatten(iterable):
    return list(items_from(iterable))

def items_from(iterable):
    cursor_stack = [iter(iterable)]
    while cursor_stack:
        sub_iterable = cursor_stack[-1]
        try:
            item = next(sub_iterable)
        except StopIteration:   # post-order
            cursor_stack.pop()
            continue
        if is_list_like(item):  # pre-order
            cursor_stack.append(iter(item))
        elif item is not None:
            yield item          # in-order

def is_list_like(item):
    return isinstance(item, list)

另外,请注意,在is_list_likeI have中isinstance(item, list),可以将其更改为处理更多输入类型,在这里,我只想拥有最简单的版本,其中(可迭代)只是一个列表。但是您也可以这样做:

def is_list_like(item):
    try:
        iter(item)
        return not isinstance(item, str)  # strings are not lists (hmm...) 
    except TypeError:
        return False

flatten_iter([["test", "a"], "b])会将字符串视为“简单项目”,因此将返回["test", "a", "b"]而不是["t", "e", "s", "t", "a", "b"]。请注意,在这种情况下,iter(item)每个项目都会被调用两次,让我们假设这是读者练习此清洁器的一种练习。

测试和评论其他实现

最后,请记住,您不能使用来打印无限嵌套的列表Lprint(L)因为它在内部将使用对__repr__RecursionError: maximum recursion depth exceeded while getting the repr of an object)的递归调用。出于相同的原因,flatten涉及解决方案str将失败,并显示相同的错误消息。

如果您需要测试解决方案,则可以使用此函数生成一个简单的嵌套列表:

def build_deep_list(depth):
    """Returns a list of the form $l_{depth} = [depth-1, l_{depth-1}]$
    with $depth > 1$ and $l_0 = [0]$.
    """
    sub_list = [0]
    for d in range(1, depth):
        sub_list = [d, sub_list]
    return sub_list

给出:build_deep_list(5)>>> [4, [3, [2, [1, [0]]]]]

When trying to answer such a question you really need to give the limitations of the code you propose as a solution. If it was only about performances I wouldn’t mind too much, but most of the codes proposed as solution (including the accepted answer) fail to flatten any list that has a depth greater than 1000.

When I say most of the codes I mean all codes that use any form of recursion (or call a standard library function that is recursive). All these codes fail because for every of the recursive call made, the (call) stack grow by one unit, and the (default) python call stack has a size of 1000.

If you’re not too familiar with the call stack, then maybe the following will help (otherwise you can just scroll to the Implementation).

Call stack size and recursive programming (dungeon analogy)

Finding the treasure and exit

Imagine you enter a huge dungeon with numbered rooms, looking for a treasure. You don’t know the place but you have some indications on how to find the treasure. Each indication is a riddle (difficulty varies, but you can’t predict how hard they will be). You decide to think a little bit about a strategy to save time, you make two observations:

  1. It’s hard (long) to find the treasure as you’ll have to solve (potentially hard) riddles to get there.
  2. Once the treasure found, returning to the entrance may be easy, you just have to use the same path in the other direction (though this needs a bit of memory to recall your path).

When entering the dungeon, you notice a small notebook here. You decide to use it to write down every room you exit after solving a riddle (when entering a new room), this way you’ll be able to return back to the entrance. That’s a genius idea, you won’t even spend a cent implementing your strategy.

You enter the dungeon, solving with great success the first 1001 riddles, but here comes something you hadn’t planed, you have no space left in the notebook you borrowed. You decide to abandon your quest as you prefer not having the treasure than being lost forever inside the dungeon (that looks smart indeed).

Executing a recursive program

Basically, it’s the exact same thing as finding the treasure. The dungeon is the computer’s memory, your goal now is not to find a treasure but to compute some function (find f(x) for a given x). The indications simply are sub-routines that will help you solving f(x). Your strategy is the same as the call stack strategy, the notebook is the stack, the rooms are the functions’ return addresses:

x = ["over here", "am", "I"]
y = sorted(x) # You're about to enter a room named `sorted`, note down the current room address here so you can return back: 0x4004f4 (that room address looks weird)
# Seems like you went back from your quest using the return address 0x4004f4
# Let's see what you've collected 
print(' '.join(y))

The problem you encountered in the dungeon will be the same here, the call stack has a finite size (here 1000) and therefore, if you enter too many functions without returning back then you’ll fill the call stack and have an error that look like “Dear adventurer, I’m very sorry but your notebook is full”: RecursionError: maximum recursion depth exceeded. Note that you don’t need recursion to fill the call stack, but it’s very unlikely that a non-recursive program call 1000 functions without ever returning. It’s important to also understand that once you returned from a function, the call stack is freed from the address used (hence the name “stack”, return address are pushed in before entering a function and pulled out when returning). In the special case of a simple recursion (a function f that call itself once — over and over –) you will enter f over and over until the computation is finished (until the treasure is found) and return from f until you go back to the place where you called f in the first place. The call stack will never be freed from anything until the end where it will be freed from all return addresses one after the other.

How to avoid this issue?

That’s actually pretty simple: “don’t use recursion if you don’t know how deep it can go”. That’s not always true as in some cases, Tail Call recursion can be Optimized (TCO). But in python, this is not the case, and even “well written” recursive function will not optimize stack use. There is an interesting post from Guido about this question: Tail Recursion Elimination.

There is a technique that you can use to make any recursive function iterative, this technique we could call bring your own notebook. For example, in our particular case we simply are exploring a list, entering a room is equivalent to entering a sublist, the question you should ask yourself is how can I get back from a list to its parent list? The answer is not that complex, repeat the following until the stack is empty:

  1. push the current list address and index in a stack when entering a new sublist (note that a list address+index is also an address, therefore we just use the exact same technique used by the call stack);
  2. every time an item is found, yield it (or add them in a list);
  3. once a list is fully explored, go back to the parent list using the stack return address (and index).

Also note that this is equivalent to a DFS in a tree where some nodes are sublists A = [1, 2] and some are simple items: 0, 1, 2, 3, 4 (for L = [0, [1,2], 3, 4]). The tree looks like this:

                    L
                    |
           -------------------
           |     |     |     |
           0   --A--   3     4
               |   |
               1   2

The DFS traversal pre-order is: L, 0, A, 1, 2, 3, 4. Remember, in order to implement an iterative DFS you also “need” a stack. The implementation I proposed before result in having the following states (for the stack and the flat_list):

init.:  stack=[(L, 0)]
**0**:  stack=[(L, 0)],         flat_list=[0]
**A**:  stack=[(L, 1), (A, 0)], flat_list=[0]
**1**:  stack=[(L, 1), (A, 0)], flat_list=[0, 1]
**2**:  stack=[(L, 1), (A, 1)], flat_list=[0, 1, 2]
**3**:  stack=[(L, 2)],         flat_list=[0, 1, 2, 3]
**3**:  stack=[(L, 3)],         flat_list=[0, 1, 2, 3, 4]
return: stack=[],               flat_list=[0, 1, 2, 3, 4]

In this example, the stack maximum size is 2, because the input list (and therefore the tree) have depth 2.

Implementation

For the implementation, in python you can simplify a little bit by using iterators instead of simple lists. References to the (sub)iterators will be used to store sublists return addresses (instead of having both the list address and the index). This is not a big difference but I feel this is more readable (and also a bit faster):

def flatten(iterable):
    return list(items_from(iterable))

def items_from(iterable):
    cursor_stack = [iter(iterable)]
    while cursor_stack:
        sub_iterable = cursor_stack[-1]
        try:
            item = next(sub_iterable)
        except StopIteration:   # post-order
            cursor_stack.pop()
            continue
        if is_list_like(item):  # pre-order
            cursor_stack.append(iter(item))
        elif item is not None:
            yield item          # in-order

def is_list_like(item):
    return isinstance(item, list)

Also, notice that in is_list_like I have isinstance(item, list), which could be changed to handle more input types, here I just wanted to have the simplest version where (iterable) is just a list. But you could also do that:

def is_list_like(item):
    try:
        iter(item)
        return not isinstance(item, str)  # strings are not lists (hmm...) 
    except TypeError:
        return False

This considers strings as “simple items” and therefore flatten_iter([["test", "a"], "b]) will return ["test", "a", "b"] and not ["t", "e", "s", "t", "a", "b"]. Remark that in that case, iter(item) is called twice on each item, let’s pretend it’s an exercise for the reader to make this cleaner.

Testing and remarks on other implementations

In the end, remember that you can’t print a infinitely nested list L using print(L) because internally it will use recursive calls to __repr__ (RecursionError: maximum recursion depth exceeded while getting the repr of an object). For the same reason, solutions to flatten involving str will fail with the same error message.

If you need to test your solution, you can use this function to generate a simple nested list:

def build_deep_list(depth):
    """Returns a list of the form $l_{depth} = [depth-1, l_{depth-1}]$
    with $depth > 1$ and $l_0 = [0]$.
    """
    sub_list = [0]
    for d in range(1, depth):
        sub_list = [d, sub_list]
    return sub_list

Which gives: build_deep_list(5) >>> [4, [3, [2, [1, [0]]]]].


回答 16

这是compiler.ast.flatten2.7.5中的实现:

def flatten(seq):
    l = []
    for elt in seq:
        t = type(elt)
        if t is tuple or t is list:
            for elt2 in flatten(elt):
                l.append(elt2)
        else:
            l.append(elt)
    return l

有更好,更快的方法(如果您已经到达这里,您已经看到了它们)

另请注意:

自2.6版起弃用:编译器软件包已在Python 3中删除。

Here’s the compiler.ast.flatten implementation in 2.7.5:

def flatten(seq):
    l = []
    for elt in seq:
        t = type(elt)
        if t is tuple or t is list:
            for elt2 in flatten(elt):
                l.append(elt2)
        else:
            l.append(elt)
    return l

There are better, faster methods (If you’ve reached here, you have seen them already)

Also note:

Deprecated since version 2.6: The compiler package has been removed in Python 3.


回答 17

完全hacky,但我认为它可以工作(取决于您的data_type)

flat_list = ast.literal_eval("[%s]"%re.sub("[\[\]]","",str(the_list)))

totally hacky but I think it would work (depending on your data_type)

flat_list = ast.literal_eval("[%s]"%re.sub("[\[\]]","",str(the_list)))

回答 18

只需使用一个funcy库: pip install funcy

import funcy


funcy.flatten([[[[1, 1], 1], 2], 3]) # returns generator
funcy.lflatten([[[[1, 1], 1], 2], 3]) # returns list

Just use a funcy library: pip install funcy

import funcy


funcy.flatten([[[[1, 1], 1], 2], 3]) # returns generator
funcy.lflatten([[[[1, 1], 1], 2], 3]) # returns list

回答 19

这是另一种py2方法,我不确定它是最快还是最优雅也不最安全…

from collections import Iterable
from itertools import imap, repeat, chain


def flat(seqs, ignore=(int, long, float, basestring)):
    return repeat(seqs, 1) if any(imap(isinstance, repeat(seqs), ignore)) or not isinstance(seqs, Iterable) else chain.from_iterable(imap(flat, seqs))

它可以忽略您想要的任何特定(或派生)类型,它返回一个迭代器,因此您可以将其转换为任何特定的容器(例如list,tuple,dict或仅使用它)以减少内存占用,无论是好是坏它可以处理初始的不可迭代对象,例如int …

请注意,大多数繁重的工作都是在C中完成的,因为据我所知,这是itertools的实现方式,因此尽管是递归的,但AFAIK并不受python递归深度的限制,因为函数调用发生在C中,尽管这样做并不意味着您会受到内存的限制,特别是在OS X中,从今天开始,它的堆栈大小有了硬限制(OS X Mavericks)…

有一种稍微快一点的方法,但可移植性较低的方法,只有在可以假定可以明确确定输入的基本元素的情况下,才使用它,否则,将获得无限递归,并且具有有限堆栈大小的OS X将很快地引发细分错误…

def flat(seqs, ignore={int, long, float, str, unicode}):
    return repeat(seqs, 1) if type(seqs) in ignore or not isinstance(seqs, Iterable) else chain.from_iterable(imap(flat, seqs))

在这里,我们使用集合来检查类型,因此需要O(1)与O(类型数)来检查是否应忽略某个元素,尽管任何具有声明的被忽略类型的派生类型的值都将失败,这就是为什么要使用它strunicode因此请谨慎使用…

测试:

import random

def test_flat(test_size=2000):
    def increase_depth(value, depth=1):
        for func in xrange(depth):
            value = repeat(value, 1)
        return value

    def random_sub_chaining(nested_values):
        for values in nested_values:
            yield chain((values,), chain.from_iterable(imap(next, repeat(nested_values, random.randint(1, 10)))))

    expected_values = zip(xrange(test_size), imap(str, xrange(test_size)))
    nested_values = random_sub_chaining((increase_depth(value, depth) for depth, value in enumerate(expected_values)))
    assert not any(imap(cmp, chain.from_iterable(expected_values), flat(chain(((),), nested_values, ((),)))))

>>> test_flat()
>>> list(flat([[[1, 2, 3], [4, 5]], 6]))
[1, 2, 3, 4, 5, 6]
>>>  

$ uname -a
Darwin Samys-MacBook-Pro.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64
$ python --version
Python 2.7.5

Here is another py2 approach, Im not sure if its the fastest or the most elegant nor safest …

from collections import Iterable
from itertools import imap, repeat, chain


def flat(seqs, ignore=(int, long, float, basestring)):
    return repeat(seqs, 1) if any(imap(isinstance, repeat(seqs), ignore)) or not isinstance(seqs, Iterable) else chain.from_iterable(imap(flat, seqs))

It can ignore any specific (or derived) type you would like, it returns an iterator, so you can convert it to any specific container such as list, tuple, dict or simply consume it in order to reduce memory footprint, for better or worse it can handle initial non-iterable objects such as int …

Note most of the heavy lifting is done in C, since as far as I know thats how itertools are implemented, so while it is recursive, AFAIK it isn’t bounded by python recursion depth since the function calls are happening in C, though this doesn’t mean you are bounded by memory, specially in OS X where its stack size has a hard limit as of today (OS X Mavericks) …

there is a slightly faster approach, but less portable method, only use it if you can assume that the base elements of the input can be explicitly determined otherwise, you’ll get an infinite recursion, and OS X with its limited stack size, will throw a segmentation fault fairly quickly …

def flat(seqs, ignore={int, long, float, str, unicode}):
    return repeat(seqs, 1) if type(seqs) in ignore or not isinstance(seqs, Iterable) else chain.from_iterable(imap(flat, seqs))

here we are using sets to check for the type so it takes O(1) vs O(number of types) to check whether or not an element should be ignored, though of course any value with derived type of the stated ignored types will fail, this is why its using str, unicode so use it with caution …

tests:

import random

def test_flat(test_size=2000):
    def increase_depth(value, depth=1):
        for func in xrange(depth):
            value = repeat(value, 1)
        return value

    def random_sub_chaining(nested_values):
        for values in nested_values:
            yield chain((values,), chain.from_iterable(imap(next, repeat(nested_values, random.randint(1, 10)))))

    expected_values = zip(xrange(test_size), imap(str, xrange(test_size)))
    nested_values = random_sub_chaining((increase_depth(value, depth) for depth, value in enumerate(expected_values)))
    assert not any(imap(cmp, chain.from_iterable(expected_values), flat(chain(((),), nested_values, ((),)))))

>>> test_flat()
>>> list(flat([[[1, 2, 3], [4, 5]], 6]))
[1, 2, 3, 4, 5, 6]
>>>  

$ uname -a
Darwin Samys-MacBook-Pro.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64
$ python --version
Python 2.7.5

回答 20

不使用任何库:

def flat(l):
    def _flat(l, r):    
        if type(l) is not list:
            r.append(l)
        else:
            for i in l:
                r = r + flat(i)
        return r
    return _flat(l, [])



# example
test = [[1], [[2]], [3], [['a','b','c'] , [['z','x','y']], ['d','f','g']], 4]    
print flat(test) # prints [1, 2, 3, 'a', 'b', 'c', 'z', 'x', 'y', 'd', 'f', 'g', 4]

Without using any library:

def flat(l):
    def _flat(l, r):    
        if type(l) is not list:
            r.append(l)
        else:
            for i in l:
                r = r + flat(i)
        return r
    return _flat(l, [])



# example
test = [[1], [[2]], [3], [['a','b','c'] , [['z','x','y']], ['d','f','g']], 4]    
print flat(test) # prints [1, 2, 3, 'a', 'b', 'c', 'z', 'x', 'y', 'd', 'f', 'g', 4]

回答 21

使用itertools.chain

import itertools
from collections import Iterable

def list_flatten(lst):
    flat_lst = []
    for item in itertools.chain(lst):
        if isinstance(item, Iterable):
            item = list_flatten(item)
            flat_lst.extend(item)
        else:
            flat_lst.append(item)
    return flat_lst

或不链接:

def flatten(q, final):
    if not q:
        return
    if isinstance(q, list):
        if not isinstance(q[0], list):
            final.append(q[0])
        else:
            flatten(q[0], final)
        flatten(q[1:], final)
    else:
        final.append(q)

Using itertools.chain:

import itertools
from collections import Iterable

def list_flatten(lst):
    flat_lst = []
    for item in itertools.chain(lst):
        if isinstance(item, Iterable):
            item = list_flatten(item)
            flat_lst.extend(item)
        else:
            flat_lst.append(item)
    return flat_lst

Or without chaining:

def flatten(q, final):
    if not q:
        return
    if isinstance(q, list):
        if not isinstance(q[0], list):
            final.append(q[0])
        else:
            flatten(q[0], final)
        flatten(q[1:], final)
    else:
        final.append(q)

回答 22

我使用递归来解决任何深度的嵌套列表

def combine_nlist(nlist,init=0,combiner=lambda x,y: x+y):
    '''
    apply function: combiner to a nested list element by element(treated as flatten list)
    '''
    current_value=init
    for each_item in nlist:
        if isinstance(each_item,list):
            current_value =combine_nlist(each_item,current_value,combiner)
        else:
            current_value = combiner(current_value,each_item)
    return current_value

因此,在定义函数combin_nlist之后,就很容易使用此函数进行展平。或者,您可以将其组合为一个功能。我喜欢我的解决方案,因为它可以应用于任何嵌套列表。

def flatten_nlist(nlist):
    return combine_nlist(nlist,[],lambda x,y:x+[y])

结果

In [379]: flatten_nlist([1,2,3,[4,5],[6],[[[7],8],9],10])
Out[379]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

I used recursive to solve nested list with any depth

def combine_nlist(nlist,init=0,combiner=lambda x,y: x+y):
    '''
    apply function: combiner to a nested list element by element(treated as flatten list)
    '''
    current_value=init
    for each_item in nlist:
        if isinstance(each_item,list):
            current_value =combine_nlist(each_item,current_value,combiner)
        else:
            current_value = combiner(current_value,each_item)
    return current_value

So after i define function combine_nlist, it is easy to use this function do flatting. Or you can combine it into one function. I like my solution because it can be applied to any nested list.

def flatten_nlist(nlist):
    return combine_nlist(nlist,[],lambda x,y:x+[y])

result

In [379]: flatten_nlist([1,2,3,[4,5],[6],[[[7],8],9],10])
Out[379]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

回答 23

最简单的方法是使用变身利用图书馆pip install morph

代码是:

import morph

list = [[[1, 2, 3], [4, 5]], 6]
flattened_list = morph.flatten(list)  # returns [1, 2, 3, 4, 5, 6]

The easiest way is to use the morph library using pip install morph.

The code is:

import morph

list = [[[1, 2, 3], [4, 5]], 6]
flattened_list = morph.flatten(list)  # returns [1, 2, 3, 4, 5, 6]

回答 24

我知道已经有很多很棒的答案,但是我想添加一个使用功能性编程方法解决问题的答案。在这个答案中,我使用了双重递归:

def flatten_list(seq):
    if not seq:
        return []
    elif isinstance(seq[0],list):
        return (flatten_list(seq[0])+flatten_list(seq[1:]))
    else:
        return [seq[0]]+flatten_list(seq[1:])

print(flatten_list([1,2,[3,[4],5],[6,7]]))

输出:

[1, 2, 3, 4, 5, 6, 7]

I am aware that there are already many awesome answers but i wanted to add an answer that uses the functional programming method of solving the question. In this answer i make use of double recursion :

def flatten_list(seq):
    if not seq:
        return []
    elif isinstance(seq[0],list):
        return (flatten_list(seq[0])+flatten_list(seq[1:]))
    else:
        return [seq[0]]+flatten_list(seq[1:])

print(flatten_list([1,2,[3,[4],5],[6,7]]))

output:

[1, 2, 3, 4, 5, 6, 7]

回答 25

我不确定这是否一定更快或更有效,但这是我要做的:

def flatten(lst):
    return eval('[' + str(lst).replace('[', '').replace(']', '') + ']')

L = [[[1, 2, 3], [4, 5]], 6]
print(flatten(L))

flatten这里的函数将列表转换为字符串,取出所有方括号,将方括号附加到两端,然后将其重新转换为列表。

虽然,如果您知道列表中的方括号中包含字符串,例如[[1, 2], "[3, 4] and [5]"],则您需要做其他事情。

I’m not sure if this is necessarily quicker or more effective, but this is what I do:

def flatten(lst):
    return eval('[' + str(lst).replace('[', '').replace(']', '') + ']')

L = [[[1, 2, 3], [4, 5]], 6]
print(flatten(L))

The flatten function here turns the list into a string, takes out all of the square brackets, attaches square brackets back onto the ends, and turns it back into a list.

Although, if you knew you would have square brackets in your list in strings, like [[1, 2], "[3, 4] and [5]"], you would have to do something else.


回答 26

这是在python2上进行flatten的简单实现

flatten=lambda l: reduce(lambda x,y:x+y,map(flatten,l),[]) if isinstance(l,list) else [l]

test=[[1,2,3,[3,4,5],[6,7,[8,9,[10,[11,[12,13,14]]]]]],]
print flatten(test)

#output [1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

This is a simple implement of flatten on python2

flatten=lambda l: reduce(lambda x,y:x+y,map(flatten,l),[]) if isinstance(l,list) else [l]

test=[[1,2,3,[3,4,5],[6,7,[8,9,[10,[11,[12,13,14]]]]]],]
print flatten(test)

#output [1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

回答 27

这将使列表或字典(或列表列表或字典的字典等)变平。它假定值是字符串,并创建一个字符串,该字符串将每个项目与分隔符参数连接在一起。如果需要,可以使用分隔符随后将结果拆分为列表对象。如果下一个值是列表或字符串,则使用递归。使用key参数来告诉您要使用字典对象的键还是值(将key设置为false)。

def flatten_obj(n_obj, key=True, my_sep=''):
    my_string = ''
    if type(n_obj) == list:
        for val in n_obj:
            my_sep_setter = my_sep if my_string != '' else ''
            if type(val) == list or type(val) == dict:
                my_string += my_sep_setter + flatten_obj(val, key, my_sep)
            else:
                my_string += my_sep_setter + val
    elif type(n_obj) == dict:
        for k, v in n_obj.items():
            my_sep_setter = my_sep if my_string != '' else ''
            d_val = k if key else v
            if type(v) == list or type(v) == dict:
                my_string += my_sep_setter + flatten_obj(v, key, my_sep)
            else:
                my_string += my_sep_setter + d_val
    elif type(n_obj) == str:
        my_sep_setter = my_sep if my_string != '' else ''
        my_string += my_sep_setter + n_obj
        return my_string
    return my_string

print(flatten_obj(['just', 'a', ['test', 'to', 'try'], 'right', 'now', ['or', 'later', 'today'],
                [{'dictionary_test': 'test'}, {'dictionary_test_two': 'later_today'}, 'my power is 9000']], my_sep=', ')

Yield:

just, a, test, to, try, right, now, or, later, today, dictionary_test, dictionary_test_two, my power is 9000

This will flatten a list or dictionary (or list of lists or dictionaries of dictionaries etc). It assumes that the values are strings and it creates a string that concatenates each item with a separator argument. If you wanted you could use the separator to split the result into a list object afterward. It uses recursion if the next value is a list or a string. Use the key argument to tell whether you want the keys or the values (set key to false) from the dictionary object.

def flatten_obj(n_obj, key=True, my_sep=''):
    my_string = ''
    if type(n_obj) == list:
        for val in n_obj:
            my_sep_setter = my_sep if my_string != '' else ''
            if type(val) == list or type(val) == dict:
                my_string += my_sep_setter + flatten_obj(val, key, my_sep)
            else:
                my_string += my_sep_setter + val
    elif type(n_obj) == dict:
        for k, v in n_obj.items():
            my_sep_setter = my_sep if my_string != '' else ''
            d_val = k if key else v
            if type(v) == list or type(v) == dict:
                my_string += my_sep_setter + flatten_obj(v, key, my_sep)
            else:
                my_string += my_sep_setter + d_val
    elif type(n_obj) == str:
        my_sep_setter = my_sep if my_string != '' else ''
        my_string += my_sep_setter + n_obj
        return my_string
    return my_string

print(flatten_obj(['just', 'a', ['test', 'to', 'try'], 'right', 'now', ['or', 'later', 'today'],
                [{'dictionary_test': 'test'}, {'dictionary_test_two': 'later_today'}, 'my power is 9000']], my_sep=', ')

yields:

just, a, test, to, try, right, now, or, later, today, dictionary_test, dictionary_test_two, my power is 9000

回答 28

如果您喜欢递归,这可能是您感兴趣的解决方案:

def f(E):
    if E==[]: 
        return []
    elif type(E) != list: 
        return [E]
    else:
        a = f(E[0])
        b = f(E[1:])
        a.extend(b)
        return a

我实际上是从前一段时间写的一些练习Scheme代码中改编而成的。

请享用!

If you like recursion, this might be a solution of interest to you:

def f(E):
    if E==[]: 
        return []
    elif type(E) != list: 
        return [E]
    else:
        a = f(E[0])
        b = f(E[1:])
        a.extend(b)
        return a

I actually adapted this from some practice Scheme code that I had written a while back.

Enjoy!


回答 29

我是python的新手,来自Lisp背景。这是我想出的(查看lulz的var名称):

def flatten(lst):
    if lst:
        car,*cdr=lst
        if isinstance(car,(list,tuple)):
            if cdr: return flatten(car) + flatten(cdr)
            return flatten(car)
        if cdr: return [car] + flatten(cdr)
        return [car]

似乎可以工作。测试:

flatten((1,2,3,(4,5,6,(7,8,(((1,2)))))))

返回:

[1, 2, 3, 4, 5, 6, 7, 8, 1, 2]

I’m new to python and come from a lisp background. This is what I came up with (check out the var names for lulz):

def flatten(lst):
    if lst:
        car,*cdr=lst
        if isinstance(car,(list,tuple)):
            if cdr: return flatten(car) + flatten(cdr)
            return flatten(car)
        if cdr: return [car] + flatten(cdr)
        return [car]

Seems to work. Test:

flatten((1,2,3,(4,5,6,(7,8,(((1,2)))))))

returns:

[1, 2, 3, 4, 5, 6, 7, 8, 1, 2]

如何在Python中将字符串转换为整数?

问题:如何在Python中将字符串转换为整数?

我有一个来自MySQL查询的元组,像这样:

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))

我想将所有字符串元素转换为整数,然后将它们放回列表列表中:

T2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

我试图用它来实现它,eval但是还没有得到令人满意的结果。

I have a tuple of tuples from a MySQL query like this:

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))

I’d like to convert all the string elements into integers and put them back into a list of lists:

T2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

I tried to achieve it with eval but didn’t get any decent result yet.


回答 0

int()是Python标准的内置函数,用于将字符串转换为整数值。您使用一个包含数字作为参数的字符串来调用它,它返回转换为整数的数字:

print (int("1") + 1)

上面的照片2

如果您知道列表T1的结构(它仅包含列表,仅一个级别),则可以在Python 2中执行此操作:

T2 = [map(int, x) for x in T1]

在Python 3中:

T2 = [list(map(int, x)) for x in T1]

int() is the Python standard built-in function to convert a string into an integer value. You call it with a string containing a number as the argument, and it returns the number converted to an integer:

print (int("1") + 1)

The above prints 2.

If you know the structure of your list, T1 (that it simply contains lists, only one level), you could do this in Python 2:

T2 = [map(int, x) for x in T1]

In Python 3:

T2 = [list(map(int, x)) for x in T1]

回答 1

您可以通过列表理解来做到这一点:

T2 = [[int(column) for column in row] for row in T1]

内部列表理解([int(column) for column in row])建立一个listint期从序列int-able物体,如小数字符串中row。外部列表推导([... for row in T1]))生成一个内部列表推导的结果的列表,该结果适用于中的每个项目T1

如果任何行包含无法通过转换的对象,则代码段将失败int。如果要处理包含非十进制字符串的行,则需要一个更智能的函数。

如果您知道行的结构,则可以使用对行函数的调用来替换内部列表理解。例如。

T2 = [parse_a_row_of_T1(row) for row in T1]

You can do this with a list comprehension:

T2 = [[int(column) for column in row] for row in T1]

The inner list comprehension ([int(column) for column in row]) builds a list of ints from a sequence of int-able objects, like decimal strings, in row. The outer list comprehension ([... for row in T1])) builds a list of the results of the inner list comprehension applied to each item in T1.

The code snippet will fail if any of the rows contain objects that can’t be converted by int. You’ll need a smarter function if you want to process rows containing non-decimal strings.

If you know the structure of the rows, you can replace the inner list comprehension with a call to a function of the row. Eg.

T2 = [parse_a_row_of_T1(row) for row in T1]

回答 2

我宁愿只使用理解列表:

[[int(y) for y in x] for x in T1]

I would rather prefer using only comprehension lists:

[[int(y) for y in x] for x in T1]

回答 3

代替put int( ),put float( )可以让您将小数与整数一起使用。

Instead of putting int( ), put float( ) which will let you use decimals along with integers.


回答 4

到目前为止,我都同意所有人的回答,但是问题是,如果您没有所有整数,它们将崩溃。

如果要排除非整数,则

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))
new_list = list(list(int(a) for a in b) for b in T1 if a.isdigit())

这仅产生实际数字。我不使用直接列表推导的原因是因为列表推导会泄漏其内部变量。

I would agree with everyones answers so far but the problem is is that if you do not have all integers they will crash.

If you wanted to exclude non-integers then

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))
new_list = list(list(int(a) for a in b) for b in T1 if a.isdigit())

This yields only actual digits. The reason I don’t use direct list comprehensions is because list comprehension leaks their internal variables.


回答 5

T3=[]

for i in range(0,len(T1)):
    T3.append([])
    for j in range(0,len(T1[i])):
        b=int(T1[i][j])
        T3[i].append(b)

print T3
T3=[]

for i in range(0,len(T1)):
    T3.append([])
    for j in range(0,len(T1[i])):
        b=int(T1[i][j])
        T3[i].append(b)

print T3

回答 6

尝试这个。

x = "1"

x是一个字符串,因为它周围带有引号,但其中带有数字。

x = int(x)

由于x的数字为1,因此我可以将其变成整数。

要查看字符串是否为数字,可以执行此操作。

def is_number(var):
    try:
        if var == int(var):
            return True
    except Exception:
        return False

x = "1"

y = "test"

x_test = is_number(x)

print(x_test)

它应该打印到IDLE True,因为x是一个数字。

y_test = is_number(y)

print(y_test)

它应该打印为IDLE False,因为y中没有数字。

Try this.

x = "1"

x is a string because it has quotes around it, but it has a number in it.

x = int(x)

Since x has the number 1 in it, I can turn it in to a integer.

To see if a string is a number, you can do this.

def is_number(var):
    try:
        if var == int(var):
            return True
    except Exception:
        return False

x = "1"

y = "test"

x_test = is_number(x)

print(x_test)

It should print to IDLE True because x is a number.

y_test = is_number(y)

print(y_test)

It should print to IDLE False because y in not a number.


回答 7

使用列表推导:

t2 = [map(int, list(l)) for l in t1]

Using list comprehensions:

t2 = [map(int, list(l)) for l in t1]

回答 8

在Python 3.5.1中,这些工作如下:

c = input('Enter number:')
print (int(float(c)))
print (round(float(c)))

Enter number:  4.7
4
5

乔治。

In Python 3.5.1 things like these work:

c = input('Enter number:')
print (int(float(c)))
print (round(float(c)))

and

Enter number:  4.7
4
5

George.


回答 9

查看此功能

def parse_int(s):
    try:
        res = int(eval(str(s)))
        if type(res) == int:
            return res
    except:
        return

然后

val = parse_int('10')  # Return 10
val = parse_int('0')  # Return 0
val = parse_int('10.5')  # Return 10
val = parse_int('0.0')  # Return 0
val = parse_int('Ten')  # Return None

您也可以检查

if val == None:  # True if input value can not be converted
    pass  # Note: Don't use 'if not val:'

See this function

def parse_int(s):
    try:
        res = int(eval(str(s)))
        if type(res) == int:
            return res
    except:
        return

Then

val = parse_int('10')  # Return 10
val = parse_int('0')  # Return 0
val = parse_int('10.5')  # Return 10
val = parse_int('0.0')  # Return 0
val = parse_int('Ten')  # Return None

You can also check

if val == None:  # True if input value can not be converted
    pass  # Note: Don't use 'if not val:'

回答 10

适用于Python 2的另一个功能解决方案:

from functools import partial

map(partial(map, int), T1)

不过,Python 3会有些混乱:

list(map(list, map(partial(map, int), T1)))

我们可以用包装纸解决

def oldmap(f, iterable):
    return list(map(f, iterable))

oldmap(partial(oldmap, int), T1)

Yet another functional solution for Python 2:

from functools import partial

map(partial(map, int), T1)

Python 3 will be a little bit messy though:

list(map(list, map(partial(map, int), T1)))

we can fix this with a wrapper

def oldmap(f, iterable):
    return list(map(f, iterable))

oldmap(partial(oldmap, int), T1)

回答 11

如果只是元组的元组,类似 rows=[map(int, row) for row in rows]就可以解决。(在其中有一个列表推导和对map(f,lst)的调用,该调用等于[f in a lst]中的f(a)。)

如果由于某种原因在数据库中有类似的东西,Eval 不是您想要做的__import__("os").unlink("importantsystemfile")。始终验证您的输入(如果没有其他问题,如果输入错误,则会引发int()异常)。

If it’s only a tuple of tuples, something like rows=[map(int, row) for row in rows] will do the trick. (There’s a list comprehension and a call to map(f, lst), which is equal to [f(a) for a in lst], in there.)

Eval is not what you want to do, in case there’s something like __import__("os").unlink("importantsystemfile") in your database for some reason. Always validate your input (if with nothing else, the exception int() will raise if you have bad input).


回答 12

您可以执行以下操作:

T1 = (('13', '17', '18', '21', '32'),  
     ('07', '11', '13', '14', '28'),  
     ('01', '05', '06', '08', '15', '16'))  
new_list = list(list(int(a) for a in b if a.isdigit()) for b in T1)  
print(new_list)  

You can do something like this:

T1 = (('13', '17', '18', '21', '32'),  
     ('07', '11', '13', '14', '28'),  
     ('01', '05', '06', '08', '15', '16'))  
new_list = list(list(int(a) for a in b if a.isdigit()) for b in T1)  
print(new_list)  

回答 13

我想分享一个似乎此处未提及的可用选项:

rumpy.random.permutation(x)

将生成数组x的随机排列。不完全是您的要求,但这是解决类似问题的潜在方法。

I want to share an available option that doesn’t seem to be mentioned here yet:

rumpy.random.permutation(x)

Will generate a random permutation of array x. Not exactly what you asked for, but it is a potential solution to similar questions.


使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

问题:使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

我有一个数据框,df并且从中使用了几列groupby

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

通过以上方法,我几乎得到了所需的表(数据框)。缺少的是另外一列,其中包含每个组中的行数。换句话说,我有意思,但我也想知道有多少个数字被用来获得这些价值。例如,在第一组中有8个值,在第二组中有10个值,依此类推。

简而言之:如何获取数据框的分组统计信息?

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?


回答 0

groupby对象上,该agg函数可以列出一个列表,以一次应用多种聚合方法。这应该给您您需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

回答 1

快速回答:

获取每个组的行数的最简单方法是调用.size(),它返回一个Series

df.groupby(['col1','col2']).size()


通常,您希望此结果为DataFrame(而不是Series),因此您可以执行以下操作:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


如果您想了解如何计算每组的行数和其他统计信息,请继续阅读下面的内容。


详细的例子:

考虑以下示例数据框:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

首先让我们.size()用来获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

然后让我们使用.size().reset_index(name='counts')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


包括结果以获取更多统计信息

当您要计算分组数据的统计信息时,通常如下所示:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

由于嵌套的列标签,并且行计数是基于每列的,因此上面的结果有点令人讨厌。

为了获得对输出的更多控制权,我通常将统计信息拆分为单独的汇总,然后使用进行合并join。看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



脚注

下面显示了用于生成测试数据的代码:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


免责声明:

如果您要聚合的某些列具有空值,那么您真的希望将组行计数视为每列的独立聚合。否则,您可能会误认为实际上有多少记录用于计算均值之类的东西,因为熊猫会NaN在均值计算中丢弃条目而不会告诉您。

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let’s use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let’s use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.


回答 2

一种功能统治一切: GroupBy.describe

返回countmeanstd,和其他有用的统计每个组。

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

要获取特定的统计信息,只需选择它们,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe适用于多列(更改['C']为(['C', 'D']或完全删除),看看会发生什么,结果是一个MultiIndexed列数据框)。

您还将获得不同的字符串数据统计信息。这是一个例子

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

有关更多信息,请参见文档

One Function to Rule Them All: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here’s an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.


回答 3

我们可以使用groupby和count轻松地做到这一点。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

回答 4

要获取多个统计信息,请折叠索引并保留列名:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

生成:

**在此处输入图片说明**

To get multiple stats, collapse the index, and retain column names:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:

**enter image description here**


回答 5

创建一个组对象并调用如下示例所示的方法:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

Create a group object and call methods like below example:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

回答 6

请尝试此代码

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

我认为该代码将添加一个名为“ count it”的列,每个列的计数

Please try this code

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called ‘count it’ which count of each group


在Python中使用try-except-else是否是一种好习惯?

问题:在Python中使用try-except-else是否是一种好习惯?

在Python中,我不时看到该块:

try:
   try_this(whatever)
except SomeException as exception:
   #Handle exception
else:
   return something

try-except-else存在的原因是什么?

我不喜欢这种编程,因为它使用异常来执行流控制。但是,如果它包含在语言中,则一定有充分的理由,不是吗?

据我了解,异常不是错误,并且仅应将其用于特殊情况(例如,我尝试将文件写入磁盘,并且没有更多空间,或者我没有权限),而不是流控制。

通常,我将异常处理为:

something = some_default_value
try:
    something = try_this(whatever)
except SomeException as exception:
    #Handle exception
finally:
    return something

或者,如果发生异常,我真的不想返回任何东西,那么:

try:
    something = try_this(whatever)
    return something
except SomeException as exception:
    #Handle exception

From time to time in Python, I see the block:

try:
   try_this(whatever)
except SomeException as exception:
   #Handle exception
else:
   return something

What is the reason for the try-except-else to exist?

I do not like that kind of programming, as it is using exceptions to perform flow control. However, if it is included in the language, there must be a good reason for it, isn’t it?

It is my understanding that exceptions are not errors, and that they should only be used for exceptional conditions (e.g. I try to write a file into disk and there is no more space, or maybe I do not have permission), and not for flow control.

Normally I handle exceptions as:

something = some_default_value
try:
    something = try_this(whatever)
except SomeException as exception:
    #Handle exception
finally:
    return something

Or if I really do not want to return anything if an exception happens, then:

try:
    something = try_this(whatever)
    return something
except SomeException as exception:
    #Handle exception

回答 0

“我不知道它是否出于无知,但我不喜欢这种编程,因为它使用异常来执行流控制。”

在Python世界中,使用异常进行流控制是常见且正常的。

甚至Python核心开发人员也将异常用于流控制,并且该样式已在语言中大量使用(即,迭代器协议使用StopIteration发出信号以终止循环)。

此外,try-except样式用于防止某些“跨越式”构造固有的竞争条件。例如,测试os.path.exists会导致信息在您使用时已过时。同样,Queue.full返回的信息可能已过时。在这种情况下,try-except-else样式将产生更可靠的代码。

“据我了解,异常不是错误,它们仅应用于特殊情况”

在其他一些语言中,该规则反映了图书馆所反映的文化规范。该“规则”还部分基于这些语言的性能考虑。

Python的文化规范有些不同。在许多情况下,必须对控制流使用exceptions。另外,在Python中使用异常不会像在某些编译语言中那样降低周围的代码和调用代码的速度(即CPython已经在每一步实现了用于异常检查的代码,而不管您是否实际使用异常)。

换句话说,您理解“exceptions是为了exceptions”是一条在其他语言中有意义的规则,但不适用于Python。

“但是,如果它本身包含在语言中,那一定有充分的理由,不是吗?”

除了帮助避免竞争条件外,异常对于在循环外拉出错误处理也非常有用。这是解释语言中的必要优化,这些语言通常不会具有自动循环不变的代码运动

另外,在通常情况下,异常可以大大简化代码,在正常情况下,处理问题的能力与问题发生的地方相距甚远。例如,通常有用于业务逻辑的顶级用户界面代码调用代码,而后者又调用低级例程。低级例程中出现的情况(例如数据库访问中唯一键的重复记录)只能以顶级代码处理(例如,要求用户提供与现有键不冲突的新键)。对此类控制流使用异常可以使中级例程完全忽略该问题,并将其与流控制的这一方面很好地分离。

这里有一篇关于异常必不可少的不错的博客文章

另外,请参见此堆栈溢出答案:异常真的是异常错误吗?

“ try-except-else存在的原因是什么?”

其他条款本身很有趣。它在没有exceptions的情况下运行,但是在最终条款之前。这是其主要目的。

如果没有else子句,那么在最终确定之前运行其他代码的唯一选择就是将代码添加到try子句的笨拙做法。这很笨拙,因为它冒着在代码中引发异常的危险,而这些异常本来不会受到try块的保护。

在完成之前运行其他不受保护的代码的用例很少出现。因此,不要期望在已发布的代码中看到很多示例。这有点罕见。

else子句的另一个用例是执行在没有异常发生时必须发生的动作以及在处理异常时不发生的动作。例如:

recip = float('Inf')
try:
    recip = 1 / f(x)
except ZeroDivisionError:
    logging.info('Infinite result')
else:
    logging.info('Finite result')

另一个示例发生在单元测试赛跑者中:

try:
    tests_run += 1
    run_testcase(case)
except Exception:
    tests_failed += 1
    logging.exception('Failing test case: %r', case)
    print('F', end='')
else:
    logging.info('Successful test case: %r', case)
    print('.', end='')

最后,在尝试块中最常用的else子句是为了美化一些(在相同的缩进级别上对齐exceptions结果和非exceptions结果)。此用法始终是可选的,并非严格必要。

“I do not know if it is out of ignorance, but I do not like that kind of programming, as it is using exceptions to perform flow control.”

In the Python world, using exceptions for flow control is common and normal.

Even the Python core developers use exceptions for flow-control and that style is heavily baked into the language (i.e. the iterator protocol uses StopIteration to signal loop termination).

In addition, the try-except-style is used to prevent the race-conditions inherent in some of the “look-before-you-leap” constructs. For example, testing os.path.exists results in information that may be out-of-date by the time you use it. Likewise, Queue.full returns information that may be stale. The try-except-else style will produce more reliable code in these cases.

“It my understanding that exceptions are not errors, they should only be used for exceptional conditions”

In some other languages, that rule reflects their cultural norms as reflected in their libraries. The “rule” is also based in-part on performance considerations for those languages.

The Python cultural norm is somewhat different. In many cases, you must use exceptions for control-flow. Also, the use of exceptions in Python does not slow the surrounding code and calling code as it does in some compiled languages (i.e. CPython already implements code for exception checking at every step, regardless of whether you actually use exceptions or not).

In other words, your understanding that “exceptions are for the exceptional” is a rule that makes sense in some other languages, but not for Python.

“However, if it is included in the language itself, there must be a good reason for it, isn’t it?”

Besides helping to avoid race-conditions, exceptions are also very useful for pulling error-handling outside loops. This is a necessary optimization in interpreted languages which do not tend to have automatic loop invariant code motion.

Also, exceptions can simplify code quite a bit in common situations where the ability to handle an issue is far removed from where the issue arose. For example, it is common to have top level user-interface code calling code for business logic which in turn calls low-level routines. Situations arising in the low-level routines (such as duplicate records for unique keys in database accesses) can only be handled in top-level code (such as asking the user for a new key that doesn’t conflict with existing keys). The use of exceptions for this kind of control-flow allows the mid-level routines to completely ignore the issue and be nicely decoupled from that aspect of flow-control.

There is a nice blog post on the indispensibility of exceptions here.

Also, see this Stack Overflow answer: Are exceptions really for exceptional errors?

“What is the reason for the try-except-else to exist?”

The else-clause itself is interesting. It runs when there is no exception but before the finally-clause. That is its primary purpose.

Without the else-clause, the only option to run additional code before finalization would be the clumsy practice of adding the code to the try-clause. That is clumsy because it risks raising exceptions in code that wasn’t intended to be protected by the try-block.

The use-case of running additional unprotected code prior to finalization doesn’t arise very often. So, don’t expect to see many examples in published code. It is somewhat rare.

Another use-case for the else-clause is to perform actions that must occur when no exception occurs and that do not occur when exceptions are handled. For example:

recip = float('Inf')
try:
    recip = 1 / f(x)
except ZeroDivisionError:
    logging.info('Infinite result')
else:
    logging.info('Finite result')

Another example occurs in unittest runners:

try:
    tests_run += 1
    run_testcase(case)
except Exception:
    tests_failed += 1
    logging.exception('Failing test case: %r', case)
    print('F', end='')
else:
    logging.info('Successful test case: %r', case)
    print('.', end='')

Lastly, the most common use of an else-clause in a try-block is for a bit of beautification (aligning the exceptional outcomes and non-exceptional outcomes at the same level of indentation). This use is always optional and isn’t strictly necessary.


回答 1

try-except-else存在的原因是什么?

一个try块可以处理预期的错误。该except块应该只捕获您准备处理的异常。如果您处理了意外错误,则您的代码可能会做错事情并隐藏错误。

else如果没有错误,将执行一个子句,通过不执行该代码try块中的代码,可以避免捕获意外错误。同样,捕获意外错误可能会隐藏错误。

例如:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    return something

“ try,except”套件有两个可选子句,elsefinally。所以实际上是try-except-else-finally

else仅在try块中没有异常的情况下才会评估。它使我们能够简化下面更复杂的代码:

no_error = None
try:
    try_this(whatever)
    no_error = True
except SomeException as the_exception:
    handle(the_exception)
if no_error:
    return something

因此,如果将a else与替代方案(可能会产生错误)进行比较,我们会发现它减少了代码行,并且我们可以拥有更具可读性,可维护性和更少错误的代码库。

finally

finally 即使使用return语句对另一行进行评估,它也将执行。

用伪代码分解

这可能有助于以尽可能小的形式展示所有功能并带有注释来分解此内容。假定此语法在语法上正确(但除非定义名称,否则不可运行)伪代码在函数中。

例如:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle_SomeException(the_exception)
    # Handle a instance of SomeException or a subclass of it.
except Exception as the_exception:
    generic_handle(the_exception)
    # Handle any other exception that inherits from Exception
    # - doesn't include GeneratorExit, KeyboardInterrupt, SystemExit
    # Avoid bare `except:`
else: # there was no exception whatsoever
    return something()
    # if no exception, the "something()" gets evaluated,
    # but the return will not be executed due to the return in the
    # finally block below.
finally:
    # this block will execute no matter what, even if no exception,
    # after "something" is eval'd but before that value is returned
    # but even if there is an exception.
    # a return here will hijack the return functionality. e.g.:
    return True # hijacks the return in the else clause above

的确,我们可以将代码包含在else块中的代码中try,如果没有异常,它将在其中运行,但是如果该代码本身引发了我们正在捕获的异常,该怎么办?将其留在try块中将隐藏该错误。

我们希望最小化try块中的代码行,以避免捕获我们未曾想到的异常,其原理是,如果我们的代码失败,我们希望它大声失败。这是最佳做法

据我了解,异常不是错误

在Python中,大多数exceptions都是错误。

我们可以使用pydoc查看异常层次结构。例如,在Python 2中:

$ python -m pydoc exceptions

或Python 3:

$ python -m pydoc builtins

将给我们层次结构。我们可以看到大多数Exception错误都是错误的,尽管Python使用其中的一些错误来结束for循环(StopIteration)。这是Python 3的层次结构:

BaseException
    Exception
        ArithmeticError
            FloatingPointError
            OverflowError
            ZeroDivisionError
        AssertionError
        AttributeError
        BufferError
        EOFError
        ImportError
            ModuleNotFoundError
        LookupError
            IndexError
            KeyError
        MemoryError
        NameError
            UnboundLocalError
        OSError
            BlockingIOError
            ChildProcessError
            ConnectionError
                BrokenPipeError
                ConnectionAbortedError
                ConnectionRefusedError
                ConnectionResetError
            FileExistsError
            FileNotFoundError
            InterruptedError
            IsADirectoryError
            NotADirectoryError
            PermissionError
            ProcessLookupError
            TimeoutError
        ReferenceError
        RuntimeError
            NotImplementedError
            RecursionError
        StopAsyncIteration
        StopIteration
        SyntaxError
            IndentationError
                TabError
        SystemError
        TypeError
        ValueError
            UnicodeError
                UnicodeDecodeError
                UnicodeEncodeError
                UnicodeTranslateError
        Warning
            BytesWarning
            DeprecationWarning
            FutureWarning
            ImportWarning
            PendingDeprecationWarning
            ResourceWarning
            RuntimeWarning
            SyntaxWarning
            UnicodeWarning
            UserWarning
    GeneratorExit
    KeyboardInterrupt
    SystemExit

有评论者问:

假设您有一个可对外部API进行ping的方法,并且想在API包装器之外的类上处理异常,那么您是否只是从方法中的except子句中返回e,其中e是异常对象?

不,您不返回该异常,只需将其重新引发raise以保留堆栈跟踪即可。

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise

或者,在Python 3中,您可以引发新的异常并通过异常链接保留回溯:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise DifferentException from the_exception

我在这里详细回答

What is the reason for the try-except-else to exist?

A try block allows you to handle an expected error. The except block should only catch exceptions you are prepared to handle. If you handle an unexpected error, your code may do the wrong thing and hide bugs.

An else clause will execute if there were no errors, and by not executing that code in the try block, you avoid catching an unexpected error. Again, catching an unexpected error can hide bugs.

Example

For example:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    return something

The “try, except” suite has two optional clauses, else and finally. So it’s actually try-except-else-finally.

else will evaluate only if there is no exception from the try block. It allows us to simplify the more complicated code below:

no_error = None
try:
    try_this(whatever)
    no_error = True
except SomeException as the_exception:
    handle(the_exception)
if no_error:
    return something

so if we compare an else to the alternative (which might create bugs) we see that it reduces the lines of code and we can have a more readable, maintainable, and less buggy code-base.

finally

finally will execute no matter what, even if another line is being evaluated with a return statement.

Broken down with pseudo-code

It might help to break this down, in the smallest possible form that demonstrates all features, with comments. Assume this syntactically correct (but not runnable unless the names are defined) pseudo-code is in a function.

For example:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle_SomeException(the_exception)
    # Handle a instance of SomeException or a subclass of it.
except Exception as the_exception:
    generic_handle(the_exception)
    # Handle any other exception that inherits from Exception
    # - doesn't include GeneratorExit, KeyboardInterrupt, SystemExit
    # Avoid bare `except:`
else: # there was no exception whatsoever
    return something()
    # if no exception, the "something()" gets evaluated,
    # but the return will not be executed due to the return in the
    # finally block below.
finally:
    # this block will execute no matter what, even if no exception,
    # after "something" is eval'd but before that value is returned
    # but even if there is an exception.
    # a return here will hijack the return functionality. e.g.:
    return True # hijacks the return in the else clause above

It is true that we could include the code in the else block in the try block instead, where it would run if there were no exceptions, but what if that code itself raises an exception of the kind we’re catching? Leaving it in the try block would hide that bug.

We want to minimize lines of code in the try block to avoid catching exceptions we did not expect, under the principle that if our code fails, we want it to fail loudly. This is a best practice.

It is my understanding that exceptions are not errors

In Python, most exceptions are errors.

We can view the exception hierarchy by using pydoc. For example, in Python 2:

$ python -m pydoc exceptions

or Python 3:

$ python -m pydoc builtins

Will give us the hierarchy. We can see that most kinds of Exception are errors, although Python uses some of them for things like ending for loops (StopIteration). This is Python 3’s hierarchy:

BaseException
    Exception
        ArithmeticError
            FloatingPointError
            OverflowError
            ZeroDivisionError
        AssertionError
        AttributeError
        BufferError
        EOFError
        ImportError
            ModuleNotFoundError
        LookupError
            IndexError
            KeyError
        MemoryError
        NameError
            UnboundLocalError
        OSError
            BlockingIOError
            ChildProcessError
            ConnectionError
                BrokenPipeError
                ConnectionAbortedError
                ConnectionRefusedError
                ConnectionResetError
            FileExistsError
            FileNotFoundError
            InterruptedError
            IsADirectoryError
            NotADirectoryError
            PermissionError
            ProcessLookupError
            TimeoutError
        ReferenceError
        RuntimeError
            NotImplementedError
            RecursionError
        StopAsyncIteration
        StopIteration
        SyntaxError
            IndentationError
                TabError
        SystemError
        TypeError
        ValueError
            UnicodeError
                UnicodeDecodeError
                UnicodeEncodeError
                UnicodeTranslateError
        Warning
            BytesWarning
            DeprecationWarning
            FutureWarning
            ImportWarning
            PendingDeprecationWarning
            ResourceWarning
            RuntimeWarning
            SyntaxWarning
            UnicodeWarning
            UserWarning
    GeneratorExit
    KeyboardInterrupt
    SystemExit

A commenter asked:

Say you have a method which pings an external API and you want to handle the exception at a class outside the API wrapper, do you simply return e from the method under the except clause where e is the exception object?

No, you don’t return the exception, just reraise it with a bare raise to preserve the stacktrace.

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise

Or, in Python 3, you can raise a new exception and preserve the backtrace with exception chaining:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise DifferentException from the_exception

I elaborate in my answer here.


回答 2

Python不赞成将异常仅用于特殊情况的想法,实际上,习惯用法是“要求宽恕,而不是允许”。这意味着将异常作为流程控制的常规部分是完全可以接受的,并且实际上是受到鼓励的。

通常,这是一件好事,因为以这种方式工作有助于避免某些问题(显而易见的示例是,通常避免出现竞争条件),并且它倾向于使代码更具可读性。

假设您遇到这样一种情况,您需要处理一些用户输入,但是已经处理了默认输入。该try: ... except: ... else: ...结构使代码易于阅读:

try:
   raw_value = int(input())
except ValueError:
   value = some_processed_value
else: # no error occured
   value = process_value(raw_value)

与其他语言可能的工作方式进行比较:

raw_value = input()
if valid_number(raw_value):
    value = process_value(int(raw_value))
else:
    value = some_processed_value

注意优点。无需检查该值是否有效并单独对其进行分析,只需完成一次即可。代码也遵循更合理的顺序,首先是主代码路径,然后是“如果不起作用,请执行此操作”。

该示例自然有点虚构,但它显示了这种结构的情况。

Python doesn’t subscribe to the idea that exceptions should only be used for exceptional cases, in fact the idiom is ‘ask for forgiveness, not permission’. This means that using exceptions as a routine part of your flow control is perfectly acceptable, and in fact, encouraged.

This is generally a good thing, as working this way helps avoid some issues (as an obvious example, race conditions are often avoided), and it tends to make code a little more readable.

Imagine you have a situation where you take some user input which needs to be processed, but have a default which is already processed. The try: ... except: ... else: ... structure makes for very readable code:

try:
   raw_value = int(input())
except ValueError:
   value = some_processed_value
else: # no error occured
   value = process_value(raw_value)

Compare to how it might work in other languages:

raw_value = input()
if valid_number(raw_value):
    value = process_value(int(raw_value))
else:
    value = some_processed_value

Note the advantages. There is no need to check the value is valid and parse it separately, they are done once. The code also follows a more logical progression, the main code path is first, followed by ‘if it doesn’t work, do this’.

The example is naturally a little contrived, but it shows there are cases for this structure.


回答 3

在python中使用try-except-else是否是一种好习惯?

答案是它取决于上下文。如果您这样做:

d = dict()
try:
    item = d['item']
except KeyError:
    item = 'default'

它表明您不太了解Python。此功能封装在dict.get方法中:

item = d.get('item', 'default')

try/ except块是写什么都可以有效地在一行用原子方法执行的视觉上更多混乱和冗长的方式。在其他情况下,这是正确的。

但是,这并不意味着我们应该避免所有异常处理。在某些情况下,最好避免比赛条件。不要检查文件是否存在,只需尝试将其打开,然后捕获相应的IOError。为了简单起见,请尝试将其封装或分解为适当的名称。

阅读PythonZen,了解其中存在一些紧绷的原则,并且要警惕过于依赖其中任何一条语句的教条。

Is it a good practice to use try-except-else in python?

The answer to this is that it is context dependent. If you do this:

d = dict()
try:
    item = d['item']
except KeyError:
    item = 'default'

It demonstrates that you don’t know Python very well. This functionality is encapsulated in the dict.get method:

item = d.get('item', 'default')

The try/except block is a much more visually cluttered and verbose way of writing what can be efficiently executing in a single line with an atomic method. There are other cases where this is true.

However, that does not mean that we should avoid all exception handling. In some cases it is preferred to avoid race conditions. Don’t check if a file exists, just attempt to open it, and catch the appropriate IOError. For the sake of simplicity and readability, try to encapsulate this or factor it out as apropos.

Read the Zen of Python, understanding that there are principles that are in tension, and be wary of dogma that relies too heavily on any one of the statements in it.


回答 4

请参见以下示例,该示例说明了有关try-except-else-finally的所有信息:

for i in range(3):
    try:
        y = 1 / i
    except ZeroDivisionError:
        print(f"\ti = {i}")
        print("\tError report: ZeroDivisionError")
    else:
        print(f"\ti = {i}")
        print(f"\tNo error report and y equals {y}")
    finally:
        print("Try block is run.")

实施它并获得:

    i = 0
    Error report: ZeroDivisionError
Try block is run.
    i = 1
    No error report and y equals 1.0
Try block is run.
    i = 2
    No error report and y equals 0.5
Try block is run.

See the following example which illustrate everything about try-except-else-finally:

for i in range(3):
    try:
        y = 1 / i
    except ZeroDivisionError:
        print(f"\ti = {i}")
        print("\tError report: ZeroDivisionError")
    else:
        print(f"\ti = {i}")
        print(f"\tNo error report and y equals {y}")
    finally:
        print("Try block is run.")

Implement it and come by:

    i = 0
    Error report: ZeroDivisionError
Try block is run.
    i = 1
    No error report and y equals 1.0
Try block is run.
    i = 2
    No error report and y equals 0.5
Try block is run.

回答 5

您应谨慎使用finally块,因为它与try中使用else块的功能不同,除了。无论try的结果如何,都将运行finally块。

In [10]: dict_ = {"a": 1}

In [11]: try:
   ....:     dict_["b"]
   ....: except KeyError:
   ....:     pass
   ....: finally:
   ....:     print "something"
   ....:     
something

正如所有人都指出的那样,使用else块会使您的代码更具可读性,并且仅在未引发异常时运行

In [14]: try:
             dict_["b"]
         except KeyError:
             pass
         else:
             print "something"
   ....:

You should be careful about using the finally block, as it is not the same thing as using an else block in the try, except. The finally block will be run regardless of the outcome of the try except.

In [10]: dict_ = {"a": 1}

In [11]: try:
   ....:     dict_["b"]
   ....: except KeyError:
   ....:     pass
   ....: finally:
   ....:     print "something"
   ....:     
something

As everyone has noted using the else block causes your code to be more readable, and only runs when an exception is not thrown

In [14]: try:
             dict_["b"]
         except KeyError:
             pass
         else:
             print "something"
   ....:

回答 6

每当您看到以下内容时:

try:
    y = 1 / x
except ZeroDivisionError:
    pass
else:
    return y

甚至这个:

try:
    return 1 / x
except ZeroDivisionError:
    return None

考虑一下这个:

import contextlib
with contextlib.suppress(ZeroDivisionError):
    return 1 / x

Whenever you see this:

try:
    y = 1 / x
except ZeroDivisionError:
    pass
else:
    return y

Or even this:

try:
    return 1 / x
except ZeroDivisionError:
    return None

Consider this instead:

import contextlib
with contextlib.suppress(ZeroDivisionError):
    return 1 / x

回答 7

只是因为没有人发表过这一意见,我会说

避免使用else条款,因为大多数人都不熟悉这些条款try/excepts

与关键字tryexcept和和不同finally,该else子句的含义不言而喻。它的可读性较差。因为它不经常使用,所以它将导致阅读您的代码的人想要仔细检查文档,以确保他们了解正在发生的事情。

(我之所以写此答案,恰恰是因为我try/except/else在代码库中找到了a ,它导致了wtf时刻并迫使我进行了谷歌搜索)。

因此,无论我在哪里看到类似OP示例的代码:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    # do some more processing in non-exception case
    return something

我宁愿重构为

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    return  # <1>
# do some more processing in non-exception case  <2>
return something
  • <1>明确的回报清楚地表明,在exceptions情况下,我们已经完成工作

  • <2>作为一个很好的次要副作用,该else块中的代码以前经过了一个级别的确定。

Just because no-one else has posted this opinion, I would say

avoid else clauses in try/excepts because they’re unfamiliar to most people

Unlike the keywords try, except, and finally, the meaning of the else clause isn’t self-evident; it’s less readable. Because it’s not used very often, it’ll cause people that read your code to want to double-check the docs to be sure they understand what’s going on.

(I’m writing this answer precisely because I found a try/except/else in my codebase and it caused a wtf moment and forced me to do some googling).

So, wherever I see code like the OP example:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    # do some more processing in non-exception case
    return something

I would prefer to refactor to

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    return  # <1>
# do some more processing in non-exception case  <2>
return something
  • <1> explicit return, clearly shows that, in the exception case, we are finished working

  • <2> as a nice minor side-effect, the code that used to be in the else block is dedented by one level.


回答 8

这是我关于如何理解Python中try-except-else-finally块的简单代码段:

def div(a, b):
    try:
        a/b
    except ZeroDivisionError:
        print("Zero Division Error detected")
    else:
        print("No Zero Division Error")
    finally:
        print("Finally the division of %d/%d is done" % (a, b))

让我们尝试div 1/1:

div(1, 1)
No Zero Division Error
Finally the division of 1/1 is done

让我们尝试div 1/0

div(1, 0)
Zero Division Error detected
Finally the division of 1/0 is done

This is my simple snippet on howto understand try-except-else-finally block in Python:

def div(a, b):
    try:
        a/b
    except ZeroDivisionError:
        print("Zero Division Error detected")
    else:
        print("No Zero Division Error")
    finally:
        print("Finally the division of %d/%d is done" % (a, b))

Let’s try div 1/1:

div(1, 1)
No Zero Division Error
Finally the division of 1/1 is done

Let’s try div 1/0

div(1, 0)
Zero Division Error detected
Finally the division of 1/0 is done

回答 9

OP,您是正确的。 在Python中try / except之后的else很难看。它导致另一个不需要的流控制对象:

try:
    x = blah()
except:
    print "failed at blah()"
else:
    print "just succeeded with blah"

完全清楚的等效项是:

try:
    x = blah()
    print "just succeeded with blah"
except:
    print "failed at blah()"

这比else子句清楚得多。try / except之后的else不经常编写,因此花一点时间弄清楚其含义是什么。

仅仅因为您可以做某事,并不意味着您应该做某事。

语言已经添加了许多功能,因为有人认为它可能派上用场。麻烦的是,功能越多,事物的清晰度和显而易见性就越差,这是因为人们通常不使用那些钟声和口哨声。

这里只有我的5美分。我必须走到后面,清理掉大学一年级开发人员写的很多代码,这些开发人员认为他们很聪明,并希望以超级严格,超级高效的方式编写代码,这只会使事情变得一团糟以尝试稍后阅读/修改。我每天对可读性进行投票,而星期日则两次。

OP, YOU ARE CORRECT. The else after try/except in Python is ugly. it leads to another flow-control object where none is needed:

try:
    x = blah()
except:
    print "failed at blah()"
else:
    print "just succeeded with blah"

A totally clear equivalent is:

try:
    x = blah()
    print "just succeeded with blah"
except:
    print "failed at blah()"

This is far clearer than an else clause. The else after try/except is not frequently written, so it takes a moment to figure what the implications are.

Just because you CAN do a thing, doesn’t mean you SHOULD do a thing.

Lots of features have been added to languages because someone thought it might come in handy. Trouble is, the more features, the less clear and obvious things are because people don’t usually use those bells and whistles.

Just my 5 cents here. I have to come along behind and clean up a lot of code written by 1st-year out of college developers who think they’re smart and want to write code in some uber-tight, uber-efficient way when that just makes it a mess to try and read / modify later. I vote for readability every day and twice on Sundays.


如何使用PIL调整图像大小并保持其纵横比?

问题:如何使用PIL调整图像大小并保持其纵横比?

有什么明显的方法可以实现我所缺少的吗?我只是想制作缩略图。

Is there an obvious way to do this that I’m missing? I’m just trying to make thumbnails.


回答 0

定义最大大小。然后,通过计算调整大小比例min(maxwidth/width, maxheight/height)

适当的大小是oldsize*ratio

当然,还有一个库方法可以做到这一点:method Image.thumbnail
以下是PIL文档中的一个(经过编辑的)示例。

import os, sys
import Image

size = 128, 128

for infile in sys.argv[1:]:
    outfile = os.path.splitext(infile)[0] + ".thumbnail"
    if infile != outfile:
        try:
            im = Image.open(infile)
            im.thumbnail(size, Image.ANTIALIAS)
            im.save(outfile, "JPEG")
        except IOError:
            print "cannot create thumbnail for '%s'" % infile

Define a maximum size. Then, compute a resize ratio by taking min(maxwidth/width, maxheight/height).

The proper size is oldsize*ratio.

There is of course also a library method to do this: the method Image.thumbnail.
Below is an (edited) example from the PIL documentation.

import os, sys
import Image

size = 128, 128

for infile in sys.argv[1:]:
    outfile = os.path.splitext(infile)[0] + ".thumbnail"
    if infile != outfile:
        try:
            im = Image.open(infile)
            im.thumbnail(size, Image.ANTIALIAS)
            im.save(outfile, "JPEG")
        except IOError:
            print "cannot create thumbnail for '%s'" % infile

回答 1

该脚本将使用PIL(Python影像库)将图像(somepic.jpg)调整为300像素的宽度,并且高度与新宽度成比例。它通过确定原始宽度(img.size [0])的300个像素的百分比,然后将原始高度(img.size [1])乘以该百分比,来实现此目的。将“ basewidth”更改为任何其他数字以更改图像的默认宽度。

from PIL import Image

basewidth = 300
img = Image.open('somepic.jpg')
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), Image.ANTIALIAS)
img.save('sompic.jpg') 

This script will resize an image (somepic.jpg) using PIL (Python Imaging Library) to a width of 300 pixels and a height proportional to the new width. It does this by determining what percentage 300 pixels is of the original width (img.size[0]) and then multiplying the original height (img.size[1]) by that percentage. Change “basewidth” to any other number to change the default width of your images.

from PIL import Image

basewidth = 300
img = Image.open('somepic.jpg')
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), Image.ANTIALIAS)
img.save('sompic.jpg') 

回答 2

我还建议使用PIL的缩略图方法,因为它可以消除您的所有比率麻烦。

不过,有一个重要提示:替换

im.thumbnail(size)

im.thumbnail(size,Image.ANTIALIAS)

默认情况下,PIL使用Image.NEAREST过滤器来调整大小,这会导致性能良好但质量较差。

I also recommend using PIL’s thumbnail method, because it removes all the ratio hassles from you.

One important hint, though: Replace

im.thumbnail(size)

with

im.thumbnail(size,Image.ANTIALIAS)

by default, PIL uses the Image.NEAREST filter for resizing which results in good performance, but poor quality.


回答 3

基于@tomvon,我完成了以下操作(选择您的情况):

a)调整高度我知道新的宽度,所以我需要新的高度

new_width  = 680
new_height = new_width * height / width 

b)调整宽度我知道新的高度,所以我需要新的宽度

new_height = 680
new_width  = new_height * width / height

然后:

img = img.resize((new_width, new_height), Image.ANTIALIAS)

Based in @tomvon, I finished using the following (pick your case):

a) Resizing height (I know the new width, so I need the new height)

new_width  = 680
new_height = new_width * height / width 

b) Resizing width (I know the new height, so I need the new width)

new_height = 680
new_width  = new_height * width / height

Then just:

img = img.resize((new_width, new_height), Image.ANTIALIAS)

回答 4

PIL已经可以选择裁剪图像

img = ImageOps.fit(img, size, Image.ANTIALIAS)

PIL already has the option to crop an image

img = ImageOps.fit(img, size, Image.ANTIALIAS)

回答 5

from PIL import Image

img = Image.open('/your image path/image.jpg') # image extension *.png,*.jpg
new_width  = 200
new_height = 300
img = img.resize((new_width, new_height), Image.ANTIALIAS)
img.save('output image name.png') # format may what you want *.png, *jpg, *.gif
from PIL import Image

img = Image.open('/your image path/image.jpg') # image extension *.png,*.jpg
new_width  = 200
new_height = 300
img = img.resize((new_width, new_height), Image.ANTIALIAS)
img.save('output image name.png') # format may what you want *.png, *jpg, *.gif

回答 6

如果您尝试保持相同的宽高比,那么是否不按原始尺寸的某个百分比调整尺寸?

例如,原始尺寸的一半

half = 0.5
out = im.resize( [int(half * s) for s in im.size] )

If you are trying to maintain the same aspect ratio, then wouldn’t you resize by some percentage of the original size?

For example, half the original size

half = 0.5
out = im.resize( [int(half * s) for s in im.size] )

回答 7

from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))

我使用这个库:

pip install python-resize-image
from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))

I use this library:

pip install python-resize-image

回答 8

如果您不希望/不需要使用枕头打开图像,请使用以下命令:

from PIL import Image

new_img_arr = numpy.array(Image.fromarray(img_arr).resize((new_width, new_height), Image.ANTIALIAS))

If you don’t want / don’t have a need to open image with Pillow, use this:

from PIL import Image

new_img_arr = numpy.array(Image.fromarray(img_arr).resize((new_width, new_height), Image.ANTIALIAS))

回答 9

只需使用更现代的包装器更新此问题,该库即可包装Pillow(PIL的一个分支) https://pypi.org/project/python-resize-image/

允许你做这样的事情:

from PIL import Image
from resizeimage import resizeimage

fd_img = open('test-image.jpeg', 'r')
img = Image.open(fd_img)
img = resizeimage.resize_width(img, 200)
img.save('test-image-width.jpeg', img.format)
fd_img.close()

在上面的链接中堆了更多示例。

Just updating this question with a more modern wrapper This library wraps Pillow (a fork of PIL) https://pypi.org/project/python-resize-image/

Allowing you to do something like this :-

from PIL import Image
from resizeimage import resizeimage

fd_img = open('test-image.jpeg', 'r')
img = Image.open(fd_img)
img = resizeimage.resize_width(img, 200)
img.save('test-image-width.jpeg', img.format)
fd_img.close()

Heaps more examples in the above link.


回答 10

我试图调整幻灯片视频的某些图像的大小,因此,我不仅要一个最大尺寸,而且要一个最大宽度一个最大高度(视频帧的大小)。
总是有可能拍摄人像视频…
Image.thumbnail方法很有前途,但我无法将其放大到较小的图像。

因此,在找不到在此处(或其他位置)执行此操作的明显方法之后,我编写了此函数并将其放在此处以供以后使用:

from PIL import Image

def get_resized_img(img_path, video_size):
    img = Image.open(img_path)
    width, height = video_size  # these are the MAX dimensions
    video_ratio = width / height
    img_ratio = img.size[0] / img.size[1]
    if video_ratio >= 1:  # the video is wide
        if img_ratio <= video_ratio:  # image is not wide enough
            width_new = int(height * img_ratio)
            size_new = width_new, height
        else:  # image is wider than video
            height_new = int(width / img_ratio)
            size_new = width, height_new
    else:  # the video is tall
        if img_ratio >= video_ratio:  # image is not tall enough
            height_new = int(width / img_ratio)
            size_new = width, height_new
        else:  # image is taller than video
            width_new = int(height * img_ratio)
            size_new = width_new, height
    return img.resize(size_new, resample=Image.LANCZOS)

I was trying to resize some images for a slideshow video and because of that, I wanted not just one max dimension, but a max width and a max height (the size of the video frame).
And there was always the possibility of a portrait video…
The Image.thumbnail method was promising, but I could not make it upscale a smaller image.

So after I couldn’t find an obvious way to do that here (or at some other places), I wrote this function and put it here for the ones to come:

from PIL import Image

def get_resized_img(img_path, video_size):
    img = Image.open(img_path)
    width, height = video_size  # these are the MAX dimensions
    video_ratio = width / height
    img_ratio = img.size[0] / img.size[1]
    if video_ratio >= 1:  # the video is wide
        if img_ratio <= video_ratio:  # image is not wide enough
            width_new = int(height * img_ratio)
            size_new = width_new, height
        else:  # image is wider than video
            height_new = int(width / img_ratio)
            size_new = width, height_new
    else:  # the video is tall
        if img_ratio >= video_ratio:  # image is not tall enough
            height_new = int(width / img_ratio)
            size_new = width, height_new
        else:  # image is taller than video
            width_new = int(height * img_ratio)
            size_new = width_new, height
    return img.resize(size_new, resample=Image.LANCZOS)

回答 11

一种用于保持约束比率并通过最大宽度/高度的简单方法。不是最漂亮的,但是可以完成工作并且很容易理解:

def resize(img_path, max_px_size, output_folder):
    with Image.open(img_path) as img:
        width_0, height_0 = img.size
        out_f_name = os.path.split(img_path)[-1]
        out_f_path = os.path.join(output_folder, out_f_name)

        if max((width_0, height_0)) <= max_px_size:
            print('writing {} to disk (no change from original)'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 > height_0:
            wpercent = max_px_size / float(width_0)
            hsize = int(float(height_0) * float(wpercent))
            img = img.resize((max_px_size, hsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 < height_0:
            hpercent = max_px_size / float(height_0)
            wsize = int(float(width_0) * float(hpercent))
            img = img.resize((max_px_size, wsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

这是一个使用此功能运行批处理图像大小调整的python脚本

A simple method for keeping constrained ratios and passing a max width / height. Not the prettiest but gets the job done and is easy to understand:

def resize(img_path, max_px_size, output_folder):
    with Image.open(img_path) as img:
        width_0, height_0 = img.size
        out_f_name = os.path.split(img_path)[-1]
        out_f_path = os.path.join(output_folder, out_f_name)

        if max((width_0, height_0)) <= max_px_size:
            print('writing {} to disk (no change from original)'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 > height_0:
            wpercent = max_px_size / float(width_0)
            hsize = int(float(height_0) * float(wpercent))
            img = img.resize((max_px_size, hsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 < height_0:
            hpercent = max_px_size / float(height_0)
            wsize = int(float(width_0) * float(hpercent))
            img = img.resize((max_px_size, wsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

Here’s a python script that uses this function to run batch image resizing.


回答 12

已通过“ tomvon”更新了以上答案

from PIL import Image

img = Image.open(image_path)

width, height = img.size[:2]

if height > width:
    baseheight = 64
    hpercent = (baseheight/float(img.size[1]))
    wsize = int((float(img.size[0])*float(hpercent)))
    img = img.resize((wsize, baseheight), Image.ANTIALIAS)
    img.save('resized.jpg')
else:
    basewidth = 64
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.ANTIALIAS)
    img.save('resized.jpg')

Have updated the answer above by “tomvon”

from PIL import Image

img = Image.open(image_path)

width, height = img.size[:2]

if height > width:
    baseheight = 64
    hpercent = (baseheight/float(img.size[1]))
    wsize = int((float(img.size[0])*float(hpercent)))
    img = img.resize((wsize, baseheight), Image.ANTIALIAS)
    img.save('resized.jpg')
else:
    basewidth = 64
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.ANTIALIAS)
    img.save('resized.jpg')

回答 13

我的丑陋例子。

函数获取文件,例如:“ pic [0-9a-z]。[extension]”,将其大小调整为120×120,将节移动到中心并保存到“ ico [0-9a-z]。[extension]”,使用纵向和景观:

def imageResize(filepath):
    from PIL import Image
    file_dir=os.path.split(filepath)
    img = Image.open(filepath)

    if img.size[0] > img.size[1]:
        aspect = img.size[1]/120
        new_size = (img.size[0]/aspect, 120)
    else:
        aspect = img.size[0]/120
        new_size = (120, img.size[1]/aspect)
    img.resize(new_size).save(file_dir[0]+'/ico'+file_dir[1][3:])
    img = Image.open(file_dir[0]+'/ico'+file_dir[1][3:])

    if img.size[0] > img.size[1]:
        new_img = img.crop( (
            (((img.size[0])-120)/2),
            0,
            120+(((img.size[0])-120)/2),
            120
        ) )
    else:
        new_img = img.crop( (
            0,
            (((img.size[1])-120)/2),
            120,
            120+(((img.size[1])-120)/2)
        ) )

    new_img.save(file_dir[0]+'/ico'+file_dir[1][3:])

My ugly example.

Function get file like: “pic[0-9a-z].[extension]”, resize them to 120×120, moves section to center and save to “ico[0-9a-z].[extension]”, works with portrait and landscape:

def imageResize(filepath):
    from PIL import Image
    file_dir=os.path.split(filepath)
    img = Image.open(filepath)

    if img.size[0] > img.size[1]:
        aspect = img.size[1]/120
        new_size = (img.size[0]/aspect, 120)
    else:
        aspect = img.size[0]/120
        new_size = (120, img.size[1]/aspect)
    img.resize(new_size).save(file_dir[0]+'/ico'+file_dir[1][3:])
    img = Image.open(file_dir[0]+'/ico'+file_dir[1][3:])

    if img.size[0] > img.size[1]:
        new_img = img.crop( (
            (((img.size[0])-120)/2),
            0,
            120+(((img.size[0])-120)/2),
            120
        ) )
    else:
        new_img = img.crop( (
            0,
            (((img.size[1])-120)/2),
            120,
            120+(((img.size[1])-120)/2)
        ) )

    new_img.save(file_dir[0]+'/ico'+file_dir[1][3:])

回答 14

我以这种方式调整了图像的大小,并且效果很好

from io import BytesIO
from django.core.files.uploadedfile import InMemoryUploadedFile
import os, sys
from PIL import Image


def imageResize(image):
    outputIoStream = BytesIO()
    imageTemproaryResized = imageTemproary.resize( (1920,1080), Image.ANTIALIAS) 
    imageTemproaryResized.save(outputIoStream , format='PNG', quality='10') 
    outputIoStream.seek(0)
    uploadedImage = InMemoryUploadedFile(outputIoStream,'ImageField', "%s.jpg" % image.name.split('.')[0], 'image/jpeg', sys.getsizeof(outputIoStream), None)

    ## For upload local folder
    fs = FileSystemStorage()
    filename = fs.save(uploadedImage.name, uploadedImage)

I resizeed the image in such a way and it’s working very well

from io import BytesIO
from django.core.files.uploadedfile import InMemoryUploadedFile
import os, sys
from PIL import Image


def imageResize(image):
    outputIoStream = BytesIO()
    imageTemproaryResized = imageTemproary.resize( (1920,1080), Image.ANTIALIAS) 
    imageTemproaryResized.save(outputIoStream , format='PNG', quality='10') 
    outputIoStream.seek(0)
    uploadedImage = InMemoryUploadedFile(outputIoStream,'ImageField', "%s.jpg" % image.name.split('.')[0], 'image/jpeg', sys.getsizeof(outputIoStream), None)

    ## For upload local folder
    fs = FileSystemStorage()
    filename = fs.save(uploadedImage.name, uploadedImage)

回答 15

我还将添加一个调整大小的版本,以保持宽高比固定。在这种情况下,它将根据初始宽高比asp_rat(为float(!))调整高度以匹配新图像的宽度。但是,要将宽度调整为高度,您只需要注释一条线,然后在else循环中取消注释另一条线即可。您会在哪里看到。

您不需要分号(;),我保留它们只是为了提醒自己我经常使用的语言的语法。

from PIL import Image

img_path = "filename.png";
img = Image.open(img_path);     # puts our image to the buffer of the PIL.Image object

width, height = img.size;
asp_rat = width/height;

# Enter new width (in pixels)
new_width = 50;

# Enter new height (in pixels)
new_height = 54;

new_rat = new_width/new_height;

if (new_rat == asp_rat):
    img = img.resize((new_width, new_height), Image.ANTIALIAS); 

# adjusts the height to match the width
# NOTE: if you want to adjust the width to the height, instead -> 
# uncomment the second line (new_width) and comment the first one (new_height)
else:
    new_height = round(new_width / asp_rat);
    #new_width = round(new_height * asp_rat);
    img = img.resize((new_width, new_height), Image.ANTIALIAS);

# usage: resize((x,y), resample)
# resample filter -> PIL.Image.BILINEAR, PIL.Image.NEAREST (default), PIL.Image.BICUBIC, etc..
# https://pillow.readthedocs.io/en/3.1.x/reference/Image.html#PIL.Image.Image.resize

# Enter the name under which you would like to save the new image
img.save("outputname.png");

并且,它完成了。我尽力将其记录在案,因此很明显。

我希望这可能对那里的人有帮助!

I will also add a version of the resize that keeps the aspect ratio fixed. In this case, it will adjust the height to match the width of the new image, based on the initial aspect ratio, asp_rat, which is float (!). But, to adjust the width to the height, instead, you just need to comment one line and uncomment the other in the else loop. You will see, where.

You do not need the semicolons (;), I keep them just to remind myself of syntax of languages I use more often.

from PIL import Image

img_path = "filename.png";
img = Image.open(img_path);     # puts our image to the buffer of the PIL.Image object

width, height = img.size;
asp_rat = width/height;

# Enter new width (in pixels)
new_width = 50;

# Enter new height (in pixels)
new_height = 54;

new_rat = new_width/new_height;

if (new_rat == asp_rat):
    img = img.resize((new_width, new_height), Image.ANTIALIAS); 

# adjusts the height to match the width
# NOTE: if you want to adjust the width to the height, instead -> 
# uncomment the second line (new_width) and comment the first one (new_height)
else:
    new_height = round(new_width / asp_rat);
    #new_width = round(new_height * asp_rat);
    img = img.resize((new_width, new_height), Image.ANTIALIAS);

# usage: resize((x,y), resample)
# resample filter -> PIL.Image.BILINEAR, PIL.Image.NEAREST (default), PIL.Image.BICUBIC, etc..
# https://pillow.readthedocs.io/en/3.1.x/reference/Image.html#PIL.Image.Image.resize

# Enter the name under which you would like to save the new image
img.save("outputname.png");

And, it is done. I tried to document it as much as I can, so it is clear.

I hope it might be helpful to someone out there!


回答 16

打开你的图片文件

from PIL import Image
im = Image.open("image.png")

使用PIL Image.resize(size,resample = 0)方法,在其中用图像的(宽度,高度)替换大小为2元组。

这将以原始尺寸显示图像:

display(im.resize((int(im.size[0]),int(im.size[1])), 0) )

这将以1/2尺寸显示图像:

display(im.resize((int(im.size[0]/2),int(im.size[1]/2)), 0) )

这将以1/3的大小显示图像:

display(im.resize((int(im.size[0]/3),int(im.size[1]/3)), 0) )

这将以1/4的大小显示图像:

display(im.resize((int(im.size[0]/4),int(im.size[1]/4)), 0) )

Open your image file

from PIL import Image
im = Image.open("image.png")

Use PIL Image.resize(size, resample=0) method, where you substitute (width, height) of your image for the size 2-tuple.

This will display your image at original size:

display(im.resize((int(im.size[0]),int(im.size[1])), 0) )

This will display your image at 1/2 the size:

display(im.resize((int(im.size[0]/2),int(im.size[1]/2)), 0) )

This will display your image at 1/3 the size:

display(im.resize((int(im.size[0]/3),int(im.size[1]/3)), 0) )

This will display your image at 1/4 the size:

display(im.resize((int(im.size[0]/4),int(im.size[1]/4)), 0) )

etc etc


回答 17

from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))
from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))

回答 18

您可以通过以下代码调整图片大小:

From PIL import Image
img=Image.open('Filename.jpg') # paste image in python folder
print(img.size())
new_img=img.resize((400,400))
new_img.save('new_filename.jpg')

You can resize image by below code:

From PIL import Image
img=Image.open('Filename.jpg') # paste image in python folder
print(img.size())
new_img=img.resize((400,400))
new_img.save('new_filename.jpg')

如何在列表中找到重复项并使用它们创建另一个列表?

问题:如何在列表中找到重复项并使用它们创建另一个列表?

如何在Python列表中找到重复项并创建另一个重复项列表?该列表仅包含整数。

How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.


回答 0

要删除重复项,请使用set(a)。要打印副本,类似:

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

请注意,这Counter并不是特别有效(计时),并且在这里可能会过大。set会表现更好。此代码按源顺序计算唯一元素的列表:

seen = set()
uniq = []
for x in a:
    if x not in seen:
        uniq.append(x)
        seen.add(x)

或者,更简洁地说:

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]    

我不建议您使用后者,因为not seen.add(x)这样做并不明显(set add()方法总是返回None,因此需要not)。

要计算没有库的重复元素列表:

seen = {}
dupes = []

for x in a:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

如果列表元素不可散列,则不能使用集合/字典,而必须求助于二​​次时间解(将每个解比较)。例如:

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]

To remove duplicates use set(a). To print duplicates, something like:

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

Note that Counter is not particularly efficient (timings) and probably overkill here. set will perform better. This code computes a list of unique elements in the source order:

seen = set()
uniq = []
for x in a:
    if x not in seen:
        uniq.append(x)
        seen.add(x)

or, more concisely:

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]    

I don’t recommend the latter style, because it is not obvious what not seen.add(x) is doing (the set add() method always returns None, hence the need for not).

To compute the list of duplicated elements without libraries:

seen = {}
dupes = []

for x in a:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]

回答 1

>>> l = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in l if l.count(x) > 1])
set([1, 4, 5])
>>> l = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in l if l.count(x) > 1])
set([1, 4, 5])

回答 2

您不需要计数,只需要查看之前是否曾查看过该项目即可。改编这个问题的答案,这一问题:

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

以防万一速度很重要,以下是一些时间安排:

# file: test.py
import collections

def thg435(l):
    return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in l if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def RiteshKumar(l):
    return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
    seen = set()
    seen2 = set()
    seen_add = seen.add
    seen2_add = seen2.add
    for item in L:
        if item in seen:
            seen2_add(item)
        else:
            seen_add(item)
    return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100

结果如下:(做得很好@JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop

有趣的是,除了计时本身之外,使用pypy时排名也略有变化。最有趣的是,基于计数器的方法从pypy的优化中受益匪浅,而我建议的方法缓存方法似乎几乎没有效果。

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop

显然,这种影响与输入数据的“重复性”有关。我已经设置l = [random.randrange(1000000) for i in xrange(10000)]并得到以下结果:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop

You don’t need the count, just whether or not the item was seen before. Adapted that answer to this problem:

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

Just in case speed matters, here are some timings:

# file: test.py
import collections

def thg435(l):
    return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in l if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def RiteshKumar(l):
    return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
    seen = set()
    seen2 = set()
    seen_add = seen.add
    seen2_add = seen2.add
    for item in L:
        if item in seen:
            seen2_add(item)
        else:
            seen_add(item)
    return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100

Here are the results: (well done @JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop

Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy’s optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop

Apparantly this effect is related to the “duplicatedness” of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)] and got these results:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop

回答 3

您可以使用iteration_utilities.duplicates

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

或者,如果您只希望每个重复项之一,可以将其与iteration_utilities.unique_everseen

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

它还可以处理不可散列的元素(但是以性能为代价):

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

这是这里仅有的其他几种方法可以处理的。

基准测试

我做了一个快速基准测试,其中包含这里提到的大多数(但不是全部)方法。

第一个基准测试仅包含一小部分列表长度,因为某些方法具有 O(n**2)行为。

在曲线图中,y轴表示时间,因此值越低越好。它还绘制了对数-对数,因此可以更好地可视化各种值:

在此处输入图片说明

除去这些O(n**2)方法,我做了另一个基准测试,列表中最多有500万个元素:

在此处输入图片说明

如您所见 iteration_utilities.duplicates方法比其他任何方法甚至链式方法都快unique_everseen(duplicates(...))都快也比其他方法快或同样快。

这里要注意的另一件有趣的事情是,对于小名单,熊猫方法非常慢,但可以轻松竞争更长的名单。

但是,由于这些基准测试表明大多数方法的性能大致相同,因此使用哪种方法都无关紧要(除了具有O(n**2)运行时的三种方法外)。

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

基准1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

基准2

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

免责声明

1这来自我编写的第三方库iteration_utilities

You can use iteration_utilities.duplicates:

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

or if you only want one of each duplicate this can be combined with iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

It can also handle unhashable elements (however at the cost of performance):

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

That’s something that only a few of the other approaches here can handle.

Benchmarks

I did a quick benchmark containing most (but not all) of the approaches mentioned here.

The first benchmark included only a small range of list-lengths because some approaches have O(n**2) behavior.

In the graphs the y-axis represents the time, so a lower value means better. It’s also plotted log-log so the wide range of values can be visualized better:

enter image description here

Removing the O(n**2) approaches I did another benchmark up to half a million elements in a list:

enter image description here

As you can see the iteration_utilities.duplicates approach is faster than any of the other approaches and even chaining unique_everseen(duplicates(...)) was faster or equally fast than the other approaches.

One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists.

However as these benchmarks show most of the approaches perform roughly equally, so it doesn’t matter much which one is used (except for the 3 that had O(n**2) runtime).

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

Benchmark 1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

Benchmark 2

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

Disclaimer

1 This is from a third-party library I have written: iteration_utilities.


回答 4

我在寻找相关问题时遇到了这个问题-想知道为什么没人提供基于生成器的解决方案?解决此问题的方法是:

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]

我担心可伸缩性,因此测试了几种方法,包括幼稚的项目,这些项目在小列表上都可以很好地工作,但是随着列表变大就可怕地扩展(注意-使用timeit会更好,但这只是说明性的)。

我加入了@moooeeeep进行比较(这是非常快的:如果输入列表是完全随机的,则是最快的),而itertools方法对于大多数已排序的列表来说甚至更快…现在包括@firelynx的pandas方法-慢,但没有很可怕,而且很简单。注意-对于大型的有序列表,sort / tee / zip方法在我的机器上始终是最快的,对于混洗的列表,moooeeeep最快,但是您的里程可能会有所不同。

优点

  • 使用相同的代码非常容易地测试“任何”重复项

假设条件

  • 重复应仅报告一次
  • 重复的订单不需要保留
  • 重复项可能在列表中的任何位置

最快的解决方案,一百万个条目:

def getDupes(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k

经过测试的方法

import itertools
import time
import random

def getDupes_1(c):
    '''naive'''
    for i in xrange(0, len(c)):
        if c[i] in c[:i]:
            yield c[i]

def getDupes_2(c):
    '''set len change'''
    s = set()
    for i in c:
        l = len(s)
        s.add(i)
        if len(s) == l:
            yield i

def getDupes_3(c):
    '''in dict'''
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

def getDupes_4(c):
    '''in set'''
    s,r = set(),set()
    for i in c:
        if i not in s:
            s.add(i)
        elif i not in r:
            r.add(i)
            yield i

def getDupes_5(c):
    '''sort/adjacent'''
    c = sorted(c)
    r = None
    for i in xrange(1, len(c)):
        if c[i] == c[i - 1]:
            if c[i] != r:
                yield c[i]
                r = c[i]

def getDupes_6(c):
    '''sort/groupby'''
    def multiple(x):
        try:
            x.next()
            x.next()
            return True
        except:
            return False
    for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
        yield k

def getDupes_7(c):
    '''sort/zip'''
    c = sorted(c)
    r = None
    for k, g in zip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_8(c):
    '''sort/izip'''
    c = sorted(c)
    r = None
    for k, g in itertools.izip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_9(c):
    '''sort/tee/izip'''
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.izip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def getDupes_a(l):
    '''moooeeeep'''
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    for x in l:
        if x in seen or seen_add(x):
            yield x

def getDupes_b(x):
    '''iter*/sorted'''
    x = sorted(x)
    def _matches():
        for k,g in itertools.izip(x[:-1],x[1:]):
            if k == g:
                yield k
    for k, n in itertools.groupby(_matches()):
        yield k

def getDupes_c(a):
    '''pandas'''
    import pandas as pd
    vc = pd.Series(a).value_counts()
    i = vc[vc > 1].index
    for _ in i:
        yield _

def hasDupes(fn,c):
    try:
        if fn(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def getDupes(fn,c):
    return list(fn(c))

STABLE = True
if STABLE:
    print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
    print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
    for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                 getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
        print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
        deltas = []
        for FIRST in (True,False):
            for i in xrange(0, 5):
                c = range(0,1000000)
                if STABLE:
                    c[0] = location
                else:
                    c.append(location)
                    random.shuffle(c)
                start = time.time()
                if FIRST:
                    print '.' if location == test(c).next() else '!',
                else:
                    print '.' if [location] == list(test(c)) else '!',
                deltas.append(time.time()-start)
            print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
        print
    print

“所有重复”测试的结果是一致的,在此数组中找到“第一个”重复项,然后是“所有”重复项:

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  

当列表首先被洗牌时,排序的价格显而易见-效率显着下降,@ moooeeeep方法占主导地位,set和dict方法相似,但表现较差:

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  

I came across this question whilst looking in to something related – and wonder why no-one offered a generator based solution? Solving this problem would be:

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]

I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).

I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists… Now includes pandas approach from @firelynx — slow, but not horribly so, and simple. Note – sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.

Advantages

  • very quick simple to test for ‘any’ duplicates using the same code

Assumptions

  • Duplicates should be reported once only
  • Duplicate order does not need to be preserved
  • Duplicate might be anywhere in the list

Fastest solution, 1m entries:

def getDupes(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k

Approaches tested

import itertools
import time
import random

def getDupes_1(c):
    '''naive'''
    for i in xrange(0, len(c)):
        if c[i] in c[:i]:
            yield c[i]

def getDupes_2(c):
    '''set len change'''
    s = set()
    for i in c:
        l = len(s)
        s.add(i)
        if len(s) == l:
            yield i

def getDupes_3(c):
    '''in dict'''
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

def getDupes_4(c):
    '''in set'''
    s,r = set(),set()
    for i in c:
        if i not in s:
            s.add(i)
        elif i not in r:
            r.add(i)
            yield i

def getDupes_5(c):
    '''sort/adjacent'''
    c = sorted(c)
    r = None
    for i in xrange(1, len(c)):
        if c[i] == c[i - 1]:
            if c[i] != r:
                yield c[i]
                r = c[i]

def getDupes_6(c):
    '''sort/groupby'''
    def multiple(x):
        try:
            x.next()
            x.next()
            return True
        except:
            return False
    for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
        yield k

def getDupes_7(c):
    '''sort/zip'''
    c = sorted(c)
    r = None
    for k, g in zip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_8(c):
    '''sort/izip'''
    c = sorted(c)
    r = None
    for k, g in itertools.izip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_9(c):
    '''sort/tee/izip'''
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.izip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def getDupes_a(l):
    '''moooeeeep'''
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    for x in l:
        if x in seen or seen_add(x):
            yield x

def getDupes_b(x):
    '''iter*/sorted'''
    x = sorted(x)
    def _matches():
        for k,g in itertools.izip(x[:-1],x[1:]):
            if k == g:
                yield k
    for k, n in itertools.groupby(_matches()):
        yield k

def getDupes_c(a):
    '''pandas'''
    import pandas as pd
    vc = pd.Series(a).value_counts()
    i = vc[vc > 1].index
    for _ in i:
        yield _

def hasDupes(fn,c):
    try:
        if fn(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def getDupes(fn,c):
    return list(fn(c))

STABLE = True
if STABLE:
    print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
    print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
    for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                 getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
        print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
        deltas = []
        for FIRST in (True,False):
            for i in xrange(0, 5):
                c = range(0,1000000)
                if STABLE:
                    c[0] = location
                else:
                    c.append(location)
                    random.shuffle(c)
                start = time.time()
                if FIRST:
                    print '.' if location == test(c).next() else '!',
                else:
                    print '.' if [location] == list(test(c)) else '!',
                deltas.append(time.time()-start)
            print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
        print
    print

The results for the ‘all dupes’ test were consistent, finding “first” duplicate then “all” duplicates in this array:

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  

When the lists are shuffled first, the price of the sort becomes apparent – the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  

回答 5

使用熊猫:

>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])

Using pandas:

>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])

回答 6

collections.Counter是python 2.7中的新功能:


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>> 

在早期版本中,您可以改用常规dict:

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

collections.Counter is new in python 2.7:


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>> 

In an earlier version you can use a conventional dict instead:

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

回答 7

这是一个简洁明了的解决方案-

for x in set(li):
    li.remove(x)

li = list(set(li))

Here’s a neat and concise solution –

for x in set(li):
    li.remove(x)

li = list(set(li))

回答 8

如果不转换为列表,可能最简单的方法如下所示。 当他们要求不使用布景时,这可能在面试中很有用

a=[1,2,3,3,3]
dup=[]
for each in a:
  if each not in dup:
    dup.append(each)
print(dup)

=======否则将获得2个单独的唯一值和重复值列表

a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
  if each not in uniques:
    uniques.append(each)
  else:
    dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)

Without converting to list and probably the simplest way would be something like below. This may be useful during a interview when they ask not to use sets

a=[1,2,3,3,3]
dup=[]
for each in a:
  if each not in dup:
    dup.append(each)
print(dup)

======= else to get 2 separate lists of unique values and duplicate values

a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
  if each not in uniques:
    uniques.append(each)
  else:
    dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)

回答 9

我会用熊猫做这件事,因为我经常使用熊猫

import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()

[3,6]

可能不是很有效,但是肯定比许多其他答案要少的代码,所以我想我会做出贡献

I would do this with pandas, because I use pandas a lot

import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()

Gives

[3,6]

Probably isn’t very efficient, but it sure is less code than a lot of the other answers, so I thought I would contribute


回答 10

接受的答案的第三个示例给出了错误的答案,并且没有尝试重复。这是正确的版本:

number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set

the third example of the accepted answer give an erroneous answer and does not attempt to give duplicates. Here is the correct version :

number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set

回答 11

如何通过检查出现次数来简单地循环遍历列表中的每个元素,然后将它们添加到集合中,然后将其打印出来。希望这可以帮助某人。

myList  = [2 ,4 , 6, 8, 4, 6, 12];
newList = set()

for i in myList:
    if myList.count(i) >= 2:
        newList.add(i)

print(list(newList))
## [4 , 6]

How about simply loop through each element in the list by checking the number of occurrences, then adding them to a set which will then print the duplicates. Hope this helps someone out there.

myList  = [2 ,4 , 6, 8, 4, 6, 12];
newList = set()

for i in myList:
    if myList.count(i) >= 2:
        newList.add(i)

print(list(newList))
## [4 , 6]

回答 12

我们可以使用itertools.groupby来查找所有具有重复项的项目:

from itertools import groupby

myList  = [2, 4, 6, 8, 4, 6, 12]
# when the list is sorted, groupby groups by consecutive elements which are similar
for x, y in groupby(sorted(myList)):
    #  list(y) returns all the occurences of item x
    if len(list(y)) > 1:
        print x  

输出将是:

4
6

We can use itertools.groupby in order to find all the items that have dups:

from itertools import groupby

myList  = [2, 4, 6, 8, 4, 6, 12]
# when the list is sorted, groupby groups by consecutive elements which are similar
for x, y in groupby(sorted(myList)):
    #  list(y) returns all the occurences of item x
    if len(list(y)) > 1:
        print x  

The output will be:

4
6

回答 13

我想在列表中查找重复项的最有效方法是:

from collections import Counter

def duplicates(values):
    dups = Counter(values) - Counter(set(values))
    return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))

它使用Counter所有元素和所有唯一元素。将第一个与第二个相减将只保留重复项。

I guess the most effective way to find duplicates in a list is:

from collections import Counter

def duplicates(values):
    dups = Counter(values) - Counter(set(values))
    return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))

It uses the Counter the all the elements and all unique elements. Subtracting the first one with the second will leave out only the duplicates.


回答 14

有点晚了,但可能对某些人有所帮助。对于一个庞大的清单,我发现这对我有用。

l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
    if x in s:
        s.remove(x)
    else:
        d.append(x)
d
[1,3,1]

只显示所有重复项并保留顺序。

A bit late, but maybe helpful for some. For a largish list, I found this worked for me.

l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
    if x in s:
        s.remove(x)
    else:
        d.append(x)
d
[1,3,1]

Shows just and all duplicates and preserves order.


回答 15

在Python中一次迭代即可找到重复对象的非常简单快捷的方法是:

testList = ['red', 'blue', 'red', 'green', 'blue', 'blue']

testListDict = {}

for item in testList:
  try:
    testListDict[item] += 1
  except:
    testListDict[item] = 1

print testListDict

输出如下:

>>> print testListDict
{'blue': 3, 'green': 1, 'red': 2}

这以及我博客中的更多内容http://www.howtoprogramwithpython.com

Very simple and quick way of finding dupes with one iteration in Python is:

testList = ['red', 'blue', 'red', 'green', 'blue', 'blue']

testListDict = {}

for item in testList:
  try:
    testListDict[item] += 1
  except:
    testListDict[item] = 1

print testListDict

Output will be as follows:

>>> print testListDict
{'blue': 3, 'green': 1, 'red': 2}

This and more in my blog http://www.howtoprogramwithpython.com


回答 16

我要进入这个讨论的很晚了。即使,我也想用一个衬纸来解决这个问题。因为那是Python的魅力。如果我们只想将重复项放入单独的列表(或任何集合)中,我建议按照以下步骤进行操作。说我们有一个重复的列表,我们可以将其称为“目标”

    target=[1,2,3,4,4,4,3,5,6,8,4,3]

现在,如果要获取重复项,可以使用一种衬纸,如下所示:

    duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

这段代码会将重复的记录作为键并作为值存入字典’duplicates’中。’duplicate’字典如下所示:

    {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

如果只想将所有重复的记录都放在一个列表中,那么它的代码又要短得多:

    duplicates=filter(lambda rec : target.count(rec)>1,target)

输出将是:

    [3, 4, 4, 4, 3, 4, 3]

这在python 2.7.x +版本中完美工作

I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that’s the charm of Python. if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as ‘target’

    target=[1,2,3,4,4,4,3,5,6,8,4,3]

Now if we want to get the duplicates,we can use the one liner as below:

    duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

This code will put the duplicated records as key and count as value in to the dictionary ‘duplicates’.’duplicate’ dictionary will look like as below:

    {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

If you just want all the records with duplicates alone in a list, its again much shorter code:

    duplicates=filter(lambda rec : target.count(rec)>1,target)

Output will be:

    [3, 4, 4, 4, 3, 4, 3]

This works perfectly in python 2.7.x + versions


回答 17

如果您不愿意编写自己的算法或使用库,则使用Python 3.8一线式:

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

打印项目和计数:

[(1, 2), (2, 2), (5, 4)]

groupby具有分组功能,因此您可以以不同的方式定义分组,并Tuple根据需要返回其他字段。

groupby 很懒,所以不要太慢。

Python 3.8 one-liner if you don’t care to write your own algorithm or use libraries:

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

Prints item and count:

[(1, 2), (2, 2), (5, 4)]

groupby takes a grouping function so you can define your groupings in different ways and return additional Tuple fields as needed.

groupby is lazy so it shouldn’t be too slow.


回答 18

其他一些测试。当然可以…

set([x for x in l if l.count(x) > 1])

…太贵了 使用下一个最终方法大约要快500倍(更长的数组会带来更好的结果):

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

仅2个循环,没有非常昂贵的l.count()操作。

例如,下面是比较这些方法的代码。代码如下,这是输出:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

测试代码:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

Some other tests. Of course to do…

set([x for x in l if l.count(x) > 1])

…is too costly. It’s about 500 times faster (the more long array gives better results) to use the next final method:

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

Only 2 loops, no very costly l.count() operations.

Here is a code to compare the methods for example. The code is below, here is the output:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

The testing code:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

回答 19

方法1:

list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))

说明: [idx的val,如果input_list [idx + 1:]中的val,则enumerate(input_list)中的val]是一个列表推导,如果从当前位置的列表中存在相同的元素,则返回一个元素。

例如:input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

从列表42中的第一个元素开始,索引为0,它会检查input_list [1:]中是否存在元素42(即,从索引1到列表末尾),因为input_list [1:]中存在42 ,它将返回42。

然后转到具有索引1的下一个元素31,并检查input_list [2:]中是否存在元素31(即,从索引2到列表末尾),因为input_list [2:]中存在31,它将返回31。

类似地,它遍历列表中的所有元素,并且仅将重复/重复的元素返回到列表中。

然后,因为我们有重复项,所以在列表中,我们需要从每个重复项中选择一个,即删除重复项中的重复项,然后调用python内置的名为set()的python,它会删除重复项,

然后我们剩下一个集合,而不是列表,因此要从一个集合转换为列表,我们使用typecasting,list(),并将元素集转换为一个列表。

方法2:

def dupes(ilist):
    temp_list = [] # initially, empty temporary list
    dupe_list = [] # initially, empty duplicate list
    for each in ilist:
        if each in temp_list: # Found a Duplicate element
            if not each in dupe_list: # Avoid duplicate elements in dupe_list
                dupe_list.append(each) # Add duplicate element to dupe_list
        else: 
            temp_list.append(each) # Add a new (non-duplicate) to temp_list

    return dupe_list

说明: 在这里,我们首先创建两个空列表。然后继续遍历列表的所有元素,以查看它是否存在于temp_list中(最初为空)。如果temp_list中没有它,则使用append方法将其添加到temp_list中。

如果它已经存在于temp_list中,则意味着该列表的当前元素是重复的,因此我们需要使用append方法将其添加到dupe_list中。

Method 1:

list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))

Explanation: [val for idx, val in enumerate(input_list) if val in input_list[idx+1:]] is a list comprehension, that returns an element, if the same element is present from it’s current position, in list, the index.

Example: input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

starting with the first element in list, 42, with index 0, it checks if the element 42, is present in input_list[1:] (i.e., from index 1 till end of list) Because 42 is present in input_list[1:], it will return 42.

Then it goes to the next element 31, with index 1, and checks if element 31 is present in the input_list[2:] (i.e., from index 2 till end of list), Because 31 is present in input_list[2:], it will return 31.

similarly it goes through all the elements in the list, and will return only the repeated/duplicate elements into a list.

Then because we have duplicates, in a list, we need to pick one of each duplicate, i.e. remove duplicate among duplicates, and to do so, we do call a python built-in named set(), and it removes the duplicates,

Then we are left with a set, but not a list, and hence to convert from a set to list, we use, typecasting, list(), and that converts the set of elements to a list.

Method 2:

def dupes(ilist):
    temp_list = [] # initially, empty temporary list
    dupe_list = [] # initially, empty duplicate list
    for each in ilist:
        if each in temp_list: # Found a Duplicate element
            if not each in dupe_list: # Avoid duplicate elements in dupe_list
                dupe_list.append(each) # Add duplicate element to dupe_list
        else: 
            temp_list.append(each) # Add a new (non-duplicate) to temp_list

    return dupe_list

Explanation: Here We create two empty lists, to start with. Then keep traversing through all the elements of the list, to see if it exists in temp_list (initially empty). If it is not there in the temp_list, then we add it to the temp_list, using append method.

If it already exists in temp_list, it means, that the current element of the list is a duplicate, and hence we need to add it to dupe_list using append method.


回答 20

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
    try:
        clean_list.remove(item)
    except ValueError:
        duplicated_items.append(item)


print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

您基本上可以通过转换为set(clean_list)来删除重复项,然后对其进行迭代raw_list,同时从item清除列表中删除每个重复项以在中出现raw_list。如果item未找到,ValueError则捕获引发的Exception并将其item添加到duplicated_items列表中。

如果需要重复项的索引,则只需enumerate列出并使用索引即可。(for index, item in enumerate(raw_list):),速度更快,并针对大型列表(如数千个元素以上)进行了优化

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
    try:
        clean_list.remove(item)
    except ValueError:
        duplicated_items.append(item)


print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

You basically remove duplicates by converting to set (clean_list), then iterate the raw_list, while removing each item in the clean list for occurrence in raw_list. If item is not found, the raised ValueError Exception is caught and the item is added to duplicated_items list.

If the index of duplicated items is needed, just enumerate the list and play around with the index. (for index, item in enumerate(raw_list):) which is faster and optimised for large lists (like thousands+ of elements)


回答 21

使用list.count()列表中的方法找出给定列表中的重复元素

arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
    arr.append(int(input("Enter Element in a list: ")))
for i in arr:
    if arr.count(i)>1 and i not in dup:
        dup.append(i)
print(dup)

use of list.count() method in the list to find out the duplicate elements of a given list

arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
    arr.append(int(input("Enter Element in a list: ")))
for i in arr:
    if arr.count(i)>1 and i not in dup:
        dup.append(i)
print(dup)

回答 22

单线,乐趣无穷,并且需要一个声明。

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

one-liner, for fun, and where a single statement is required.

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

回答 23

list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)
list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)

回答 24

一线解决方案:

set([i for i in list if sum([1 for a in list if a == i]) > 1])

One line solution:

set([i for i in list if sum([1 for a in list if a == i]) > 1])

回答 25

这里有很多答案,但是我认为这是一种相对易读且易于理解的方法:

def get_duplicates(sorted_list):
    duplicates = []
    last = sorted_list[0]
    for x in sorted_list[1:]:
        if x == last:
            duplicates.append(x)
        last = x
    return set(duplicates)

笔记:

  • 如果您希望保留重复计数,请取消强制转换为底部的“ set”以获取完整列表
  • 如果您更喜欢使用生成器,请用yield x和底部的return语句替换duplicates.append(x)(可以稍后进行设置)

There are a lot of answers up here, but I think this is relatively a very readable and easy to understand approach:

def get_duplicates(sorted_list):
    duplicates = []
    last = sorted_list[0]
    for x in sorted_list[1:]:
        if x == last:
            duplicates.append(x)
        last = x
    return set(duplicates)

Notes:

  • If you wish to preserve duplication count, get rid of the cast to ‘set’ at the bottom to get the full list
  • If you prefer to use generators, replace duplicates.append(x) with yield x and the return statement at the bottom (you can cast to set later)

回答 26

这是一个快速生成器,它使用字典将每个元素存储为具有布尔值的键,以检查是否已产生重复项。

对于具有所有元素都是可哈希类型的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        if value in unique and unique[value]:
            unique[value] = False
            yield value
        else:
            unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]

对于可能包含列表的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        is_list = False
        if type(value) is list:
            value = tuple(value)
            is_list = True

        if value in unique and unique[value]:
            unique[value] = False
            if is_list:
                value = list(value)

            yield value
        else:
            unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]

Here’s a fast generator that uses a dict to store each element as a key with a boolean value for checking if the duplicate item has already been yielded.

For lists with all elements that are hashable types:

def gen_dupes(array):
    unique = {}
    for value in array:
        if value in unique and unique[value]:
            unique[value] = False
            yield value
        else:
            unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]

For lists that might contain lists:

def gen_dupes(array):
    unique = {}
    for value in array:
        is_list = False
        if type(value) is list:
            value = tuple(value)
            is_list = True

        if value in unique and unique[value]:
            unique[value] = False
            if is_list:
                value = list(value)

            yield value
        else:
            unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]

回答 27

def removeduplicates(a):
  seen = set()

  for i in a:
    if i not in seen:
      seen.add(i)
  return seen 

print(removeduplicates([1,1,2,2]))
def removeduplicates(a):
  seen = set()

  for i in a:
    if i not in seen:
      seen.add(i)
  return seen 

print(removeduplicates([1,1,2,2]))

回答 28

使用toolz时

from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4] 

When using toolz:

from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4] 

回答 29

这是我必须这样做的方法,因为我向自己提出挑战,不要使用其他方法:

def dupList(oldlist):
    if type(oldlist)==type((2,2)):
        oldlist=[x for x in oldlist]
    newList=[]
    newList=newList+oldlist
    oldlist=oldlist
    forbidden=[]
    checkPoint=0
    for i in range(len(oldlist)):
        #print 'start i', i
        if i in forbidden:
            continue
        else:
            for j in range(len(oldlist)):
                #print 'start j', j
                if j in forbidden:
                    continue
                else:
                    #print 'after Else'
                    if i!=j: 
                        #print 'i,j', i,j
                        #print oldlist
                        #print newList
                        if oldlist[j]==oldlist[i]:
                            #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j]
                            forbidden.append(j)
                            #print 'forbidden', forbidden
                            del newList[j-checkPoint]
                            #print newList
                            checkPoint=checkPoint+1
    return newList

因此您的示例工作方式为:

>>>a = [1,2,3,3,3,4,5,6,6,7]
>>>dupList(a)
[1, 2, 3, 4, 5, 6, 7]

this is the way I had to do it because I challenged myself not to use other methods:

def dupList(oldlist):
    if type(oldlist)==type((2,2)):
        oldlist=[x for x in oldlist]
    newList=[]
    newList=newList+oldlist
    oldlist=oldlist
    forbidden=[]
    checkPoint=0
    for i in range(len(oldlist)):
        #print 'start i', i
        if i in forbidden:
            continue
        else:
            for j in range(len(oldlist)):
                #print 'start j', j
                if j in forbidden:
                    continue
                else:
                    #print 'after Else'
                    if i!=j: 
                        #print 'i,j', i,j
                        #print oldlist
                        #print newList
                        if oldlist[j]==oldlist[i]:
                            #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j]
                            forbidden.append(j)
                            #print 'forbidden', forbidden
                            del newList[j-checkPoint]
                            #print newList
                            checkPoint=checkPoint+1
    return newList

so your sample works as:

>>>a = [1,2,3,3,3,4,5,6,6,7]
>>>dupList(a)
[1, 2, 3, 4, 5, 6, 7]

随机播放DataFrame行

问题:随机播放DataFrame行

我有以下DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

从csv文件读取DataFrame。所有具有Type1的行都在最上面,然后是具有Type2 的行,然后是具有Type3 的行,依此类推。

我想重新整理DataFrame行的顺序,以便将所有行Type混合在一起。可能的结果可能是:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

我该如何实现?

I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame’s rows, so that all Type‘s are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?


回答 0

使用Pandas的惯用方式是使用.sample数据框的方法对所有行进行采样而无需替换:

df.sample(frac=1)

frac关键字参数指定的行的分数到随机样品中返回,所以frac=1装置返回所有行(随机顺序)。


注意: 如果您希望就地改组数据帧并重置索引,则可以执行例如

df = df.sample(frac=1).reset_index(drop=True)

在此,指定drop=True可防止.reset_index创建包含旧索引条目的列。

后续注解:尽管上面的操作似乎并不就位,但是python / pandas足够聪明,不会为经过改组的对象做另一个malloc。也就是说,即使参考对象已更改(我的意思id(df_old)是与相同id(df_new)),底层C对象仍然相同。为了证明确实如此,您可以运行一个简单的内存探查器:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)


回答 1

您可以为此简单地使用sklearn

from sklearn.utils import shuffle
df = shuffle(df)

You can simply use sklearn for this

from sklearn.utils import shuffle
df = shuffle(df)

回答 2

您可以通过使用改组后的索引建立索引来改组数据帧的行。为此,您可以使用np.random.permutation(但np.random.choice也可以):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

如果要像示例中那样将索引的编号始终保持为1、2,..,n,则只需重置索引即可: df_shuffled.reset_index(drop=True)

You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)


回答 3

TL; DRnp.random.shuffle(ndarray)可以胜任。
所以,在你的情况下

np.random.shuffle(DataFrame.values)

DataFrame在后台,使用NumPy ndarray作为数据持有者。(您可以从DataFrame源代码检查)

因此,如果使用np.random.shuffle(),它将沿多维数组的第一个轴随机排列数组。但是DataFrame遗体的索引仍然没有改组。

虽然,有一些要考虑的问题。

  • 函数不返回任何内容。如果要保留原始对象的副本,则必须这样做,然后再传递给该函数。
  • sklearn.utils.shuffle(),如用户tj89所建议的那样,可以指定random_state其他选项来控制输出。您可能需要出于开发目的。
  • sklearn.utils.shuffle()是比较快的。但洗牌的轴信息(索引,列)DataFrame与沿ndarray它包含的内容。

基准结果

sklearn.utils.shuffle()和之间np.random.shuffle()

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915秒 快8倍

np.random.shuffle(nd)

0.8897626010002568秒

数据框

df = sklearn.utils.shuffle(df)

0.3183923360193148秒 快3倍

np.random.shuffle(df.values)

0.9357550159329548秒

结论:如果可以将轴信息(索引,列)与ndarray一起改组,请使用sklearn.utils.shuffle()。否则,使用np.random.shuffle()

使用的代码

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

TL;DR: np.random.shuffle(ndarray) can do the job.
So, in your case

np.random.shuffle(DataFrame.values)

DataFrame, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)

So if you use np.random.shuffle(), it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame remains unshuffled.

Though, there are some points to consider.

  • function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
  • sklearn.utils.shuffle(), as user tj89 suggested, can designate random_state along with another option to control output. You may want that for dev purpose.
  • sklearn.utils.shuffle() is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame along with the ndarray it contains.

Benchmark result

between sklearn.utils.shuffle() and np.random.shuffle().

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 8x faster

np.random.shuffle(nd)

0.8897626010002568 sec

DataFrame

df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 3x faster

np.random.shuffle(df.values)

0.9357550159329548 sec

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()

used code

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)


回答 4

(我没有足够的声誉在最高职位上对此发表评论,所以我希望其他人可以为我这样做。)第一种方法引起了人们的关注:

df.sample(frac=1)

进行深拷贝或只是更改数据框。我运行了以下代码:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

我的结果是:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

这意味着该方法返回上一个注释中建议的相同对象。因此,此方法的确可以制作随机的副本

(I don’t have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:

df.sample(frac=1)

made a deep copy or just changed the dataframe. I ran the following code:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.


回答 5

还有用的是,如果将其用于Machine_learning并且希望始终分离相同的数据,则可以使用:

df.sample(n=len(df), random_state=42)

这样可以确保您的随机选择始终可复制

What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:

df.sample(n=len(df), random_state=42)

this makes sure, that you keep your random choice always replicatable


回答 6

AFAIK最简单的解决方案是:

df_shuffled = df.reindex(np.random.permutation(df.index))

AFAIK the simplest solution is:

df_shuffled = df.reindex(np.random.permutation(df.index))

回答 7

通过取样阵列中的这种情况下,洗牌大熊猫数据帧索引和随机那么它的顺序来设置所述阵列的数据帧的索引。现在根据索引对数据帧进行排序。这是您经过改组的数据框

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

输出

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

在上面的代码中将数据框插入我的位置。

shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

output

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

Insert you data frame in the place of mine in above code .


回答 8

这是另一种方式:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)

Here is another way:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)


python文件扩展名.pyc .pyd .pyo代表什么?

问题:python文件扩展名.pyc .pyd .pyo代表什么?

这些python文件扩展名是什么意思?

  • .pyc
  • .pyd
  • .pyo

它们之间有什么区别,它们是如何从* .py文件生成的?

What do these python file extensions mean?

  • .pyc
  • .pyd
  • .pyo

What are the differences between them and how are they generated from a *.py file?


回答 0

  1. .py:这通常是您编写的输入源代码。
  2. .pyc:这是编译后的字节码。如果您导入模块,则python将构建一个*.pyc包含字节码的文件,以使以后再次(更快)地再次导入它。
  3. .pyo:这是在Python 3.5之前用于*.pyc通过优化(-O)标志创建的文件的文件格式。(请参阅下面的注释)
  4. .pyd:这基本上是Windows dll文件。http://docs.python.org/faq/windows.html#is-a-pyd-file-the-same-as-a-dll

另外,对于某些.pyc与vs 有关的讨论.pyo,请查看:http : //www.network-theory.co.uk/docs/pytut/CompiledPythonfiles.html(我已复制了下面的重要部分)

  • 当使用-O标志调用Python解释器时,将生成优化的代码并将其存储在’.pyo’文件中。目前,优化器没有太大帮助。它仅删除断言语句。当使用-O时,所有字节码都被优化;.pyc文件将被忽略,.py文件将被编译为优化的字节码。
  • 将两个-O标志传递给Python解释器(-OO)将导致字节码编译器执行优化,这在极少数情况下可能会导致程序故障。当前仅从__doc__字节码中删除了字符串,从而生成了更紧凑的“ .pyo”文件。由于某些程序可能依赖于这些程序的可用性,因此只有在知道自己在做什么的情况下才应使用此选项。
  • 从’.pyc’或’.pyo’文件中读取程序比从’.py’文件中读取程序运行得更快。关于“ .pyc”或“ .pyo”文件,唯一更快的是它们的加载速度。
  • 通过在命令行中给出脚本名称来运行脚本时,该脚本的字节码永远不会写入“ .pyc”或“ .pyo”文件。因此,可以通过将脚本的大部分代码移至模块并使用较小的引导脚本来导入该模块来减少脚本的启动时间。也可以直接在命令行上命名“ .pyc”或“ .pyo”文件。

注意:

2015年15月15日,Python 3.5版本实现了PEP-488,并删除了.pyo文件。这意味着.pyc文件代表未优化和优化的字节码。

  1. .py: This is normally the input source code that you’ve written.
  2. .pyc: This is the compiled bytecode. If you import a module, python will build a *.pyc file that contains the bytecode to make importing it again later easier (and faster).
  3. .pyo: This was a file format used before Python 3.5 for *.pyc files that were created with optimizations (-O) flag. (see the note below)
  4. .pyd: This is basically a windows dll file. http://docs.python.org/faq/windows.html#is-a-pyd-file-the-same-as-a-dll

Also for some further discussion on .pyc vs .pyo, take a look at: http://www.network-theory.co.uk/docs/pytut/CompiledPythonfiles.html (I’ve copied the important part below)

  • When the Python interpreter is invoked with the -O flag, optimized code is generated and stored in ‘.pyo’ files. The optimizer currently doesn’t help much; it only removes assert statements. When -O is used, all bytecode is optimized; .pyc files are ignored and .py files are compiled to optimized bytecode.
  • Passing two -O flags to the Python interpreter (-OO) will cause the bytecode compiler to perform optimizations that could in some rare cases result in malfunctioning programs. Currently only __doc__ strings are removed from the bytecode, resulting in more compact ‘.pyo’ files. Since some programs may rely on having these available, you should only use this option if you know what you’re doing.
  • A program doesn’t run any faster when it is read from a ‘.pyc’ or ‘.pyo’ file than when it is read from a ‘.py’ file; the only thing that’s faster about ‘.pyc’ or ‘.pyo’ files is the speed with which they are loaded.
  • When a script is run by giving its name on the command line, the bytecode for the script is never written to a ‘.pyc’ or ‘.pyo’ file. Thus, the startup time of a script may be reduced by moving most of its code to a module and having a small bootstrap script that imports that module. It is also possible to name a ‘.pyc’ or ‘.pyo’ file directly on the command line.

Note:

On 2015-09-15 the Python 3.5 release implemented PEP-488 and eliminated .pyo files. This means that .pyc files represent both unoptimized and optimized bytecode.


回答 1

  • .py-常规脚本
  • .py3-(很少使用)Python3脚本。Python3脚本通常以“ .py”而不是“ .py3”结尾,但是我已经看过几次了
  • .pyc-编译脚本(字节码)
  • .pyo -优化的pyc文件(Python3.5的,Python将只使用PYC而非杓和PYC)
  • .pyw-在没有控制台的情况下以窗口模式运行的Python脚本;用pythonw.exe执行
  • .pyx -Cython src转换为C / C ++
  • .pyd -Windows DLL编写的Python脚本
  • .pxd -Cython脚本,等效于C / C ++头
  • .pxi -MyPy存根
  • .pyi-存根文件(PEP 484
  • .pyz -Python脚本存档(PEP 441); 这是一个在标准Python脚本标头之后包含二进制格式的压缩Python脚本(ZIP)的脚本
  • .pywz-适用于MS-Windows的Python脚本存档(PEP 441); 这是一个在标准Python脚本标头之后包含二进制格式的压缩Python脚本(ZIP)的脚本
  • .py [cod] -“ .gitignore”中的通配符表示文件可能是“ .pyc”,“。pyo”或“ .pyd”。
  • .pth-路径配置文件;其内容是要添加到的其他项(每行一个)sys.path。参见site模块。

可以在http://dcjtech.info/topic/python-file-extensions/找到更多其他Python文件扩展名的列表(大多数是罕见的和非正式的)。

  • .py – Regular script
  • .py3 – (rarely used) Python3 script. Python3 scripts usually end with “.py” not “.py3”, but I have seen that a few times
  • .pyc – compiled script (Bytecode)
  • .pyo – optimized pyc file (As of Python3.5, Python will only use pyc rather than pyo and pyc)
  • .pyw – Python script to run in Windowed mode, without a console; executed with pythonw.exe
  • .pyx – Cython src to be converted to C/C++
  • .pyd – Python script made as a Windows DLL
  • .pxd – Cython script which is equivalent to a C/C++ header
  • .pxi – MyPy stub
  • .pyi – Stub file (PEP 484)
  • .pyz – Python script archive (PEP 441); this is a script containing compressed Python scripts (ZIP) in binary form after the standard Python script header
  • .pywz – Python script archive for MS-Windows (PEP 441); this is a script containing compressed Python scripts (ZIP) in binary form after the standard Python script header
  • .py[cod] – wildcard notation in “.gitignore” that means the file may be “.pyc”, “.pyo”, or “.pyd”.
  • .pth – a path configuration file; its contents are additional items (one per line) to be added to sys.path. See site module.

A larger list of additional Python file-extensions (mostly rare and unofficial) can be found at http://dcjtech.info/topic/python-file-extensions/


错误:“’dict’对象没有属性’iteritems’”

问题:错误:“’dict’对象没有属性’iteritems’”

我正在尝试使用NetworkX读取Shapefile并使用该函数write_shp()生成将包含节点和边的Shapefile,但是当我尝试运行代码时,出现以下错误:

Traceback (most recent call last):   File
"C:/Users/Felipe/PycharmProjects/untitled/asdf.py", line 4, in
<module>
    nx.write_shp(redVial, "shapefiles")   File "C:\Python34\lib\site-packages\networkx\readwrite\nx_shp.py", line
192, in write_shp
    for key, data in e[2].iteritems(): AttributeError: 'dict' object has no attribute 'iteritems'

我正在使用Python 3.4,并通过pip install安装了NetworkX。

在发生此错误之前,它已经给我另一个提示“ xrange不存在”或类似名称,因此我进行了查找,然后将其更改xrangerangenx_shp.py文件,似乎可以解决该问题。

根据我的阅读,它可能与Python版本(Python2 vs Python3)有关。

I’m trying to use NetworkX to read a Shapefile and use the function write_shp() to generate the Shapefiles that will contain the nodes and edges, but when I try to run the code it gives me the following error:

Traceback (most recent call last):   File
"C:/Users/Felipe/PycharmProjects/untitled/asdf.py", line 4, in
<module>
    nx.write_shp(redVial, "shapefiles")   File "C:\Python34\lib\site-packages\networkx\readwrite\nx_shp.py", line
192, in write_shp
    for key, data in e[2].iteritems(): AttributeError: 'dict' object has no attribute 'iteritems'

I’m using Python 3.4 and installed NetworkX via pip install.

Before this error it had already given me another one that said “xrange does not exist” or something like that, so I looked it up and just changed xrange to range in the nx_shp.py file, which seemed to solve it.

From what I’ve read it could be related to the Python version (Python2 vs Python3).


回答 0

正如您在python3中一样,请使用dict.items()代替dict.iteritems()

iteritems() 已在python3中删除,因此您无法再使用此方法。

看一下Python 3.0 Wiki的“ 内置更改”部分,其中指出:

删除dict.iteritems()dict.iterkeys()dict.itervalues()

相反:使用dict.items()dict.keys()dict.values() 分别。

As you are in python3 , use dict.items() instead of dict.iteritems()

iteritems() was removed in python3, so you can’t use this method anymore.

Take a look at Python 3.0 Wiki Built-in Changes section, where it is stated:

Removed dict.iteritems(), dict.iterkeys(), and dict.itervalues().

Instead: use dict.items(), dict.keys(), and dict.values() respectively.


回答 1

Python2中,我们有.items().iteritems()在字典中。dict.items()返回字典中的元组列表[(k1,v1),(k2,v2),...]。它复制了字典中的所有元组并创建了新列表。如果字典很大,则对内存的影响很大。

因此,他们dict.iteritems()在更高版本的Python2中创建了代码。此返回的迭代器对象。未复制整个词典,因此内存消耗较少。使用人Python2被教导要使用dict.iteritems()的,而不是.items()如下面的代码解释了效率。

import timeit

d = {i:i*2 for i in xrange(10000000)}  
start = timeit.default_timer()
for key,value in d.items():
    tmp = key + value #do something like print
t1 = timeit.default_timer() - start

start = timeit.default_timer()
for key,value in d.iteritems():
    tmp = key + value
t2 = timeit.default_timer() - start

输出:

Time with d.items(): 9.04773592949
Time with d.iteritems(): 2.17707300186

Python3,他们想使之更有效率,所以感动dictionary.iteritems()dict.items(),并删除.iteritems(),因为它不再需要。

您已经使用过dict.iteritems()Python3所以失败了。尝试使用dict.items()具有与相同功能dict.iteritems()Python2。这是从Python2到的一点点迁移问题Python3

In Python2, we had .items() and .iteritems() in dictionaries. dict.items() returned list of tuples in dictionary [(k1,v1),(k2,v2),...]. It copied all tuples in dictionary and created new list. If dictionary is very big, there is very big memory impact.

So they created dict.iteritems() in later versions of Python2. This returned iterator object. Whole dictionary was not copied so there is lesser memory consumption. People using Python2 are taught to use dict.iteritems() instead of .items() for efficiency as explained in following code.

import timeit

d = {i:i*2 for i in xrange(10000000)}  
start = timeit.default_timer()
for key,value in d.items():
    tmp = key + value #do something like print
t1 = timeit.default_timer() - start

start = timeit.default_timer()
for key,value in d.iteritems():
    tmp = key + value
t2 = timeit.default_timer() - start

Output:

Time with d.items(): 9.04773592949
Time with d.iteritems(): 2.17707300186

In Python3, they wanted to make it more efficient, so moved dictionary.iteritems() to dict.items(), and removed .iteritems() as it was no longer needed.

You have used dict.iteritems() in Python3 so it has failed. Try using dict.items() which has the same functionality as dict.iteritems() of Python2. This is a tiny bit migration issue from Python2 to Python3.


回答 2

我有一个类似的问题(使用3.5),每天损失1/2,但这是可行的-我退休了,只是学习Python,所以我可以帮助我的孙子(12)。

mydict2={'Atlanta':78,'Macon':85,'Savannah':72}
maxval=(max(mydict2.values()))
print(maxval)
mykey=[key for key,value in mydict2.items()if value==maxval][0]
print(mykey)
YEILDS; 
85
Macon

I had a similar problem (using 3.5) and lost 1/2 a day to it but here is a something that works – I am retired and just learning Python so I can help my grandson (12) with it.

mydict2={'Atlanta':78,'Macon':85,'Savannah':72}
maxval=(max(mydict2.values()))
print(maxval)
mykey=[key for key,value in mydict2.items()if value==maxval][0]
print(mykey)
YEILDS; 
85
Macon

回答 3

在Python2 中,该功能dictionary.iteritems()dictionary.items()Python3中的效率更高,该功能dictionary.iteritems()已迁移到dictionary.items()iteritems()已删除。因此,您将收到此错误。

dict.items()在Python3中使用,与Python2相同dict.iteritems()

In Python2, dictionary.iteritems() is more efficient than dictionary.items() so in Python3, the functionality of dictionary.iteritems() has been migrated to dictionary.items() and iteritems() is removed. So you are getting this error.

Use dict.items() in Python3 which is same as dict.iteritems() of Python2.


回答 4

这样做的目的.iteritems()是通过在循环时一次产生一个结果来使用较少的存储空间。我不确定为什么Python 3版本不支持,iteritems()尽管事实证明它比.items()

如果要包含同时支持PY版本2和3的代码,

try:
    iteritems
except NameError:
    iteritems = items

如果您在其他系统上部署项目并且不确定PY版本,这将有所帮助。

The purpose of .iteritems() was to use less memory space by yielding one result at a time while looping. I am not sure why Python 3 version does not support iteritems()though it’s been proved to be efficient than .items()

If you want to include a code that supports both the PY version 2 and 3,

try:
    iteritems
except NameError:
    iteritems = items

This can help if you deploy your project in some other system and you aren’t sure about the PY version.


回答 5

正如RafaelC回答的那样,Python 3重命名了dict.iteritems-> dict.items。尝试使用其他软件包版本。这将列出可用的软件包:

python -m pip install yourOwnPackageHere==

然后重新运行您要在==之后尝试安装/切换版本的版本

As answered by RafaelC, Python 3 renamed dict.iteritems -> dict.items. Try a different package version. This will list available packages:

python -m pip install yourOwnPackageHere==

Then rerun with the version you will try after == to install/switch version