标签归档:python-3.x

是否有用于字符串自然排序的内置函数?

问题:是否有用于字符串自然排序的内置函数?

使用Python 3.x,我有一个要对其执行自然字母排序的字符串列表。

自然排序: Windows中文件的排序顺序。

例如,以下列表是自然排序的(我想要的):

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

这是上面列表的“排序”版本(我所拥有的):

['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

我正在寻找一种类似于第一个的排序功能。

I have a list of strings for which I would like to perform a natural alphabetical sort.

For instance, the following list is naturally sorted (what I want):

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

And here’s the “sorted” version of the above list (what I get using sorted()):

['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

I’m looking for a sort function which behaves like the first one.


回答 0

在PyPI上有一个名为natsort的第三方库(完整披露,我是软件包的作者)。对于您的情况,可以执行以下任一操作:

>>> from natsort import natsorted, ns
>>> x = ['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
>>> natsorted(x, key=lambda y: y.lower())
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsorted(x, alg=ns.IGNORECASE)  # or alg=ns.IC
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

您应该注意,它natsort使用通用算法,因此它几乎可以处理您向其抛出的任何输入。如果您想了解为什么选择一个库而不是滚动自己的函数的更多详细信息,请查阅natsort文档的“ 如何工作”页面,尤其是“ 到处都是特殊情况”!部分。


如果您需要排序键而不是排序功能,请使用以下公式之一。

>>> from natsort import natsort_keygen, ns
>>> l1 = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> l2 = l1[:]
>>> natsort_key1 = natsort_keygen(key=lambda y: y.lower())
>>> l1.sort(key=natsort_key1)
>>> l1
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsort_key2 = natsort_keygen(alg=ns.IGNORECASE)
>>> l2.sort(key=natsort_key2)
>>> l2
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

There is a third party library for this on PyPI called natsort (full disclosure, I am the package’s author). For your case, you can do either of the following:

>>> from natsort import natsorted, ns
>>> x = ['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
>>> natsorted(x, key=lambda y: y.lower())
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsorted(x, alg=ns.IGNORECASE)  # or alg=ns.IC
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

You should note that natsort uses a general algorithm so it should work for just about any input that you throw at it. If you want more details on why you might choose a library to do this rather than rolling your own function, check out the natsort documentation’s How It Works page, in particular the Special Cases Everywhere! section.


If you need a sorting key instead of a sorting function, use either of the below formulas.

>>> from natsort import natsort_keygen, ns
>>> l1 = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> l2 = l1[:]
>>> natsort_key1 = natsort_keygen(key=lambda y: y.lower())
>>> l1.sort(key=natsort_key1)
>>> l1
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsort_key2 = natsort_keygen(alg=ns.IGNORECASE)
>>> l2.sort(key=natsort_key2)
>>> l2
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

回答 1

试试这个:

import re

def natural_sort(l): 
    convert = lambda text: int(text) if text.isdigit() else text.lower() 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

输出:

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

从此处改编的代码:为人类分类:自然排序顺序

Try this:

import re

def natural_sort(l): 
    convert = lambda text: int(text) if text.isdigit() else text.lower() 
    alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ] 
    return sorted(l, key = alphanum_key)

Output:

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

Code adapted from here: Sorting for Humans : Natural Sort Order.


回答 2

这是Mark Byer答案的更多Python版本:

import re

def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
    return [int(text) if text.isdigit() else text.lower()
            for text in _nsre.split(s)]    

现在,这个功能可以作为在使用它的任何功能,就象是一把钥匙list.sortsortedmax,等。

作为lambda:

lambda s: [int(t) if t.isdigit() else t.lower() for t in re.split('(\d+)', s)]

Here’s a much more pythonic version of Mark Byer’s answer:

import re

def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
    return [int(text) if text.isdigit() else text.lower()
            for text in _nsre.split(s)]    

Now this function can be used as a key in any function that uses it, like list.sort, sorted, max, etc.

As a lambda:

lambda s: [int(t) if t.isdigit() else t.lower() for t in re.split('(\d+)', s)]

回答 3

我基于http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html编写了一个函数,该函数增加了仍可以传递您自己的“键”参数的功能。我需要这样做,以便执行包含更复杂对象(不仅仅是字符串)的自然列表。

import re

def natural_sort(list, key=lambda s:s):
    """
    Sort the list into natural alphanumeric order.
    """
    def get_alphanum_key_func(key):
        convert = lambda text: int(text) if text.isdigit() else text 
        return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
    sort_key = get_alphanum_key_func(key)
    list.sort(key=sort_key)

例如:

my_list = [{'name':'b'}, {'name':'10'}, {'name':'a'}, {'name':'1'}, {'name':'9'}]
natural_sort(my_list, key=lambda x: x['name'])
print my_list
[{'name': '1'}, {'name': '9'}, {'name': '10'}, {'name': 'a'}, {'name': 'b'}]

I wrote a function based on http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html which adds the ability to still pass in your own ‘key’ parameter. I need this in order to perform a natural sort of lists that contain more complex objects (not just strings).

import re

def natural_sort(list, key=lambda s:s):
    """
    Sort the list into natural alphanumeric order.
    """
    def get_alphanum_key_func(key):
        convert = lambda text: int(text) if text.isdigit() else text 
        return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
    sort_key = get_alphanum_key_func(key)
    list.sort(key=sort_key)

For example:

my_list = [{'name':'b'}, {'name':'10'}, {'name':'a'}, {'name':'1'}, {'name':'9'}]
natural_sort(my_list, key=lambda x: x['name'])
print my_list
[{'name': '1'}, {'name': '9'}, {'name': '10'}, {'name': 'a'}, {'name': 'b'}]

回答 4

data = ['elm13', 'elm9', 'elm0', 'elm1', 'Elm11', 'Elm2', 'elm10']

让我们分析数据。所有元素的数字容量为2。公共文字部分共有3个字母'elm'

因此,元素的最大长度为5。我们可以增加此值以确保(例如,增加到8)。

牢记这一点,我们有一个单线解决方案:

data.sort(key=lambda x: '{0:0>8}'.format(x).lower())

没有正则表达式和外部库!

print(data)

>>> ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'elm13']

说明:

for elm in data:
    print('{0:0>8}'.format(elm).lower())

>>>
0000elm0
0000elm1
0000elm2
0000elm9
000elm10
000elm11
000elm13
data = ['elm13', 'elm9', 'elm0', 'elm1', 'Elm11', 'Elm2', 'elm10']

Let’s analyse the data. The digit capacity of all elements is 2. And there are 3 letters in common literal part 'elm'.

So, the maximal length of element is 5. We can increase this value to make sure (for example, to 8).

Bearing that in mind, we’ve got a one-line solution:

data.sort(key=lambda x: '{0:0>8}'.format(x).lower())

without regular expressions and external libraries!

print(data)

>>> ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'elm13']

Explanation:

for elm in data:
    print('{0:0>8}'.format(elm).lower())

>>>
0000elm0
0000elm1
0000elm2
0000elm9
000elm10
000elm11
000elm13

回答 5

鉴于:

data=['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

类似于SergO的解决方案,没有外部库1-liner将是

data.sort(key=lambda x : int(x[3:]))

要么

sorted_data=sorted(data, key=lambda x : int(x[3:]))

说明:

此解决方案使用排序关键功能来定义将用于排序的功能。因为我们知道每个数据条目都以’elm’开头,所以排序函数会将字符串中第三个字符之后的部分(即int(x [3:]))转换为整数。如果数据的数字部分位于其他位置,则函数的这一部分将必须更改。

干杯

Given:

data=['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']

Similar to SergO’s solution, a 1-liner without external libraries would be:

data.sort(key=lambda x : int(x[3:]))

or

sorted_data=sorted(data, key=lambda x : int(x[3:]))

Explanation:

This solution uses the key feature of sort to define a function that will be employed for the sorting. Because we know that every data entry is preceded by ‘elm’ the sorting function converts to integer the portion of the string after the 3rd character (i.e. int(x[3:])). If the numerical part of the data is in a different location, then this part of the function would have to change.

Cheers


回答 6

现在, 有了更多*优雅(pythonic )的功能,只需触摸一下

有很多实现,尽管有些实现已经接近,但没有一个实现能够完全抓住现代python提供的优雅。

  • 使用python(3.5.1)测试
  • 包括一个附加列表,以证明当数字为中间字符串时它可以工作
  • 但是,没有测试,我假设如果您的列表很大,那么事先编译正则表达式会更有效
    • 如果这是错误的假设,我敢肯定有人会纠正我

速成
from re import compile, split    
dre = compile(r'(\d+)')
mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
全码
#!/usr/bin/python3
# coding=utf-8
"""
Natural-Sort Test
"""

from re import compile, split

dre = compile(r'(\d+)')
mylist = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13', 'elm']
mylist2 = ['e0lm', 'e1lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm', 'e01lm']

mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
mylist2.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])

print(mylist)  
  # ['elm', 'elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
print(mylist2)  
  # ['e0lm', 'e1lm', 'e01lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm']

使用时的注意事项

  • from os.path import split
    • 您将需要区分进口

灵感来自

And now for something more* elegant (pythonic) -just a touch

There are many implementations out there, and while some have come close, none quite captured the elegance modern python affords.

  • Tested using python(3.5.1)
  • Included an additional list to demonstrate that it works when the numbers are mid string
  • Didn’t test, however, I am assuming that if your list was sizable it would be more efficient to compile the regex beforehand
    • I’m sure someone will correct me if this is an erroneous assumption

Quicky
from re import compile, split    
dre = compile(r'(\d+)')
mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
Full-Code
#!/usr/bin/python3
# coding=utf-8
"""
Natural-Sort Test
"""

from re import compile, split

dre = compile(r'(\d+)')
mylist = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13', 'elm']
mylist2 = ['e0lm', 'e1lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm', 'e01lm']

mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
mylist2.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])

print(mylist)  
  # ['elm', 'elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
print(mylist2)  
  # ['e0lm', 'e1lm', 'e01lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm']

Caution when using

  • from os.path import split
    • you will need to differentiate the imports

Inspiration from


回答 7

这篇文章的价值

我的观点是提供可以普遍应用的非正则表达式解决方案。
我将创建三个函数:

  1. find_first_digit我从@AnuragUniyal借来的。它将查找字符串中第一个数字或非数字的位置。
  2. split_digits这是一个生成器,用于将字符串分成数字和非数字块。yield如果是数字,它也将是整数。
  3. natural_key只是包装split_digits成一个tuple。这是我们作为一个关键的使用sortedmaxmin

功能

def find_first_digit(s, non=False):
    for i, x in enumerate(s):
        if x.isdigit() ^ non:
            return i
    return -1

def split_digits(s, case=False):
    non = True
    while s:
        i = find_first_digit(s, non)
        if i == 0:
            non = not non
        elif i == -1:
            yield int(s) if s.isdigit() else s if case else s.lower()
            s = ''
        else:
            x, s = s[:i], s[i:]
            yield int(x) if x.isdigit() else x if case else x.lower()

def natural_key(s, *args, **kwargs):
    return tuple(split_digits(s, *args, **kwargs))

我们可以看到它是通用的,因为我们可以有多个数字块:

# Note that the key has lower case letters
natural_key('asl;dkfDFKJ:sdlkfjdf809lkasdjfa_543_hh')

('asl;dkfdfkj:sdlkfjdf', 809, 'lkasdjfa_', 543, '_hh')

或区分大小写:

natural_key('asl;dkfDFKJ:sdlkfjdf809lkasdjfa_543_hh', True)

('asl;dkfDFKJ:sdlkfjdf', 809, 'lkasdjfa_', 543, '_hh')

我们可以看到它以适当的顺序对OP的列表进行了排序

sorted(
    ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13'],
    key=natural_key
)

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

但是它也可以处理更复杂的列表:

sorted(
    ['f_1', 'e_1', 'a_2', 'g_0', 'd_0_12:2', 'd_0_1_:2'],
    key=natural_key
)

['a_2', 'd_0_1_:2', 'd_0_12:2', 'e_1', 'f_1', 'g_0']

我的正则表达式将是

def int_maybe(x):
    return int(x) if str(x).isdigit() else x

def split_digits_re(s, case=False):
    parts = re.findall('\d+|\D+', s)
    if not case:
        return map(int_maybe, (x.lower() for x in parts))
    else:
        return map(int_maybe, parts)
    
def natural_key_re(s, *args, **kwargs):
    return tuple(split_digits_re(s, *args, **kwargs))

Value Of This Post

My point is to offer a non regex solution that can be applied generally.
I’ll create three functions:

  1. find_first_digit which I borrowed from @AnuragUniyal. It will find the position of the first digit or non-digit in a string.
  2. split_digits which is a generator that picks apart a string into digit and non digit chunks. It will also yield integers when it is a digit.
  3. natural_key just wraps split_digits into a tuple. This is what we use as a key for sorted, max, min.

Functions

def find_first_digit(s, non=False):
    for i, x in enumerate(s):
        if x.isdigit() ^ non:
            return i
    return -1

def split_digits(s, case=False):
    non = True
    while s:
        i = find_first_digit(s, non)
        if i == 0:
            non = not non
        elif i == -1:
            yield int(s) if s.isdigit() else s if case else s.lower()
            s = ''
        else:
            x, s = s[:i], s[i:]
            yield int(x) if x.isdigit() else x if case else x.lower()

def natural_key(s, *args, **kwargs):
    return tuple(split_digits(s, *args, **kwargs))

We can see that it is general in that we can have multiple digit chunks:

# Note that the key has lower case letters
natural_key('asl;dkfDFKJ:sdlkfjdf809lkasdjfa_543_hh')

('asl;dkfdfkj:sdlkfjdf', 809, 'lkasdjfa_', 543, '_hh')

Or leave as case sensitive:

natural_key('asl;dkfDFKJ:sdlkfjdf809lkasdjfa_543_hh', True)

('asl;dkfDFKJ:sdlkfjdf', 809, 'lkasdjfa_', 543, '_hh')

We can see that it sorts the OP’s list in the appropriate order

sorted(
    ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13'],
    key=natural_key
)

['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

But it can handle more complicated lists as well:

sorted(
    ['f_1', 'e_1', 'a_2', 'g_0', 'd_0_12:2', 'd_0_1_:2'],
    key=natural_key
)

['a_2', 'd_0_1_:2', 'd_0_12:2', 'e_1', 'f_1', 'g_0']

My regex equivalent would be

def int_maybe(x):
    return int(x) if str(x).isdigit() else x

def split_digits_re(s, case=False):
    parts = re.findall('\d+|\D+', s)
    if not case:
        return map(int_maybe, (x.lower() for x in parts))
    else:
        return map(int_maybe, parts)
    
def natural_key_re(s, *args, **kwargs):
    return tuple(split_digits_re(s, *args, **kwargs))

回答 8

一种选择是使用扩展形式http://wiki.answers.com/Q/What_does_expanded_form_mean将字符串转换为元组并替换数字

这样,a90将变为(“ a”,90,0),而a1将变为(“ a”,1)

下面是一些示例代码(由于它从数字中删除前导0的方式,因此效率不高)

alist=["something1",
    "something12",
    "something17",
    "something2",
    "something25and_then_33",
    "something25and_then_34",
    "something29",
    "beta1.1",
    "beta2.3.0",
    "beta2.33.1",
    "a001",
    "a2",
    "z002",
    "z1"]

def key(k):
    nums=set(list("0123456789"))
        chars=set(list(k))
    chars=chars-nums
    for i in range(len(k)):
        for c in chars:
            k=k.replace(c+"0",c)
    l=list(k)
    base=10
    j=0
    for i in range(len(l)-1,-1,-1):
        try:
            l[i]=int(l[i])*base**j
            j+=1
        except:
            j=0
    l=tuple(l)
    print l
    return l

print sorted(alist,key=key)

输出:

('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 1)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 7)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 3)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 4)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 9)
('b', 'e', 't', 'a', 1, '.', 1)
('b', 'e', 't', 'a', 2, '.', 3, '.')
('b', 'e', 't', 'a', 2, '.', 30, 3, '.', 1)
('a', 1)
('a', 2)
('z', 2)
('z', 1)
['a001', 'a2', 'beta1.1', 'beta2.3.0', 'beta2.33.1', 'something1', 'something2', 'something12', 'something17', 'something25and_then_33', 'something25and_then_34', 'something29', 'z1', 'z002']

One option is to turn the string into a tuple and replace digits using expanded form http://wiki.answers.com/Q/What_does_expanded_form_mean

that way a90 would become (“a”,90,0) and a1 would become (“a”,1)

below is some sample code (which isn’t very efficient due to the way It removes leading 0’s from numbers)

alist=["something1",
    "something12",
    "something17",
    "something2",
    "something25and_then_33",
    "something25and_then_34",
    "something29",
    "beta1.1",
    "beta2.3.0",
    "beta2.33.1",
    "a001",
    "a2",
    "z002",
    "z1"]

def key(k):
    nums=set(list("0123456789"))
        chars=set(list(k))
    chars=chars-nums
    for i in range(len(k)):
        for c in chars:
            k=k.replace(c+"0",c)
    l=list(k)
    base=10
    j=0
    for i in range(len(l)-1,-1,-1):
        try:
            l[i]=int(l[i])*base**j
            j+=1
        except:
            j=0
    l=tuple(l)
    print l
    return l

print sorted(alist,key=key)

output:

('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 1)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 7)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 3)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 4)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 9)
('b', 'e', 't', 'a', 1, '.', 1)
('b', 'e', 't', 'a', 2, '.', 3, '.')
('b', 'e', 't', 'a', 2, '.', 30, 3, '.', 1)
('a', 1)
('a', 2)
('z', 2)
('z', 1)
['a001', 'a2', 'beta1.1', 'beta2.3.0', 'beta2.33.1', 'something1', 'something2', 'something12', 'something17', 'something25and_then_33', 'something25and_then_34', 'something29', 'z1', 'z002']

回答 9

根据此处的答案,我编写了一个natural_sorted行为类似于内置函数的函数sorted

# Copyright (C) 2018, Benjamin Drung <bdrung@posteo.de>
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
# ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
# ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

import re

def natural_sorted(iterable, key=None, reverse=False):
    """Return a new naturally sorted list from the items in *iterable*.

    The returned list is in natural sort order. The string is ordered
    lexicographically (using the Unicode code point number to order individual
    characters), except that multi-digit numbers are ordered as a single
    character.

    Has two optional arguments which must be specified as keyword arguments.

    *key* specifies a function of one argument that is used to extract a
    comparison key from each list element: ``key=str.lower``.  The default value
    is ``None`` (compare the elements directly).

    *reverse* is a boolean value.  If set to ``True``, then the list elements are
    sorted as if each comparison were reversed.

    The :func:`natural_sorted` function is guaranteed to be stable. A sort is
    stable if it guarantees not to change the relative order of elements that
    compare equal --- this is helpful for sorting in multiple passes (for
    example, sort by department, then by salary grade).
    """
    prog = re.compile(r"(\d+)")

    def alphanum_key(element):
        """Split given key in list of strings and digits"""
        return [int(c) if c.isdigit() else c for c in prog.split(key(element)
                if key else element)]

    return sorted(iterable, key=alphanum_key, reverse=reverse)

源代码也可以在我的GitHub代码段存储库中找到:https//github.com/bdrung/snippets/blob/master/natural_sorted.py

Based on the answers here, I wrote a natural_sorted function that behaves like the built-in function sorted:

# Copyright (C) 2018, Benjamin Drung <bdrung@posteo.de>
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
# ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
# ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

import re

def natural_sorted(iterable, key=None, reverse=False):
    """Return a new naturally sorted list from the items in *iterable*.

    The returned list is in natural sort order. The string is ordered
    lexicographically (using the Unicode code point number to order individual
    characters), except that multi-digit numbers are ordered as a single
    character.

    Has two optional arguments which must be specified as keyword arguments.

    *key* specifies a function of one argument that is used to extract a
    comparison key from each list element: ``key=str.lower``.  The default value
    is ``None`` (compare the elements directly).

    *reverse* is a boolean value.  If set to ``True``, then the list elements are
    sorted as if each comparison were reversed.

    The :func:`natural_sorted` function is guaranteed to be stable. A sort is
    stable if it guarantees not to change the relative order of elements that
    compare equal --- this is helpful for sorting in multiple passes (for
    example, sort by department, then by salary grade).
    """
    prog = re.compile(r"(\d+)")

    def alphanum_key(element):
        """Split given key in list of strings and digits"""
        return [int(c) if c.isdigit() else c for c in prog.split(key(element)
                if key else element)]

    return sorted(iterable, key=alphanum_key, reverse=reverse)

The source code is also available in my GitHub snippets repository: https://github.com/bdrung/snippets/blob/master/natural_sorted.py


回答 10

上面的答案对于显示的特定示例很好,但是对于更自然的自然排序问题却错过了几个有用的案例。我只是被其中一种情况所困扰,因此创建了一个更彻底的解决方案:

def natural_sort_key(string_or_number):
    """
    by Scott S. Lawton <scott@ProductArchitect.com> 2014-12-11; public domain and/or CC0 license

    handles cases where simple 'int' approach fails, e.g.
        ['0.501', '0.55'] floating point with different number of significant digits
        [0.01, 0.1, 1]    already numeric so regex and other string functions won't work (and aren't required)
        ['elm1', 'Elm2']  ASCII vs. letters (not case sensitive)
    """

    def try_float(astring):
        try:
            return float(astring)
        except:
            return astring

    if isinstance(string_or_number, basestring):
        string_or_number = string_or_number.lower()

        if len(re.findall('[.]\d', string_or_number)) <= 1:
            # assume a floating point value, e.g. to correctly sort ['0.501', '0.55']
            # '.' for decimal is locale-specific, e.g. correct for the Anglosphere and Asia but not continental Europe
            return [try_float(s) for s in re.split(r'([\d.]+)', string_or_number)]
        else:
            # assume distinct fields, e.g. IP address, phone number with '.', etc.
            # caveat: might want to first split by whitespace
            # TBD: for unicode, replace isdigit with isdecimal
            return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_or_number)]
    else:
        # consider: add code to recurse for lists/tuples and perhaps other iterables
        return string_or_number

测试代码和一些链接(StackOverflow的打开和关闭)在这里:http : //productarchitect.com/code/better-natural-sort.py

欢迎反馈。这并不是一个确定的解决方案。只是前进了一步。

The above answers are good for the specific example that was shown, but miss several useful cases for the more general question of natural sort. I just got bit by one of those cases, so created a more thorough solution:

def natural_sort_key(string_or_number):
    """
    by Scott S. Lawton <scott@ProductArchitect.com> 2014-12-11; public domain and/or CC0 license

    handles cases where simple 'int' approach fails, e.g.
        ['0.501', '0.55'] floating point with different number of significant digits
        [0.01, 0.1, 1]    already numeric so regex and other string functions won't work (and aren't required)
        ['elm1', 'Elm2']  ASCII vs. letters (not case sensitive)
    """

    def try_float(astring):
        try:
            return float(astring)
        except:
            return astring

    if isinstance(string_or_number, basestring):
        string_or_number = string_or_number.lower()

        if len(re.findall('[.]\d', string_or_number)) <= 1:
            # assume a floating point value, e.g. to correctly sort ['0.501', '0.55']
            # '.' for decimal is locale-specific, e.g. correct for the Anglosphere and Asia but not continental Europe
            return [try_float(s) for s in re.split(r'([\d.]+)', string_or_number)]
        else:
            # assume distinct fields, e.g. IP address, phone number with '.', etc.
            # caveat: might want to first split by whitespace
            # TBD: for unicode, replace isdigit with isdecimal
            return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_or_number)]
    else:
        # consider: add code to recurse for lists/tuples and perhaps other iterables
        return string_or_number

Test code and several links (on and off of StackOverflow) are here: http://productarchitect.com/code/better-natural-sort.py

Feedback welcome. That’s not meant to be a definitive solution; just a step forward.


回答 11

最有可能functools.cmp_to_key()与python sort的底层实现紧密相关。此外,cmp参数是旧式的。现代方法是将输入项转换为支持所需的丰富比较操作的对象。

在CPython 2.x中,即使尚未实现各自的丰富比较运算符,也可以对不同类型的对象进行排序。在CPython 3.x下,不同类型的对象必须显式支持比较。请参阅Python如何比较string和int?链接到官方文档。大多数答案取决于这种隐式排序。切换到Python 3.x将需要一种新类型来实现和统一数字和字符串之间的比较。

Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
>>> (0,"foo") < ("foo",0)
True  
Python 3.5.2 (default, Oct 14 2016, 12:54:53) 
>>> (0,"foo") < ("foo",0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: unorderable types: int() < str()

有三种不同的方法。第一种使用嵌套类来利用Python的Iterable比较算法。第二个将嵌套展开为单个类。第三类放弃了子类化str,而是专注于性能。都是定时的;第二个速度快两倍,而第三个速度快六倍。子类化str不是必需的,并且一开始可能不是一个好主意,但是确实带来了一些便利。

复制排序字符以强制按大小写排序,并交换大小写以强制小写字母优先排序;这是“自然排序”的典型定义。我无法决定分组的类型;有些人可能更喜欢以下内容,这也带来了显着的性能优势:

d = lambda s: s.lower()+s.swapcase()

如果使用了比较运算符,则将它们设置为,object因此不会被忽略functools.total_ordering

import functools
import itertools


@functools.total_ordering
class NaturalStringA(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda c, s: [ c.NaturalStringPart("".join(v))
                        for k,v in
                       itertools.groupby(s, c.isdigit)
                     ]
    d = classmethod(d)
    @functools.total_ordering
    class NaturalStringPart(str):
        d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
        d = staticmethod(d)
        def __lt__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) < int(other)
            except ValueError:
                if self.isdigit():
                    return True
                elif other.isdigit():
                    return False
                else:
                    return self.d(self) < self.d(other)
        def __eq__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) == int(other)
            except ValueError:
                if self.isdigit() or other.isdigit():
                    return False
                else:
                    return self.d(self) == self.d(other)
        __le__ = object.__le__
        __ne__ = object.__ne__
        __gt__ = object.__gt__
        __ge__ = object.__ge__
    def __lt__(self, other):
        return self.d(self) < self.d(other)
    def __eq__(self, other):
        return self.d(self) == self.d(other)
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools


@functools.total_ordering
class NaturalStringB(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
    d = staticmethod(d)
    def __lt__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
        return False
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None or o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return False
        return True
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools
import enum


class OrderingType(enum.Enum):
    PerWordSwapCase         = lambda s: s.lower()+s.swapcase()
    PerCharacterSwapCase    = lambda s: "".join(c.lower()+c.swapcase() for c in s)


class NaturalOrdering:
    @classmethod
    def by(cls, ordering):
        def wrapper(string):
            return cls(string, ordering)
        return wrapper
    def __init__(self, string, ordering=OrderingType.PerCharacterSwapCase):
        self.string = string
        self.groups = [ (k,int("".join(v)))
                            if k else
                        (k,ordering("".join(v)))
                            for k,v in
                        itertools.groupby(string, str.isdigit)
                      ]
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , self.string
            )
    def __lesser(self, other, default):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return s_v < o_v
        return default
    def __lt__(self, other):
        return self.__lesser(other, default=False)
    def __le__(self, other):
        return self.__lesser(other, default=True)
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None or o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return False
        return True
    # functools.total_ordering doesn't create single-call wrappers if both
    # __le__ and __lt__ exist, so do it manually.
    def __gt__(self, other):
        op_result = self.__le__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    def __ge__(self, other):
        op_result = self.__lt__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    # __ne__ is the only implied ordering relationship, it automatically
    # delegates to __eq__
>>> import natsort
>>> import timeit
>>> l1 = ['Apple', 'corn', 'apPlE', 'arbour', 'Corn', 'Banana', 'apple', 'banana']
>>> l2 = list(map(str, range(30)))
>>> l3 = ["{} {}".format(x,y) for x in l1 for y in l2]
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringA)', number=10000, globals=globals()))
362.4729259099986
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringB)', number=10000, globals=globals()))
189.7340817489967
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalOrdering.by(OrderingType.PerCharacterSwapCase))', number=10000, globals=globals()))
69.34636392899847
>>> print(timeit.timeit('natsort.natsorted(l3+["0"], alg=natsort.ns.GROUPLETTERS | natsort.ns.LOWERCASEFIRST)', number=10000, globals=globals()))
98.2531585780016

自然排序既复杂又模糊地定义为问题。不要忘记unicodedata.normalize(...)事先运行,并考虑使用str.casefold()而不是str.lower()。我可能没有考虑过细微的编码问题。因此,我暂时推荐natsort库。我快速浏览了github仓库;代码维护一直很出色。

我见过的所有算法都取决于技巧,例如复制和降低字符以及交换大小写。虽然这使运行时间加倍,但是另一种方法将要求对输入字符集进行自然排序。我认为这不是unicode规范的一部分,并且由于unicode位数比得多[0-9],因此创建这种排序同样令人生畏。如果要进行区域设置比较,请locale.strxfrm按照Python的Sorting HOW TO准备字符串。

Most likely functools.cmp_to_key() is closely tied to the underlying implementation of python’s sort. Besides, the cmp parameter is legacy. The modern way is to transform the input items into objects that support the desired rich comparison operations.

Under CPython 2.x, objects of disparate types can be ordered even if the respective rich comparison operators haven’t been implemented. Under CPython 3.x, objects of different types must explicitly support the comparison. See How does Python compare string and int? which links to the official documentation. Most of the answers depend on this implicit ordering. Switching to Python 3.x will require a new type to implement and unify comparisons between numbers and strings.

Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
>>> (0,"foo") < ("foo",0)
True  
Python 3.5.2 (default, Oct 14 2016, 12:54:53) 
>>> (0,"foo") < ("foo",0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: unorderable types: int() < str()

There are three different approaches. The first uses nested classes to take advantage of Python’s Iterable comparison algorithm. The second unrolls this nesting into a single class. The third foregoes subclassing str to focus on performance. All are timed; the second is twice as fast while the third almost six times faster. Subclassing str isn’t required, and was probably a bad idea in the first place, but it does come with certain conveniences.

The sort characters are duplicated to force ordering by case, and case-swapped to force lower case letter to sort first; this is the typical definition of “natural sort”. I couldn’t decide on the type of grouping; some might prefer the following, which also brings significant performance benefits:

d = lambda s: s.lower()+s.swapcase()

Where utilized, the comparison operators are set to that of object so they won’t be ignored by functools.total_ordering.

import functools
import itertools


@functools.total_ordering
class NaturalStringA(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda c, s: [ c.NaturalStringPart("".join(v))
                        for k,v in
                       itertools.groupby(s, c.isdigit)
                     ]
    d = classmethod(d)
    @functools.total_ordering
    class NaturalStringPart(str):
        d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
        d = staticmethod(d)
        def __lt__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) < int(other)
            except ValueError:
                if self.isdigit():
                    return True
                elif other.isdigit():
                    return False
                else:
                    return self.d(self) < self.d(other)
        def __eq__(self, other):
            if not isinstance(self, type(other)):
                return NotImplemented
            try:
                return int(self) == int(other)
            except ValueError:
                if self.isdigit() or other.isdigit():
                    return False
                else:
                    return self.d(self) == self.d(other)
        __le__ = object.__le__
        __ne__ = object.__ne__
        __gt__ = object.__gt__
        __ge__ = object.__ge__
    def __lt__(self, other):
        return self.d(self) < self.d(other)
    def __eq__(self, other):
        return self.d(self) == self.d(other)
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools


@functools.total_ordering
class NaturalStringB(str):
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , super().__repr__()
            )
    d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
    d = staticmethod(d)
    def __lt__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return s_v < o_v
        return False
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
        zipped = itertools.zip_longest(*groups)
        for s,o in zipped:
            if s is None or o is None:
                return False
            s_k, s_v = s[0], "".join(s[1])
            o_k, o_v = o[0], "".join(o[1])
            if s_k and o_k:
                s_v, o_v = int(s_v), int(o_v)
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                s_v, o_v = self.d(s_v), self.d(o_v)
                if s_v == o_v:
                    continue
                return False
        return True
    __le__ = object.__le__
    __ne__ = object.__ne__
    __gt__ = object.__gt__
    __ge__ = object.__ge__
import functools
import itertools
import enum


class OrderingType(enum.Enum):
    PerWordSwapCase         = lambda s: s.lower()+s.swapcase()
    PerCharacterSwapCase    = lambda s: "".join(c.lower()+c.swapcase() for c in s)


class NaturalOrdering:
    @classmethod
    def by(cls, ordering):
        def wrapper(string):
            return cls(string, ordering)
        return wrapper
    def __init__(self, string, ordering=OrderingType.PerCharacterSwapCase):
        self.string = string
        self.groups = [ (k,int("".join(v)))
                            if k else
                        (k,ordering("".join(v)))
                            for k,v in
                        itertools.groupby(string, str.isdigit)
                      ]
    def __repr__(self):
        return "{}({})".format\
            ( type(self).__name__
            , self.string
            )
    def __lesser(self, other, default):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None:
                return True
            if o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return s_v < o_v
            elif s_k:
                return True
            elif o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return s_v < o_v
        return default
    def __lt__(self, other):
        return self.__lesser(other, default=False)
    def __le__(self, other):
        return self.__lesser(other, default=True)
    def __eq__(self, other):
        if not isinstance(self, type(other)):
            return NotImplemented
        for s,o in itertools.zip_longest(self.groups, other.groups):
            if s is None or o is None:
                return False
            s_k, s_v = s
            o_k, o_v = o
            if s_k and o_k:
                if s_v == o_v:
                    continue
                return False
            elif s_k or o_k:
                return False
            else:
                if s_v == o_v:
                    continue
                return False
        return True
    # functools.total_ordering doesn't create single-call wrappers if both
    # __le__ and __lt__ exist, so do it manually.
    def __gt__(self, other):
        op_result = self.__le__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    def __ge__(self, other):
        op_result = self.__lt__(other)
        if op_result is NotImplemented:
            return op_result
        return not op_result
    # __ne__ is the only implied ordering relationship, it automatically
    # delegates to __eq__
>>> import natsort
>>> import timeit
>>> l1 = ['Apple', 'corn', 'apPlE', 'arbour', 'Corn', 'Banana', 'apple', 'banana']
>>> l2 = list(map(str, range(30)))
>>> l3 = ["{} {}".format(x,y) for x in l1 for y in l2]
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringA)', number=10000, globals=globals()))
362.4729259099986
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringB)', number=10000, globals=globals()))
189.7340817489967
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalOrdering.by(OrderingType.PerCharacterSwapCase))', number=10000, globals=globals()))
69.34636392899847
>>> print(timeit.timeit('natsort.natsorted(l3+["0"], alg=natsort.ns.GROUPLETTERS | natsort.ns.LOWERCASEFIRST)', number=10000, globals=globals()))
98.2531585780016

Natural sorting is both pretty complicated and vaguely defined as a problem. Don’t forget to run unicodedata.normalize(...) beforehand, and consider use str.casefold() rather than str.lower(). There are probably subtle encoding issues I haven’t considered. So I tentatively recommend the natsort library. I took a quick glance at the github repository; the code maintenance has been stellar.

All the algorithms I’ve seen depend on tricks such as duplicating and lowering characters, and swapping case. While this doubles the running time, an alternative would require a total natural ordering on the input character set. I don’t think this is part of the unicode specification, and since there are many more unicode digits than [0-9], creating such a sorting would be equally daunting. If you want locale-aware comparisons, prepare your strings with locale.strxfrm per Python’s Sorting HOW TO.


回答 12

让我对这种需要提出自己的看法:

from typing import Tuple, Union, Optional, Generator


StrOrInt = Union[str, int]


# On Python 3.6, string concatenation is REALLY fast
# Tested myself, and this fella also tested:
# https://blog.ganssle.io/articles/2019/11/string-concat.html
def griter(s: str) -> Generator[StrOrInt, None, None]:
    last_was_digit: Optional[bool] = None
    cluster: str = ""
    for c in s:
        if last_was_digit is None:
            last_was_digit = c.isdigit()
            cluster += c
            continue
        if c.isdigit() != last_was_digit:
            if last_was_digit:
                yield int(cluster)
            else:
                yield cluster
            last_was_digit = c.isdigit()
            cluster = ""
        cluster += c
    if last_was_digit:
        yield int(cluster)
    else:
        yield cluster
    return


def grouper(s: str) -> Tuple[StrOrInt, ...]:
    return tuple(griter(s))

现在,如果我们有这样的列表:

filelist = [
    'File3', 'File007', 'File3a', 'File10', 'File11', 'File1', 'File4', 'File5',
    'File9', 'File8', 'File8b1', 'File8b2', 'File8b11', 'File6'
]

我们可以简单地使用key=kwarg进行自然排序:

>>> sorted(filelist, key=grouper)
['File1', 'File3', 'File3a', 'File4', 'File5', 'File6', 'File007', 'File8', 
'File8b1', 'File8b2', 'File8b11', 'File9', 'File10', 'File11']

当然,这里的缺点是,因为现在,函数将大写字母排在小写字母之前。

我将把不区分大小写的石斑鱼的实现留给读者:-)

Let me submit my own take on this need:

from typing import Tuple, Union, Optional, Generator


StrOrInt = Union[str, int]


# On Python 3.6, string concatenation is REALLY fast
# Tested myself, and this fella also tested:
# https://blog.ganssle.io/articles/2019/11/string-concat.html
def griter(s: str) -> Generator[StrOrInt, None, None]:
    last_was_digit: Optional[bool] = None
    cluster: str = ""
    for c in s:
        if last_was_digit is None:
            last_was_digit = c.isdigit()
            cluster += c
            continue
        if c.isdigit() != last_was_digit:
            if last_was_digit:
                yield int(cluster)
            else:
                yield cluster
            last_was_digit = c.isdigit()
            cluster = ""
        cluster += c
    if last_was_digit:
        yield int(cluster)
    else:
        yield cluster
    return


def grouper(s: str) -> Tuple[StrOrInt, ...]:
    return tuple(griter(s))

Now if we have the list like such:

filelist = [
    'File3', 'File007', 'File3a', 'File10', 'File11', 'File1', 'File4', 'File5',
    'File9', 'File8', 'File8b1', 'File8b2', 'File8b11', 'File6'
]

We can simply use the key= kwarg to do a natural sort:

>>> sorted(filelist, key=grouper)
['File1', 'File3', 'File3a', 'File4', 'File5', 'File6', 'File007', 'File8', 
'File8b1', 'File8b2', 'File8b11', 'File9', 'File10', 'File11']

The drawback here is of course, as it is now, the function will sort uppercase letters before lowercase letters.

I’ll leave the implementation of a case-insenstive grouper to the reader :-)


回答 13

我建议您只使用key关键字参数of sorted来获得所需的列表
,例如:

to_order= [e2,E1,e5,E4,e3]
ordered= sorted(to_order, key= lambda x: x.lower())
    # ordered should be [E1,e2,e3,E4,e5]

I suggest you simply use the key keyword argument of sorted to achieve your desired list
For example:

to_order= [e2,E1,e5,E4,e3]
ordered= sorted(to_order, key= lambda x: x.lower())
    # ordered should be [E1,e2,e3,E4,e5]

回答 14

在@Mark Byers回答之后,这是一个接受key参数的改编,并且更符合PEP8。

def natsorted(seq, key=None):
    def convert(text):
        return int(text) if text.isdigit() else text

    def alphanum(obj):
        if key is not None:
            return [convert(c) for c in re.split(r'([0-9]+)', key(obj))]
        return [convert(c) for c in re.split(r'([0-9]+)', obj)]

    return sorted(seq, key=alphanum)

我也做了一个要点

Following @Mark Byers answer, here is an adaptation which accepts the key parameter, and is more PEP8-compliant.

def natsorted(seq, key=None):
    def convert(text):
        return int(text) if text.isdigit() else text

    def alphanum(obj):
        if key is not None:
            return [convert(c) for c in re.split(r'([0-9]+)', key(obj))]
        return [convert(c) for c in re.split(r'([0-9]+)', obj)]

    return sorted(seq, key=alphanum)

I also made a Gist


回答 15

Claudiu对Mark Byer答案的改进;-)

import re

def natural_sort_key(s, _re=re.compile(r'(\d+)')):
    return [int(t) if i & 1 else t.lower() for i, t in enumerate(_re.split(s))]

...
my_naturally_sorted_list = sorted(my_list, key=natural_sort_key)

顺便说一句,也许不是每个人都记得函数参数默认值是在def时间评估的

An improvement on Claudiu’s improvement on Mark Byers’ answer ;-)

import re

def natural_sort_key(s, _re=re.compile(r'(\d+)')):
    return [int(t) if i & 1 else t.lower() for i, t in enumerate(_re.split(s))]

...
my_naturally_sorted_list = sorted(my_list, key=natural_sort_key)

BTW, maybe not everyone remembers that function argument defaults are evaluated at def time


回答 16

a = ['H1', 'H100', 'H10', 'H3', 'H2', 'H6', 'H11', 'H50', 'H5', 'H99', 'H8']
b = ''
c = []

def bubble(bad_list):#bubble sort method
        length = len(bad_list) - 1
        sorted = False

        while not sorted:
                sorted = True
                for i in range(length):
                        if bad_list[i] > bad_list[i+1]:
                                sorted = False
                                bad_list[i], bad_list[i+1] = bad_list[i+1], bad_list[i] #sort the integer list 
                                a[i], a[i+1] = a[i+1], a[i] #sort the main list based on the integer list index value

for a_string in a: #extract the number in the string character by character
        for letter in a_string:
                if letter.isdigit():
                        #print letter
                        b += letter
        c.append(b)
        b = ''

print 'Before sorting....'
print a
c = map(int, c) #converting string list into number list
print c
bubble(c)

print 'After sorting....'
print c
print a

致谢

泡泡排序作业

如何在python中一次读取一个字母的字符串

a = ['H1', 'H100', 'H10', 'H3', 'H2', 'H6', 'H11', 'H50', 'H5', 'H99', 'H8']
b = ''
c = []

def bubble(bad_list):#bubble sort method
        length = len(bad_list) - 1
        sorted = False

        while not sorted:
                sorted = True
                for i in range(length):
                        if bad_list[i] > bad_list[i+1]:
                                sorted = False
                                bad_list[i], bad_list[i+1] = bad_list[i+1], bad_list[i] #sort the integer list 
                                a[i], a[i+1] = a[i+1], a[i] #sort the main list based on the integer list index value

for a_string in a: #extract the number in the string character by character
        for letter in a_string:
                if letter.isdigit():
                        #print letter
                        b += letter
        c.append(b)
        b = ''

print 'Before sorting....'
print a
c = map(int, c) #converting string list into number list
print c
bubble(c)

print 'After sorting....'
print c
print a

Acknowledgments:

Bubble Sort Homework

How to read a string one letter at a time in python


回答 17

>>> import re
>>> sorted(lst, key=lambda x: int(re.findall(r'\d+$', x)[0]))
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> import re
>>> sorted(lst, key=lambda x: int(re.findall(r'\d+$', x)[0]))
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

如何在python 3.x中使用string.replace()

问题:如何在python 3.x中使用string.replace()

在python 3.x上不推荐使用string.replace()。这样做的新方法是什么?

The string.replace() is deprecated on python 3.x. What is the new way of doing this?


回答 0

与2.x中一样,使用str.replace()

例:

>>> 'Hello world'.replace('world', 'Guido')
'Hello Guido'

As in 2.x, use str.replace().

Example:

>>> 'Hello world'.replace('world', 'Guido')
'Hello Guido'

回答 1

replace()<class 'str'>python3中的一种方法:

>>> 'hello, world'.replace(',', ':')
'hello: world'

replace() is a method of <class 'str'> in python3:

>>> 'hello, world'.replace(',', ':')
'hello: world'

回答 2

python 3中的replace()方法仅用于:

a = "This is the island of istanbul"
print (a.replace("is" , "was" , 3))

#3 is the maximum replacement that can be done in the string#

>>> Thwas was the wasland of istanbul

# Last substring 'is' in istanbul is not replaced by was because maximum of 3 has already been reached

The replace() method in python 3 is used simply by:

a = "This is the island of istanbul"
print (a.replace("is" , "was" , 3))

#3 is the maximum replacement that can be done in the string#

>>> Thwas was the wasland of istanbul

# Last substring 'is' in istanbul is not replaced by was because maximum of 3 has already been reached

回答 3

您可以使用str.replace()作为str.replace() 。假设您有一个类似的字符串,'Testing PRI/Sec (#434242332;PP:432:133423846,335)'并且您想要将所有'#',':',';','/'符号替换为'-'。您可以通过这种方式(常规方式)进行替换,

>>> str = 'Testing PRI/Sec (#434242332;PP:432:133423846,335)'
>>> str = str.replace('#', '-')
>>> str = str.replace(':', '-')
>>> str = str.replace(';', '-')
>>> str = str.replace('/', '-')
>>> str
'Testing PRI-Sec (-434242332-PP-432-133423846,335)'

或这样(str.replace()的链)

>>> str = 'Testing PRI/Sec (#434242332;PP:432:133423846,335)'.replace('#', '-').replace(':', '-').replace(';', '-').replace('/', '-')
>>> str
'Testing PRI-Sec (-434242332-PP-432-133423846,335)'

You can use str.replace() as a chain of str.replace(). Think you have a string like 'Testing PRI/Sec (#434242332;PP:432:133423846,335)' and you want to replace all the '#',':',';','/' sign with '-'. You can replace it either this way(normal way),

>>> str = 'Testing PRI/Sec (#434242332;PP:432:133423846,335)'
>>> str = str.replace('#', '-')
>>> str = str.replace(':', '-')
>>> str = str.replace(';', '-')
>>> str = str.replace('/', '-')
>>> str
'Testing PRI-Sec (-434242332-PP-432-133423846,335)'

or this way(chain of str.replace())

>>> str = 'Testing PRI/Sec (#434242332;PP:432:133423846,335)'.replace('#', '-').replace(':', '-').replace(';', '-').replace('/', '-')
>>> str
'Testing PRI-Sec (-434242332-PP-432-133423846,335)'

回答 4

试试这个:

mystring = "This Is A String"
print(mystring.replace("String","Text"))

Try this:

mystring = "This Is A String"
print(mystring.replace("String","Text"))

回答 5

仅供参考,将一些字符附加到字符串内任意位置固定的单词(例如,通过添加后缀-ly来将形容词更改为副词)时,可以将后缀放在行的末尾以提高可读性。为此,请split()在内部使用replace()

s="The dog is large small"
ss=s.replace(s.split()[3],s.split()[3]+'ly')
ss
'The dog is largely small'

FYI, when appending some characters to an arbitrary, position-fixed word inside the string (e.g. changing an adjective to an adverb by adding the suffix -ly), you can put the suffix at the end of the line for readability. To do this, use split() inside replace():

s="The dog is large small"
ss=s.replace(s.split()[3],s.split()[3]+'ly')
ss
'The dog is largely small'

回答 6

ss = s.replace(s.split()[1], +s.split()[1] + 'gy')
# should have no plus after the comma --i.e.,
ss = s.replace(s.split()[1], s.split()[1] + 'gy')
ss = s.replace(s.split()[1], +s.split()[1] + 'gy')
# should have no plus after the comma --i.e.,
ss = s.replace(s.split()[1], s.split()[1] + 'gy')

为什么(’x’,)中的’x’比’x’==’x’快?

问题:为什么(’x’,)中的’x’比’x’==’x’快?

>>> timeit.timeit("'x' in ('x',)")
0.04869917374131205
>>> timeit.timeit("'x' == 'x'")
0.06144205736110564

也适用于具有多个元素的元组,两个版本似乎线性增长:

>>> timeit.timeit("'x' in ('x', 'y')")
0.04866674801541748
>>> timeit.timeit("'x' == 'x' or 'x' == 'y'")
0.06565782838087131
>>> timeit.timeit("'x' in ('y', 'x')")
0.08975995576448526
>>> timeit.timeit("'x' == 'y' or 'x' == 'y'")
0.12992391047427532

基于此,我认为我应该完全开始in在任何地方而不是在所有地方使用==

>>> timeit.timeit("'x' in ('x',)")
0.04869917374131205
>>> timeit.timeit("'x' == 'x'")
0.06144205736110564

Also works for tuples with multiple elements, both versions seem to grow linearly:

>>> timeit.timeit("'x' in ('x', 'y')")
0.04866674801541748
>>> timeit.timeit("'x' == 'x' or 'x' == 'y'")
0.06565782838087131
>>> timeit.timeit("'x' in ('y', 'x')")
0.08975995576448526
>>> timeit.timeit("'x' == 'y' or 'x' == 'y'")
0.12992391047427532

Based on this, I think I should totally start using in everywhere instead of ==!


回答 0

正如我对大卫·沃尔沃(David Wolever)所提到的那样,这不仅仅是眼神。两种方法都发送到is; 你可以通过做证明

min(Timer("x == x", setup="x = 'a' * 1000000").repeat(10, 10000))
#>>> 0.00045456900261342525

min(Timer("x == y", setup="x = 'a' * 1000000; y = 'a' * 1000000").repeat(10, 10000))
#>>> 0.5256857610074803

第一个只能如此之快,因为它通过身份检查。

为了找出为什么一个比另一个要花更长的时间,让我们追溯执行。

它们都以开头ceval.cCOMPARE_OP因为这是所涉及的字节码

TARGET(COMPARE_OP) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *res = cmp_outcome(oparg, left, right);
    Py_DECREF(left);
    Py_DECREF(right);
    SET_TOP(res);
    if (res == NULL)
        goto error;
    PREDICT(POP_JUMP_IF_FALSE);
    PREDICT(POP_JUMP_IF_TRUE);
    DISPATCH();
}

这会从堆栈中弹出值(从技术上讲,它只会弹出一个)

PyObject *right = POP();
PyObject *left = TOP();

并运行比较:

PyObject *res = cmp_outcome(oparg, left, right);

cmp_outcome 这是:

static PyObject *
cmp_outcome(int op, PyObject *v, PyObject *w)
{
    int res = 0;
    switch (op) {
    case PyCmp_IS: ...
    case PyCmp_IS_NOT: ...
    case PyCmp_IN:
        res = PySequence_Contains(w, v);
        if (res < 0)
            return NULL;
        break;
    case PyCmp_NOT_IN: ...
    case PyCmp_EXC_MATCH: ...
    default:
        return PyObject_RichCompare(v, w, op);
    }
    v = res ? Py_True : Py_False;
    Py_INCREF(v);
    return v;
}

这是路径分开的地方。该PyCmp_IN分支不

int
PySequence_Contains(PyObject *seq, PyObject *ob)
{
    Py_ssize_t result;
    PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
    if (sqm != NULL && sqm->sq_contains != NULL)
        return (*sqm->sq_contains)(seq, ob);
    result = _PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);
    return Py_SAFE_DOWNCAST(result, Py_ssize_t, int);
}

请注意,元组定义为

static PySequenceMethods tuple_as_sequence = {
    ...
    (objobjproc)tuplecontains,                  /* sq_contains */
};

PyTypeObject PyTuple_Type = {
    ...
    &tuple_as_sequence,                         /* tp_as_sequence */
    ...
};

所以分公司

if (sqm != NULL && sqm->sq_contains != NULL)

将采用和*sqm->sq_contains,即功能(objobjproc)tuplecontains

这确实

static int
tuplecontains(PyTupleObject *a, PyObject *el)
{
    Py_ssize_t i;
    int cmp;

    for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i)
        cmp = PyObject_RichCompareBool(el, PyTuple_GET_ITEM(a, i),
                                           Py_EQ);
    return cmp;
}

…等等,那不是PyObject_RichCompareBool其他分支所采取的吗?不,那是PyObject_RichCompare

该代码路径很短,因此很可能取决于这两者的速度。让我们比较一下。

int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    ...
}

代码路径PyObject_RichCompareBool几乎立即终止。对于PyObject_RichCompare,它确实

PyObject *
PyObject_RichCompare(PyObject *v, PyObject *w, int op)
{
    PyObject *res;

    assert(Py_LT <= op && op <= Py_GE);
    if (v == NULL || w == NULL) { ... }
    if (Py_EnterRecursiveCall(" in comparison"))
        return NULL;
    res = do_richcompare(v, w, op);
    Py_LeaveRecursiveCall();
    return res;
}

Py_EnterRecursiveCall/ Py_LeaveRecursiveCall组合不采取在前面的路径,但这些都是比较快的宏将递增和递减一些全局后短路。

do_richcompare 确实:

static PyObject *
do_richcompare(PyObject *v, PyObject *w, int op)
{
    richcmpfunc f;
    PyObject *res;
    int checked_reverse_op = 0;

    if (v->ob_type != w->ob_type && ...) { ... }
    if ((f = v->ob_type->tp_richcompare) != NULL) {
        res = (*f)(v, w, op);
        if (res != Py_NotImplemented)
            return res;
        ...
    }
    ...
}

这做一些快速的检查,以电话v->ob_type->tp_richcompare

PyTypeObject PyUnicode_Type = {
    ...
    PyUnicode_RichCompare,      /* tp_richcompare */
    ...
};

哪个

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
    int result;
    PyObject *v;

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))
        Py_RETURN_NOTIMPLEMENTED;

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)
        return NULL;

    if (left == right) {
        switch (op) {
        case Py_EQ:
        case Py_LE:
        case Py_GE:
            /* a string is equal to itself */
            v = Py_True;
            break;
        case Py_NE:
        case Py_LT:
        case Py_GT:
            v = Py_False;
            break;
        default:
            ...
        }
    }
    else if (...) { ... }
    else { ...}
    Py_INCREF(v);
    return v;
}

即,此快捷方式left == right仅在…之后

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)

所有路径都看起来像这样(手动递归内联,展开和修剪已知分支)

POP()                           # Stack stuff
TOP()                           #
                                #
case PyCmp_IN:                  # Dispatch on operation
                                #
sqm != NULL                     # Dispatch to builtin op
sqm->sq_contains != NULL        #
*sqm->sq_contains               #
                                #
cmp == 0                        # Do comparison in loop
i < Py_SIZE(a)                  #
v == w                          #
op == Py_EQ                     #
++i                             # 
cmp == 0                        #
                                #
res < 0                         # Convert to Python-space
res ? Py_True : Py_False        #
Py_INCREF(v)                    #
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

POP()                           # Stack stuff
TOP()                           #
                                #
default:                        # Dispatch on operation
                                #
Py_LT <= op                     # Checking operation
op <= Py_GE                     #
v == NULL                       #
w == NULL                       #
Py_EnterRecursiveCall(...)      # Recursive check
                                #
v->ob_type != w->ob_type        # More operation checks
f = v->ob_type->tp_richcompare  # Dispatch to builtin op
f != NULL                       #
                                #
!PyUnicode_Check(left)          # ...More checks
!PyUnicode_Check(right))        #
PyUnicode_READY(left) == -1     #
PyUnicode_READY(right) == -1    #
left == right                   # Finally, doing comparison
case Py_EQ:                     # Immediately short circuit
Py_INCREF(v);                   #
                                #
res != Py_NotImplemented        #
                                #
Py_LeaveRecursiveCall()         # Recursive check
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

现在,PyUnicode_CheckPyUnicode_READY相当便宜,因为他们只检查了几个领域,但它应该是显而易见的是,上面一个是较小的代码路径,它具有较少的函数调用,只有一个开关语句是只是有点薄。

TL; DR:

都派往if (left_pointer == right_pointer); 不同之处在于他们为达到目标需要做多少工作。in只是做得更少。

As I mentioned to David Wolever, there’s more to this than meets the eye; both methods dispatch to is; you can prove this by doing

min(Timer("x == x", setup="x = 'a' * 1000000").repeat(10, 10000))
#>>> 0.00045456900261342525

min(Timer("x == y", setup="x = 'a' * 1000000; y = 'a' * 1000000").repeat(10, 10000))
#>>> 0.5256857610074803

The first can only be so fast because it checks by identity.

To find out why one would take longer than the other, let’s trace through execution.

They both start in ceval.c, from COMPARE_OP since that is the bytecode involved

TARGET(COMPARE_OP) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *res = cmp_outcome(oparg, left, right);
    Py_DECREF(left);
    Py_DECREF(right);
    SET_TOP(res);
    if (res == NULL)
        goto error;
    PREDICT(POP_JUMP_IF_FALSE);
    PREDICT(POP_JUMP_IF_TRUE);
    DISPATCH();
}

This pops the values from the stack (technically it only pops one)

PyObject *right = POP();
PyObject *left = TOP();

and runs the compare:

PyObject *res = cmp_outcome(oparg, left, right);

cmp_outcome is this:

static PyObject *
cmp_outcome(int op, PyObject *v, PyObject *w)
{
    int res = 0;
    switch (op) {
    case PyCmp_IS: ...
    case PyCmp_IS_NOT: ...
    case PyCmp_IN:
        res = PySequence_Contains(w, v);
        if (res < 0)
            return NULL;
        break;
    case PyCmp_NOT_IN: ...
    case PyCmp_EXC_MATCH: ...
    default:
        return PyObject_RichCompare(v, w, op);
    }
    v = res ? Py_True : Py_False;
    Py_INCREF(v);
    return v;
}

This is where the paths split. The PyCmp_IN branch does

int
PySequence_Contains(PyObject *seq, PyObject *ob)
{
    Py_ssize_t result;
    PySequenceMethods *sqm = seq->ob_type->tp_as_sequence;
    if (sqm != NULL && sqm->sq_contains != NULL)
        return (*sqm->sq_contains)(seq, ob);
    result = _PySequence_IterSearch(seq, ob, PY_ITERSEARCH_CONTAINS);
    return Py_SAFE_DOWNCAST(result, Py_ssize_t, int);
}

Note that a tuple is defined as

static PySequenceMethods tuple_as_sequence = {
    ...
    (objobjproc)tuplecontains,                  /* sq_contains */
};

PyTypeObject PyTuple_Type = {
    ...
    &tuple_as_sequence,                         /* tp_as_sequence */
    ...
};

So the branch

if (sqm != NULL && sqm->sq_contains != NULL)

will be taken and *sqm->sq_contains, which is the function (objobjproc)tuplecontains, will be taken.

This does

static int
tuplecontains(PyTupleObject *a, PyObject *el)
{
    Py_ssize_t i;
    int cmp;

    for (i = 0, cmp = 0 ; cmp == 0 && i < Py_SIZE(a); ++i)
        cmp = PyObject_RichCompareBool(el, PyTuple_GET_ITEM(a, i),
                                           Py_EQ);
    return cmp;
}

…Wait, wasn’t that PyObject_RichCompareBool what the other branch took? Nope, that was PyObject_RichCompare.

That code path was short so it likely just comes down to the speed of these two. Let’s compare.

int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    ...
}

The code path in PyObject_RichCompareBool pretty much immediately terminates. For PyObject_RichCompare, it does

PyObject *
PyObject_RichCompare(PyObject *v, PyObject *w, int op)
{
    PyObject *res;

    assert(Py_LT <= op && op <= Py_GE);
    if (v == NULL || w == NULL) { ... }
    if (Py_EnterRecursiveCall(" in comparison"))
        return NULL;
    res = do_richcompare(v, w, op);
    Py_LeaveRecursiveCall();
    return res;
}

The Py_EnterRecursiveCall/Py_LeaveRecursiveCall combo are not taken in the previous path, but these are relatively quick macros that’ll short-circuit after incrementing and decrementing some globals.

do_richcompare does:

static PyObject *
do_richcompare(PyObject *v, PyObject *w, int op)
{
    richcmpfunc f;
    PyObject *res;
    int checked_reverse_op = 0;

    if (v->ob_type != w->ob_type && ...) { ... }
    if ((f = v->ob_type->tp_richcompare) != NULL) {
        res = (*f)(v, w, op);
        if (res != Py_NotImplemented)
            return res;
        ...
    }
    ...
}

This does some quick checks to call v->ob_type->tp_richcompare which is

PyTypeObject PyUnicode_Type = {
    ...
    PyUnicode_RichCompare,      /* tp_richcompare */
    ...
};

which does

PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
    int result;
    PyObject *v;

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))
        Py_RETURN_NOTIMPLEMENTED;

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)
        return NULL;

    if (left == right) {
        switch (op) {
        case Py_EQ:
        case Py_LE:
        case Py_GE:
            /* a string is equal to itself */
            v = Py_True;
            break;
        case Py_NE:
        case Py_LT:
        case Py_GT:
            v = Py_False;
            break;
        default:
            ...
        }
    }
    else if (...) { ... }
    else { ...}
    Py_INCREF(v);
    return v;
}

Namely, this shortcuts on left == right… but only after doing

    if (!PyUnicode_Check(left) || !PyUnicode_Check(right))

    if (PyUnicode_READY(left) == -1 ||
        PyUnicode_READY(right) == -1)

All in all the paths then look something like this (manually recursively inlining, unrolling and pruning known branches)

POP()                           # Stack stuff
TOP()                           #
                                #
case PyCmp_IN:                  # Dispatch on operation
                                #
sqm != NULL                     # Dispatch to builtin op
sqm->sq_contains != NULL        #
*sqm->sq_contains               #
                                #
cmp == 0                        # Do comparison in loop
i < Py_SIZE(a)                  #
v == w                          #
op == Py_EQ                     #
++i                             # 
cmp == 0                        #
                                #
res < 0                         # Convert to Python-space
res ? Py_True : Py_False        #
Py_INCREF(v)                    #
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

vs

POP()                           # Stack stuff
TOP()                           #
                                #
default:                        # Dispatch on operation
                                #
Py_LT <= op                     # Checking operation
op <= Py_GE                     #
v == NULL                       #
w == NULL                       #
Py_EnterRecursiveCall(...)      # Recursive check
                                #
v->ob_type != w->ob_type        # More operation checks
f = v->ob_type->tp_richcompare  # Dispatch to builtin op
f != NULL                       #
                                #
!PyUnicode_Check(left)          # ...More checks
!PyUnicode_Check(right))        #
PyUnicode_READY(left) == -1     #
PyUnicode_READY(right) == -1    #
left == right                   # Finally, doing comparison
case Py_EQ:                     # Immediately short circuit
Py_INCREF(v);                   #
                                #
res != Py_NotImplemented        #
                                #
Py_LeaveRecursiveCall()         # Recursive check
                                #
Py_DECREF(left)                 # Stack stuff
Py_DECREF(right)                #
SET_TOP(res)                    #
res == NULL                     #
DISPATCH()                      #

Now, PyUnicode_Check and PyUnicode_READY are pretty cheap since they only check a couple of fields, but it should be obvious that the top one is a smaller code path, it has fewer function calls, only one switch statement and is just a bit thinner.

TL;DR:

Both dispatch to if (left_pointer == right_pointer); the difference is just how much work they do to get there. in just does less.


回答 1

这里有三个因素在起作用,共同产生这种令人惊讶的行为。

首先:in操作员使用快捷方式并检查身份(x is y),然后再检查是否等于(x == y):

>>> n = float('nan')
>>> n in (n, )
True
>>> n == n
False
>>> n is n
True

第二:由于Python的字符串interning,两个"x"in "x" in ("x", )是相同的:

>>> "x" is "x"
True

(大警告:这是实现特定的行为!is应该永远不会被用来比较字符串,因为它有时给了惊人的答复;例如"x" * 100 is "x" * 100 ==> False

第三:在详细Veedrac的梦幻般的回答tuple.__contains__x in (y, )大致相当于(y, ).__contains__(x))获取进行标识检查的速度比点str.__eq__(再次,x == y就是大致相当于x.__eq__(y))一样。

您可以看到这一点的证据,因为x in (y, )它比逻辑上等效的要慢得多x == y

In [18]: %timeit 'x' in ('x', )
10000000 loops, best of 3: 65.2 ns per loop

In [19]: %timeit 'x' == 'x'    
10000000 loops, best of 3: 68 ns per loop

In [20]: %timeit 'x' in ('y', ) 
10000000 loops, best of 3: 73.4 ns per loop

In [21]: %timeit 'x' == 'y'    
10000000 loops, best of 3: 56.2 ns per loop

这种x in (y, )情况的速度较慢,因为在is比较失败之后,in运算符将退回到常规的相等性检查(即使用==),因此比较所需的时间与相同,因此==由于创建元组的开销,整个操作的速度变慢了。 ,遍历其成员等。

还请注意,只有在以下a in (b, )情况下更快a is b

In [48]: a = 1             

In [49]: b = 2

In [50]: %timeit a is a or a == a
10000000 loops, best of 3: 95.1 ns per loop

In [51]: %timeit a in (a, )      
10000000 loops, best of 3: 140 ns per loop

In [52]: %timeit a is b or a == b
10000000 loops, best of 3: 177 ns per loop

In [53]: %timeit a in (b, )      
10000000 loops, best of 3: 169 ns per loop

(为什么a in (b, )要比这快a is b or a == b?我猜想虚拟机指令会更少—  a in (b, )只有〜3条指令,其中a is b or a == b会有更多的VM指令)

Veedrac的答案- https://stackoverflow.com/a/28889838/71522 -进入每个过程具体是什么情况更多的细节==in,是非常值得读。

There are three factors at play here which, combined, produce this surprising behavior.

First: the in operator takes a shortcut and checks identity (x is y) before it checks equality (x == y):

>>> n = float('nan')
>>> n in (n, )
True
>>> n == n
False
>>> n is n
True

Second: because of Python’s string interning, both "x"s in "x" in ("x", ) will be identical:

>>> "x" is "x"
True

(big warning: this is implementation-specific behavior! is should never be used to compare strings because it will give surprising answers sometimes; for example "x" * 100 is "x" * 100 ==> False)

Third: as detailed in Veedrac’s fantastic answer, tuple.__contains__ (x in (y, ) is roughly equivalent to (y, ).__contains__(x)) gets to the point of performing the identity check faster than str.__eq__ (again, x == y is roughly equivalent to x.__eq__(y)) does.

You can see evidence for this because x in (y, ) is significantly slower than the logically equivalent, x == y:

In [18]: %timeit 'x' in ('x', )
10000000 loops, best of 3: 65.2 ns per loop

In [19]: %timeit 'x' == 'x'    
10000000 loops, best of 3: 68 ns per loop

In [20]: %timeit 'x' in ('y', ) 
10000000 loops, best of 3: 73.4 ns per loop

In [21]: %timeit 'x' == 'y'    
10000000 loops, best of 3: 56.2 ns per loop

The x in (y, ) case is slower because, after the is comparison fails, the in operator falls back to normal equality checking (i.e., using ==), so the comparison takes about the same amount of time as ==, rendering the entire operation slower because of the overhead of creating the tuple, walking its members, etc.

Note also that a in (b, ) is only faster when a is b:

In [48]: a = 1             

In [49]: b = 2

In [50]: %timeit a is a or a == a
10000000 loops, best of 3: 95.1 ns per loop

In [51]: %timeit a in (a, )      
10000000 loops, best of 3: 140 ns per loop

In [52]: %timeit a is b or a == b
10000000 loops, best of 3: 177 ns per loop

In [53]: %timeit a in (b, )      
10000000 loops, best of 3: 169 ns per loop

(why is a in (b, ) faster than a is b or a == b? My guess would be fewer virtual machine instructions — a in (b, ) is only ~3 instructions, where a is b or a == b will be quite a few more VM instructions)

Veedrac’s answer — https://stackoverflow.com/a/28889838/71522 — goes into much more detail on specifically what happens during each of == and in and is well worth the read.


为什么Python3中没有xrange函数?

问题:为什么Python3中没有xrange函数?

最近,我开始使用Python3,它缺少xrange的好处。

简单的例子:

1) Python2:

from time import time as t
def count():
  st = t()
  [x for x in xrange(10000000) if x%4 == 0]
  et = t()
  print et-st
count()

2) Python3:

from time import time as t

def xrange(x):

    return iter(range(x))

def count():
    st = t()
    [x for x in xrange(10000000) if x%4 == 0]
    et = t()
    print (et-st)
count()

结果分别是:

1) 1.53888392448 2) 3.215819835662842

这是为什么?我的意思是,为什么xrange被删除了?这是学习的好工具。对于初学者来说,就像我自己一样,就像我们都处在某个时刻。为什么要删除它?有人可以指出我正确的PEP,我找不到它。

干杯。

Recently I started using Python3 and it’s lack of xrange hurts.

Simple example:

1) Python2:

from time import time as t
def count():
  st = t()
  [x for x in xrange(10000000) if x%4 == 0]
  et = t()
  print et-st
count()

2) Python3:

from time import time as t

def xrange(x):

    return iter(range(x))

def count():
    st = t()
    [x for x in xrange(10000000) if x%4 == 0]
    et = t()
    print (et-st)
count()

The results are, respectively:

1) 1.53888392448 2) 3.215819835662842

Why is that? I mean, why xrange’s been removed? It’s such a great tool to learn. For the beginners, just like myself, like we all were at some point. Why remove it? Can somebody point me to the proper PEP, I can’t find it.

Cheers.


回答 0

进行一些性能评估,timeit而不是尝试使用手动进行time

首先,Apple 2.7.2 64位:

In [37]: %timeit collections.deque((x for x in xrange(10000000) if x%4 == 0), maxlen=0)
1 loops, best of 3: 1.05 s per loop

现在,python.org 3.3.0 64位:

In [83]: %timeit collections.deque((x for x in range(10000000) if x%4 == 0), maxlen=0)
1 loops, best of 3: 1.32 s per loop

In [84]: %timeit collections.deque((x for x in xrange(10000000) if x%4 == 0), maxlen=0)
1 loops, best of 3: 1.31 s per loop

In [85]: %timeit collections.deque((x for x in iter(range(10000000)) if x%4 == 0), maxlen=0) 
1 loops, best of 3: 1.33 s per loop

显然,3.x range确实比2.x慢一些xrange。OP的xrange功能与此无关。(不足为奇,因为__iter__在循环中发生的任何事情的1000 万次调用中,对插槽的一次性调用不太可能可见,但有人提出来了。)

但这仅慢了30%。OP如何使速度慢2倍?好吧,如果我使用32位Python重复相同的测试,则得到1.58和3.12。因此,我的猜测是,这是3.x针对64位性能进行了优化(以损害32位的方式)的又一案例。

但这真的重要吗?再次使用3.3.0 64位进行检查:

In [86]: %timeit [x for x in range(10000000) if x%4 == 0]
1 loops, best of 3: 3.65 s per loop

因此,构建所需的list时间是整个迭代的两倍以上。

至于“比Python 2.6+消耗更多的资源”,在我的测试中,看起来3.x range的大小与2.x的大小完全相同xrange,即使它的大小是10x的大小,也可以构建不必要的列表问题仍然比范围迭代可能做的任何事情多出10000000x。

那么显式for循环而不是内部的C循环又deque如何呢?

In [87]: def consume(x):
   ....:     for i in x:
   ....:         pass
In [88]: %timeit consume(x for x in range(10000000) if x%4 == 0)
1 loops, best of 3: 1.85 s per loop

因此,在for语句中浪费的时间几乎与迭代的实际工作中所浪费的时间一样多range

如果您担心优化范围对象的迭代,则可能在错误的位置。


同时,xrange无论人们告诉您相同的内容多少次,您都会不断询问为什么要删除它,但是我会再次重复:它没有删除:它被重命名为range,而2.x range是被删除的东西。

这是3.3 range对象是2.x xrange对象(而不是2.x range函数)的直接后代的证明:3.3range2.7xrange的源。您甚至可以查看更改历史记录(我相信,该更改已链接到替换文件中任何位置的字符串“ xrange”的最后一个实例的更改)。

那么,为什么它变慢?

好吧,其中之一是,他们添加了许多新功能。另外,他们已经在整个地方(尤其是在迭代过程中)进行了各种具有较小副作用的更改。尽管有时有时会稍微低估不太重要的案例,但仍有大量工作可以极大地优化各种重要案例。将所有这些加起来,令迭代range速度现在变慢会令我感到惊讶。这是最重要的案例之一,没有人会足够关注。没有人会遇到现实中的用例,这种性能差异是他们代码中的热点。

Some performance measurements, using timeit instead of trying to do it manually with time.

First, Apple 2.7.2 64-bit:

In [37]: %timeit collections.deque((x for x in xrange(10000000) if x%4 == 0), maxlen=0)
1 loops, best of 3: 1.05 s per loop

Now, python.org 3.3.0 64-bit:

In [83]: %timeit collections.deque((x for x in range(10000000) if x%4 == 0), maxlen=0)
1 loops, best of 3: 1.32 s per loop

In [84]: %timeit collections.deque((x for x in xrange(10000000) if x%4 == 0), maxlen=0)
1 loops, best of 3: 1.31 s per loop

In [85]: %timeit collections.deque((x for x in iter(range(10000000)) if x%4 == 0), maxlen=0) 
1 loops, best of 3: 1.33 s per loop

Apparently, 3.x range really is a bit slower than 2.x xrange. And the OP’s xrange function has nothing to do with it. (Not surprising, as a one-time call to the __iter__ slot isn’t likely to be visible among 10000000 calls to whatever happens in the loop, but someone brought it up as a possibility.)

But it’s only 30% slower. How did the OP get 2x as slow? Well, if I repeat the same tests with 32-bit Python, I get 1.58 vs. 3.12. So my guess is that this is yet another of those cases where 3.x has been optimized for 64-bit performance in ways that hurt 32-bit.

But does it really matter? Check this out, with 3.3.0 64-bit again:

In [86]: %timeit [x for x in range(10000000) if x%4 == 0]
1 loops, best of 3: 3.65 s per loop

So, building the list takes more than twice as long than the entire iteration.

And as for “consumes much more resources than Python 2.6+”, from my tests, it looks like a 3.x range is exactly the same size as a 2.x xrange—and, even if it were 10x as big, building the unnecessary list is still about 10000000x more of a problem than anything the range iteration could possibly do.

And what about an explicit for loop instead of the C loop inside deque?

In [87]: def consume(x):
   ....:     for i in x:
   ....:         pass
In [88]: %timeit consume(x for x in range(10000000) if x%4 == 0)
1 loops, best of 3: 1.85 s per loop

So, almost as much time wasted in the for statement as in the actual work of iterating the range.

If you’re worried about optimizing the iteration of a range object, you’re probably looking in the wrong place.


Meanwhile, you keep asking why xrange was removed, no matter how many times people tell you the same thing, but I’ll repeat it again: It was not removed: it was renamed to range, and the 2.x range is what was removed.

Here’s some proof that the 3.3 range object is a direct descendant of the 2.x xrange object (and not of the 2.x range function): the source to 3.3 range and 2.7 xrange. You can even see the change history (linked to, I believe, the change that replaced the last instance of the string “xrange” anywhere in the file).

So, why is it slower?

Well, for one, they’ve added a lot of new features. For another, they’ve done all kinds of changes all over the place (especially inside iteration) that have minor side effects. And there’d been a lot of work to dramatically optimize various important cases, even if it sometimes slightly pessimizes less important cases. Add this all up, and I’m not surprised that iterating a range as fast as possible is now a bit slower. It’s one of those less-important cases that nobody would ever care enough to focus on. No one is likely to ever have a real-life use case where this performance difference is the hotspot in their code.


回答 1

Python3的范围 Python2的xrange。无需在其周围包装迭代器。要获得Python3中的实际列表,您需要使用list(range(...))

如果您想要适用于Python2和Python3的产品,请尝试以下操作

try:
    xrange
except NameError:
    xrange = range

Python3’s range is Python2’s xrange. There’s no need to wrap an iter around it. To get an actual list in Python3, you need to use list(range(...))

If you want something that works with Python2 and Python3, try this

try:
    xrange
except NameError:
    xrange = range

回答 2

Python 3的range类型与Python 2的类型一样xrange。我不确定为什么会看到速度变慢,因为函数直接返回的迭代器xrange正是您返回的迭代器range

我无法在系统上重现速度下降的情况。这是我的测试方式:

Python 2,具有xrange

Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import timeit
>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=100)
18.631936646865853

Python 3,range速度稍微快一点:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import timeit
>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)
17.31399508687869

我最近了解到Python 3的range类型还具有其他一些简洁的功能,例如对切片的支持:range(10,100,2)[5:25:5]is range(20, 60, 10)

Python 3’s range type works just like Python 2’s xrange. I’m not sure why you’re seeing a slowdown, since the iterator returned by your xrange function is exactly what you’d get if you iterated over range directly.

I’m not able to reproduce the slowdown on my system. Here’s how I tested:

Python 2, with xrange:

Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import timeit
>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=100)
18.631936646865853

Python 3, with range is a tiny bit faster:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import timeit
>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)
17.31399508687869

I recently learned that Python 3’s range type has some other neat features, such as support for slicing: range(10,100,2)[5:25:5] is range(20, 60, 10)!


回答 3

修复python2代码的一种方法是:

import sys

if sys.version_info >= (3, 0):
    def xrange(*args, **kwargs):
        return iter(range(*args, **kwargs))

One way to fix up your python2 code is:

import sys

if sys.version_info >= (3, 0):
    def xrange(*args, **kwargs):
        return iter(range(*args, **kwargs))

回答 4

Python 2中的xrange是一个生成器,它实现了迭代器,而range只是一个函数。在Python3中,我不知道为什么将其从xrange中删除。

xrange from Python 2 is a generator and implements iterator while range is just a function. In Python3 I don’t know why was dropped off the xrange.


回答 5

comp:〜$ python Python 2.7.6(默认,2015年6月22日,17:58:13)[gcc 4.8.2]在linux2上

>>> import timeit
>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=100)

5.656799077987671

>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=100)

5.579368829727173

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

21.54827117919922

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

22.014557123184204

当timeit number = 1参数时:

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=1)

0.2245171070098877

>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=1)

0.10750913619995117

comp:〜$ python3 Python 3.4.3(默认,2015年10月14日,20:28:29)[GCC 4.8.4]在Linux上

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

9.113872020003328

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

9.07014398300089

timeit number = 1,2,3,4参数可以快速且线性地工作:

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=1)

0.09329321900440846

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=2)

0.18501482300052885

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=3)

0.2703447980020428

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=4)

0.36209142999723554

因此,看来如果我们测量1个运行循环周期,例如timeit.timeit(“ [x表示x在range(1000000)中,如果x%4]”,number = 1)(正如我们在实际代码中实际使用的那样),python3的运行速度足够快,但在重复循环中,python 2 xrange()在速度上胜过python 3中的range()。

comp:~$ python Python 2.7.6 (default, Jun 22 2015, 17:58:13) [GCC 4.8.2] on linux2

>>> import timeit
>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=100)

5.656799077987671

>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=100)

5.579368829727173

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

21.54827117919922

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

22.014557123184204

With timeit number=1 param:

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=1)

0.2245171070098877

>>> timeit.timeit("[x for x in xrange(1000000) if x%4]",number=1)

0.10750913619995117

comp:~$ python3 Python 3.4.3 (default, Oct 14 2015, 20:28:29) [GCC 4.8.4] on linux

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

9.113872020003328

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=100)

9.07014398300089

With timeit number=1,2,3,4 param works quick and in linear way:

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=1)

0.09329321900440846

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=2)

0.18501482300052885

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=3)

0.2703447980020428

>>> timeit.timeit("[x for x in range(1000000) if x%4]",number=4)

0.36209142999723554

So it seems if we measure 1 running loop cycle like timeit.timeit(“[x for x in range(1000000) if x%4]”,number=1) (as we actually use in real code) python3 works quick enough, but in repeated loops python 2 xrange() wins in speed against range() from python 3.


TypeError:method()接受1个位置参数,但给出了2个

问题:TypeError:method()接受1个位置参数,但给出了2个

如果我有课…

class MyClass:

    def method(arg):
        print(arg)

…我用来创建对象的…

my_object = MyClass()

我这样称呼method("foo")

>>> my_object.method("foo")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: method() takes exactly 1 positional argument (2 given)

…为什么当我只给出一个参数时,Python告诉我给它两个参数?

If I have a class…

class MyClass:

    def method(arg):
        print(arg)

…which I use to create an object…

my_object = MyClass()

…on which I call method("foo") like so…

>>> my_object.method("foo")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: method() takes exactly 1 positional argument (2 given)

…why does Python tell me I gave it two arguments, when I only gave one?


回答 0

在Python中,这是:

my_object.method("foo")

…是语法糖,口译员在后台将其翻译为:

MyClass.method(my_object, "foo")

您可以看到,它确实有两个参数-从调用者的角度来看,只是第一个参数是隐式的。

这是因为大多数方法会对被调用的对象进行某些处理,因此需要某种方法在该方法内部引用该对象。按照惯例,第一个参数self在方法定义内调用:

class MyNewClass:

    def method(self, arg):
        print(self)
        print(arg)

如果您呼叫method("foo")的实例MyNewClass,它会按预期运作:

>>> my_new_object = MyNewClass()
>>> my_new_object.method("foo")
<__main__.MyNewClass object at 0x29045d0>
foo

有时(但不经常),您实际上不在乎您的方法所绑定的对象,在这种情况下,您可以使用内置函数来修饰该方法,staticmethod()例如:

class MyOtherClass:

    @staticmethod
    def method(arg):
        print(arg)

…在这种情况下,您无需self在方法定义中添加参数,它仍然有效:

>>> my_other_object = MyOtherClass()
>>> my_other_object.method("foo")
foo

In Python, this:

my_object.method("foo")

…is syntactic sugar, which the interpreter translates behind the scenes into:

MyClass.method(my_object, "foo")

…which, as you can see, does indeed have two arguments – it’s just that the first one is implicit, from the point of view of the caller.

This is because most methods do some work with the object they’re called on, so there needs to be some way for that object to be referred to inside the method. By convention, this first argument is called self inside the method definition:

class MyNewClass:

    def method(self, arg):
        print(self)
        print(arg)

If you call method("foo") on an instance of MyNewClass, it works as expected:

>>> my_new_object = MyNewClass()
>>> my_new_object.method("foo")
<__main__.MyNewClass object at 0x29045d0>
foo

Occasionally (but not often), you really don’t care about the object that your method is bound to, and in that circumstance, you can decorate the method with the builtin staticmethod() function to say so:

class MyOtherClass:

    @staticmethod
    def method(arg):
        print(arg)

…in which case you don’t need to add a self argument to the method definition, and it still works:

>>> my_other_object = MyOtherClass()
>>> my_other_object.method("foo")
foo

回答 1

遇到此类错误时要考虑的其他事项:

我遇到了这个错误消息,发现这篇文章很有帮助。事实证明,我重写了__init__()存在对象继承的位置。

继承的示例相当长,因此我将跳到一个不使用继承的更简单的示例:

class MyBadInitClass:
    def ___init__(self, name):
        self.name = name

    def name_foo(self, arg):
        print(self)
        print(arg)
        print("My name is", self.name)


class MyNewClass:
    def new_foo(self, arg):
        print(self)
        print(arg)


my_new_object = MyNewClass()
my_new_object.new_foo("NewFoo")
my_bad_init_object = MyBadInitClass(name="Test Name")
my_bad_init_object.name_foo("name foo")

结果是:

<__main__.MyNewClass object at 0x033C48D0>
NewFoo
Traceback (most recent call last):
  File "C:/Users/Orange/PycharmProjects/Chapter9/bad_init_example.py", line 41, in <module>
    my_bad_init_object = MyBadInitClass(name="Test Name")
TypeError: object() takes no parameters

PyCharm没有抓住这种错别字。Notepad ++也没有(其他编辑器/ IDE也可能)。

当然,这是一个“不带任何参数”的TypeError,与期望得到一个的“得到两个”没有太大区别,就Python中的对象初始化而言。

解决主题:在语法上正确的情况下将使用重载初始化程序,但在语法上正确的情况下将被使用,而是使用内置初始化程序。该对象不会期望/处理此问题,并且会引发错误。

如果出现sytax错误:修复很简单,只需编辑自定义init语句即可:

def __init__(self, name):
    self.name = name

Something else to consider when this type of error is encountered:

I was running into this error message and found this post helpful. Turns out in my case I had overridden an __init__() where there was object inheritance.

The inherited example is rather long, so I’ll skip to a more simple example that doesn’t use inheritance:

class MyBadInitClass:
    def ___init__(self, name):
        self.name = name

    def name_foo(self, arg):
        print(self)
        print(arg)
        print("My name is", self.name)


class MyNewClass:
    def new_foo(self, arg):
        print(self)
        print(arg)


my_new_object = MyNewClass()
my_new_object.new_foo("NewFoo")
my_bad_init_object = MyBadInitClass(name="Test Name")
my_bad_init_object.name_foo("name foo")

Result is:

<__main__.MyNewClass object at 0x033C48D0>
NewFoo
Traceback (most recent call last):
  File "C:/Users/Orange/PycharmProjects/Chapter9/bad_init_example.py", line 41, in <module>
    my_bad_init_object = MyBadInitClass(name="Test Name")
TypeError: object() takes no parameters

PyCharm didn’t catch this typo. Nor did Notepad++ (other editors/IDE’s might).

Granted, this is a “takes no parameters” TypeError, it isn’t much different than “got two” when expecting one, in terms of object initialization in Python.

Addressing the topic: An overloading initializer will be used if syntactically correct, but if not it will be ignored and the built-in used instead. The object won’t expect/handle this and the error is thrown.

In the case of the sytax error: The fix is simple, just edit the custom init statement:

def __init__(self, name):
    self.name = name

回答 2

简单来说。

在Python中,您应该将self参数作为第一个参数添加到类中所有已定义的方法中:

class MyClass:
  def method(self, arg):
    print(arg)

然后,您可以根据自己的直觉使用您的方法:

>>> my_object = MyClass()
>>> my_object.method("foo")
foo

这应该可以解决您的问题:)

为了更好地理解,您还可以阅读以下问题的答案:自我的目的是什么?

In simple words.

In Python you should add self argument as the first argument to all defined methods in classes:

class MyClass:
  def method(self, arg):
    print(arg)

Then you can use your method according to your intuition:

>>> my_object = MyClass()
>>> my_object.method("foo")
foo

This should solve your problem :)

For a better understanding, you can also read the answers to this question: What is the purpose of self?


回答 3

Python的新手**,以错误的方式使用Python的功能时遇到了这个问题。尝试从某处调用此定义:

def create_properties_frame(self, parent, **kwargs):

使用没有双星的通话会导致问题:

self.create_properties_frame(frame, kw_gsp)

TypeError:create_properties_frame()接受2个位置参数,但给出了3个

解决方案是**在参数中添加:

self.create_properties_frame(frame, **kw_gsp)

Newcomer to Python, I had this issue when I was using the Python’s ** feature in a wrong way. Trying to call this definition from somewhere:

def create_properties_frame(self, parent, **kwargs):

using a call without a double star was causing the problem:

self.create_properties_frame(frame, kw_gsp)

TypeError: create_properties_frame() takes 2 positional arguments but 3 were given

The solution is to add ** to the argument:

self.create_properties_frame(frame, **kw_gsp)

回答 4

当您未指定参数No __init__()或任何其他寻找方法时,就会发生这种情况。

例如:

class Dog:
    def __init__(self):
        print("IN INIT METHOD")

    def __unicode__(self,):
        print("IN UNICODE METHOD")

    def __str__(self):
        print("IN STR METHOD")

obj=Dog("JIMMY",1,2,3,"WOOF")

当您运行上述程序时,它给您这样的错误:

TypeError:__init __()接受1个位置参数,但给出了6个

我们如何摆脱这件事?

只需传递参数,__init__()寻找什么方法

class Dog:
    def __init__(self, dogname, dob_d, dob_m, dob_y, dogSpeakText):
        self.name_of_dog = dogname
        self.date_of_birth = dob_d
        self.month_of_birth = dob_m
        self.year_of_birth = dob_y
        self.sound_it_make = dogSpeakText

    def __unicode__(self, ):
        print("IN UNICODE METHOD")

    def __str__(self):
        print("IN STR METHOD")


obj = Dog("JIMMY", 1, 2, 3, "WOOF")
print(id(obj))

It occurs when you don’t specify the no of parameters the __init__() or any other method looking for.

For example:

class Dog:
    def __init__(self):
        print("IN INIT METHOD")

    def __unicode__(self,):
        print("IN UNICODE METHOD")

    def __str__(self):
        print("IN STR METHOD")

obj=Dog("JIMMY",1,2,3,"WOOF")

When you run the above programme, it gives you an error like that:

TypeError: __init__() takes 1 positional argument but 6 were given

How we can get rid of this thing?

Just pass the parameters, what __init__() method looking for

class Dog:
    def __init__(self, dogname, dob_d, dob_m, dob_y, dogSpeakText):
        self.name_of_dog = dogname
        self.date_of_birth = dob_d
        self.month_of_birth = dob_m
        self.year_of_birth = dob_y
        self.sound_it_make = dogSpeakText

    def __unicode__(self, ):
        print("IN UNICODE METHOD")

    def __str__(self):
        print("IN STR METHOD")


obj = Dog("JIMMY", 1, 2, 3, "WOOF")
print(id(obj))

回答 5

您实际上应该创建一个类:

class accum:
    def __init__(self):
        self.acc = 0
    def accumulator(self, var2add, end):
        if not end:
            self.acc+=var2add
    return self.acc

You should actually create a class:

class accum:
    def __init__(self):
        self.acc = 0
    def accumulator(self, var2add, end):
        if not end:
            self.acc+=var2add
    return self.acc

回答 6

就我而言,我忘记添加 ()

我正在这样调用方法

obj = className.myMethod

但是应该是这样

obj = className.myMethod()

In my case, I forgot to add the ()

I was calling the method like this

obj = className.myMethod

But it should be is like this

obj = className.myMethod()

回答 7

cls参数传递到@classmethod以解决此问题。

@classmethod
def test(cls):
    return ''

Pass cls parameter into @classmethod to resolve this problem.

@classmethod
def test(cls):
    return ''

如何将输入读取为数字?

问题:如何将输入读取为数字?

为什么在下面的代码中使用xy字符串而不是整数?

(注意:在Python 2.x中使用raw_input()。在Python 3.x中使用input()。在Python 3.x中raw_input()被重命名为input()

play = True

while play:

    x = input("Enter a number: ")
    y = input("Enter a number: ")

    print(x + y)
    print(x - y)
    print(x * y)
    print(x / y)
    print(x % y)

    if input("Play again? ") == "no":
        play = False

Why are x and y strings instead of ints in the below code?

(Note: in Python 2.x use raw_input(). In Python 3.x use input(). raw_input() was renamed to input() in Python 3.x)

play = True

while play:

    x = input("Enter a number: ")
    y = input("Enter a number: ")

    print(x + y)
    print(x - y)
    print(x * y)
    print(x / y)
    print(x % y)

    if input("Play again? ") == "no":
        play = False

回答 0

TLDR

  • Python 3不会评估input函数接收到的数据,但是Python 2的input函数会评估(阅读下一节以了解含义)。
  • inputraw_input函数相当于Python 2与Python 3 。

Python 2.x

有两个函数用于获取用户输入,分别称为inputraw_input。它们之间的区别是,raw_input不评估数据并以字符串形式按原样返回。但是,input将对您输入的内容进行评估,评估结果将返回。例如,

>>> import sys
>>> sys.version
'2.7.6 (default, Mar 22 2014, 22:59:56) \n[GCC 4.8.2]'
>>> data = input("Enter a number: ")
Enter a number: 5 + 17
>>> data, type(data)
(22, <type 'int'>)

5 + 17评估数据,结果为22。当它对表达式求值时5 + 17,它将检测到您要添加两个数字,因此结果也将是同一int类型。因此,类型转换是免费完成的,并22作为的结果返回input并存储在data变量中。您可以将其input视为raw_input带有eval调用的组合。

>>> data = eval(raw_input("Enter a number: "))
Enter a number: 5 + 17
>>> data, type(data)
(22, <type 'int'>)

注意:input在Python 2.x 中使用时应小心。我在这个答案中解释了为什么在使用它时要小心。

但是,raw_input不评估输入并以字符串形式原样返回。

>>> import sys
>>> sys.version
'2.7.6 (default, Mar 22 2014, 22:59:56) \n[GCC 4.8.2]'
>>> data = raw_input("Enter a number: ")
Enter a number: 5 + 17
>>> data, type(data)
('5 + 17', <type 'str'>)

Python 3.x

Python 3.x input和Python 2.x raw_input类似,raw_input在Python 3.x中不可用。

>>> import sys
>>> sys.version
'3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]'
>>> data = input("Enter a number: ")
Enter a number: 5 + 17
>>> data, type(data)
('5 + 17', <class 'str'>)

要回答您的问题,由于Python 3.x不会评估和转换数据类型,因此必须使用显式转换为ints int,如下所示

x = int(input("Enter a number: "))
y = int(input("Enter a number: "))

您可以接受任意基数的数字,并使用int函数将其直接转换为10基数

>>> data = int(input("Enter a number: "), 8)
Enter a number: 777
>>> data
511
>>> data = int(input("Enter a number: "), 16)
Enter a number: FFFF
>>> data
65535
>>> data = int(input("Enter a number: "), 2)
Enter a number: 10101010101
>>> data
1365

第二个参数告诉输入的数字的基础是什么,然后在内部对其进行理解和转换。如果输入的数据有误,将抛出ValueError

>>> data = int(input("Enter a number: "), 2)
Enter a number: 1234
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: invalid literal for int() with base 2: '1234'

对于可以包含小数部分的值,类型应为float而不是int

x = float(input("Enter a number:"))

除此之外,您的程序可以像这样进行一些更改

while True:
    ...
    ...
    if input("Play again? ") == "no":
        break

您可以play使用break和摆脱变量while True

TLDR

  • Python 3 doesn’t evaluate the data received with input function, but Python 2’s input function does (read the next section to understand the implication).
  • Python 2’s equivalent of Python 3’s input is the raw_input function.

Python 2.x

There were two functions to get user input, called input and raw_input. The difference between them is, raw_input doesn’t evaluate the data and returns as it is, in string form. But, input will evaluate whatever you entered and the result of evaluation will be returned. For example,

>>> import sys
>>> sys.version
'2.7.6 (default, Mar 22 2014, 22:59:56) \n[GCC 4.8.2]'
>>> data = input("Enter a number: ")
Enter a number: 5 + 17
>>> data, type(data)
(22, <type 'int'>)

The data 5 + 17 is evaluated and the result is 22. When it evaluates the expression 5 + 17, it detects that you are adding two numbers and so the result will also be of the same int type. So, the type conversion is done for free and 22 is returned as the result of input and stored in data variable. You can think of input as the raw_input composed with an eval call.

>>> data = eval(raw_input("Enter a number: "))
Enter a number: 5 + 17
>>> data, type(data)
(22, <type 'int'>)

Note: you should be careful when you are using input in Python 2.x. I explained why one should be careful when using it, in this answer.

But, raw_input doesn’t evaluate the input and returns as it is, as a string.

>>> import sys
>>> sys.version
'2.7.6 (default, Mar 22 2014, 22:59:56) \n[GCC 4.8.2]'
>>> data = raw_input("Enter a number: ")
Enter a number: 5 + 17
>>> data, type(data)
('5 + 17', <type 'str'>)

Python 3.x

Python 3.x’s input and Python 2.x’s raw_input are similar and raw_input is not available in Python 3.x.

>>> import sys
>>> sys.version
'3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]'
>>> data = input("Enter a number: ")
Enter a number: 5 + 17
>>> data, type(data)
('5 + 17', <class 'str'>)

Solution

To answer your question, since Python 3.x doesn’t evaluate and convert the data type, you have to explicitly convert to ints, with int, like this

x = int(input("Enter a number: "))
y = int(input("Enter a number: "))

You can accept numbers of any base and convert them directly to base-10 with the int function, like this

>>> data = int(input("Enter a number: "), 8)
Enter a number: 777
>>> data
511
>>> data = int(input("Enter a number: "), 16)
Enter a number: FFFF
>>> data
65535
>>> data = int(input("Enter a number: "), 2)
Enter a number: 10101010101
>>> data
1365

The second parameter tells what is the base of the numbers entered and then internally it understands and converts it. If the entered data is wrong it will throw a ValueError.

>>> data = int(input("Enter a number: "), 2)
Enter a number: 1234
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: invalid literal for int() with base 2: '1234'

For values that can have a fractional component, the type would be float rather than int:

x = float(input("Enter a number:"))

Apart from that, your program can be changed a little bit, like this

while True:
    ...
    ...
    if input("Play again? ") == "no":
        break

You can get rid of the play variable by using break and while True.


回答 1

在Python 3.x中,raw_input已重命名为,inputinput删除了Python2.x 。

这意味着,就像Python 3.x中的一样raw_inputinput总是返回一个字符串对象。

要解决此问题,您需要通过将它们输入以下内容来将这些输入明确地变成整数int

x = int(input("Enter a number: "))
y = int(input("Enter a number: "))

In Python 3.x, raw_input was renamed to input and the Python 2.x input was removed.

This means that, just like raw_input, input in Python 3.x always returns a string object.

To fix the problem, you need to explicitly make those inputs into integers by putting them in int:

x = int(input("Enter a number: "))
y = int(input("Enter a number: "))

回答 2

对于单行中的多个整数,map可能会更好。

arr = map(int, raw_input().split())

如果数字已知(例如2个整数),则可以使用

num1, num2 = map(int, raw_input().split())

For multiple integer in a single line, map might be better.

arr = map(int, raw_input().split())

If the number is already known, (like 2 integers), you can use

num1, num2 = map(int, raw_input().split())

回答 3

input()(Python 3)和raw_input()(Python 2)始终返回字符串。使用显式将结果转换为整数int()

x = int(input("Enter a number: "))
y = int(input("Enter a number: "))

input() (Python 3) and raw_input() (Python 2) always return strings. Convert the result to integer explicitly with int().

x = int(input("Enter a number: "))
y = int(input("Enter a number: "))

回答 4

多个问题需要在单行上输入多个整数。最好的方法是一行输入整个数字字符串,然后将它们拆分为整数。这是Python 3版本:

a = []
p = input()
p = p.split()      
for i in p:
    a.append(int(i))

也可以使用列表理解

p = input().split("whatever the seperator is")

并将所有输入从字符串转换为整数,我们执行以下操作

x = [int(i) for i in p]
print(x, end=' ')

应以直线打印列表元素。

Multiple questions require input for several integers on single line. The best way is to input the whole string of numbers one one line and then split them to integers. Here is a Python 3 version:

a = []
p = input()
p = p.split()      
for i in p:
    a.append(int(i))

Also a list comprehension can be used

p = input().split("whatever the seperator is")

And to convert all the inputs from string to int we do the following

x = [int(i) for i in p]
print(x, end=' ')

shall print the list elements in a straight line.


回答 5

转换为整数:

my_number = int(input("enter the number"))

对于浮点数类似:

my_decimalnumber = float(input("enter the number"))

Convert to integers:

my_number = int(input("enter the number"))

Similarly for floating point numbers:

my_decimalnumber = float(input("enter the number"))

回答 6

n=int(input())
for i in range(n):
    n=input()
    n=int(n)
    arr1=list(map(int,input().split()))

for循环应运行’n’次。第二个“ n”是数组的长度。最后一条语句将整数映射到列表,并以空格分隔的形式接受输入。您还可以在for循环的末尾返回数组。

n=int(input())
for i in range(n):
    n=input()
    n=int(n)
    arr1=list(map(int,input().split()))

the for loop shall run ‘n’ number of times . the second ‘n’ is the length of the array. the last statement maps the integers to a list and takes input in space separated form . you can also return the array at the end of for loop.


回答 7

我在解决CodeChef上的问题时遇到了输入整数的问题,该问题应从一行读取两个以空格分隔的整数。

虽然int(input())对于单个整数就足够了,但是我没有找到直接输入两个整数的方法。我尝试了这个:

num = input()
num1 = 0
num2 = 0

for i in range(len(num)):
    if num[i] == ' ':
        break

num1 = int(num[:i])
num2 = int(num[i+1:])

现在,我将num1和num2用作整数。希望这可以帮助。

I encountered a problem of taking integer input while solving a problem on CodeChef, where two integers – separated by space – should be read from one line.

While int(input()) is sufficient for a single integer, I did not find a direct way to input two integers. I tried this:

num = input()
num1 = 0
num2 = 0

for i in range(len(num)):
    if num[i] == ' ':
        break

num1 = int(num[:i])
num2 = int(num[i+1:])

Now I use num1 and num2 as integers. Hope this helps.


回答 8

def dbz():
    try:
        r = raw_input("Enter number:")
        if r.isdigit():
            i = int(raw_input("Enter divident:"))
            d = int(r)/i
            print "O/p is -:",d
        else:
            print "Not a number"
    except Exception ,e:
        print "Program halted incorrect data entered",type(e)
dbz()

Or 

num = input("Enter Number:")#"input" will accept only numbers
def dbz():
    try:
        r = raw_input("Enter number:")
        if r.isdigit():
            i = int(raw_input("Enter divident:"))
            d = int(r)/i
            print "O/p is -:",d
        else:
            print "Not a number"
    except Exception ,e:
        print "Program halted incorrect data entered",type(e)
dbz()

Or 

num = input("Enter Number:")#"input" will accept only numbers

回答 9

尽管在你的榜样,int(input(...))做的伎俩在任何情况下,python-futurebuiltins.input是值得考虑的,因为这可以确保你的代码同时适用于Python 2和3 ,并禁用Python2的违约行为,input努力成为‘聪明的’关于输入数据类型(builtins.input基本上只是的行为类似于raw_input)。

While in your example, int(input(...)) does the trick in any case, python-future‘s builtins.input is worth consideration since that makes sure your code works for both Python 2 and 3 and disables Python2’s default behaviour of input trying to be “clever” about the input data type (builtins.input basically just behaves like raw_input).


如何在Python 3中使用过滤,映射和归约

问题:如何在Python 3中使用过滤,映射和归约

filter,,map并且reduce可以在Python 2中完美运行。这是一个示例:

>>> def f(x):
        return x % 2 != 0 and x % 3 != 0
>>> filter(f, range(2, 25))
[5, 7, 11, 13, 17, 19, 23]

>>> def cube(x):
        return x*x*x
>>> map(cube, range(1, 11))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]

>>> def add(x,y):
        return x+y
>>> reduce(add, range(1, 11))
55

但是在Python 3中,我收到以下输出:

>>> filter(f, range(2, 25))
<filter object at 0x0000000002C14908>

>>> map(cube, range(1, 11))
<map object at 0x0000000002C82B70>

>>> reduce(add, range(1, 11))
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    reduce(add, range(1, 11))
NameError: name 'reduce' is not defined

如果有人可以向我解释为什么,我将不胜感激。

代码的屏幕截图,用于进一步说明:

Python 2和3的IDLE会话并排

filter, map, and reduce work perfectly in Python 2. Here is an example:

>>> def f(x):
        return x % 2 != 0 and x % 3 != 0
>>> filter(f, range(2, 25))
[5, 7, 11, 13, 17, 19, 23]

>>> def cube(x):
        return x*x*x
>>> map(cube, range(1, 11))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]

>>> def add(x,y):
        return x+y
>>> reduce(add, range(1, 11))
55

But in Python 3, I receive the following outputs:

>>> filter(f, range(2, 25))
<filter object at 0x0000000002C14908>

>>> map(cube, range(1, 11))
<map object at 0x0000000002C82B70>

>>> reduce(add, range(1, 11))
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    reduce(add, range(1, 11))
NameError: name 'reduce' is not defined

I would appreciate if someone could explain to me why this is.

Screenshot of code for further clarity:

IDLE sessions of Python 2 and 3 side-by-side


回答 0

您可以阅读Python 3.0的新增功能中的更改。从2.x升级到3.x时,应该仔细阅读它,因为已经做了很多更改。

此处的完整答案是文档中的引号。

视图和迭代器而不是列表

一些著名的API不再返回列表:

  • […]
  • map()filter()返回迭代器。如果您确实需要列表,则可以使用快速解决方案,例如list(map(...)),但是更好的解决方案通常是使用列表理解(特别是当原始代码使用lambda时),或者重写代码以使其根本不需要列表。map()该函数的副作用特别棘手。正确的转换是使用常规for循环(因为创建列表将很浪费)。
  • […]

内建

  • […]
  • 已删除reduce()functools.reduce()如果确实需要,请使用;但是,在99%的时间里,显式for循环更易于阅读。
  • […]

You can read about the changes in What’s New In Python 3.0. You should read it thoroughly when you move from 2.x to 3.x since a lot has been changed.

The whole answer here are quotes from the documentation.

Views And Iterators Instead Of Lists

Some well-known APIs no longer return lists:

  • […]
  • map() and filter() return iterators. If you really need a list, a quick fix is e.g. list(map(...)), but a better fix is often to use a list comprehension (especially when the original code uses lambda), or rewriting the code so it doesn’t need a list at all. Particularly tricky is map() invoked for the side effects of the function; the correct transformation is to use a regular for loop (since creating a list would just be wasteful).
  • […]

Builtins

  • […]
  • Removed reduce(). Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.
  • […]

回答 1

的功能mapfilter被有意改为返回迭代器,并减少从被除去的内置和放置在functools.reduce

因此,对于filtermap,您可以将它们包装起来以list()像以前一样查看结果。

>>> def f(x): return x % 2 != 0 and x % 3 != 0
...
>>> list(filter(f, range(2, 25)))
[5, 7, 11, 13, 17, 19, 23]
>>> def cube(x): return x*x*x
...
>>> list(map(cube, range(1, 11)))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
>>> import functools
>>> def add(x,y): return x+y
...
>>> functools.reduce(add, range(1, 11))
55
>>>

现在的建议是,用生成器表达式或列表推导替换map和filter的用法。例:

>>> def f(x): return x % 2 != 0 and x % 3 != 0
...
>>> [i for i in range(2, 25) if f(i)]
[5, 7, 11, 13, 17, 19, 23]
>>> def cube(x): return x*x*x
...
>>> [cube(i) for i in range(1, 11)]
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
>>>

他们说for循环在99%的时间里比减少容易阅读,但是我还是坚持functools.reduce

编辑:99%的数字直接从Guido van Rossum编写的“ Python 3.0的新功能”页面中提取。

The functionality of map and filter was intentionally changed to return iterators, and reduce was removed from being a built-in and placed in functools.reduce.

So, for filter and map, you can wrap them with list() to see the results like you did before.

>>> def f(x): return x % 2 != 0 and x % 3 != 0
...
>>> list(filter(f, range(2, 25)))
[5, 7, 11, 13, 17, 19, 23]
>>> def cube(x): return x*x*x
...
>>> list(map(cube, range(1, 11)))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
>>> import functools
>>> def add(x,y): return x+y
...
>>> functools.reduce(add, range(1, 11))
55
>>>

The recommendation now is that you replace your usage of map and filter with generators expressions or list comprehensions. Example:

>>> def f(x): return x % 2 != 0 and x % 3 != 0
...
>>> [i for i in range(2, 25) if f(i)]
[5, 7, 11, 13, 17, 19, 23]
>>> def cube(x): return x*x*x
...
>>> [cube(i) for i in range(1, 11)]
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000]
>>>

They say that for loops are 99 percent of the time easier to read than reduce, but I’d just stick with functools.reduce.

Edit: The 99 percent figure is pulled directly from the What’s New In Python 3.0 page authored by Guido van Rossum.


回答 2

作为其他答案的补充,对于上下文管理器来说,这听起来像是一个很好的用例,它将重新将这些函数的名称映射为返回列表并引入reduce全局命名空间的函数。

快速实现可能如下所示:

from contextlib import contextmanager    

@contextmanager
def noiters(*funcs):
    if not funcs: 
        funcs = [map, filter, zip] # etc
    from functools import reduce
    globals()[reduce.__name__] = reduce
    for func in funcs:
        globals()[func.__name__] = lambda *ar, func = func, **kwar: list(func(*ar, **kwar))
    try:
        yield
    finally:
        del globals()[reduce.__name__]
        for func in funcs: globals()[func.__name__] = func

用法如下所示:

with noiters(map):
    from operator import add
    print(reduce(add, range(1, 20)))
    print(map(int, ['1', '2']))

哪些打印:

190
[1, 2]

只是我的2美分:-)

As an addendum to the other answers, this sounds like a fine use-case for a context manager that will re-map the names of these functions to ones which return a list and introduce reduce in the global namespace.

A quick implementation might look like this:

from contextlib import contextmanager    

@contextmanager
def noiters(*funcs):
    if not funcs: 
        funcs = [map, filter, zip] # etc
    from functools import reduce
    globals()[reduce.__name__] = reduce
    for func in funcs:
        globals()[func.__name__] = lambda *ar, func = func, **kwar: list(func(*ar, **kwar))
    try:
        yield
    finally:
        del globals()[reduce.__name__]
        for func in funcs: globals()[func.__name__] = func

With a usage that looks like this:

with noiters(map):
    from operator import add
    print(reduce(add, range(1, 20)))
    print(map(int, ['1', '2']))

Which prints:

190
[1, 2]

Just my 2 cents :-)


回答 3

由于该reduce方法已从Python3的内置函数中删除,因此请不要忘记functools在您的代码中导入。请查看下面的代码段。

import functools
my_list = [10,15,20,25,35]
sum_numbers = functools.reduce(lambda x ,y : x+y , my_list)
print(sum_numbers)

Since the reduce method has been removed from the built in function from Python3, don’t forget to import the functools in your code. Please look at the code snippet below.

import functools
my_list = [10,15,20,25,35]
sum_numbers = functools.reduce(lambda x ,y : x+y , my_list)
print(sum_numbers)

回答 4

以下是Filter,map和reduce函数的示例。

数字= [10,11,12,22,34,43,54,34,67,87,88,98,99,87,44,66]

//过滤

奇数=列表(filter(lambda x:x%2!= 0,数字))

打印(奇数)

//地图

multipleOf2 = list(map(lambda x:x * 2,数字))

打印(multiplyOf2)

//降低

由于不常用reduce函数,因此已从Python 3的内置函数中删除了它。functools模块中仍提供了reduce函数,因此您可以执行以下操作:

从functools进口减少

sumOfNumbers = reduce(lambda x,y:x + y,数字)

打印(sumOfNumbers)

Here are the examples of Filter, map and reduce functions.

numbers = [10,11,12,22,34,43,54,34,67,87,88,98,99,87,44,66]

//Filter

oddNumbers = list(filter(lambda x: x%2 != 0, numbers))

print(oddNumbers)

//Map

multiplyOf2 = list(map(lambda x: x*2, numbers))

print(multiplyOf2)

//Reduce

The reduce function, since it is not commonly used, was removed from the built-in functions in Python 3. It is still available in the functools module, so you can do:

from functools import reduce

sumOfNumbers = reduce(lambda x,y: x+y, numbers)

print(sumOfNumbers)


回答 5

map,filter和reduce的优点之一是当您将它们“链接”在一起以进行复杂的操作时,它们变得清晰易读。但是,内置语法不清晰,全都是“向后的”。因此,我建议使用该PyFunctional软件包(https://pypi.org/project/PyFunctional/)。 这是两者的比较:

flight_destinations_dict = {'NY': {'London', 'Rome'}, 'Berlin': {'NY'}}

Py功能版本

非常清晰的语法。你可以说:

“我有一个飞行目的地序列。如果城市位于dict值中,我想从中获得dict键。最后,过滤掉我在流程中创建的空列表。”

from functional import seq  # PyFunctional package to allow easier syntax

def find_return_flights_PYFUNCTIONAL_SYNTAX(city, flight_destinations_dict):
    return seq(flight_destinations_dict.items()) \
        .map(lambda x: x[0] if city in x[1] else []) \
        .filter(lambda x: x != []) \

默认Python版本

都是倒退。您需要说:

“好的,所以有一个列表。我想从中过滤出空列表。为什么?因为如果城市位于dict值中,我首先得到了dict键。哦,我要执行的列表是flight_destinations_dict。 ”

def find_return_flights_DEFAULT_SYNTAX(city, flight_destinations_dict):
    return list(
        filter(lambda x: x != [],
               map(lambda x: x[0] if city in x[1] else [], flight_destinations_dict.items())
               )
    )

One of the advantages of map, filter and reduce is how legible they become when you “chain” them together to do something complex. However, the built-in syntax isn’t legible and is all “backwards”. So, I suggest using the PyFunctional package (https://pypi.org/project/PyFunctional/). Here’s a comparison of the two:

flight_destinations_dict = {'NY': {'London', 'Rome'}, 'Berlin': {'NY'}}

PyFunctional version

Very legible syntax. You can say:

“I have a sequence of flight destinations. Out of which I want to get the dict key if city is in the dict values. Finally, filter out the empty lists I created in the process.”

from functional import seq  # PyFunctional package to allow easier syntax

def find_return_flights_PYFUNCTIONAL_SYNTAX(city, flight_destinations_dict):
    return seq(flight_destinations_dict.items()) \
        .map(lambda x: x[0] if city in x[1] else []) \
        .filter(lambda x: x != []) \

Default Python version

It’s all backwards. You need to say:

“OK, so, there’s a list. I want to filter empty lists out of it. Why? Because I first got the dict key if the city was in the dict values. Oh, the list I’m doing this to is flight_destinations_dict.”

def find_return_flights_DEFAULT_SYNTAX(city, flight_destinations_dict):
    return list(
        filter(lambda x: x != [],
               map(lambda x: x[0] if city in x[1] else [], flight_destinations_dict.items())
               )
    )

在Python中打印多个参数

问题:在Python中打印多个参数

这只是我的代码的一部分:

print("Total score for %s is %s  ", name, score)

但我希望它打印出来:

“(姓名)的总分是(分数)”

其中name是列表中的变量,score是整数。如果有帮助的话,这就是Python 3.3。

This is just a snippet of my code:

print("Total score for %s is %s  ", name, score)

But I want it to print out:

“Total score for (name) is (score)”

where name is a variable in a list and score is an integer. This is Python 3.3 if that helps at all.


回答 0

有很多方法可以做到这一点。要使用%-formatting 修复当前代码,您需要传入一个元组:

  1. 将其作为元组传递:

    print("Total score for %s is %s" % (name, score))

具有单个元素的元组看起来像('this',)

这是其他一些常见的实现方法:

  1. 将其作为字典传递:

    print("Total score for %(n)s is %(s)s" % {'n': name, 's': score})

还有一种新型的字符串格式,可能更容易阅读:

  1. 使用新型的字符串格式:

    print("Total score for {} is {}".format(name, score))
  2. 使用带有数字的新型字符串格式(可用于重新排序或多次打印相同的字符):

    print("Total score for {0} is {1}".format(name, score))
  3. 使用具有显式名称的新型字符串格式:

    print("Total score for {n} is {s}".format(n=name, s=score))
  4. 连接字符串:

    print("Total score for " + str(name) + " is " + str(score))

我认为最清楚的两个是:

  1. 只需将值作为参数传递:

    print("Total score for", name, "is", score)

    如果您不希望print在上面的示例中自动插入空格,请更改sep参数:

    print("Total score for ", name, " is ", score, sep='')

    如果您使用的是Python 2,将不能使用最后两个,因为print这不是Python 2中的函数。不过,您可以从__future__以下方式导入此行为:

    from __future__ import print_function
  2. f在Python 3.6中使用新的-string格式:

    print(f'Total score for {name} is {score}')

There are many ways to do this. To fix your current code using %-formatting, you need to pass in a tuple:

  1. Pass it as a tuple:

    print("Total score for %s is %s" % (name, score))
    

A tuple with a single element looks like ('this',).

Here are some other common ways of doing it:

  1. Pass it as a dictionary:

    print("Total score for %(n)s is %(s)s" % {'n': name, 's': score})
    

There’s also new-style string formatting, which might be a little easier to read:

  1. Use new-style string formatting:

    print("Total score for {} is {}".format(name, score))
    
  2. Use new-style string formatting with numbers (useful for reordering or printing the same one multiple times):

    print("Total score for {0} is {1}".format(name, score))
    
  3. Use new-style string formatting with explicit names:

    print("Total score for {n} is {s}".format(n=name, s=score))
    
  4. Concatenate strings:

    print("Total score for " + str(name) + " is " + str(score))
    

The clearest two, in my opinion:

  1. Just pass the values as parameters:

    print("Total score for", name, "is", score)
    

    If you don’t want spaces to be inserted automatically by print in the above example, change the sep parameter:

    print("Total score for ", name, " is ", score, sep='')
    

    If you’re using Python 2, won’t be able to use the last two because print isn’t a function in Python 2. You can, however, import this behavior from __future__:

    from __future__ import print_function
    
  2. Use the new f-string formatting in Python 3.6:

    print(f'Total score for {name} is {score}')
    

回答 1

有很多打印方法。

让我们看另一个例子。

a = 10
b = 20
c = a + b

#Normal string concatenation
print("sum of", a , "and" , b , "is" , c) 

#convert variable into str
print("sum of " + str(a) + " and " + str(b) + " is " + str(c)) 

# if you want to print in tuple way
print("Sum of %s and %s is %s: " %(a,b,c))  

#New style string formatting
print("sum of {} and {} is {}".format(a,b,c)) 

#in case you want to use repr()
print("sum of " + repr(a) + " and " + repr(b) + " is " + repr(c))

EDIT :

#New f-string formatting from Python 3.6:
print(f'Sum of {a} and {b} is {c}')

There are many ways to print that.

Let’s have a look with another example.

a = 10
b = 20
c = a + b

#Normal string concatenation
print("sum of", a , "and" , b , "is" , c) 

#convert variable into str
print("sum of " + str(a) + " and " + str(b) + " is " + str(c)) 

# if you want to print in tuple way
print("Sum of %s and %s is %s: " %(a,b,c))  

#New style string formatting
print("sum of {} and {} is {}".format(a,b,c)) 

#in case you want to use repr()
print("sum of " + repr(a) + " and " + repr(b) + " is " + repr(c))

EDIT :

#New f-string formatting from Python 3.6:
print(f'Sum of {a} and {b} is {c}')

回答 2

使用方法.format()

print("Total score for {0} is {1}".format(name, score))

要么:

// Recommended, more readable code

print("Total score for {n} is {s}".format(n=name, s=score))

要么:

print("Total score for" + name + " is " + score)

要么:

`print("Total score for %s is %d" % (name, score))`

Use: .format():

print("Total score for {0} is {1}".format(name, score))

Or:

// Recommended, more readable code

print("Total score for {n} is {s}".format(n=name, s=score))

Or:

print("Total score for" + name + " is " + score)

Or:

`print("Total score for %s is %d" % (name, score))`

回答 3

在Python 3.6中,f-string它更加干净。

在早期版本中:

print("Total score for %s is %s. " % (name, score))

在Python 3.6中:

print(f'Total score for {name} is {score}.')

会做。

它更高效,更优雅。

In Python 3.6, f-string is much cleaner.

In earlier version:

print("Total score for %s is %s. " % (name, score))

In Python 3.6:

print(f'Total score for {name} is {score}.')

will do.

It is more efficient and elegant.


回答 4

保持简单,我个人喜欢字符串连接:

print("Total score for " + name + " is " + score)

它同时适用于Python 2.7和3.X。

注意:如果score是一个int,则应将其转换为str

print("Total score for " + name + " is " + str(score))

Keeping it simple, I personally like string concatenation:

print("Total score for " + name + " is " + score)

It works with both Python 2.7 an 3.X.

NOTE: If score is an int, then, you should convert it to str:

print("Total score for " + name + " is " + str(score))

回答 5

你试一试:

print("Total score for", name, "is", score)

Just try:

print("Total score for", name, "is", score)

回答 6

只要遵循这个

idiot_type = "the biggest idiot"
year = 22
print("I have been {} for {} years ".format(idiot_type, years))

要么

idiot_type = "the biggest idiot"
year = 22
print("I have been %s for %s years."% (idiot_type, year))

忘记所有其他格式,否则大脑将无法映射所有格式。

Just follow this

idiot_type = "the biggest idiot"
year = 22
print("I have been {} for {} years ".format(idiot_type, years))

OR

idiot_type = "the biggest idiot"
year = 22
print("I have been %s for %s years."% (idiot_type, year))

And forget all others, else the brain won’t be able to map all the formats.


回答 7

print("Total score for %s is %s  " % (name, score))

%s可以替换为%d%f

print("Total score for %s is %s  " % (name, score))

%s can be replace by %d or %f


回答 8

用途f-string

print(f'Total score for {name} is {score}')

要么

用途.format

print("Total score for {} is {}".format(name, score))

Use f-string:

print(f'Total score for {name} is {score}')

Or

Use .format:

print("Total score for {} is {}".format(name, score))

回答 9

如果score是数字,则

print("Total score for %s is %d" % (name, score))

如果score是一个字符串,则

print("Total score for %s is %s" % (name, score))

如果score是数字,%d则为,如果是字符串%s,则为,如果score是浮点型,则为%f

If score is a number, then

print("Total score for %s is %d" % (name, score))

If score is a string, then

print("Total score for %s is %s" % (name, score))

If score is a number, then it’s %d, if it’s a string, then it’s %s, if score is a float, then it’s %f


回答 10

这是我的工作:

print("Total score for " + name + " is " + score)

请记住在for前后放置一个空格is

This is what I do:

print("Total score for " + name + " is " + score)

Remember to put a space after for and before and after is.


NameError:全局名称“ xrange”未在Python 3中定义

问题:NameError:全局名称“ xrange”未在Python 3中定义

运行python程序时出现错误:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\_sandbox.py", line 110, in <module>
  File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\_sandbox.py", line 27, in __init__
  File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\class\inventory.py", line 17, in __init__
builtins.NameError: global name 'xrange' is not defined

游戏是从这里开始的

是什么导致此错误?

I am getting an error when running a python program:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\_sandbox.py", line 110, in <module>
  File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\_sandbox.py", line 27, in __init__
  File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\class\inventory.py", line 17, in __init__
builtins.NameError: global name 'xrange' is not defined

The game is from here.

What causes this error?


回答 0

您正在尝试使用Python 3运行Python 2代码库。在Python 3中xrange()已重命名为range()

而是使用Python 2运行游戏。不要试图将它移植,除非你知道自己在做什么,很可能会出现超越更多的问题xrange()range()

作为记录,您看到的不是语法错误,而是运行时异常。


如果您确实知道自己在做什么,并且正在积极地使Python 2代码库与Python 3兼容,则可以通过将全局名称添加为模块的别名来桥接代码range。(请注意,您可能必须更新range()Python 2代码库中的所有现有用法,list(range(...))以确保仍然在Python 3中获得列表对象):

try:
    # Python 2
    xrange
except NameError:
    # Python 3, xrange is now named range
    xrange = range

# Python 2 code that uses xrange(...) unchanged, and any
# range(...) replaced with list(range(...))

或更换的所有用途xrange(...)range(...)在代码库,然后使用不同的垫片,使与Python 2的Python语法3兼容:

try:
    # Python 2 forward compatibility
    range = xrange
except NameError:
    pass

# Python 2 code transformed from range(...) -> list(range(...)) and
# xrange(...) -> range(...).

对于希望长期与Python 3兼容的代码库而言,后者是更可取的,因此只要有可能,便更容易使用Python 3语法。

You are trying to run a Python 2 codebase with Python 3. xrange() was renamed to range() in Python 3.

Run the game with Python 2 instead. Don’t try to port it unless you know what you are doing, most likely there will be more problems beyond xrange() vs. range().

For the record, what you are seeing is not a syntax error but a runtime exception instead.


If you do know what your are doing and are actively making a Python 2 codebase compatible with Python 3, you can bridge the code by adding the global name to your module as an alias for range. (Take into account that you may have to update any existing range() use in the Python 2 codebase with list(range(...)) to ensure you still get a list object in Python 3):

try:
    # Python 2
    xrange
except NameError:
    # Python 3, xrange is now named range
    xrange = range

# Python 2 code that uses xrange(...) unchanged, and any
# range(...) replaced with list(range(...))

or replace all uses of xrange(...) with range(...) in the codebase and then use a different shim to make the Python 3 syntax compatible with Python 2:

try:
    # Python 2 forward compatibility
    range = xrange
except NameError:
    pass

# Python 2 code transformed from range(...) -> list(range(...)) and
# xrange(...) -> range(...).

The latter is preferable for codebases that want to aim to be Python 3 compatible only in the long run, it is easier to then just use Python 3 syntax whenever possible.


回答 1

添加xrange=range您的代码:)对我有用。

add xrange=range in your code :) It works to me.


回答 2

我加入这个解决进口问题的
更多信息

from past.builtins import xrange

I solved the issue by adding this import
More info

from past.builtins import xrange

回答 3

在python 2.x中,xrange用于返回生成器,而range用于返回列表。在python 3.x中,xrange已被删除,并且range返回一个生成器,就像python 2.x中的xrange一样。因此,在python 3.x中,您需要使用range而不是xrange。

in python 2.x, xrange is used to return a generator while range is used to return a list. In python 3.x , xrange has been removed and range returns a generator just like xrange in python 2.x. Therefore, in python 3.x you need to use range rather than xrange.


回答 4

更换

Python 2 xrange

Python 3 range

休息都一样。

Replace

Python 2 xrange to

Python 3 range

Rest all same.


回答 5

我同意最后一个答案。但是还有另一种方法可以解决此问题。您可以下载名为future的软件包,例如pip install future。然后在.py文件中输入“ from past.builtins import xrange”。此方法用于文件中有很多xrange的情况。

I agree with the last answer.But there is another way to solve this problem.You can download the package named future,such as pip install future.And in your .py file input this “from past.builtins import xrange”.This method is for the situation that there are many xranges in your file.


如何更正TypeError:散列之前必须对Unicode对象进行编码?

问题:如何更正TypeError:散列之前必须对Unicode对象进行编码?

我有这个错误:

Traceback (most recent call last):
  File "python_md5_cracker.py", line 27, in <module>
  m.update(line)
TypeError: Unicode-objects must be encoded before hashing

当我尝试在Python 3.2.2中执行以下代码时:

import hashlib, sys
m = hashlib.md5()
hash = ""
hash_file = input("What is the file name in which the hash resides?  ")
wordlist = input("What is your wordlist?  (Enter the file name)  ")
try:
  hashdocument = open(hash_file, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  hash = hashdocument.readline()
  hash = hash.replace("\n", "")

try:
  wordlistfile = open(wordlist, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  pass
for line in wordlistfile:
  # Flush the buffer (this caused a massive problem when placed 
  # at the beginning of the script, because the buffer kept getting
  # overwritten, thus comparing incorrect hashes)
  m = hashlib.md5()
  line = line.replace("\n", "")
  m.update(line)
  word_hash = m.hexdigest()
  if word_hash == hash:
    print("Collision! The word corresponding to the given hash is", line)
    input()
    sys.exit()

print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()

I have this error:

Traceback (most recent call last):
  File "python_md5_cracker.py", line 27, in <module>
  m.update(line)
TypeError: Unicode-objects must be encoded before hashing

when I try to execute this code in Python 3.2.2:

import hashlib, sys
m = hashlib.md5()
hash = ""
hash_file = input("What is the file name in which the hash resides?  ")
wordlist = input("What is your wordlist?  (Enter the file name)  ")
try:
  hashdocument = open(hash_file, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  hash = hashdocument.readline()
  hash = hash.replace("\n", "")

try:
  wordlistfile = open(wordlist, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  pass
for line in wordlistfile:
  # Flush the buffer (this caused a massive problem when placed 
  # at the beginning of the script, because the buffer kept getting
  # overwritten, thus comparing incorrect hashes)
  m = hashlib.md5()
  line = line.replace("\n", "")
  m.update(line)
  word_hash = m.hexdigest()
  if word_hash == hash:
    print("Collision! The word corresponding to the given hash is", line)
    input()
    sys.exit()

print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()

回答 0

它可能正在寻找来自的字符编码wordlistfile

wordlistfile = open(wordlist,"r",encoding='utf-8')

或者,如果您逐行工作:

line.encode('utf-8')

It is probably looking for a character encoding from wordlistfile.

wordlistfile = open(wordlist,"r",encoding='utf-8')

Or, if you’re working on a line-by-line basis:

line.encode('utf-8')

回答 1

您必须定义encoding formatlike utf-8,尝试这种简单的方法,

本示例使用SHA256算法生成一个随机数:

>>> import hashlib
>>> hashlib.sha256(str(random.getrandbits(256)).encode('utf-8')).hexdigest()
'cd183a211ed2434eac4f31b317c573c50e6c24e3a28b82ddcb0bf8bedf387a9f'

You must have to define encoding format like utf-8, Try this easy way,

This example generates a random number using the SHA256 algorithm:

>>> import hashlib
>>> hashlib.sha256(str(random.getrandbits(256)).encode('utf-8')).hexdigest()
'cd183a211ed2434eac4f31b317c573c50e6c24e3a28b82ddcb0bf8bedf387a9f'

回答 2

要存储密码(PY3):

import hashlib, os
password_salt = os.urandom(32).hex()
password = '12345'

hash = hashlib.sha512()
hash.update(('%s%s' % (password_salt, password)).encode('utf-8'))
password_hash = hash.hexdigest()

To store the password (PY3):

import hashlib, os
password_salt = os.urandom(32).hex()
password = '12345'

hash = hashlib.sha512()
hash.update(('%s%s' % (password_salt, password)).encode('utf-8'))
password_hash = hash.hexdigest()

回答 3

该错误已说明您必须执行的操作。MD5对字节进行操作,因此您必须将Unicode字符串编码为bytes,例如使用line.encode('utf-8')

The error already says what you have to do. MD5 operates on bytes, so you have to encode Unicode string into bytes, e.g. with line.encode('utf-8').


回答 4

请首先查看答案。

现在,该错误信息是明确的:你只能使用字节,而不是Python字符串(曾经被认为是unicode在Python <3),所以你必须与你的首选编码编码字符串:utf-32utf-16utf-8或受限制的甚至是一个8-位编码(有些可能称为代码页)。

从文件中读取时,Python 3将自动将单词列表文件中的字节解码为Unicode。我建议你这样做:

m.update(line.encode(wordlistfile.encoding))

因此,推送到md5算法的编码数据的编码方式与基础文件完全相同。

Please take a look first at that answer.

Now, the error message is clear: you can only use bytes, not Python strings (what used to be unicode in Python < 3), so you have to encode the strings with your preferred encoding: utf-32, utf-16, utf-8 or even one of the restricted 8-bit encodings (what some might call codepages).

The bytes in your wordlist file are being automatically decoded to Unicode by Python 3 as you read from the file. I suggest you do:

m.update(line.encode(wordlistfile.encoding))

so that the encoded data pushed to the md5 algorithm are encoded exactly like the underlying file.


回答 5

import hashlib
string_to_hash = '123'
hash_object = hashlib.sha256(str(string_to_hash).encode('utf-8'))
print('Hash', hash_object.hexdigest())
import hashlib
string_to_hash = '123'
hash_object = hashlib.sha256(str(string_to_hash).encode('utf-8'))
print('Hash', hash_object.hexdigest())

回答 6

您可以以二进制模式打开文件:

import hashlib

with open(hash_file) as file:
    control_hash = file.readline().rstrip("\n")

wordlistfile = open(wordlist, "rb")
# ...
for line in wordlistfile:
    if hashlib.md5(line.rstrip(b'\n\r')).hexdigest() == control_hash:
       # collision

You could open the file in binary mode:

import hashlib

with open(hash_file) as file:
    control_hash = file.readline().rstrip("\n")

wordlistfile = open(wordlist, "rb")
# ...
for line in wordlistfile:
    if hashlib.md5(line.rstrip(b'\n\r')).hexdigest() == control_hash:
       # collision

回答 7

编码这行为我固定。

m.update(line.encode('utf-8'))

encoding this line fixed it for me.

m.update(line.encode('utf-8'))

回答 8

如果是单行字符串。用b或B包裹它。例如:

variable = b"This is a variable"

要么

variable2 = B"This is also a variable"

If it’s a single line string. wrapt it with b or B. e.g:

variable = b"This is a variable"

or

variable2 = B"This is also a variable"

回答 9

该程序是上述MD5破解程序的无错误和增强版本,该MD5破解程序读取包含哈希密码列表的文件,并根据英语词典单词列表中的哈希单词检查该文件。希望对您有所帮助。

我从以下链接https://github.com/dwyl/english-words下载了英语词典

# md5cracker.py
# English Dictionary https://github.com/dwyl/english-words 

import hashlib, sys

hash_file = 'exercise\hashed.txt'
wordlist = 'data_sets\english_dictionary\words.txt'

try:
    hashdocument = open(hash_file,'r')
except IOError:
    print('Invalid file.')
    sys.exit()
else:
    count = 0
    for hash in hashdocument:
        hash = hash.rstrip('\n')
        print(hash)
        i = 0
        with open(wordlist,'r') as wordlistfile:
            for word in wordlistfile:
                m = hashlib.md5()
                word = word.rstrip('\n')            
                m.update(word.encode('utf-8'))
                word_hash = m.hexdigest()
                if word_hash==hash:
                    print('The word, hash combination is ' + word + ',' + hash)
                    count += 1
                    break
                i += 1
        print('Itiration is ' + str(i))
    if count == 0:
        print('The hash given does not correspond to any supplied word in the wordlist.')
    else:
        print('Total passwords identified is: ' + str(count))
sys.exit()

This program is the bug free and enhanced version of the above MD5 cracker that reads the file containing list of hashed passwords and checks it against hashed word from the English dictionary word list. Hope it is helpful.

I downloaded the English dictionary from the following link https://github.com/dwyl/english-words

# md5cracker.py
# English Dictionary https://github.com/dwyl/english-words 

import hashlib, sys

hash_file = 'exercise\hashed.txt'
wordlist = 'data_sets\english_dictionary\words.txt'

try:
    hashdocument = open(hash_file,'r')
except IOError:
    print('Invalid file.')
    sys.exit()
else:
    count = 0
    for hash in hashdocument:
        hash = hash.rstrip('\n')
        print(hash)
        i = 0
        with open(wordlist,'r') as wordlistfile:
            for word in wordlistfile:
                m = hashlib.md5()
                word = word.rstrip('\n')            
                m.update(word.encode('utf-8'))
                word_hash = m.hexdigest()
                if word_hash==hash:
                    print('The word, hash combination is ' + word + ',' + hash)
                    count += 1
                    break
                i += 1
        print('Itiration is ' + str(i))
    if count == 0:
        print('The hash given does not correspond to any supplied word in the wordlist.')
    else:
        print('Total passwords identified is: ' + str(count))
sys.exit()