分类目录归档:知识问答

如何在Python中逐行读取文件?

问题:如何在Python中逐行读取文件?

在史前时代(Python 1.4)中,我们做到了:

fp = open('filename.txt')
while 1:
    line = fp.readline()
    if not line:
        break
    print line

在Python 2.1之后,我们做到了:

for line in open('filename.txt').xreadlines():
    print line

在Python 2.3中获得便捷的迭代器协议之前,它可以做到:

for line in open('filename.txt'):
    print line

我看过一些使用更详细的示例:

with open('filename.txt') as fp:
    for line in fp:
        print line

这是首选的方法吗?

[edit]我知道with语句可以确保关闭文件…但是为什么文件对象的迭代器协议中没有包含该语句呢?

In pre-historic times (Python 1.4) we did:

fp = open('filename.txt')
while 1:
    line = fp.readline()
    if not line:
        break
    print line

after Python 2.1, we did:

for line in open('filename.txt').xreadlines():
    print line

before we got the convenient iterator protocol in Python 2.3, and could do:

for line in open('filename.txt'):
    print line

I’ve seen some examples using the more verbose:

with open('filename.txt') as fp:
    for line in fp:
        print line

is this the preferred method going forwards?

[edit] I get that the with statement ensures closing of the file… but why isn’t that included in the iterator protocol for file objects?


回答 0

首选以下原因正是有一个原因:

with open('filename.txt') as fp:
    for line in fp:
        print line

CPython的相对确定性的引用计数方案对垃圾回收来说,我们都被宠坏了。如果其他假设的Python实现with使用某种其他方案来回收内存,则它们在没有该块的情况下不一定会“足够快地”关闭文件。

在这样的实现中,如果您的代码打开文件的速度比垃圾收集器调用孤立文件句柄上的终结器的速度快,则可能会从OS收到“打开太多文件”错误。通常的解决方法是立即触发GC,但这是一个讨厌的技巧,必须由可能遇到错误的每个函数(包括库中的函数)来完成。什么样的恶梦。

或者,您可以只使用该with块。

奖金问题

(如果仅对问题的客观方面感兴趣,请立即停止阅读。)

为什么文件对象的迭代器协议中未包含该代码?

这是有关API设计的主观问题,因此我有两个部分的主观答案。

从直觉上讲,这是错的,因为它使迭代器协议执行两项单独的操作(遍历行关闭文件句柄),并且使外观简单的函数执行两项操作通常不是一个好主意。在这种情况下,感觉特别糟糕,因为迭代器以准功能,基于值的方式与文件内容相关联,但是管理文件句柄是完全独立的任务。对于阅读代码的人来说,将两者无形地压成一个动作是令人惊讶的,这使得推理程序行为变得更加困难。

其他语言基本上得出了相同的结论。Haskell简要调情了所谓的“惰性IO”,它允许您遍历文件并在到达流末尾时自动将其关闭,但是如今,在Haskell和Haskell中几乎普遍不建议使用惰性IO。用户大多转向更明确的资源管理,例如Conduit,其行为更像withPython中的块。

从技术上讲,您可能需要对Python中的文件句柄进行某些操作,如果迭代关闭了文件句柄,这些操作将无法正常工作。例如,假设我需要遍历文件两次:

with open('filename.txt') as fp:
    for line in fp:
        ...
    fp.seek(0)
    for line in fp:
        ...

虽然这是一种不太常见的用例,但请考虑以下事实:我可能刚刚将底部的三行代码添加到了原来具有前三行的现有代码库中。如果迭代关闭了该文件,我将无法执行该操作。因此,将迭代和资源管理分开保持可以更轻松地将代码块组合成一个更大的,可运行的Python程序。

可组合性是语言或API最重要的可用性功能之一。

There is exactly one reason why the following is preferred:

with open('filename.txt') as fp:
    for line in fp:
        print line

We are all spoiled by CPython’s relatively deterministic reference-counting scheme for garbage collection. Other, hypothetical implementations of Python will not necessarily close the file “quickly enough” without the with block if they use some other scheme to reclaim memory.

In such an implementation, you might get a “too many files open” error from the OS if your code opens files faster than the garbage collector calls finalizers on orphaned file handles. The usual workaround is to trigger the GC immediately, but this is a nasty hack and it has to be done by every function that could encounter the error, including those in libraries. What a nightmare.

Or you could just use the with block.

Bonus Question

(Stop reading now if are only interested in the objective aspects of the question.)

Why isn’t that included in the iterator protocol for file objects?

This is a subjective question about API design, so I have a subjective answer in two parts.

On a gut level, this feels wrong, because it makes iterator protocol do two separate things—iterate over lines and close the file handle—and it’s often a bad idea to make a simple-looking function do two actions. In this case, it feels especially bad because iterators relate in a quasi-functional, value-based way to the contents of a file, but managing file handles is a completely separate task. Squashing both, invisibly, into one action, is surprising to humans who read the code and makes it more difficult to reason about program behavior.

Other languages have essentially come to the same conclusion. Haskell briefly flirted with so-called “lazy IO” which allows you to iterate over a file and have it automatically closed when you get to the end of the stream, but it’s almost universally discouraged to use lazy IO in Haskell these days, and Haskell users have mostly moved to more explicit resource management like Conduit which behaves more like the with block in Python.

On a technical level, there are some things you may want to do with a file handle in Python which would not work as well if iteration closed the file handle. For example, suppose I need to iterate over the file twice:

with open('filename.txt') as fp:
    for line in fp:
        ...
    fp.seek(0)
    for line in fp:
        ...

While this is a less common use case, consider the fact that I might have just added the three lines of code at the bottom to an existing code base which originally had the top three lines. If iteration closed the file, I wouldn’t be able to do that. So keeping iteration and resource management separate makes it easier to compose chunks of code into a larger, working Python program.

Composability is one of the most important usability features of a language or API.


回答 1

是,

with open('filename.txt') as fp:
    for line in fp:
        print line

是要走的路。

它并不冗长。更安全。

Yes,

with open('filename.txt') as fp:
    for line in fp:
        print line

is the way to go.

It is not more verbose. It is more safe.


回答 2

如果您被多余的行关闭,则可以使用包装函数,如下所示:

def with_iter(iterable):
    with iterable as iter:
        for item in iter:
            yield item

for line in with_iter(open('...')):
    ...

在Python 3.3中,该yield from语句会使此操作更短:

def with_iter(iterable):
    with iterable as iter:
        yield from iter

if you’re turned off by the extra line, you can use a wrapper function like so:

def with_iter(iterable):
    with iterable as iter:
        for item in iter:
            yield item

for line in with_iter(open('...')):
    ...

in Python 3.3, the yield from statement would make this even shorter:

def with_iter(iterable):
    with iterable as iter:
        yield from iter

回答 3

f = open('test.txt','r')
for line in f.xreadlines():
    print line
f.close()
f = open('test.txt','r')
for line in f.xreadlines():
    print line
f.close()

我可以使用`pip`代替`easy_install`来实现`python setup.py install`依赖关系解析吗?

问题:我可以使用`pip`代替`easy_install`来实现`python setup.py install`依赖关系解析吗?

python setup.py install会自动安装requires=[]使用中列出的软件包easy_install。我该如何使用它pip呢?

python setup.py install will automatically install packages listed in requires=[] using easy_install. How do I get it to use pip instead?


回答 0

是的你可以。您可以从网络或计算机上的tarball或文件夹中安装软件包。例如:

从网络上的tarball安装

pip install https://pypi.python.org/packages/source/r/requests/requests-2.3.0.tar.gz

从本地tarball安装

wget https://pypi.python.org/packages/source/r/requests/requests-2.3.0.tar.gz
pip install requests-2.3.0.tar.gz

从本地文件夹安装

tar -zxvf requests-2.3.0.tar.gz
cd requests-2.3.0
pip install .

您可以删除requests-2.3.0文件夹。

从本地文件夹安装(可编辑模式)

pip install -e .

这将以可编辑模式安装软件包。您对代码所做的任何更改将立即在整个系统中应用。如果您是程序包开发人员并且想要测试更改,这将很有用。这也意味着您必须在不中断安装的情况下删除文件夹。

Yes you can. You can install a package from a tarball or a folder, on the web or your computer. For example:

Install from tarball on web

pip install https://pypi.python.org/packages/source/r/requests/requests-2.3.0.tar.gz

Install from local tarball

wget https://pypi.python.org/packages/source/r/requests/requests-2.3.0.tar.gz
pip install requests-2.3.0.tar.gz

Install from local folder

tar -zxvf requests-2.3.0.tar.gz
cd requests-2.3.0
pip install .

You can delete the requests-2.3.0 folder.

Install from local folder (editable mode)

pip install -e .

This installs the package in editable mode. Any changes you make to the code will immediately apply across the system. This is useful if you are the package developer and want to test changes. It also means you can’t delete the folder without breaking the install.


回答 1

您可以先pip install归档python setup.py sdist。您也pip install -e .可以像python setup.py develop

You can pip install a file perhaps by python setup.py sdist first. You can also pip install -e . which is like python setup.py develop.


回答 2

如果您真的python setup.py install愿意使用,可以尝试如下操作:

from setuptools import setup, find_packages
from setuptools.command.install import install as InstallCommand


class Install(InstallCommand):
    """ Customized setuptools install command which uses pip. """

    def run(self, *args, **kwargs):
        import pip
        pip.main(['install', '.'])
        InstallCommand.run(self, *args, **kwargs)


setup(
    name='your_project',
    version='0.0.1a',
    cmdclass={
        'install': Install,
    },
    packages=find_packages(),
    install_requires=['simplejson']
)

If you are really set on using python setup.py install you could try something like this:

from setuptools import setup, find_packages
from setuptools.command.install import install as InstallCommand


class Install(InstallCommand):
    """ Customized setuptools install command which uses pip. """

    def run(self, *args, **kwargs):
        import pip
        pip.main(['install', '.'])
        InstallCommand.run(self, *args, **kwargs)


setup(
    name='your_project',
    version='0.0.1a',
    cmdclass={
        'install': Install,
    },
    packages=find_packages(),
    install_requires=['simplejson']
)

如何执行两个列表的按元素相乘?

问题:如何执行两个列表的按元素相乘?

我想执行元素明智的乘法,将两个列表按值在Python中相乘,就像我们在Matlab中可以做到的那样。

这就是我在Matlab中要做的。

a = [1,2,3,4]
b = [2,3,4,5]
a .* b = [2, 6, 12, 20]

对于from 和from的每个组合x * y,列表理解将给出16个列表条目。不确定如何映射。xayb

如果有人对此感兴趣,我有一个数据集,并想乘以Numpy.linspace(1.0, 0.5, num=len(dataset)) =)

I want to perform an element wise multiplication, to multiply two lists together by value in Python, like we can do it in Matlab.

This is how I would do it in Matlab.

a = [1,2,3,4]
b = [2,3,4,5]
a .* b = [2, 6, 12, 20]

A list comprehension would give 16 list entries, for every combination x * y of x from a and y from b. Unsure of how to map this.

If anyone is interested why, I have a dataset, and want to multiply it by Numpy.linspace(1.0, 0.5, num=len(dataset)) =).


回答 0

使用列表理解与zip():混合。

[a*b for a,b in zip(lista,listb)]

Use a list comprehension mixed with zip():.

[a*b for a,b in zip(lista,listb)]

回答 1

由于您已经在使用numpy,因此将数据存储在numpy数组而不是列表中很有意义。完成此操作后,您将免费获得类似智能元素的产品:

In [1]: import numpy as np

In [2]: a = np.array([1,2,3,4])

In [3]: b = np.array([2,3,4,5])

In [4]: a * b
Out[4]: array([ 2,  6, 12, 20])

Since you’re already using numpy, it makes sense to store your data in a numpy array rather than a list. Once you do this, you get things like element-wise products for free:

In [1]: import numpy as np

In [2]: a = np.array([1,2,3,4])

In [3]: b = np.array([2,3,4,5])

In [4]: a * b
Out[4]: array([ 2,  6, 12, 20])

回答 2

使用np.multiply(a,b):

import numpy as np
a = [1,2,3,4]
b = [2,3,4,5]
np.multiply(a,b)

Use np.multiply(a,b):

import numpy as np
a = [1,2,3,4]
b = [2,3,4,5]
np.multiply(a,b)

回答 3

您可以尝试将每个元素乘以一个循环。这样做的捷径是

ab = [a[i]*b[i] for i in range(len(a))]

You can try multiplying each element in a loop. The short hand for doing that is

ab = [a[i]*b[i] for i in range(len(a))]

回答 4

还有一个答案:

-1…需要导入
+1…非常易读

import operator
a = [1,2,3,4]
b = [10,11,12,13]

list(map(operator.mul, a, b))

输出[10、22、36、52]

Yet another answer:

-1 … requires import
+1 … is very readable

import operator
a = [1,2,3,4]
b = [10,11,12,13]

list(map(operator.mul, a, b))

outputs [10, 22, 36, 52]


回答 5

相当直观的方法:

a = [1,2,3,4]
b = [2,3,4,5]
ab = []                        #Create empty list
for i in range(0, len(a)):
     ab.append(a[i]*b[i])      #Adds each element to the list

Fairly intuitive way of doing this:

a = [1,2,3,4]
b = [2,3,4,5]
ab = []                        #Create empty list
for i in range(0, len(a)):
     ab.append(a[i]*b[i])      #Adds each element to the list

回答 6

您可以使用 lambda

foo=[1,2,3,4]
bar=[1,2,5,55]
l=map(lambda x,y:x*y,foo,bar)

you can multiplication using lambda

foo=[1,2,3,4]
bar=[1,2,5,55]
l=map(lambda x,y:x*y,foo,bar)

回答 7

对于大型列表,我们可以反复进行:

product_iter_object = itertools.imap(operator.mul, [1,2,3,4], [2,3,4,5])

product_iter_object.next() 给出输出列表中的每个元素。

输出将是两个输入列表中较短者的长度。

For large lists, we can do it the iter-way:

product_iter_object = itertools.imap(operator.mul, [1,2,3,4], [2,3,4,5])

product_iter_object.next() gives each of the element in the output list.

The output would be the length of the shorter of the two input lists.


回答 8

创建一个数组;将每个列表乘以数组;将数组转换为列表

import numpy as np

a = [1,2,3,4]
b = [2,3,4,5]

c = (np.ones(len(a))*a*b).tolist()

[2.0, 6.0, 12.0, 20.0]

create an array of ones; multiply each list times the array; convert array to a list

import numpy as np

a = [1,2,3,4]
b = [2,3,4,5]

c = (np.ones(len(a))*a*b).tolist()

[2.0, 6.0, 12.0, 20.0]

回答 9

gahooa的答案对于标题中所述的问题是正确的,但是如果列表已经是numpy格式大于十,它将更快(3个数量级)并且可读性更高,如NPE。我得到这些时间:

0.0049ms -> N = 4, a = [i for i in range(N)], c = [a*b for a,b in zip(a, b)]
0.0075ms -> N = 4, a = [i for i in range(N)], c = a * b
0.0167ms -> N = 4, a = np.arange(N), c = [a*b for a,b in zip(a, b)]
0.0013ms -> N = 4, a = np.arange(N), c = a * b
0.0171ms -> N = 40, a = [i for i in range(N)], c = [a*b for a,b in zip(a, b)]
0.0095ms -> N = 40, a = [i for i in range(N)], c = a * b
0.1077ms -> N = 40, a = np.arange(N), c = [a*b for a,b in zip(a, b)]
0.0013ms -> N = 40, a = np.arange(N), c = a * b
0.1485ms -> N = 400, a = [i for i in range(N)], c = [a*b for a,b in zip(a, b)]
0.0397ms -> N = 400, a = [i for i in range(N)], c = a * b
1.0348ms -> N = 400, a = np.arange(N), c = [a*b for a,b in zip(a, b)]
0.0020ms -> N = 400, a = np.arange(N), c = a * b

即从以下测试程序。

import timeit

init = ['''
import numpy as np
N = {}
a = {}
b = np.linspace(0.0, 0.5, len(a))
'''.format(i, j) for i in [4, 40, 400] 
                  for j in ['[i for i in range(N)]', 'np.arange(N)']]

func = ['''c = [a*b for a,b in zip(a, b)]''',
'''c = a * b''']

for i in init:
  for f in func:
    lines = i.split('\n')
    print('{:6.4f}ms -> {}, {}, {}'.format(
           timeit.timeit(f, setup=i, number=1000), lines[2], lines[3], f))

gahooa’s answer is correct for the question as phrased in the heading, but if the lists are already numpy format or larger than ten it will be MUCH faster (3 orders of magnitude) as well as more readable, to do simple numpy multiplication as suggested by NPE. I get these timings:

0.0049ms -> N = 4, a = [i for i in range(N)], c = [a*b for a,b in zip(a, b)]
0.0075ms -> N = 4, a = [i for i in range(N)], c = a * b
0.0167ms -> N = 4, a = np.arange(N), c = [a*b for a,b in zip(a, b)]
0.0013ms -> N = 4, a = np.arange(N), c = a * b
0.0171ms -> N = 40, a = [i for i in range(N)], c = [a*b for a,b in zip(a, b)]
0.0095ms -> N = 40, a = [i for i in range(N)], c = a * b
0.1077ms -> N = 40, a = np.arange(N), c = [a*b for a,b in zip(a, b)]
0.0013ms -> N = 40, a = np.arange(N), c = a * b
0.1485ms -> N = 400, a = [i for i in range(N)], c = [a*b for a,b in zip(a, b)]
0.0397ms -> N = 400, a = [i for i in range(N)], c = a * b
1.0348ms -> N = 400, a = np.arange(N), c = [a*b for a,b in zip(a, b)]
0.0020ms -> N = 400, a = np.arange(N), c = a * b

i.e. from the following test program.

import timeit

init = ['''
import numpy as np
N = {}
a = {}
b = np.linspace(0.0, 0.5, len(a))
'''.format(i, j) for i in [4, 40, 400] 
                  for j in ['[i for i in range(N)]', 'np.arange(N)']]

func = ['''c = [a*b for a,b in zip(a, b)]''',
'''c = a * b''']

for i in init:
  for f in func:
    lines = i.split('\n')
    print('{:6.4f}ms -> {}, {}, {}'.format(
           timeit.timeit(f, setup=i, number=1000), lines[2], lines[3], f))

回答 10

可以使用枚举。

a = [1, 2, 3, 4]
b = [2, 3, 4, 5]

ab = [val * b[i] for i, val in enumerate(a)]

Can use enumerate.

a = [1, 2, 3, 4]
b = [2, 3, 4, 5]

ab = [val * b[i] for i, val in enumerate(a)]

回答 11

map功能在这里可能非常有用。使用map我们可以将任何函数应用于可迭代对象的每个元素。

Python 3.x

>>> def my_mul(x,y):
...     return x*y
...
>>> a = [1,2,3,4]
>>> b = [2,3,4,5]
>>>
>>> list(map(my_mul,a,b))
[2, 6, 12, 20]
>>>

当然:

map(f, iterable)

相当于

[f(x) for x in iterable]

因此,我们可以通过以下方式获得解决方案:

>>> [my_mul(x,y) for x, y in zip(a,b)]
[2, 6, 12, 20]
>>>

在Python 2.x中map()意味着:将函数应用于可迭代的每个元素并构造一个新列表。在Python 3.x中,map构造迭代器而不是列表。

代替my_mul我们可以使用 mul运算符

Python 2.7

>>>from operator import mul # import mul operator
>>>a = [1,2,3,4]
>>>b = [2,3,4,5]
>>>map(mul,a,b)
[2, 6, 12, 20]
>>>

Python 3.5+

>>> from operator import mul
>>> a = [1,2,3,4]
>>> b = [2,3,4,5]
>>> [*map(mul,a,b)]
[2, 6, 12, 20]
>>>

请注意,由于map()构造了迭代器,因此我们使用*可迭代的拆包运算符来获取列表。解压缩方法比list构造函数要快一些:

>>> list(map(mul,a,b))
[2, 6, 12, 20]
>>>

The map function can be very useful here. Using map we can apply any function to each element of an iterable.

Python 3.x

>>> def my_mul(x,y):
...     return x*y
...
>>> a = [1,2,3,4]
>>> b = [2,3,4,5]
>>>
>>> list(map(my_mul,a,b))
[2, 6, 12, 20]
>>>

Of course:

map(f, iterable)

is equivalent to

[f(x) for x in iterable]

So we can get our solution via:

>>> [my_mul(x,y) for x, y in zip(a,b)]
[2, 6, 12, 20]
>>>

In Python 2.x map() means: apply a function to each element of an iterable and construct a new list. In Python 3.x, map construct iterators instead of lists.

Instead of my_mul we could use mul operator

Python 2.7

>>>from operator import mul # import mul operator
>>>a = [1,2,3,4]
>>>b = [2,3,4,5]
>>>map(mul,a,b)
[2, 6, 12, 20]
>>>

Python 3.5+

>>> from operator import mul
>>> a = [1,2,3,4]
>>> b = [2,3,4,5]
>>> [*map(mul,a,b)]
[2, 6, 12, 20]
>>>

Please note that since map() constructs an iterator we use * iterable unpacking operator to get a list. The unpacking approach is a bit faster then the list constructor:

>>> list(map(mul,a,b))
[2, 6, 12, 20]
>>>

回答 12

要维护列表类型,并在一行中完成(当然,在将numpy导入为np之后):

list(np.array([1,2,3,4]) * np.array([2,3,4,5]))

要么

list(np.array(a) * np.array(b))

To maintain the list type, and do it in one line (after importing numpy as np, of course):

list(np.array([1,2,3,4]) * np.array([2,3,4,5]))

or

list(np.array(a) * np.array(b))

回答 13

您可以将其用于相同长度的列表

def lstsum(a, b):
    c=0
    pos = 0
for element in a:
   c+= element*b[pos]
   pos+=1
return c

you can use this for lists of the same length

def lstsum(a, b):
    c=0
    pos = 0
for element in a:
   c+= element*b[pos]
   pos+=1
return c

将int转换为ASCII并返回Python

问题:将int转换为ASCII并返回Python

我正在为我的站点制作URL缩短器,而我目前的计划(我愿意接受建议)是使用节点ID来生成缩短的URL。因此,从理论上讲,节点26可能是short.com/z,节点1可能是short.com/a,节点52可能是short.com/Z,节点104可能是short.com/ZZ。当用户转到该URL时,我需要撤消该过程(显然)。

我可以想到一些可行的方法来解决此问题,但我想还有更好的方法。有什么建议?

I’m working on making a URL shortener for my site, and my current plan (I’m open to suggestions) is to use a node ID to generate the shortened URL. So, in theory, node 26 might be short.com/z, node 1 might be short.com/a, node 52 might be short.com/Z, and node 104 might be short.com/ZZ. When a user goes to that URL, I need to reverse the process (obviously).

I can think of some kludgy ways to go about this, but I’m guessing there are better ones. Any suggestions?


回答 0

ASCII转换为int:

ord('a')

97

然后返回一个字符串:

  • 在Python2中: str(unichr(97))
  • 在Python3中: chr(97)

'a'

ASCII to int:

ord('a')

gives 97

And back to a string:

  • in Python2: str(unichr(97))
  • in Python3: chr(97)

gives 'a'


回答 1

>>> ord("a")
97
>>> chr(97)
'a'
>>> ord("a")
97
>>> chr(97)
'a'

回答 2

如果多个字符绑定在一个整数/长整数内,这就是我的问题:

s = '0123456789'
nchars = len(s)
# string to int or long. Type depends on nchars
x = sum(ord(s[byte])<<8*(nchars-byte-1) for byte in range(nchars))
# int or long to string
''.join(chr((x>>8*(nchars-byte-1))&0xFF) for byte in range(nchars))

Yield'0123456789'x = 227581098929683594426425L

If multiple characters are bound inside a single integer/long, as was my issue:

s = '0123456789'
nchars = len(s)
# string to int or long. Type depends on nchars
x = sum(ord(s[byte])<<8*(nchars-byte-1) for byte in range(nchars))
# int or long to string
''.join(chr((x>>8*(nchars-byte-1))&0xFF) for byte in range(nchars))

Yields '0123456789' and x = 227581098929683594426425L


回答 3

BASE58编码URL怎么样?像flickr这样。

# note the missing lowercase L and the zero etc.
BASE58 = '123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ' 
url = ''
while node_id >= 58:
    div, mod = divmod(node_id, 58)
    url = BASE58[mod] + url
    node_id = int(div)

return 'http://short.com/%s' % BASE58[node_id] + url

将其转换为数字也没什么大不了的。

What about BASE58 encoding the URL? Like for example flickr does.

# note the missing lowercase L and the zero etc.
BASE58 = '123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ' 
url = ''
while node_id >= 58:
    div, mod = divmod(node_id, 58)
    url = BASE58[mod] + url
    node_id = int(div)

return 'http://short.com/%s' % BASE58[node_id] + url

Turning that back into a number isn’t a big deal either.


回答 4

使用hex(id)[2:]int(urlpart, 16)。还有其他选择。对您的id进行base32编码也可以正常工作,但是我不知道有没有内置Python进行base32编码的库。

显然,在Python 2.4中使用base64模块引入了base32编码器。您可以尝试使用b32encodeb32decode。你应该给True两者的casefoldmap01期权b32decode的情况下,人们写下你的短网址。

实际上,我收回了这一点。我仍然认为base32编码是一个好主意,但是该模块对于URL缩短的情况没有用。您可以查看模块中的实现,并针对此特定情况进行自己的设计。:-)

Use hex(id)[2:] and int(urlpart, 16). There are other options. base32 encoding your id could work as well, but I don’t know that there’s any library that does base32 encoding built into Python.

Apparently a base32 encoder was introduced in Python 2.4 with the base64 module. You might try using b32encode and b32decode. You should give True for both the casefold and map01 options to b32decode in case people write down your shortened URLs.

Actually, I take that back. I still think base32 encoding is a good idea, but that module is not useful for the case of URL shortening. You could look at the implementation in the module and make your own for this specific case. :-)


我如何让Pyflakes忽略声明?

问题:我如何让Pyflakes忽略声明?

我们的许多模块都始于:

try:
    import json
except ImportError:
    from django.utils import simplejson as json  # Python 2.4 fallback.

…这是整个文件中唯一的Pyflakes警告:

foo/bar.py:14: redefinition of unused 'json' from line 12

我如何让Pyflakes忽略这一点?

(通常我会去阅读文档,但是链接断开了。如果没有人回答,我只会阅读源代码。)

A lot of our modules start with:

try:
    import json
except ImportError:
    from django.utils import simplejson as json  # Python 2.4 fallback.

…and it’s the only Pyflakes warning in the entire file:

foo/bar.py:14: redefinition of unused 'json' from line 12

How can I get Pyflakes to ignore this?

(Normally I’d go read the docs but the link is broken. If nobody has an answer, I’ll just read the source.)


回答 0

如果您可以改用flake8-包裹pyflakes和pep8 checker-则以

# NOQA

(其中的空格非常大-代码末尾与之间的2个空格,在代码与文本#之间的一个空格NOQA)将告诉检查程序忽略该行上的任何错误。

If you can use flake8 instead – which wraps pyflakes as well as the pep8 checker – a line ending with

# NOQA

(in which the space is significant – 2 spaces between the end of the code and the #, one between it and the NOQA text) will tell the checker to ignore any errors on that line.


回答 1

我知道这是在不久前被质疑的,并且已经得到答复。

但是我想补充一下我通常使用的内容:

try:
    import json
    assert json  # silence pyflakes
except ImportError:
    from django.utils import simplejson as json  # Python 2.4 fallback.

I know this was questioned some time ago and is already answered.

But I wanted to add what I usually use:

try:
    import json
    assert json  # silence pyflakes
except ImportError:
    from django.utils import simplejson as json  # Python 2.4 fallback.

回答 2

是的,不幸的是dimod.org和所有好东西都一起倒了。

看一下pyflakes代码,在我看来pyflakes是经过设计的,因此可以很容易地将其用作“嵌入式快速检查器”。

为了实现忽略功能,您将需要编写自己的调用pyflakes检查器。

在这里您可以找到一个主意:http : //djangosnippets.org/snippets/1762/

请注意,以上代码段仅用于同一行中的注释位置。为了忽略整个块,您可能需要在块docstring中添加’pyflakes:ignore’并基于node.doc进行过滤。

祝好运!


我正在使用Pocket-lint进行各种静态代码分析。以下是在Pocket-Lint中忽略pyflakes所做的更改:https ://code.launchpad.net/~adiroiban/pocket-lint/907742/+merge/102882

Yep, unfortunately dimod.org is down together with all goodies.

Looking at the pyflakes code, it seems to me that pyflakes is designed so that it will be easy to use it as an “embedded fast checker”.

For implementing ignore functionality you will need to write your own that calls the pyflakes checker.

Here you can find an idea: http://djangosnippets.org/snippets/1762/

Note that the above snippet only for for comments places on the same line. For ignoring a whole block you might want to add ‘pyflakes:ignore’ in the block docstring and filter based on node.doc.

Good luck!


I am using pocket-lint for all kind of static code analysis. Here are the changes made in pocket-lint for ignoring pyflakes: https://code.launchpad.net/~adiroiban/pocket-lint/907742/+merge/102882


回答 3

引用github问题票证

尽管此修复程序仍在进行中,但是如果您想知道,可以通过以下方法解决:

try:
    from unittest.runner import _WritelnDecorator
    _WritelnDecorator; # workaround for pyflakes issue #13
except ImportError:
    from unittest import _WritelnDecorator

用所需的实体(模块,函数,类)替换_unittest和_WritelnDecorator

deemoowoor

To quote from the github issue ticket:

While the fix is still coming, this is how it can be worked around, if you’re wondering:

try:
    from unittest.runner import _WritelnDecorator
    _WritelnDecorator; # workaround for pyflakes issue #13
except ImportError:
    from unittest import _WritelnDecorator

Substitude _unittest and _WritelnDecorator with the entities (modules, functions, classes) you need

deemoowoor


回答 4

这是pyflakes的Monkey补丁,添加了# bypass_pyflakes注释选项。

passive_pyflakes.py

#!/usr/bin/env python

from pyflakes.scripts import pyflakes
from pyflakes.checker import Checker


def report_with_bypass(self, messageClass, *args, **kwargs):
    text_lineno = args[0] - 1
    with open(self.filename, 'r') as code:
        if code.readlines()[text_lineno].find('bypass_pyflakes') >= 0:
            return
    self.messages.append(messageClass(self.filename, *args, **kwargs))

# monkey patch checker to support bypass
Checker.report = report_with_bypass

pyflakes.main()

如果将其另存为bypass_pyflakes.py,则可以将其调用为python bypass_pyflakes.py myfile.py

http://chase-seibert.github.com/blog/2013/01/11/bypass_pyflakes.html

Here is a monkey patch for pyflakes that adds a # bypass_pyflakes comment option.

bypass_pyflakes.py

#!/usr/bin/env python

from pyflakes.scripts import pyflakes
from pyflakes.checker import Checker


def report_with_bypass(self, messageClass, *args, **kwargs):
    text_lineno = args[0] - 1
    with open(self.filename, 'r') as code:
        if code.readlines()[text_lineno].find('bypass_pyflakes') >= 0:
            return
    self.messages.append(messageClass(self.filename, *args, **kwargs))

# monkey patch checker to support bypass
Checker.report = report_with_bypass

pyflakes.main()

If you save this as bypass_pyflakes.py, then you can invoke it as python bypass_pyflakes.py myfile.py.

http://chase-seibert.github.com/blog/2013/01/11/bypass_pyflakes.html


回答 5

您也可以使用导入__import__。它不是pythonic,但是pyflakes不再警告您。请参阅的文档__import__

try:
    import json
except ImportError:
    __import__('django.utils', globals(), locals(), ['json'], -1)

You can also import with __import__. It’s not pythonic, but pyflakes does not warn you anymore. See documentation for __import__ .

try:
    import json
except ImportError:
    __import__('django.utils', globals(), locals(), ['json'], -1)

回答 6

我创建了一个带有一些awk魔术的shell脚本来帮助我。有了这个的所有生产线import typingfrom typing import#$(后者是我在这里使用一个特殊的注释)被排除($1是Python脚本的文件名):

result=$(pyflakes -- "$1" 2>&1)

# check whether there is any output
if [ "$result" ]; then

    # lines to exclude
    excl=$(awk 'BEGIN { ORS="" } /(#\$)|(import +typing)|(from +typing +import )/ { print sep NR; sep="|" }' "$1")

    # exclude lines if there are any (otherwise we get invalid regex)
    [ "$excl" ] &&
        result=$(awk "! /^[^:]+:(${excl}):/" <<< "$result")

fi

# now echo "$result" or such ...

基本上,它会记录行号并动态创建一个正则表达式。

I created a little shell script with some awk magic to help me. With this all lines with import typing, from typing import or #$ (latter is a special comment I am using here) are excluded ($1 is the file name of the Python script):

result=$(pyflakes -- "$1" 2>&1)

# check whether there is any output
if [ "$result" ]; then

    # lines to exclude
    excl=$(awk 'BEGIN { ORS="" } /(#\$)|(import +typing)|(from +typing +import )/ { print sep NR; sep="|" }' "$1")

    # exclude lines if there are any (otherwise we get invalid regex)
    [ "$excl" ] &&
        result=$(awk "! /^[^:]+:(${excl}):/" <<< "$result")

fi

# now echo "$result" or such ...

Basically it notes the line numbers and dynamically creates a regex out it.


我应该使用scipy.pi,numpy.pi还是math.pi?

问题:我应该使用scipy.pi,numpy.pi还是math.pi?

在使用SciPy的和NumPy的一个项目,我应该使用scipy.pinumpy.pimath.pi

In a project using SciPy and NumPy, should I use scipy.pi, numpy.pi, or math.pi?


回答 0

>>> import math
>>> import numpy as np
>>> import scipy
>>> math.pi == np.pi == scipy.pi
True

所以没关系,它们都是相同的值。

这三个模块均提供pi值的唯一原因是,如果仅使用三个模块之一,则可以方便地访问pi,而不必导入另一个模块。他们没有为pi提供不同的值。

>>> import math
>>> import numpy as np
>>> import scipy
>>> math.pi == np.pi == scipy.pi
True

So it doesn’t matter, they are all the same value.

The only reason all three modules provide a pi value is so if you are using just one of the three modules, you can conveniently have access to pi without having to import another module. They’re not providing different values for pi.


回答 1

需要注意的一件事是,当然,并非所有库都将对pi使用相同的含义,因此知道您使用的内容永远不会有任何伤害。例如,符号数学库Sympy对pi的表示与math和numpy不同:

import math
import numpy
import scipy
import sympy

print(math.pi == numpy.pi)
> True
print(math.pi == scipy.pi)
> True
print(math.pi == sympy.pi)
> False

One thing to note is that not all libraries will use the same meaning for pi, of course, so it never hurts to know what you’re using. For example, the symbolic math library Sympy’s representation of pi is not the same as math and numpy:

import math
import numpy
import scipy
import sympy

print(math.pi == numpy.pi)
> True
print(math.pi == scipy.pi)
> True
print(math.pi == sympy.pi)
> False

使用Python从字符串中删除数字以外的字符?

问题:使用Python从字符串中删除数字以外的字符?

如何从字符串中删除除数字以外的所有字符?

How can I remove all characters except numbers from string?


回答 0

在Python 2. *中,到目前为止最快的方法是.translate

>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>> 

string.maketrans生成一个转换表(长度为256的字符串),在这种情况下,该转换表与''.join(chr(x) for x in range(256))(更快地制作;-)相同。.translate应用转换表(这里无关紧要,因为all本质上是指身份),并删除第二个参数(关键部分)中存在的字符。

.translate在Unicode字符串(和Python 3中的字符串)上的工作方式大不相同-我确实希望问题能说明感兴趣的是哪个Python的主要发行版!)-并不是那么简单,也不是那么快,尽管仍然非常有用。

回到2. *,性能差异令人印象深刻……:

$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 7.9 usec per loop

将事情加速7到8倍几乎不是花生,因此该translate方法非常值得了解和使用。另一种流行的非RE方法…

$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop

比RE慢50%,因此该.translate方法将其击败了一个数量级。

在Python 3或Unicode中,您需要传递.translate一个映射(以普通字符而不是直接字符作为键),该映射返回None要删除的内容。这是删除“除以下所有内容外”几个字符的一种便捷方式:

import string

class Del:
  def __init__(self, keep=string.digits):
    self.comp = dict((ord(c),c) for c in keep)
  def __getitem__(self, k):
    return self.comp.get(k)

DD = Del()

x='aaa12333bb445bb54b5b52'
x.translate(DD)

也发出'1233344554552'。但是,将其放在xx.py中,我们可以…:

$ python3.1 -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop

…表明性能优势对于这种“删除”任务消失了,而变成了性能下降。

In Python 2.*, by far the fastest approach is the .translate method:

>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>> 

string.maketrans makes a translation table (a string of length 256) which in this case is the same as ''.join(chr(x) for x in range(256)) (just faster to make;-). .translate applies the translation table (which here is irrelevant since all essentially means identity) AND deletes characters present in the second argument — the key part.

.translate works very differently on Unicode strings (and strings in Python 3 — I do wish questions specified which major-release of Python is of interest!) — not quite this simple, not quite this fast, though still quite usable.

Back to 2.*, the performance difference is impressive…:

$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 7.9 usec per loop

Speeding things up by 7-8 times is hardly peanuts, so the translate method is well worth knowing and using. The other popular non-RE approach…:

$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop

is 50% slower than RE, so the .translate approach beats it by over an order of magnitude.

In Python 3, or for Unicode, you need to pass .translate a mapping (with ordinals, not characters directly, as keys) that returns None for what you want to delete. Here’s a convenient way to express this for deletion of “everything but” a few characters:

import string

class Del:
  def __init__(self, keep=string.digits):
    self.comp = dict((ord(c),c) for c in keep)
  def __getitem__(self, k):
    return self.comp.get(k)

DD = Del()

x='aaa12333bb445bb54b5b52'
x.translate(DD)

also emits '1233344554552'. However, putting this in xx.py we have…:

$ python3.1 -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop

…which shows the performance advantage disappears, for this kind of “deletion” tasks, and becomes a performance decrease.


回答 1

使用re.sub,如下所示:

>>> import re
>>> re.sub('\D', '', 'aas30dsa20')
'3020'

\D 匹配任何非数字字符,因此,上面的代码实质上是将每个非数字字符替换为空字符串。

或者您可以使用filter,就像这样(在Python 2中):

>>> filter(str.isdigit, 'aas30dsa20')
'3020'

由于在Python 3中,filter返回的是迭代器而不是list,因此您可以使用以下代码:

>>> ''.join(filter(str.isdigit, 'aas30dsa20'))
'3020'

Use re.sub, like so:

>>> import re
>>> re.sub('\D', '', 'aas30dsa20')
'3020'

\D matches any non-digit character so, the code above, is essentially replacing every non-digit character for the empty string.

Or you can use filter, like so (in Python 2):

>>> filter(str.isdigit, 'aas30dsa20')
'3020'

Since in Python 3, filter returns an iterator instead of a list, you can use the following instead:

>>> ''.join(filter(str.isdigit, 'aas30dsa20'))
'3020'

回答 2

s=''.join(i for i in s if i.isdigit())

另一个生成器变体。

s=''.join(i for i in s if i.isdigit())

Another generator variant.


回答 3

您可以使用过滤器:

filter(lambda x: x.isdigit(), "dasdasd2313dsa")

在python3.0上,您必须加入这个(有点丑陋的:()

''.join(filter(lambda x: x.isdigit(), "dasdasd2313dsa"))

You can use filter:

filter(lambda x: x.isdigit(), "dasdasd2313dsa")

On python3.0 you have to join this (kinda ugly :( )

''.join(filter(lambda x: x.isdigit(), "dasdasd2313dsa"))

回答 4

按照拜耳的回答:

''.join(i for i in s if i.isdigit())

along the lines of bayer’s answer:

''.join(i for i in s if i.isdigit())

回答 5

您可以使用Regex轻松完成此操作

>>> import re
>>> re.sub("\D","","£70,000")
70000

You can easily do it using Regex

>>> import re
>>> re.sub("\D","","£70,000")
70000

回答 6

x.translate(None, string.digits)

将从字符串中删除所有数字。要删除字母并保留数字,请执行以下操作:

x.translate(None, string.letters)
x.translate(None, string.digits)

will delete all digits from string. To delete letters and keep the digits, do this:

x.translate(None, string.letters)

回答 7

这位操作员在评论中提到他想保留小数位。可以通过re.sub方法(按照第二个方法和恕我直言的最佳答案)来完成,方法是明确列出要保留的字符,例如

>>> re.sub("[^0123456789\.]","","poo123.4and5fish")
'123.45'

The op mentions in the comments that he wants to keep the decimal place. This can be done with the re.sub method (as per the second and IMHO best answer) by explicitly listing the characters to keep e.g.

>>> re.sub("[^0123456789\.]","","poo123.4and5fish")
'123.45'

回答 8

Python 3的快速版本:

# xx3.py
from collections import defaultdict
import string
_NoneType = type(None)

def keeper(keep):
    table = defaultdict(_NoneType)
    table.update({ord(c): c for c in keep})
    return table

digit_keeper = keeper(string.digits)

这是与regex的性能比较:

$ python3.3 -mtimeit -s'import xx3; x="aaa12333bb445bb54b5b52"' 'x.translate(xx3.digit_keeper)'
1000000 loops, best of 3: 1.02 usec per loop
$ python3.3 -mtimeit -s'import re; r = re.compile(r"\D"); x="aaa12333bb445bb54b5b52"' 'r.sub("", x)'
100000 loops, best of 3: 3.43 usec per loop

对我来说,它比正则表达式快3倍多。它也比class Del上面更快,因为defaultdict它使用C语言而不是(慢)Python进行所有查找。这是我在同一系统上的版本,以进行比较。

$ python3.3 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
100000 loops, best of 3: 13.6 usec per loop

A fast version for Python 3:

# xx3.py
from collections import defaultdict
import string
_NoneType = type(None)

def keeper(keep):
    table = defaultdict(_NoneType)
    table.update({ord(c): c for c in keep})
    return table

digit_keeper = keeper(string.digits)

Here’s a performance comparison vs. regex:

$ python3.3 -mtimeit -s'import xx3; x="aaa12333bb445bb54b5b52"' 'x.translate(xx3.digit_keeper)'
1000000 loops, best of 3: 1.02 usec per loop
$ python3.3 -mtimeit -s'import re; r = re.compile(r"\D"); x="aaa12333bb445bb54b5b52"' 'r.sub("", x)'
100000 loops, best of 3: 3.43 usec per loop

So it’s a little bit more than 3 times faster than regex, for me. It’s also faster than class Del above, because defaultdict does all its lookups in C, rather than (slow) Python. Here’s that version on my same system, for comparison.

$ python3.3 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
100000 loops, best of 3: 13.6 usec per loop

回答 9

使用生成器表达式:

>>> s = "foo200bar"
>>> new_s = "".join(i for i in s if i in "0123456789")

Use a generator expression:

>>> s = "foo200bar"
>>> new_s = "".join(i for i in s if i in "0123456789")

回答 10

丑陋但可行:

>>> s
'aaa12333bb445bb54b5b52'
>>> a = ''.join(filter(lambda x : x.isdigit(), s))
>>> a
'1233344554552'
>>>

Ugly but works:

>>> s
'aaa12333bb445bb54b5b52'
>>> a = ''.join(filter(lambda x : x.isdigit(), s))
>>> a
'1233344554552'
>>>

回答 11

$ python -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'

100000次循环,每循环3:2.48微秒最佳

$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'

100000次循环,最好为3:每个循环2.02微秒

$ python -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'

100000次循环,每循环3:2.37最佳

$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'

100000次循环,每循环3:1.97最佳

我已经观察到联接比sub快。

$ python -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'

100000 loops, best of 3: 2.48 usec per loop

$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'

100000 loops, best of 3: 2.02 usec per loop

$ python -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'

100000 loops, best of 3: 2.37 usec per loop

$ python -mtimeit -s'import re; x="aaa12333bab445bb54b5b52"' '"".join(re.findall("[a-z]+",x))'

100000 loops, best of 3: 1.97 usec per loop

I had observed that join is faster than sub.


回答 12

您可以阅读每个字符。如果是数字,则将其包括在答案中。该str.isdigit() 方法是一种知道字符是否为数字的方法。

your_input = '12kjkh2nnk34l34'
your_output = ''.join(c for c in your_input if c.isdigit())
print(your_output) # '1223434'

You can read each character. If it is digit, then include it in the answer. The str.isdigit() method is a way to know if a character is digit.

your_input = '12kjkh2nnk34l34'
your_output = ''.join(c for c in your_input if c.isdigit())
print(your_output) # '1223434'

回答 13

不是一行代码,但非常简单:

buffer = ""
some_str = "aas30dsa20"

for char in some_str:
    if not char.isdigit():
        buffer += char

print( buffer )

Not a one liner but very simple:

buffer = ""
some_str = "aas30dsa20"

for char in some_str:
    if not char.isdigit():
        buffer += char

print( buffer )

回答 14

我用这个 'letters'应该包含您要删除的所有字母:

Output = Input.translate({ord(i): None for i in 'letters'}))

例:

Input = "I would like 20 dollars for that suit" Output = Input.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxzy'})) print(Output)

输出: 20

I used this. 'letters' should contain all the letters that you want to get rid of:

Output = Input.translate({ord(i): None for i in 'letters'}))

Example:

Input = "I would like 20 dollars for that suit" Output = Input.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxzy'})) print(Output)

Output: 20


定义引发异常的lambda表达式

问题:定义引发异常的lambda表达式

我如何写一个等于的lambda表达式:

def x():
    raise Exception()

不允许以下内容:

y = lambda : raise Exception()

How can I write a lambda expression that’s equivalent to:

def x():
    raise Exception()

The following is not allowed:

y = lambda : raise Exception()

回答 0

设置Python皮肤的方法有多种:

y = lambda: (_ for _ in ()).throw(Exception('foobar'))

Lambda接受语句。既然raise ex是一条语句,您可以编写一个通用的提升器:

def raise_(ex):
    raise ex

y = lambda: raise_(Exception('foobar'))

但是,如果您的目标是避免使用def,则显然不能削减它。但是,它确实允许您有条件地引发异常,例如:

y = lambda x: 2*x if x < 10 else raise_(Exception('foobar'))

另外,您可以在不定义命名函数的情况下引发异常。您所需要的只是强健的腹部(给定的代码是2.x):

type(lambda:0)(type((lambda:0).func_code)(
  1,1,1,67,'|\0\0\202\1\0',(),(),('x',),'','',1,''),{}
)(Exception())

和python3 强健胃部解决方案:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

感谢@WarrenSpencer指出了一个非常简单的答案,如果您不在乎引发哪个异常:y = lambda: 1/0

There is more than one way to skin a Python:

y = lambda: (_ for _ in ()).throw(Exception('foobar'))

Lambdas accept statements. Since raise ex is a statement, you could write a general purpose raiser:

def raise_(ex):
    raise ex

y = lambda: raise_(Exception('foobar'))

But if your goal is to avoid a def, this obviously doesn’t cut it. It does, however allow you to conditionally raise exceptions, e.g.:

y = lambda x: 2*x if x < 10 else raise_(Exception('foobar'))

Alternatively you can raise an exception without defining a named function. All you need is a strong stomach (and 2.x for the given code):

type(lambda:0)(type((lambda:0).func_code)(
  1,1,1,67,'|\0\0\202\1\0',(),(),('x',),'','',1,''),{}
)(Exception())

And a python3 strong stomach solution:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

Thanks @WarrenSpencer for pointing out a very simple answer if you don’t care which exception is raised: y = lambda: 1/0.


回答 1

怎么样:

lambda x: exec('raise(Exception(x))')

How about:

lambda x: exec('raise(Exception(x))')

回答 2

实际上,有一种方法,但是它非常人为。

您可以使用compile()内置函数创建代码对象。这使您可以使用raise语句(或其他任何语句),但这又带来了另一个挑战:执行代码对象。通常的方法是使用该exec语句,但这会使您回到最初的问题,即您不能在lambda(或)中执行语句eval()

解决的办法是破解。诸如lambda语句结果之类的可调用对象均具有属性__code__,该属性实际上可以被替换。因此,如果您创建一个可调用__code__对象并将其值替换为上面的代码对象,则可以得到无需使用语句即可进行评估的内容。但是,实现所有这些都会导致代码非常晦涩:

map(lambda x, y, z: x.__setattr__(y, z) or x, [lambda: 0], ["__code__"], [compile("raise Exception", "", "single"])[0]()

上面执行以下操作:

  • compile()调用创建一个引发异常的代码对象;

  • 所述lambda: 0返回一个可调用什么也不做而返回值0 -这用于以后执行上述代码的对象;

  • lambda x, y, z创建调用函数__setattr__与剩下的参数,第一个参数的方法,并返回第一个参数!这是必要的,因为__setattr__它本身会返回None

  • map()调用需要的结果lambda: 0,并使用lambda x, y, z替换它的__code__目标与结果compile()的呼叫。此映射操作的结果是一个包含一个条目的列表,该列表由返回lambda x, y, z,这就是我们需要这样做的原因lambda:如果立即使用__setattr__,将丢失对该lambda: 0对象的引用!

  • 最终,map()调用返回的列表的第一个(也是唯一一个)元素被执行,导致代码对象被调用,最终引发所需的异常。

它可以工作(在python 2.6中测试),但是绝对不是很漂亮。

最后一点:如果您有权访问该types模块(需要在import之前使用该语句eval),则可以将这段代码缩短一点:使用types.FunctionType()可以创建一个函数来执行给定的代码对象,因此您赢了不需要创建虚拟函数lambda: 0并替换其__code__属性值的技巧。

Actually, there is a way, but it’s very contrived.

You can create a code object using the compile() built-in function. This allows you to use the raise statement (or any other statement, for that matter), but it raises another challenge: executing the code object. The usual way would be to use the exec statement, but that leads you back to the original problem, namely that you can’t execute statements in a lambda (or an eval(), for that matter).

The solution is a hack. Callables like the result of a lambda statement all have an attribute __code__, which can actually be replaced. So, if you create a callable and replace it’s __code__ value with the code object from above, you get something that can be evaluated without using statements. Achieving all this, though, results in very obscure code:

map(lambda x, y, z: x.__setattr__(y, z) or x, [lambda: 0], ["__code__"], [compile("raise Exception", "", "single"])[0]()

The above does the following:

  • the compile() call creates a code object that raises the exception;

  • the lambda: 0 returns a callable that does nothing but return the value 0 — this is used to execute the above code object later;

  • the lambda x, y, z creates a function that calls the __setattr__ method of the first argument with the remaining arguments, AND RETURNS THE FIRST ARGUMENT! This is necessary, because __setattr__ itself returns None;

  • the map() call takes the result of lambda: 0, and using the lambda x, y, z replaces it’s __code__ object with the result of the compile() call. The result of this map operation is a list with one entry, the one returned by lambda x, y, z, which is why we need this lambda: if we would use __setattr__ right away, we would lose the reference to the lambda: 0 object!

  • finally, the first (and only) element of the list returned by the map() call is executed, resulting in the code object being called, ultimately raising the desired exception.

It works (tested in Python 2.6), but it’s definitely not pretty.

One last note: if you have access to the types module (which would require to use the import statement before your eval), then you can shorten this code down a bit: using types.FunctionType() you can create a function that will execute the given code object, so you won’t need the hack of creating a dummy function with lambda: 0 and replacing the value of its __code__ attribute.


回答 3

用lambda表单创建的函数不能包含语句

Functions created with lambda forms cannot contain statements.


回答 4

如果您想要的只是引发任意异常的lambda表达式,则可以使用非法表达式来实现。例如,lambda x: [][0]将尝试访问空列表中的第一个元素,这将引发IndexError。

请注意:这是黑客行为,而非功能。请勿在他人可能看到或使用的任何(非代码高尔夫球)代码中使用此代码。

If all you want is a lambda expression that raises an arbitrary exception, you can accomplish this with an illegal expression. For instance, lambda x: [][0] will attempt to access the first element in an empty list, which will raise an IndexError.

PLEASE NOTE: This is a hack, not a feature. Do not use this in any (non code-golf) code that another human being might see or use.


回答 5

我想解释一下Marcelo Cantos提供的答案的UPDATE 3

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

说明

lambda: 0builtins.function该类的一个实例。
type(lambda: 0)builtins.functionClass。
(lambda: 0).__code__是一个code对象。
code对象是保存除了其他方面,编译的字节代码的对象。它在CPython https://github.com/python/cpython/blob/master/Include/include.code中定义。其方法在此处https://github.com/python/cpython/blob/master/Objects/codeobject.c中实现。我们可以在代码对象上运行帮助:

Help on code object:

class code(object)
 |  code(argcount, kwonlyargcount, nlocals, stacksize, flags, codestring,
 |        constants, names, varnames, filename, name, firstlineno,
 |        lnotab[, freevars[, cellvars]])
 |  
 |  Create a code object.  Not for the faint of heart.

type((lambda: 0).__code__)是代码类。
所以当我们说

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

我们使用以下参数调用代码对象的构造函数:

  • argcount = 1
  • kwonlyargcount = 0
  • nlocals = 1
  • stacksize = 1
  • 标志= 67
  • codestring = b’| \ 0 \ 202 \ 1 \ 0′
  • 常数=()
  • 名称=()
  • varnames =(’x’,)
  • 文件名=”
  • 名称=”
  • firstlineno = 1
  • lnotab = b”

您可以在PyCodeObject https://github.com/python/cpython/blob/master/Include/include.code的定义中了解自变量的含义。flags例如,参数的值67CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE

最多importand参数是,codestring其中包含指令操作码。让我们看看它们的含义。

>>> import dis
>>> dis.dis(b'|\0\202\1\0')
          0 LOAD_FAST                0 (0)
          2 RAISE_VARARGS            1
          4 <0>

可以在以下网址找到操作码的文档: https://docs.python.org/3.8/library/dis.html#python-bytecode-instructions。第一个字节是的操作码LOAD_FAST,第二个字节是其参数,即0。

LOAD_FAST(var_num)
    Pushes a reference to the local co_varnames[var_num] onto the stack.

因此,我们将引用x推入堆栈。的varnames是只含有“X”的字符串列表。我们将把要定义的函数的唯一参数推入堆栈。

下一个字节是其操作码,RAISE_VARARGS下一个字节是其参数,即1。

RAISE_VARARGS(argc)
    Raises an exception using one of the 3 forms of the raise statement, depending on the value of argc:
        0: raise (re-raise previous exception)
        1: raise TOS (raise exception instance or type at TOS)
        2: raise TOS1 from TOS (raise exception instance or type at TOS1 with __cause__ set to TOS)

TOS是堆栈的顶部。由于我们将x函数的第一个参数()推入了堆栈且argc为1,因此x如果它是异常实例, 则将其x引发,否则将其引发。

最后节即0不被使用。这不是有效的操作码。它可能不在那里。

回到代码片段,我们在分析:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

我们称为代码对象的构造函数:

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

我们将代码对象和空字典传递给函数对象的构造函数:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)

让我们在函数对象上调用help来了解参数的含义。

Help on class function in module builtins:

class function(object)
 |  function(code, globals, name=None, argdefs=None, closure=None)
 |  
 |  Create a function object.
 |  
 |  code
 |    a code object
 |  globals
 |    the globals dictionary
 |  name
 |    a string that overrides the name from the code object
 |  argdefs
 |    a tuple that specifies the default argument values
 |  closure
 |    a tuple that supplies the bindings for free variables

然后,我们调用传递的Exception实例作为参数的构造函数。因此,我们调用了引发异常的lambda函数。让我们运行代码段,看看它确实按预期工作。

>>> type(lambda: 0)(type((lambda: 0).__code__)(
...     1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
... )(Exception())
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "", line 1, in 
Exception

改进措施

我们看到字节码的最后节是无用的。让我们不要轻易将这个复杂的表达式弄乱。让我们删除该字节。另外,如果我们想打高尔夫球,我们可以省略Exception的实例化,而是将Exception类作为参数传递。这些更改将导致以下代码:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1',(),(),('x',),'','',1,b''),{}
)(Exception)

当我们运行它时,我们将获得与以前相同的结果。它只是更短。

I’d like to give an explanation of the UPDATE 3 of the answer provided by Marcelo Cantos:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

Explanation

lambda: 0 is an instance of the builtins.function class.
type(lambda: 0) is the builtins.function class.
(lambda: 0).__code__ is a code object.
A code object is an object which holds the compiled bytecode among other things. It is defined here in CPython https://github.com/python/cpython/blob/master/Include/code.h. Its methods are implemented here https://github.com/python/cpython/blob/master/Objects/codeobject.c. We can run the help on the code object:

Help on code object:

class code(object)
 |  code(argcount, kwonlyargcount, nlocals, stacksize, flags, codestring,
 |        constants, names, varnames, filename, name, firstlineno,
 |        lnotab[, freevars[, cellvars]])
 |  
 |  Create a code object.  Not for the faint of heart.

type((lambda: 0).__code__) is the code class.
So when we say

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

we are calling the constructor of the code object with the following arguments:

  • argcount=1
  • kwonlyargcount=0
  • nlocals=1
  • stacksize=1
  • flags=67
  • codestring=b’|\0\202\1\0′
  • constants=()
  • names=()
  • varnames=(‘x’,)
  • filename=”
  • name=”
  • firstlineno=1
  • lnotab=b”

You can read about what the arguments mean in the definition of the PyCodeObject https://github.com/python/cpython/blob/master/Include/code.h. The value of 67 for the flags argument is for example CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE.

The most importand argument is the codestring which contains instruction opcodes. Let’s see what they mean.

>>> import dis
>>> dis.dis(b'|\0\202\1\0')
          0 LOAD_FAST                0 (0)
          2 RAISE_VARARGS            1
          4 <0>

The documentation of opcodes can by found here https://docs.python.org/3.8/library/dis.html#python-bytecode-instructions. The first byte is the opcode for LOAD_FAST, the second byte is its argument i.e. 0.

LOAD_FAST(var_num)
    Pushes a reference to the local co_varnames[var_num] onto the stack.

So we push the reference to x onto the stack. The varnames is a list of strings containing only ‘x’. We will push the only argument of the function we are defining to the stack.

The next byte is the opcode for RAISE_VARARGS and the next byte is its argument i.e. 1.

RAISE_VARARGS(argc)
    Raises an exception using one of the 3 forms of the raise statement, depending on the value of argc:
        0: raise (re-raise previous exception)
        1: raise TOS (raise exception instance or type at TOS)
        2: raise TOS1 from TOS (raise exception instance or type at TOS1 with __cause__ set to TOS)

The TOS is the top-of-stack. Since we pushed the first argument (x) of our function to the stack and argc is 1 we will raise the x if it is an exception instance or make an instance of x and raise it otherwise.

The last byte i.e. 0 is not used. It is not a valid opcode. It might as well not be there.

Going back to code snippet we are anylyzing:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

We called the constructor of the code object:

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

We pass the code object and an empty dictionary to the constructor of a function object:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)

Let’s call help on a function object to see what the arguments mean.

Help on class function in module builtins:

class function(object)
 |  function(code, globals, name=None, argdefs=None, closure=None)
 |  
 |  Create a function object.
 |  
 |  code
 |    a code object
 |  globals
 |    the globals dictionary
 |  name
 |    a string that overrides the name from the code object
 |  argdefs
 |    a tuple that specifies the default argument values
 |  closure
 |    a tuple that supplies the bindings for free variables

We then call the constructed function passing an Exception instance as an argument. Consequently we called a lambda function which raises an exception. Let’s run the snippet and see that it indeed works as intended.

>>> type(lambda: 0)(type((lambda: 0).__code__)(
...     1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
... )(Exception())
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "", line 1, in 
Exception

Improvements

We saw that the last byte of the bytecode is useless. Let’s not clutter this complicated expression needlesly. Let’s remove that byte. Also if we want to golf a little we could omit the instantiation of Exception and instead pass the Exception class as an argument. Those changes would result in the following code:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1',(),(),('x',),'','',1,b''),{}
)(Exception)

When we run it we will get the same result as before. It’s just shorter.


Python的file.flush()到底在做什么?

问题:Python的file.flush()到底在做什么?

我在Python 文档的File Objects中找到了这个:

flush()不一定会将文件的数据写入磁盘。使用flush()和os.fsync()来确保此行为。

所以我的问题是:Python到底在flush做什么?我以为这会强制将数据写入磁盘,但现在我发现并没有。为什么?

I found this in the Python documentation for File Objects:

flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.

So my question is: what exactly is Python’s flush doing? I thought that it forces to write data to the disk, but now I see that it doesn’t. Why?


回答 0

通常涉及两个级别的缓冲:

  1. 内部缓冲器
  2. 操作系统缓冲区

内部缓冲区是由您针对其进行编程的运行时/库/语言创建的缓冲区,其目的是通过避免每次写入都调用系统来加快处理速度。取而代之的是,当您写入文件对象时,您将写入其缓冲区,并且只要缓冲区被填满,就会使用系统调用将数据写入实际文件。

但是,由于操作系统缓冲区的原因,这可能并不意味着数据已写入disk。这可能仅意味着将数据从运行时维护的缓冲区复制到操作系统维护的缓冲区。

如果您写了一些东西,并且它最终在缓冲区中(仅),并且切断了计算机的电源,则当计算机关闭时,该数据将不在磁盘上。

因此,为了帮助您在各自的对象上使用flushfsync方法。

第一个flush会简单地将程序缓冲区中残留的所有数据写到实际文件中。通常,这意味着数据将从程序缓冲区复制到操作系统缓冲区。

具体来说,这意味着如果另一个进程打开了要读取的相同文件,它将能够访问刚刷新到该文件的数据。但是,这不一定意味着它已“永久”存储在磁盘上。

为此,您需要调用os.fsync确保所有操作系统缓冲区与它们所使用的存储设备同步的方法,换句话说,该方法会将数据从操作系统缓冲区复制到磁盘。

通常,您无需为这两种方法烦恼,但是,如果您对磁盘上实际存储的内容抱有偏执是好事,则应按照说明进行两次调用。


2018年补遗。

请注意,具有缓存机制的磁盘现在比2013年更加普遍,因此现在涉及的缓存和缓冲区级别更高。我认为这些缓冲区也将由sync / flush调用处理,但我真的不知道。

There’s typically two levels of buffering involved:

  1. Internal buffers
  2. Operating system buffers

The internal buffers are buffers created by the runtime/library/language that you’re programming against and is meant to speed things up by avoiding system calls for every write. Instead, when you write to a file object, you write into its buffer, and whenever the buffer fills up, the data is written to the actual file using system calls.

However, due to the operating system buffers, this might not mean that the data is written to disk. It may just mean that the data is copied from the buffers maintained by your runtime into the buffers maintained by the operating system.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

So, in order to help with that you have the flush and fsync methods, on their respective objects.

The first, flush, will simply write out any data that lingers in a program buffer to the actual file. Typically this means that the data will be copied from the program buffer to the operating system buffer.

Specifically what this means is that if another process has that same file open for reading, it will be able to access the data you just flushed to the file. However, it does not necessarily mean it has been “permanently” stored on disk.

To do that, you need to call the os.fsync method which ensures all operating system buffers are synchronized with the storage devices they’re for, in other words, that method will copy data from the operating system buffers to the disk.

Typically you don’t need to bother with either method, but if you’re in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.


Addendum in 2018.

Note that disks with cache mechanisms is now much more common than back in 2013, so now there are even more levels of caching and buffers involved. I assume these buffers will be handled by the sync/flush calls as well, but I don’t really know.


回答 1

因为操作系统可能不会这样做。刷新操作将文件数据强制进入RAM中的文件缓存,然后从那里开始,操作系统的工作就是将其实际发送到磁盘。

Because the operating system may not do so. The flush operation forces the file data into the file cache in RAM, and from there it’s the OS’s job to actually send it to the disk.


回答 2

它刷新内部缓冲区,这应该导致操作系统将缓冲区写出到文件中。[1] 除非您另行配置,否则Python使用操作系统的默认缓冲。

但是有时OS仍然选择不合作。尤其是在Windows / NTFS中具有诸如写入延迟之类的奇妙功能。基本上清除了内部缓冲区,但OS缓冲区仍保持不变。因此,os.fsync()在这种情况下,您必须告诉操作系统将其写入磁盘。

[1] http://docs.python.org/library/stdtypes.html

It flushes the internal buffer, which is supposed to cause the OS to write out the buffer to the file.[1] Python uses the OS’s default buffering unless you configure it do otherwise.

But sometimes the OS still chooses not to cooperate. Especially with wonderful things like write-delays in Windows/NTFS. Basically the internal buffer is flushed, but the OS buffer is still holding on to it. So you have to tell the OS to write it to disk with os.fsync() in those cases.

[1] http://docs.python.org/library/stdtypes.html


回答 3

基本上,flush()清除RAM缓冲区,其真正功能是让您随后继续写入它-但不应将其视为最佳/最安全的文件写入功能。这将冲刷您的RAM,以获取更多数据,仅此而已。如果要确保安全地将数据写入文件,请改用close()。

Basically, flush() cleans out your RAM buffer, its real power is that it lets you continue to write to it afterwards – but it shouldn’t be thought of as the best/safest write to file feature. It’s flushing your RAM for more data to come, that is all. If you want to ensure data gets written to file safely then use close() instead.


在Python中获取迭代器中的元素数量

问题:在Python中获取迭代器中的元素数量

通常,是否有一种有效的方法可以知道Python的迭代器中有多少个元素,而无需遍历每个元素并进行计数?

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?


回答 0

不行,不可能

例:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

的长度iterator未知,直到您遍历为止。

No. It’s not possible.

Example:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

Length of iterator is unknown until you iterate through it.


回答 1

此代码应工作:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

尽管它确实遍历每个项目并计算它们,但这是最快的方法。

当迭代器没有项目时,它也适用:

>>> sum(1 for _ in range(0))
0

当然,它会无限输入地永远运行,因此请记住,迭代器可以是无限的:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

另外,请注意,执行此操作将耗尽迭代器,并且进一步尝试使用它将看不到任何元素。这是Python迭代器设计不可避免的结果。如果要保留元素,则必须将它们存储在列表或其他内容中。

This code should work:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

>>> sum(1 for _ in range(0))
0

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.


回答 2

不,任何方法都将要求您解决所有结果。你可以做

iter_length = len(list(iterable))

但是在无限迭代器上运行该函数当然永远不会返回。它还将消耗迭代器,并且如果要使用其内容,则需要将其重置。

告诉我们您要解决的实际问题可能会帮助我们找到实现目标的更好方法。

编辑:使用list()将立即将整个可迭代对象读取到内存中,这可能是不可取的。另一种方法是

sum(1 for _ in iterable)

如另一个人所张贴。这样可以避免将其保存在内存中。

No, any method will require you to resolve every result. You can do

iter_length = len(list(iterable))

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using list() will read the whole iterable into memory at once, which may be undesirable. Another way is to do

sum(1 for _ in iterable)

as another person posted. That will avoid keeping it in memory.


回答 3

您不能(除非特定迭代器的类型实现了某些特定方法才能实现)。

通常,您只能通过使用迭代器来计数迭代器项目。可能是最有效的方法之一:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(对于Python 3.x,请替换itertools.izipzip)。

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(For Python 3.x replace itertools.izip with zip).


回答 4

金田 您可以检查该__length_hint__方法,但要警告(至少gsnedders指出,至少在Python 3.4之前),这是一个未记录的实现细节遵循线程中的消息),很可能消失或召唤鼻恶魔。

否则,不会。迭代器只是一个仅公开next()方法的对象。您可以根据需要多次调用它,它们最终可能会也可能不会出现StopIteration。幸运的是,这种行为在大多数情况下对编码员是透明的。:)

Kinda. You could check the __length_hint__ method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the next() method. You can call it as many times as required and they may or may not eventually raise StopIteration. Luckily, this behaviour is most of the time transparent to the coder. :)


回答 5

我喜欢基数软件包,它非常轻巧,并根据可迭代性尝试使用可能的最快实现。

用法:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

实际count()实现如下:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

The actual count() implementation is as follows:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

回答 6

因此,对于那些想了解该讨论摘要的人。使用以下方法计算长度为5000万的生成器表达式的最终最高分:

  • len(list(gen))
  • len([_ for _ in gen])
  • sum(1 for _ in gen),
  • ilen(gen)(来自more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0)

按执行性能(包括内存消耗)排序,会让您感到惊讶:

“`

1:test_list.py:8:0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(“列表,秒”,1.9684218849870376)

2:test_list_compr.py:8:0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(’list_compr,sec’,2.5885991149989422)

3:test_sum.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(’sum,sec’,3.441088170016883)

4:more_itertools / more.py:413:1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(’ilen,sec’,9.812256851990242)

5:test_reduce.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(’reduce,sec’,13.436614598002052)“`

因此,len(list(gen))是最频繁且消耗较少的内存

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

  • len(list(gen)),
  • len([_ for _ in gen]),
  • sum(1 for _ in gen),
  • ilen(gen) (from more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0),

sorted by performance of execution (including memory consumption), will make you surprised:

“`

1: test_list.py:8: 0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(‘list, sec’, 1.9684218849870376)

2: test_list_compr.py:8: 0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(‘list_compr, sec’, 2.5885991149989422)

3: test_sum.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(‘sum, sec’, 3.441088170016883)

4: more_itertools/more.py:413: 1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(‘ilen, sec’, 9.812256851990242)

5: test_reduce.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(‘reduce, sec’, 13.436614598002052) “`

So, len(list(gen)) is the most frequent and less memory consumable


回答 7

迭代器只是一个对象,该对象具有指向要由某种缓冲区或流读取的下一个对象的指针,就像一个LinkedList,在其中迭代之前,您不知道自己拥有多少东西。迭代器之所以具有效率,是因为它们所做的只是告诉您引用之后是什么,而不是使用索引(但是如您所见,您失去了查看下一步有多少项的能力)。

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).


回答 8

关于您的原始问题,答案仍然是,通常没有办法知道Python中迭代器的长度。

鉴于您的问题是由pysam库的应用引起的,我可以给出一个更具体的答案:我是PySAM的贡献者,而最终的答案是SAM / BAM文件未提供对齐读取的确切数目。也无法从BAM索引文件中轻松获得此信息。最好的办法是在读取多个对齐方式并根据文件的总大小外推后,通过使用文件指针的位置来估计对齐的大概数量。这足以实现进度条,但不足以在恒定时间内计数路线。

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.


回答 9

快速基准:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

结果:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

即简单的count_iter_items是要走的路。

针对python3进行调整:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A quick benchmark:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

The results:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

I.e. the simple count_iter_items is the way to go.

Adjusting this for python3:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 10

有两种方法可以获取计算机上“某物”的长度。

第一种方法是存储计数-这需要接触文件/数据的任何东西来修改它(或仅公开接口的类-但归结为同一件事)。

另一种方法是遍历它并计算它的大小。

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.


回答 11

通常的做法是将这种类型的信息放在文件头中,并让pysam允许您访问此信息。我不知道格式,但是您检查过API吗?

正如其他人所说,您无法从迭代器知道长度。

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.


回答 12

这违反了迭代器的定义,迭代器是指向对象的指针,外加有关如何到达下一个对象的信息。

迭代器不知道在终止之前它将可以迭代多少次。这可能是无限的,所以无限可能是您的答案。

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.


回答 13

尽管通常不可能执行所要求的操作,但在对项目进行迭代之后,对迭代的项目数进行计数通常仍然有用。为此,您可以使用jaraco.itertools.Counter或类似的名称。这是一个使用Python 3和rwt加载程序包的示例。

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

回答 14

def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum
def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum

回答 15

大概是,您希望不迭代地对项目数进行计数,以使迭代器不会耗尽,以后再使用它。可以通过copydeepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

输出为“Finding the length did not exhaust the iterator!

您可以选择(并且不建议使用)隐藏内置len函数,如下所示:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy or deepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

The output is “Finding the length did not exhaust the iterator!

Optionally (and unadvisedly), you can shadow the built-in len function as follows:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r