分类目录归档:知识问答

Python“扩展”字典

问题:Python“扩展”字典

扩展词典与另一本词典的最佳方法是什么?例如:

>>> a = { "a" : 1, "b" : 2 }
>>> b = { "c" : 3, "d" : 4 }
>>> a
{'a': 1, 'b': 2}
>>> b
{'c': 3, 'd': 4}

我正在寻找任何操作来获得此避免for循环:

{ "a" : 1, "b" : 2, "c" : 3, "d" : 4 }

我希望做这样的事情:

a.extend(b)  # This does not work

Which is the best way to extend a dictionary with another one? For instance:

>>> a = { "a" : 1, "b" : 2 }
>>> b = { "c" : 3, "d" : 4 }
>>> a
{'a': 1, 'b': 2}
>>> b
{'c': 3, 'd': 4}

I’m looking for any operation to obtain this avoiding for loop:

{ "a" : 1, "b" : 2, "c" : 3, "d" : 4 }

I wish to do something like:

a.extend(b)  # This does not work

回答 0


回答 1

这个封闭的问题中的一颗美丽的宝石:

“单一方式”,既不改变输入命令,也不改变

basket = dict(basket_one, **basket_two)

了解什么是**basket_two(在**)指这里

如果发生冲突,来自的项目basket_two将覆盖来自的项目basket_one。就像一线书一样,这是很容易理解和透明的,并且我不反对在任何时候都可以方便地使用由其他两种方法混合而成的字典(实际上,任何理解它的读者都会得到很好的服务)顺便这促使他或她对学习dict**形式;-)。因此,例如,使用如下:

x = mungesomedict(dict(adict, **anotherdict))

在我的代码中相当频繁地出现。

最初由Alex Martelli提交

注意:在Python 3中,只有在basket_two中的每个键都是a的情况下,这才起作用string

A beautiful gem in this closed question:

The “oneliner way”, altering neither of the input dicts, is

basket = dict(basket_one, **basket_two)

Learn what **basket_two (the **) means here.

In case of conflict, the items from basket_two will override the ones from basket_one. As one-liners go, this is pretty readable and transparent, and I have no compunction against using it any time a dict that’s a mix of two others comes in handy (any reader who has trouble understanding it will in fact be very well served by the way this prompts him or her towards learning about dict and the ** form;-). So, for example, uses like:

x = mungesomedict(dict(adict, **anotherdict))

are reasonably frequent occurrences in my code.

Originally submitted by Alex Martelli

Note: In Python 3, this will only work if every key in basket_two is a string.


回答 2

您是否尝试过将字典理解与字典映射一起使用:

a = {'a': 1, 'b': 2}
b = {'c': 3, 'd': 4}

c = {**a, **b}
# c = {"a": 1, "b": 2, "c": 3, "d": 4}

另一种方法是通过使用dict(iterable,** kwarg)

c = dict(a, **b)
# c = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

在Python 3.9中,您可以使用union | 算子

# use the merging operator |
c = a | b
# c = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

Have you tried using dictionary comprehension with dictionary mapping:

a = {'a': 1, 'b': 2}
b = {'c': 3, 'd': 4}

c = {**a, **b}
# c = {"a": 1, "b": 2, "c": 3, "d": 4}

Another way of doing is by Using dict(iterable, **kwarg)

c = dict(a, **b)
# c = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

In Python 3.9 you can add two dict using union | operator

# use the merging operator |
c = a | b
# c = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

回答 3

a.update(b)

将键和值从b添加到a,如果键已经存在,则覆盖。

a.update(b)

Will add keys and values from b to a, overwriting if there’s already a value for a key.


回答 4

正如其他人提到的那样,a.update(b)对于某些命令而言ab它将达到您在问题中所要求的结果。但是,我想指出的是,我多次看到extend映射/设置对象的方法希望在语法中a.extend(b)a的值不应被b的值覆盖。a.update(b)会覆盖a的值,因此不是的理想选择extend

请注意,某些语言将这种方法称为defaultsinject,因为可以将其视为将b的值(可能是一组默认值)注入字典中而不覆盖可能已经存在的值的方法。

当然,您可以简单地注意到与a.extend(b)几乎相同b.update(a); a=b。要删除分配,您可以这样进行:

def extend(a,b):
    """Create a new dictionary with a's properties extended by b,
    without overwriting.

    >>> extend({'a':1,'b':2},{'b':3,'c':4})
    {'a': 1, 'c': 4, 'b': 2}
    """
    return dict(b,**a)

感谢Tom Leys提出的巧妙想法,它使用的无副作用dict构造函数extend

As others have mentioned, a.update(b) for some dicts a and b will achieve the result you’ve asked for in your question. However, I want to point out that many times I have seen the extend method of mapping/set objects desire that in the syntax a.extend(b), a‘s values should NOT be overwritten by b‘s values. a.update(b) overwrites a‘s values, and so isn’t a good choice for extend.

Note that some languages call this method defaults or inject, as it can be thought of as a way of injecting b’s values (which might be a set of default values) in to a dictionary without overwriting values that might already exist.

Of course, you could simple note that a.extend(b) is nearly the same as b.update(a); a=b. To remove the assignment, you could do it thus:

def extend(a,b):
    """Create a new dictionary with a's properties extended by b,
    without overwriting.

    >>> extend({'a':1,'b':2},{'b':3,'c':4})
    {'a': 1, 'c': 4, 'b': 2}
    """
    return dict(b,**a)

Thanks to Tom Leys for that smart idea using a side-effect-less dict constructor for extend.


回答 5

您还可以使用python 3.3中引入的python的collections.Chainmap

from collections import Chainmap
c = Chainmap(a, b)
c['a'] # returns 1

根据您的用例,这有一些可能的优点。这里将对它们进行更详细的说明,但我将做一个简要概述:

  • 链图仅使用词典的视图,因此实际上没有数据被复制。这样可以加快链接速度(但查找速度较慢)
  • 实际上没有键被覆盖,因此,如有必要,您可以知道数据是来自a还是来自b。

这主要使其对于诸如配置字典之类的事情有用。

You can also use python’s collections.Chainmap which was introduced in python 3.3.

from collections import Chainmap
c = Chainmap(a, b)
c['a'] # returns 1

This has a few possible advantages, depending on your use-case. They are explained in more detail here, but I’ll give a brief overview:

  • A chainmap only uses views of the dictionaries, so no data is actually copied. This results in faster chaining (but slower lookup)
  • No keys are actually overwritten so, if necessary, you know whether the data comes from a or b.

This mainly makes it useful for things like configuration dictionaries.


回答 6

如果您需要将其作为Class,则可以使用dict扩展它并使用update方法:

Class a(dict):
  # some stuff
  self.update(b)

In case you need it as a Class, you can extend it with dict and use update method:

Class a(dict):
  # some stuff
  self.update(b)

如何访问NumPy多维数组的第i列?

问题:如何访问NumPy多维数组的第i列?

假设我有:

test = numpy.array([[1, 2], [3, 4], [5, 6]])

test[i]使我得到数组的第i行(例如[1, 2])。如何访问第ith列?(例如[1, 3, 5])。另外,这将是一项昂贵的操作吗?

Suppose I have:

test = numpy.array([[1, 2], [3, 4], [5, 6]])

test[i] gets me ith line of the array (eg [1, 2]). How can I access the ith column? (eg [1, 3, 5]). Also, would this be an expensive operation?


回答 0

>>> test[:,0]
array([1, 3, 5])

同样,

>>> test[1,:]
array([3, 4])

使您可以访问行。NumPy参考资料的第1.4节(索引)对此进行了介绍。这很快,至少以我的经验而言。它肯定比循环访问每个元素要快得多。

>>> test[:,0]
array([1, 3, 5])

Similarly,

>>> test[1,:]
array([3, 4])

lets you access rows. This is covered in Section 1.4 (Indexing) of the NumPy reference. This is quick, at least in my experience. It’s certainly much quicker than accessing each element in a loop.


回答 1

如果您想一次访问多个列,则可以执行以下操作:

>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
       [3, 5],
       [6, 8]])

And if you want to access more than one column at a time you could do:

>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
       [3, 5],
       [6, 8]])

回答 2

>>> test[:,0]
array([1, 3, 5])

该命令为您提供了行向量,如果您只想在其上循环,就可以了,但是如果您要与其他尺寸为3xN的数组进行堆叠,则可以

ValueError: all the input arrays must have same number of dimensions

>>> test[:,[0]]
array([[1],
       [3],
       [5]])

为您提供列向量,以便您可以进行串联或hstack操作。

例如

>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
       [3, 4, 3],
       [5, 6, 5]])
>>> test[:,0]
array([1, 3, 5])

this command gives you a row vector, if you just want to loop over it, it’s fine, but if you want to hstack with some other array with dimension 3xN, you will have

ValueError: all the input arrays must have same number of dimensions

while

>>> test[:,[0]]
array([[1],
       [3],
       [5]])

gives you a column vector, so that you can do concatenate or hstack operation.

e.g.

>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
       [3, 4, 3],
       [5, 6, 5]])

回答 3

您还可以转置并返回一行:

In [4]: test.T[0]
Out[4]: array([1, 3, 5])

You could also transpose and return a row:

In [4]: test.T[0]
Out[4]: array([1, 3, 5])

回答 4

要获得几个独立的列,只需:

> test[:,[0,2]]

你会得到第0列和第2列

To get several and indepent columns, just:

> test[:,[0,2]]

you will get colums 0 and 2


回答 5

尽管问题已得到回答,但让我提及一些细微差别。

假设您对数组的第一列感兴趣

arr = numpy.array([[1, 2],
                   [3, 4],
                   [5, 6]])

从其他答案中已经知道,要以“行向量”(shape的数组(3,))的形式获取它,可以使用切片:

arr_c1_ref = arr[:, 1]  # creates a reference to the 1st column of the arr
arr_c1_copy = arr[:, 1].copy()  # creates a copy of the 1st column of the arr

要检查一个数组是视图还是另一个数组的副本,可以执行以下操作:

arr_c1_ref.base is arr  # True
arr_c1_copy.base is arr  # False

参见ndarray.base

除了两者之间的明显区别(修改arr_c1_ref将影响arr),遍历它们中每一个的字节步数也不同:

arr_c1_ref.strides[0]  # 8 bytes
arr_c1_copy.strides[0]  # 4 bytes

大步。为什么这很重要?假设您有一个很大的数组,A而不是arr

A = np.random.randint(2, size=(10000,10000), dtype='int32')
A_c1_ref = A[:, 1] 
A_c1_copy = A[:, 1].copy()

并且您要计算第一列的所有元素的总和,即A_c1_ref.sum()A_c1_copy.sum()。使用复制的版本要快得多:

%timeit A_c1_ref.sum()  # ~248 µs
%timeit A_c1_copy.sum()  # ~12.8 µs

这是由于前面提到的跨步数不同:

A_c1_ref.strides[0]  # 40000 bytes
A_c1_copy.strides[0]  # 4 bytes

尽管使用列副本似乎更好,但由于创建副本需要时间并使用更多的内存(在这种情况下,我花了大约200 µs的时间来创建副本)并不总是正确的。 A_c1_copy)。但是,如果我们首先需要复制,或者需要在数组的特定列上执行许多不同的操作,并且可以牺牲内存以提高速度,那么复制就可以了。

如果我们有兴趣主要使用列,最好以列大(’F’)顺序而不是行大(’C’)顺序创建数组(这是默认设置) ),然后像以前一样进行切片以获取一列而不复制它:

A = np.asfortranarray(A)  # or np.array(A, order='F')
A_c1_ref = A[:, 1]
A_c1_ref.strides[0]  # 4 bytes
%timeit A_c1_ref.sum()  # ~12.6 µs vs ~248 µs

现在,在列视图上执行求和运算(或其他任何运算)要快得多。

最后,让我注意到,转置数组并使用行切片与在原始数组上使用列切片相同,因为转置是通过交换原始数组的形状和步幅来完成的。

A.T[1,:].strides[0]  # 40000

Although the question has been answered, let me mention some nuances.

Let’s say you are interested in the first column of the array

arr = numpy.array([[1, 2],
                   [3, 4],
                   [5, 6]])

As you already know from other answers, to get it in the form of “row vector” (array of shape (3,)), you use slicing:

arr_c1_ref = arr[:, 1]  # creates a reference to the 1st column of the arr
arr_c1_copy = arr[:, 1].copy()  # creates a copy of the 1st column of the arr

To check if an array is a view or a copy of another array you can do the following:

arr_c1_ref.base is arr  # True
arr_c1_copy.base is arr  # False

see ndarray.base.

Besides the obvious difference between the two (modifying arr_c1_ref will affect arr), the number of byte-steps for traversing each of them is different:

arr_c1_ref.strides[0]  # 8 bytes
arr_c1_copy.strides[0]  # 4 bytes

see strides. Why is this important? Imagine that you have a very big array A instead of the arr:

A = np.random.randint(2, size=(10000,10000), dtype='int32')
A_c1_ref = A[:, 1] 
A_c1_copy = A[:, 1].copy()

and you want to compute the sum of all the elements of the first column, i.e. A_c1_ref.sum() or A_c1_copy.sum(). Using the copied version is much faster:

%timeit A_c1_ref.sum()  # ~248 µs
%timeit A_c1_copy.sum()  # ~12.8 µs

This is due to the different number of strides mentioned before:

A_c1_ref.strides[0]  # 40000 bytes
A_c1_copy.strides[0]  # 4 bytes

Although it might seem that using column copies is better, it is not always true for the reason that making a copy takes time and uses more memory (in this case it took me approx. 200 µs to create the A_c1_copy). However if we need the copy in the first place, or we need to do many different operations on a specific column of the array and we are ok with sacrificing memory for speed, then making a copy is the way to go.

In the case that we are interested in working mostly with columns, it could be a good idea to create our array in column-major (‘F’) order instead of the row-major (‘C’) order (which is the default), and then do the slicing as before to get a column without copying it:

A = np.asfortranarray(A)  # or np.array(A, order='F')
A_c1_ref = A[:, 1]
A_c1_ref.strides[0]  # 4 bytes
%timeit A_c1_ref.sum()  # ~12.6 µs vs ~248 µs

Now, performing the sum operation (or any other) on a column-view is much faster.

Finally let me note that transposing an array and using row-slicing is the same as using the column-slicing on the original array, because transposing is done by just swapping the shape and the strides of the original array.

A.T[1,:].strides[0]  # 40000

回答 6

>>> test
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

>>> ncol = test.shape[1]
>>> ncol
5L

然后,您可以通过以下方式选择第二至第四列:

>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
       [6, 7, 8]])
>>> test
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

>>> ncol = test.shape[1]
>>> ncol
5L

Then you can select the 2nd – 4th column this way:

>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
       [6, 7, 8]])

如何正确清理Python对象?

问题:如何正确清理Python对象?

class Package:
    def __init__(self):
        self.files = []

    # ...

    def __del__(self):
        for file in self.files:
            os.unlink(file)

__del__(self)上面的失败,并带有AttributeError异常。我了解Python__del__()调用时不保证存在“全局变量”(在这种情况下是成员数据吗?)。如果是这种情况,并且这是导致异常的原因,那么如何确保对象正确销毁?

class Package:
    def __init__(self):
        self.files = []

    # ...

    def __del__(self):
        for file in self.files:
            os.unlink(file)

__del__(self) above fails with an AttributeError exception. I understand Python doesn’t guarantee the existence of “global variables” (member data in this context?) when __del__() is invoked. If that is the case and this is the reason for the exception, how do I make sure the object destructs properly?


回答 0

我建议使用Python的with语句来管理需要清理的资源。使用显式close()语句的问题在于,您必须担心人们会忘记完全调用它,或者忘记将其放在finally块中以防止发生异常时发生资源泄漏。

要使用该with语句,请使用以下方法创建一个类:

  def __enter__(self)
  def __exit__(self, exc_type, exc_value, traceback)

在上面的示例中,您将使用

class Package:
    def __init__(self):
        self.files = []

    def __enter__(self):
        return self

    # ...

    def __exit__(self, exc_type, exc_value, traceback):
        for file in self.files:
            os.unlink(file)

然后,当有人想使用您的类时,他们将执行以下操作:

with Package() as package_obj:
    # use package_obj

变量package_obj将是Package类型的实例(它是__enter__方法返回的值)。__exit__无论是否发生异常,都会自动调用其方法。

您甚至可以进一步采用这种方法。在上面的示例中,仍然可以使用其构造函数实例化Package而无需使用该with子句。您不希望这种情况发生。您可以通过创建定义__enter____exit__方法的PackageResource类来解决此问题。然后,将严格在__enter__方法内部定义Package类并返回。这样,调用者永远无法在不使用with语句的情况下实例化Package类:

class PackageResource:
    def __enter__(self):
        class Package:
            ...
        self.package_obj = Package()
        return self.package_obj

    def __exit__(self, exc_type, exc_value, traceback):
        self.package_obj.cleanup()

您将按以下方式使用它:

with PackageResource() as package_obj:
    # use package_obj

I’d recommend using Python’s with statement for managing resources that need to be cleaned up. The problem with using an explicit close() statement is that you have to worry about people forgetting to call it at all or forgetting to place it in a finally block to prevent a resource leak when an exception occurs.

To use the with statement, create a class with the following methods:

  def __enter__(self)
  def __exit__(self, exc_type, exc_value, traceback)

In your example above, you’d use

class Package:
    def __init__(self):
        self.files = []

    def __enter__(self):
        return self

    # ...

    def __exit__(self, exc_type, exc_value, traceback):
        for file in self.files:
            os.unlink(file)

Then, when someone wanted to use your class, they’d do the following:

with Package() as package_obj:
    # use package_obj

The variable package_obj will be an instance of type Package (it’s the value returned by the __enter__ method). Its __exit__ method will automatically be called, regardless of whether or not an exception occurs.

You could even take this approach a step further. In the example above, someone could still instantiate Package using its constructor without using the with clause. You don’t want that to happen. You can fix this by creating a PackageResource class that defines the __enter__ and __exit__ methods. Then, the Package class would be defined strictly inside the __enter__ method and returned. That way, the caller never could instantiate the Package class without using a with statement:

class PackageResource:
    def __enter__(self):
        class Package:
            ...
        self.package_obj = Package()
        return self.package_obj

    def __exit__(self, exc_type, exc_value, traceback):
        self.package_obj.cleanup()

You’d use this as follows:

with PackageResource() as package_obj:
    # use package_obj

回答 1

标准方法是使用atexit.register

# package.py
import atexit
import os

class Package:
    def __init__(self):
        self.files = []
        atexit.register(self.cleanup)

    def cleanup(self):
        print("Running cleanup...")
        for file in self.files:
            print("Unlinking file: {}".format(file))
            # os.unlink(file)

但是您应该记住,这将持久化所有创建的实例,Package直到终止Python。

使用上面的代码的演示保存为package.py

$ python
>>> from package import *
>>> p = Package()
>>> q = Package()
>>> q.files = ['a', 'b', 'c']
>>> quit()
Running cleanup...
Unlinking file: a
Unlinking file: b
Unlinking file: c
Running cleanup...

The standard way is to use atexit.register:

# package.py
import atexit
import os

class Package:
    def __init__(self):
        self.files = []
        atexit.register(self.cleanup)

    def cleanup(self):
        print("Running cleanup...")
        for file in self.files:
            print("Unlinking file: {}".format(file))
            # os.unlink(file)

But you should keep in mind that this will persist all created instances of Package until Python is terminated.

Demo using the code above saved as package.py:

$ python
>>> from package import *
>>> p = Package()
>>> q = Package()
>>> q.files = ['a', 'b', 'c']
>>> quit()
Running cleanup...
Unlinking file: a
Unlinking file: b
Unlinking file: c
Running cleanup...

回答 2

作为克林特答案的附录,您可以简化PackageResource使用contextlib.contextmanager

@contextlib.contextmanager
def packageResource():
    class Package:
        ...
    package = Package()
    yield package
    package.cleanup()

另外,尽管可能不如Pythonic,但您可以重写Package.__new__

class Package(object):
    def __new__(cls, *args, **kwargs):
        @contextlib.contextmanager
        def packageResource():
            # adapt arguments if superclass takes some!
            package = super(Package, cls).__new__(cls)
            package.__init__(*args, **kwargs)
            yield package
            package.cleanup()

    def __init__(self, *args, **kwargs):
        ...

并简单地使用with Package(...) as package

为了使事情更短,请命名清理函数close并使用contextlib.closing,在这种情况下,您可以通过使用未修改的Package类,with contextlib.closing(Package(...))或者将其改写__new__为更简单的类。

class Package(object):
    def __new__(cls, *args, **kwargs):
        package = super(Package, cls).__new__(cls)
        package.__init__(*args, **kwargs)
        return contextlib.closing(package)

而且此构造函数是继承的,因此您可以简单地继承,例如

class SubPackage(Package):
    def close(self):
        pass

As an appendix to Clint’s answer, you can simplify PackageResource using contextlib.contextmanager:

@contextlib.contextmanager
def packageResource():
    class Package:
        ...
    package = Package()
    yield package
    package.cleanup()

Alternatively, though probably not as Pythonic, you can override Package.__new__:

class Package(object):
    def __new__(cls, *args, **kwargs):
        @contextlib.contextmanager
        def packageResource():
            # adapt arguments if superclass takes some!
            package = super(Package, cls).__new__(cls)
            package.__init__(*args, **kwargs)
            yield package
            package.cleanup()

    def __init__(self, *args, **kwargs):
        ...

and simply use with Package(...) as package.

To get things shorter, name your cleanup function close and use contextlib.closing, in which case you can either use the unmodified Package class via with contextlib.closing(Package(...)) or override its __new__ to the simpler

class Package(object):
    def __new__(cls, *args, **kwargs):
        package = super(Package, cls).__new__(cls)
        package.__init__(*args, **kwargs)
        return contextlib.closing(package)

And this constructor is inherited, so you can simply inherit, e.g.

class SubPackage(Package):
    def close(self):
        pass

回答 3

我不认为实例成员可能在__del__调用之前被删除。我的猜测是您特定AttributeError的原因是其他原因(也许您在其他地方错误地删除了self.file)。

但是,正如其他人指出的那样,您应该避免使用__del__。这样做的主要原因是__del__不会对的实例进行垃圾回收(只有当其引用计数达到0时,它们才会被释放)。因此,如果您的实例涉及循环引用,则它们将在应用程序运行期间一直存在于内存中。(尽管我可能对所有这些都弄错了,我不得不再次阅读gc文档,但是我相当确定它的工作原理是这样的)。

I don’t think that it’s possible for instance members to be removed before __del__ is called. My guess would be that the reason for your particular AttributeError is somewhere else (maybe you mistakenly remove self.file elsewhere).

However, as the others pointed out, you should avoid using __del__. The main reason for this is that instances with __del__ will not be garbage collected (they will only be freed when their refcount reaches 0). Therefore, if your instances are involved in circular references, they will live in memory for as long as the application run. (I may be mistaken about all this though, I’d have to read the gc docs again, but I’m rather sure it works like this).


回答 4

更好的选择是使用weakref.finalize。请参见“ 终结器对象”和“ 将终结器与__del __()方法进行比较”中的示例。


回答 5

我认为问题可能出在__init__如果代码比显示的更多?

__del__即使__init__未正确执行或引发异常也会被调用。

资源

I think the problem could be in __init__ if there is more code than shown?

__del__ will be called even when __init__ has not been executed properly or threw an exception.

Source


回答 6

这是一个最小的工作框架:

class SkeletonFixture:

    def __init__(self):
        pass

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        pass

    def method(self):
        pass


with SkeletonFixture() as fixture:
    fixture.method()

重要提示:自我回报


如果您像我一样,并且忽略了return self一部分(克林特·米勒的正确答案),那么您将盯着这个废话:

Traceback (most recent call last):
  File "tests/simplestpossible.py", line 17, in <module>                                                                                                                                                          
    fixture.method()                                                                                                                                                                                              
AttributeError: 'NoneType' object has no attribute 'method'

希望它能帮助下一个人。

Here is a minimal working skeleton:

class SkeletonFixture:

    def __init__(self):
        pass

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        pass

    def method(self):
        pass


with SkeletonFixture() as fixture:
    fixture.method()

Important: return self


If you’re like me, and overlook the return self part (of Clint Miller’s correct answer), you will be staring at this nonsense:

Traceback (most recent call last):
  File "tests/simplestpossible.py", line 17, in <module>                                                                                                                                                          
    fixture.method()                                                                                                                                                                                              
AttributeError: 'NoneType' object has no attribute 'method'

Hope it helps the next person.


回答 7

只需用try / except语句包装您的析构函数,如果您的全局变量已被处理,它将不会引发异常。

编辑

尝试这个:

from weakref import proxy

class MyList(list): pass

class Package:
    def __init__(self):
        self.__del__.im_func.files = MyList([1,2,3,4])
        self.files = proxy(self.__del__.im_func.files)

    def __del__(self):
        print self.__del__.im_func.files

它将把文件列表填充到保证在调用时存在的del函数中。弱引用代理是为了防止Python或您自己以某种方式删除self.files变量(如果删除了它,则不会影响原始文件列表)。即使存在更多对该变量的引用,如果不是不是要删除此变量,则可以删除代理封装。

Just wrap your destructor with a try/except statement and it will not throw an exception if your globals are already disposed of.

Edit

Try this:

from weakref import proxy

class MyList(list): pass

class Package:
    def __init__(self):
        self.__del__.im_func.files = MyList([1,2,3,4])
        self.files = proxy(self.__del__.im_func.files)

    def __del__(self):
        print self.__del__.im_func.files

It will stuff the file list in the del function that is guaranteed to exist at the time of call. The weakref proxy is to prevent Python, or yourself from deleting the self.files variable somehow (if it is deleted, then it will not affect the original file list). If it is not the case that this is being deleted even though there are more references to the variable, then you can remove the proxy encapsulation.


回答 8

似乎惯用的方法是提供一个close()方法(或类似方法),并明确地调用它。

It seems that the idiomatic way to do this is to provide a close() method (or similar), and call it explicitely.


是否有NumPy函数返回数组中某物的第一个索引?

问题:是否有NumPy函数返回数组中某物的第一个索引?

我知道有一种方法可以让Python列表返回某些内容的第一个索引:

>>> l = [1, 2, 3]
>>> l.index(2)
1

NumPy数组有类似的东西吗?

I know there is a method for a Python list to return the first index of something:

>>> l = [1, 2, 3]
>>> l.index(2)
1

Is there something like that for NumPy arrays?


回答 0

是的,这是给定NumPy数组array和值的答案item,以搜索:

itemindex = numpy.where(array==item)

结果是具有所有行索引,然后是所有列索引的元组。

例如,如果一个数组是二维的,并且它在两个位置包含您的商品,则

array[itemindex[0][0]][itemindex[1][0]]

将等于您的物品,所以

array[itemindex[0][1]][itemindex[1][1]]

numpy.where

Yes, here is the answer given a NumPy array, array, and a value, item, to search for:

itemindex = numpy.where(array==item)

The result is a tuple with first all the row indices, then all the column indices.

For example, if an array is two dimensions and it contained your item at two locations then

array[itemindex[0][0]][itemindex[1][0]]

would be equal to your item and so would

array[itemindex[0][1]][itemindex[1][1]]

numpy.where


回答 1

如果仅需要一个值的第一次出现的索引,则可以使用nonzero(或where,在这种情况下,这等于同一件事):

>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6

如果您需要每个的第一个索引,显然可以重复执行与上述相同的操作,但是有一个技巧可能会更快。以下是每个子序列的第一个元素的索引:

>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)

请注意,它找到了3s的两个子序列和8s的两个子序列的开头:

[ 1,1,1,2,2,3838,8]

因此,它与找到每个值的第一个匹配项略有不同。在您的程序中,您可以使用的排序版本t来获得所需的内容:

>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)

If you need the index of the first occurrence of only one value, you can use nonzero (or where, which amounts to the same thing in this case):

>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6

If you need the first index of each of many values, you could obviously do the same as above repeatedly, but there is a trick that may be faster. The following finds the indices of the first element of each subsequence:

>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)

Notice that it finds the beginning of both subsequence of 3s and both subsequences of 8s:

[1, 1, 1, 2, 2, 3, 8, 3, 8, 8]

So it’s slightly different than finding the first occurrence of each value. In your program, you may be able to work with a sorted version of t to get what you want:

>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)

回答 2

您还可以将NumPy数组转换为空中列表并获取其索引。例如,

l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i

它将打印1。

You can also convert a NumPy array to list in the air and get its index. For example,

l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i

It will print 1.


回答 3

只是添加了一个非常高性能和方便 替代基于np.ndenumerate查找第一个索引:

from numba import njit
import numpy as np

@njit
def index(array, item):
    for idx, val in np.ndenumerate(array):
        if val == item:
            return idx
    # If no item was found return None, other return types might be a problem due to
    # numbas type inference.

这非常快,并且自然可以处理多维数组

>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2

>>> index(arr1, 2)
(2, 2, 2)

>>> arr2 = np.ones(20)
>>> arr2[5] = 2

>>> index(arr2, 2)
(5,)

这可能比使用或的任何方法都要快得多(因为这会使操作短路)。np.wherenp.nonzero


但是np.argwhere也可以处理优雅与多维数组(您需要手动将它转换到一个元组它不是短路),但如果没有找到匹配它会失败:

>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)

Just to add a very performant and handy alternative based on np.ndenumerate to find the first index:

from numba import njit
import numpy as np

@njit
def index(array, item):
    for idx, val in np.ndenumerate(array):
        if val == item:
            return idx
    # If no item was found return None, other return types might be a problem due to
    # numbas type inference.

This is pretty fast and deals naturally with multidimensional arrays:

>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2

>>> index(arr1, 2)
(2, 2, 2)

>>> arr2 = np.ones(20)
>>> arr2[5] = 2

>>> index(arr2, 2)
(5,)

This can be much faster (because it’s short-circuiting the operation) than any approach using np.where or np.nonzero.


However np.argwhere could also deal gracefully with multidimensional arrays (you would need to manually cast it to a tuple and it’s not short-circuited) but it would fail if no match is found:

>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)

回答 4

如果要将其用作其他内容的索引,则可以使用布尔索引(如果数组可广播);您不需要显式索引。做到这一点的最简单的绝对方法是根据真值简单地建立索引。

other_array[first_array == item]

任何布尔运算均可:

a = numpy.arange(100)
other_array[first_array > 50]

非零方法也需要布尔值:

index = numpy.nonzero(first_array == item)[0][0]

两个零用于索引的元组(假设first_array为1D),然后是索引数组中的第一项。

If you’re going to use this as an index into something else, you can use boolean indices if the arrays are broadcastable; you don’t need explicit indices. The absolute simplest way to do this is to simply index based on a truth value.

other_array[first_array == item]

Any boolean operation works:

a = numpy.arange(100)
other_array[first_array > 50]

The nonzero method takes booleans, too:

index = numpy.nonzero(first_array == item)[0][0]

The two zeros are for the tuple of indices (assuming first_array is 1D) and then the first item in the array of indices.


回答 5

l.index(x)返回最小的i,使i是列表中x首次出现的索引。

可以放心地假定index()Python 中的函数已实现,以便在找到第一个匹配项后停止,从而获得最佳的平均性能。

为了找到在NumPy数组中的第一个匹配项之后停止的元素,请使用迭代器(ndenumerate)。

In [67]: l=range(100)

In [68]: l.index(2)
Out[68]: 2

NumPy数组:

In [69]: a = np.arange(100)

In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)

请注意,这两种方法index(),并next在未找到该元素,返回一个错误。使用next,可以在找不到元素的情况下使用第二个参数返回一个特殊值,例如

In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)

NumPy(argmaxwherenonzero)中还有其他函数可用于查找数组中的元素,但是它们都有缺点,即遍历整个数组以查找所有出现的元素,因此没有为查找第一个元素而进行优化。还需要注意的是where,并nonzero返回数组,所以你需要选择得到索引的第一个元素。

In [71]: np.argmax(a==2)
Out[71]: 2

In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)

In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)

时间比较

当搜索的项位于数组的开头%timeit在IPython shell中使用)时,仅检查大型数组的解决方案使用迭代器的解决方案会更快:

In [285]: a = np.arange(100000)

In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop

In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop

In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop

这是NumPy GitHub的未解决问题

另请参阅:Numpy:快速找到价值的第一个索引

l.index(x) returns the smallest i such that i is the index of the first occurrence of x in the list.

One can safely assume that the index() function in Python is implemented so that it stops after finding the first match, and this results in an optimal average performance.

For finding an element stopping after the first match in a NumPy array use an iterator (ndenumerate).

In [67]: l=range(100)

In [68]: l.index(2)
Out[68]: 2

NumPy array:

In [69]: a = np.arange(100)

In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)

Note that both methods index() and next return an error if the element is not found. With next, one can use a second argument to return a special value in case the element is not found, e.g.

In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)

There are other functions in NumPy (argmax, where, and nonzero) that can be used to find an element in an array, but they all have the drawback of going through the whole array looking for all occurrences, thus not being optimized for finding the first element. Note also that where and nonzero return arrays, so you need to select the first element to get the index.

In [71]: np.argmax(a==2)
Out[71]: 2

In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)

In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)

Time comparison

Just checking that for large arrays the solution using an iterator is faster when the searched item is at the beginning of the array (using %timeit in the IPython shell):

In [285]: a = np.arange(100000)

In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop

In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop

In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop

This is an open NumPy GitHub issue.

See also: Numpy: find first index of value fast


回答 6

对于一维排序数组,使用numpy.searchsorted返回NumPy整数(位置)会更加简单有效。例如,

arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)

只要确保数组已经排序

还要检查返回的索引i是否实际上包含被搜索的元素,因为searchsorted的主要目的是查找应该在其中插入元素以保持顺序的索引。

if arr[i] == 3:
    print("present")
else:
    print("not present")

For one-dimensional sorted arrays, it would be much more simpler and efficient O(log(n)) to use numpy.searchsorted which returns a NumPy integer (position). For example,

arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)

Just make sure the array is already sorted

Also check if returned index i actually contains the searched element, since searchsorted’s main objective is to find indices where elements should be inserted to maintain order.

if arr[i] == 3:
    print("present")
else:
    print("not present")

回答 7

要根据任何条件建立索引,可以执行以下操作:

In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
   .....:         print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4

这是一个快速执行list.index()的功能的函数,除非找不到该异常不会引发异常。当心-在大型阵列上这可能非常慢。如果您愿意将其用作方法,则可以将其修补到数组上。

def ndindex(ndarray, item):
    if len(ndarray.shape) == 1:
        try:
            return [ndarray.tolist().index(item)]
        except:
            pass
    else:
        for i, subarray in enumerate(ndarray):
            try:
                return [i] + ndindex(subarray, item)
            except:
                pass

In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]

To index on any criteria, you can so something like the following:

In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
   .....:         print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4

And here’s a quick function to do what list.index() does, except doesn’t raise an exception if it’s not found. Beware — this is probably very slow on large arrays. You can probably monkey patch this on to arrays if you’d rather use it as a method.

def ndindex(ndarray, item):
    if len(ndarray.shape) == 1:
        try:
            return [ndarray.tolist().index(item)]
        except:
            pass
    else:
        for i, subarray in enumerate(ndarray):
            try:
                return [i] + ndindex(subarray, item)
            except:
                pass

In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]

回答 8

对于一维数组,我建议np.flatnonzero(array == value)[0],这相当于两np.nonzero(array == value)[0][0]np.where(array == value)[0][0],但避免了拆箱1-元件的元组的丑陋。

For 1D arrays, I’d recommend np.flatnonzero(array == value)[0], which is equivalent to both np.nonzero(array == value)[0][0] and np.where(array == value)[0][0] but avoids the ugliness of unboxing a 1-element tuple.


回答 9

从np.where()中选择第一个元素的替代方法是将生成器表达式与枚举一起使用,例如:

>>> import numpy as np
>>> x = np.arange(100)   # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2

对于二维数组,可以这样做:

>>> x = np.arange(100).reshape(10,10)   # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x) 
...            for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)

这种方法的优势在于,它会在找到第一个匹配项后停止检查数组的元素,而np.where会检查所有元素是否匹配。如果数组中的早期匹配项,则生成器表达式会更快。

An alternative to selecting the first element from np.where() is to use a generator expression together with enumerate, such as:

>>> import numpy as np
>>> x = np.arange(100)   # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2

For a two dimensional array one would do:

>>> x = np.arange(100).reshape(10,10)   # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x) 
...            for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)

The advantage of this approach is that it stops checking the elements of the array after the first match is found, whereas np.where checks all elements for a match. A generator expression would be faster if there’s match early in the array.


回答 10

NumPy中有很多操作可以组合在一起完成。这将返回等于item的元素的索引:

numpy.nonzero(array - item)

然后,您可以使用列表的第一个元素来获得一个元素。

There are lots of operations in NumPy that could perhaps be put together to accomplish this. This will return indices of elements equal to item:

numpy.nonzero(array - item)

You could then take the first elements of the lists to get a single element.


回答 11

numpy_indexed包(声明,我是它的作者)包含list.index为numpy.ndarray的矢量相当于; 那是:

sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]

import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx)   # [2, -1]

该解决方案具有矢量化的性能,可推广到ndarrays,并具有多种处理缺失值的方法。

The numpy_indexed package (disclaimer, I am its author) contains a vectorized equivalent of list.index for numpy.ndarray; that is:

sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]

import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx)   # [2, -1]

This solution has vectorized performance, generalizes to ndarrays, and has various ways of dealing with missing values.


回答 12

注意:这是针对python 2.7版本的

您可以使用lambda函数来解决此问题,并且它对NumPy数组和列表均有效。

your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]

import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]

你可以使用

result[0]

获取过滤元素的第一个索引。

对于python 3.6,请使用

list(result)

代替

result

Note: this is for python 2.7 version

You can use a lambda function to deal with the problem, and it works both on NumPy array and list.

your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]

import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]

And you can use

result[0]

to get the first index of the filtered elements.

For python 3.6, use

list(result)

instead of

result

使用Python的re.compile是否值得?

问题:使用Python的re.compile是否值得?

在Python中对正则表达式使用compile有什么好处?

h = re.compile('hello')
h.match('hello world')

re.match('hello', 'hello world')

Is there any benefit in using compile for regular expressions in Python?

h = re.compile('hello')
h.match('hello world')

vs

re.match('hello', 'hello world')

回答 0

与动态编译相比,我有1000多次运行已编译的正则表达式的经验,并且没有注意到任何可察觉的差异。显然,这是轶事,当然也不是反对编译的一个很好的论据,但是我发现区别可以忽略不计。

编辑:快速浏览一下实际的Python 2.5库代码后,我发现无论何时使用Python(包括对的调用re.match()),Python都会在内部编译和缓存正则表达式,因此您实际上只是在更改正则表达式时进行更改,因此根本不会节省太多时间-仅节省检查缓存(在内部dict类型上进行键查找)所花费的时间。

从模块re.py(评论是我的):

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def _compile(*key):

    # Does cache check at top of function
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None: return p

    # ...
    # Does actual compilation on cache miss
    # ...

    # Caches compiled regex
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

我仍然经常预编译正则表达式,但只是将它们绑定到一个不错的,可重用的名称上,而不是为了获得预期的性能提升。

I’ve had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I’ve found the difference to be negligible.

EDIT: After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you’re really only changing WHEN the regex gets compiled, and shouldn’t be saving much time at all – only the time it takes to check the cache (a key lookup on an internal dict type).

From module re.py (comments are mine):

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def _compile(*key):

    # Does cache check at top of function
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None: return p

    # ...
    # Does actual compilation on cache miss
    # ...

    # Caches compiled regex
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

I still often pre-compile regular expressions, but only to bind them to a nice, reusable name, not for any expected performance gain.


回答 1

对我来说,最大的好处是 re.compile就是能够将正则表达式的定义与使用分开。

即使是一个简单的表达式0|[1-9][0-9]*((以10 为基数的整数,且不带前导零))可能也足够复杂,以至于您不必重新键入它,检查是否输入了拼写错误,之后又不得不在开始调试时重新检查是否有输入错误。 。另外,使用变量名(例如num或num_b10)比更好0|[1-9][0-9]*

当然可以存储字符串并将其传递给re.match;但是,它的可读性较差

num = "..."
# then, much later:
m = re.match(num, input)

与编译:

num = re.compile("...")
# then, much later:
m = num.match(input)

尽管距离很近,但是第二遍的最后一行在重复使用时感觉更自然,更简单。

For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you’d rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it’s nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

It’s certainly possible to store strings and pass them to re.match; however, that’s less readable:

num = "..."
# then, much later:
m = re.match(num, input)

Versus compiling:

num = re.compile("...")
# then, much later:
m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.


回答 2

FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

因此,如果您将大量使用相同的正则表达式,则值得这样做re.compile(尤其是对于更复杂的正则表达式而言)。

反对过早优化的标准论点适用,但是re.compile如果您怀疑正则表达式可能会成为性能瓶颈,那么我认为您不会因为使用正则表达式而损失太多的清晰度/简单性。

更新:

在Python 3.6(我怀疑以上计时是​​使用Python 2.x完成的)和2018硬件(MacBook Pro)下,我现在得到以下计时:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

我还添加了一个案例(请注意最后两个运行之间的引号差异),该案例显示出re.match(x, ...)在字面上[大致]等同于re.compile(x).match(...),即,似乎没有发生已编译表示形式的后台缓存。

FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

so, if you’re going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don’t think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.


回答 3

这是一个简单的测试用例:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
1 loops, best of 3: 3.1 usec per loop
10 loops, best of 3: 2.41 usec per loop
100 loops, best of 3: 2.24 usec per loop
1000 loops, best of 3: 2.21 usec per loop
10000 loops, best of 3: 2.23 usec per loop
100000 loops, best of 3: 2.24 usec per loop
1000000 loops, best of 3: 2.31 usec per loop

与重新编译:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loops, best of 3: 1.91 usec per loop
10 loops, best of 3: 0.691 usec per loop
100 loops, best of 3: 0.701 usec per loop
1000 loops, best of 3: 0.684 usec per loop
10000 loops, best of 3: 0.682 usec per loop
100000 loops, best of 3: 0.694 usec per loop
1000000 loops, best of 3: 0.702 usec per loop

因此,即使只匹配一次,使用这种简单的情况似乎编译起来也会更快。

Here’s a simple test case:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 're.match("[0-9]{3}-[0-9]{3}-[0-9]{4}", "123-123-1234")'; done
1 loops, best of 3: 3.1 usec per loop
10 loops, best of 3: 2.41 usec per loop
100 loops, best of 3: 2.24 usec per loop
1000 loops, best of 3: 2.21 usec per loop
10000 loops, best of 3: 2.23 usec per loop
100000 loops, best of 3: 2.24 usec per loop
1000000 loops, best of 3: 2.31 usec per loop

with re.compile:

~$ for x in 1 10 100 1000 10000 100000 1000000; do python -m timeit -n $x -s 'import re' 'r = re.compile("[0-9]{3}-[0-9]{3}-[0-9]{4}")' 'r.match("123-123-1234")'; done
1 loops, best of 3: 1.91 usec per loop
10 loops, best of 3: 0.691 usec per loop
100 loops, best of 3: 0.701 usec per loop
1000 loops, best of 3: 0.684 usec per loop
10000 loops, best of 3: 0.682 usec per loop
100000 loops, best of 3: 0.694 usec per loop
1000000 loops, best of 3: 0.702 usec per loop

So, it would seem to compiling is faster with this simple case, even if you only match once.


回答 4

我自己尝试过。对于从字符串中解析数字并将其求和的简单情况,使用已编译的正则表达式对象的速度大约是使用正则表达式对象的两倍。re方法的。

正如其他人指出的那样,这些re方法(包括re.compile)在以前编译的表达式的缓存中查找正则表达式字符串。因此,在正常情况下,使用re方法只是缓存查找的成本。

但是,检查代码后,显示缓存限制为100个表达式。这就引出了一个问题,溢出缓存有多痛苦?该代码包含正则表达式编译器的内部接口re.sre_compile.compile。如果调用它,我们将绕过缓存。对于一个基本的正则表达式,事实证明它要慢大约两个数量级,例如r'\w+\s+([0-9_]+)\s+\w*'

这是我的测试:

#!/usr/bin/env python
import re
import time

def timed(func):
    def wrapper(*args):
        t = time.time()
        result = func(*args)
        t = time.time() - t
        print '%s took %.3f seconds.' % (func.func_name, t)
        return result
    return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average    2 never"

@timed
def noncompiled():
    a = 0
    for x in xrange(1000000):
        m = re.match(regularExpression, testString)
        a += int(m.group(1))
    return a

@timed
def compiled():
    a = 0
    rgx = re.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiled():
    a = 0
    rgx = re.sre_compile.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a


@timed
def compiledInLoop():
    a = 0
    for x in xrange(1000000):
        rgx = re.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiledInLoop():
    a = 0
    for x in xrange(10000):
        rgx = re.sre_compile.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5
</pre>
And here is the output on my machine:
<pre>
$ regexTest.py 
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 =  2000000
r2 =  2000000
r3 =  2000000
r4 =  2000000
r5 =  20000

“ reallyCompiled”方法使用内部接口,该接口绕过缓存。请注意,在每次循环迭代中编译的代码仅迭代10,000次,而不是100万次。

I just tried this myself. For the simple case of parsing a number out of a string and summing it, using a compiled regular expression object is about twice as fast as using the re methods.

As others have pointed out, the re methods (including re.compile) look up the regular expression string in a cache of previously compiled expressions. Therefore, in the normal case, the extra cost of using the re methods is simply the cost of the cache lookup.

However, examination of the code, shows the cache is limited to 100 expressions. This begs the question, how painful is it to overflow the cache? The code contains an internal interface to the regular expression compiler, re.sre_compile.compile. If we call it, we bypass the cache. It turns out to be about two orders of magnitude slower for a basic regular expression, such as r'\w+\s+([0-9_]+)\s+\w*'.

Here’s my test:

#!/usr/bin/env python
import re
import time

def timed(func):
    def wrapper(*args):
        t = time.time()
        result = func(*args)
        t = time.time() - t
        print '%s took %.3f seconds.' % (func.func_name, t)
        return result
    return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average    2 never"

@timed
def noncompiled():
    a = 0
    for x in xrange(1000000):
        m = re.match(regularExpression, testString)
        a += int(m.group(1))
    return a

@timed
def compiled():
    a = 0
    rgx = re.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiled():
    a = 0
    rgx = re.sre_compile.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a


@timed
def compiledInLoop():
    a = 0
    for x in xrange(1000000):
        rgx = re.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiledInLoop():
    a = 0
    for x in xrange(10000):
        rgx = re.sre_compile.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5
</pre>
And here is the output on my machine:
<pre>
$ regexTest.py 
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 =  2000000
r2 =  2000000
r3 =  2000000
r4 =  2000000
r5 =  20000

The ‘reallyCompiled’ methods use the internal interface, which bypasses the cache. Note the one that compiles on each loop iteration is only iterated 10,000 times, not one million.


回答 5

我同意诚实的安倍晋三的观点,即match(...)所给的示例不同。它们不是一对一的比较,因此结果会有所不同。为了简化我的答复,我将A,B,C,D用于这些功能。哦,是的,我们正在处理4个函数,re.py而不是3个。

运行这段代码:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)

与运行此代码相同:

re.match('hello', 'hello world')          # (C)

因为,当查看源代码时re.py,(A + B)表示:

h = re._compile('hello')                  # (D)
h.match('hello world')

(C)实际上是:

re._compile('hello').match('hello world')

因此,(C)与(B)不同。实际上,(C)在调用(D)之后又调用(B),后者也被(A)调用。换句话说,(C) = (A) + (B)。因此,比较循环内的(A + B)与循环内的(C)具有相同的结果。

乔治regexTest.py为我们证明了这一点。

noncompiled took 4.555 seconds.           # (C) in a loop
compiledInLoop took 4.620 seconds.        # (A + B) in a loop
compiled took 2.323 seconds.              # (A) once + (B) in a loop

每个人的兴趣在于,如何获得2.323秒的结果。为了确保compile(...)仅被调用一次,我们需要将已编译的regex对象存储在内存中。如果使用的是类,则可以存储对象并在每次调用函数时重用。

class Foo:
    regex = re.compile('hello')
    def my_function(text)
        return regex.match(text)

如果我们不使用类(今天是我的要求),那么我无可奉告。我仍在学习在Python中使用全局变量,并且我知道全局变量是一件坏事。

还有一点,我认为使用(A) + (B)方法具有优势。这是我观察到的一些事实(如果我记错了,请纠正我):

  1. 调用A一次,它将先搜索一次,然后再搜索_cache一次sre_compile.compile()以创建正则表达式对象。调用A两次,它将进行两次搜索和一次编译(因为正则表达式对象已缓存)。

  2. 如果 _cache两者之间被刷新,则正则表达式对象将从内存中释放出来,Python需要再次编译。(有人建议Python不会重新编译。)

  3. 如果我们使用(A)保留regex对象,则regex对象仍将进入_cache并以某种方式刷新。但是我们的代码对此保留了引用,并且正则表达式对象不会从内存中释放。这些,Python无需再次编译。

  4. 乔治测试的compedInLoop与已编译的测试之间的2秒差异主要是构建密钥和搜索_cache所需的时间。这并不意味着正则表达式的编译时间。

  5. George的真正编译测试显示了每次真正重新进行编译会发生什么情况:它将慢100倍(他将循环从1,000,000减少到10,000)。

以下是(A + B)优于(C)的情况:

  1. 如果我们可以在一个类中缓存正则表达式对象的引用。
  2. 如果需要重复(在循环内或多次)调用(B),则必须在循环外缓存对regex对象的引用。

(C)足够好的情况:

  1. 我们无法缓存参考。
  2. 我们只会偶尔使用一次。
  3. 总的来说,我们没有太多的正则表达式(假设编译过的正则表达式永远不会被刷新)

回顾一下,这里是ABC:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)
re.match('hello', 'hello world')          # (C)

谢谢阅读。

I agree with Honest Abe that the match(...) in the given examples are different. They are not a one-to-one comparisons and thus, outcomes are vary. To simplify my reply, I use A, B, C, D for those functions in question. Oh yes, we are dealing with 4 functions in re.py instead of 3.

Running this piece of code:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)

is same as running this code:

re.match('hello', 'hello world')          # (C)

Because, when looked into the source re.py, (A + B) means:

h = re._compile('hello')                  # (D)
h.match('hello world')

and (C) is actually:

re._compile('hello').match('hello world')

So, (C) is not the same as (B). In fact, (C) calls (B) after calling (D) which is also called by (A). In other words, (C) = (A) + (B). Therefore, comparing (A + B) inside a loop has same result as (C) inside a loop.

George’s regexTest.py proved this for us.

noncompiled took 4.555 seconds.           # (C) in a loop
compiledInLoop took 4.620 seconds.        # (A + B) in a loop
compiled took 2.323 seconds.              # (A) once + (B) in a loop

Everyone’s interest is, how to get the result of 2.323 seconds. In order to make sure compile(...) only get called once, we need to store the compiled regex object in memory. If we are using a class, we could store the object and reuse when every time our function get called.

class Foo:
    regex = re.compile('hello')
    def my_function(text)
        return regex.match(text)

If we are not using class (which is my request today), then I have no comment. I’m still learning to use global variable in Python, and I know global variable is a bad thing.

One more point, I believe that using (A) + (B) approach has an upper hand. Here are some facts as I observed (please correct me if I’m wrong):

  1. Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached).

  2. If the _cache get flushed in between, then the regex object is released from memory and Python need to compile again. (someone suggest that Python won’t recompile.)

  3. If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keep a reference on it and the regex object will not be released from memory. Those, Python need not to compile again.

  4. The 2 seconds differences in George’s test compiledInLoop vs compiled is mainly the time required to build the key and search the _cache. It doesn’t mean the compile time of regex.

  5. George’s reallycompile test show what happen if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).

Here are the only cases that (A + B) is better than (C):

  1. If we can cache a reference of the regex object inside a class.
  2. If we need to calls (B) repeatedly (inside a loop or multiple times), we must cache the reference to regex object outside the loop.

Case that (C) is good enough:

  1. We cannot cache a reference.
  2. We only use it once in a while.
  3. In overall, we don’t have too many regex (assume the compiled one never get flushed)

Just a recap, here are the A B C:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)
re.match('hello', 'hello world')          # (C)

Thanks for reading.


回答 6

通常,是否使用re.compile几乎没有区别。在内部,所有功能都是通过编译步骤实现的:

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    return _compile(pattern, flags).finditer(string)

另外,re.compile()绕过了额外的间接和缓存逻辑:

_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        if p.flags & LOCALE:
            if not _locale:
                return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
    return p

除了使用re.compile带来的小速度优势外,人们还喜欢命名潜在的复杂模式规范并将其与应用了业务逻辑的业务逻辑分开的可读性:

#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
assign_pattern = re.compile(r':=')             # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs

#### Applications ########################################################

if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()

请注意,另一位受访者错误地认为pyc文件直接存储了已编译的模式。但是,实际上,每次加载PYC时都会对其进行重建:

>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
        f.read(8)
        dis(marshal.load(f))

  1           0 LOAD_CONST               0 (-1)
              3 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (re)
              9 STORE_NAME               0 (re)

  3          12 LOAD_NAME                0 (re)
             15 LOAD_ATTR                1 (compile)
             18 LOAD_CONST               2 ('[aeiou]{2,5}')
             21 CALL_FUNCTION            1
             24 STORE_NAME               2 (lc_vowels)
             27 LOAD_CONST               1 (None)
             30 RETURN_VALUE

上面的反汇编来自PYC文件,其中tmp.py包含:

import re
lc_vowels = re.compile(r'[aeiou]{2,5}')

Mostly, there is little difference whether you use re.compile or not. Internally, all of the functions are implemented in terms of a compile step:

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    return _compile(pattern, flags).finditer(string)

In addition, re.compile() bypasses the extra indirection and caching logic:

_cache = {}

_pattern_type = type(sre_compile.compile("", 0))

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    try:
        p, loc = _cache[type(pattern), pattern, flags]
        if loc is None or loc == _locale.setlocale(_locale.LC_CTYPE):
            return p
    except KeyError:
        pass
    if isinstance(pattern, _pattern_type):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            _cache.clear()
        if p.flags & LOCALE:
            if not _locale:
                return p
            loc = _locale.setlocale(_locale.LC_CTYPE)
        else:
            loc = None
        _cache[type(pattern), pattern, flags] = p, loc
    return p

In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:

#### Patterns ############################################################
number_pattern = re.compile(r'\d+(\.\d*)?')    # Integer or decimal number
assign_pattern = re.compile(r':=')             # Assignment operator
identifier_pattern = re.compile(r'[A-Za-z]+')  # Identifiers
whitespace_pattern = re.compile(r'[\t ]+')     # Spaces and tabs

#### Applications ########################################################

if whitespace_pattern.match(s): business_logic_rule_1()
if assign_pattern.match(s): business_logic_rule_2()

Note, one other respondent incorrectly believed that pyc files stored compiled patterns directly; however, in reality they are rebuilt each time when the PYC is loaded:

>>> from dis import dis
>>> with open('tmp.pyc', 'rb') as f:
        f.read(8)
        dis(marshal.load(f))

  1           0 LOAD_CONST               0 (-1)
              3 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (re)
              9 STORE_NAME               0 (re)

  3          12 LOAD_NAME                0 (re)
             15 LOAD_ATTR                1 (compile)
             18 LOAD_CONST               2 ('[aeiou]{2,5}')
             21 CALL_FUNCTION            1
             24 STORE_NAME               2 (lc_vowels)
             27 LOAD_CONST               1 (None)
             30 RETURN_VALUE

The above disassembly comes from the PYC file for a tmp.py containing:

import re
lc_vowels = re.compile(r'[aeiou]{2,5}')

回答 7

通常,我发现使用标志(至少更容易记住操作方式)(例如re.I在编译模式时)比内联使用标志更容易。

>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']

>>> re.findall('(?i)foo','some string FoO bar')
['FoO']

In general, I find it is easier to use flags (at least easier to remember how), like re.I when compiling patterns than to use flags inline.

>>> foo_pat = re.compile('foo',re.I)
>>> foo_pat.findall('some string FoO bar')
['FoO']

vs

>>> re.findall('(?i)foo','some string FoO bar')
['FoO']

回答 8

使用给定的示例:

h = re.compile('hello')
h.match('hello world')

上面的示例中的match方法与以下使用的方法不同:

re.match('hello', 'hello world')

re.compile()返回一个正则表达式对象,这意味着它h是一个正则表达式对象。

regex对象具有自己的match方法,该方法带有可选的posendpos参数:

regex.match(string[, pos[, endpos]])

位置

可选的第二个参数pos在开始搜索的字符串中给出一个索引;它默认为0。这并不完全等同于切片字符串;该'^'模式字符在字符串的真正开始,并在仅仅一个换行符后的位置相匹配,但不一定,其中搜索是启动索引。

端点

可选参数endpos限制了将搜索字符串的距离;就像字符串是endpos字符长一样,因此仅搜索从pos到的字符endpos - 1进行匹配。如果endpos小于pos,则不会找到匹配项;否则,如果rx是已编译的正则表达式对象,rx.search(string, 0, 50)则等效于rx.search(string[:50], 0)

regex对象的searchfindallfinditer方法也支持这些参数。

re.match(pattern, string, flags=0)如您所见,它不支持它们,
也不支持searchfindallfinditer对应项。

一个匹配对象有补充这些参数属性:

match.pos

传递给正则表达式对象的search()或match()方法的pos值。这是RE引擎开始寻找匹配项的字符串索引。

match.endpos

传递给正则表达式对象的search()或match()方法的endpos的值。这是字符串的索引,RE引擎将超出该索引。


一个正则表达式对象有两个独特的,可能有用的,属性:

正则表达式组

模式中的捕获组数。

正则表达式

字典,将由(?P)定义的任何符号组名映射到组号。如果模式中未使用任何符号组,则词典为空。


最后,match对象具有以下属性:

匹配

其match()或search()方法产生此match实例的正则表达式对象。

Using the given examples:

h = re.compile('hello')
h.match('hello world')

The match method in the example above is not the same as the one used below:

re.match('hello', 'hello world')

re.compile() returns a regular expression object, which means h is a regex object.

The regex object has its own match method with the optional pos and endpos parameters:

regex.match(string[, pos[, endpos]])

pos

The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

endpos

The optional parameter endpos limits how far the string will be searched; it will be as if the string is endpos characters long, so only the characters from pos to endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

The regex object’s search, findall, and finditer methods also support these parameters.

re.match(pattern, string, flags=0) does not support them as you can see,
nor does its search, findall, and finditer counterparts.

A match object has attributes that complement these parameters:

match.pos

The value of pos which was passed to the search() or match() method of a regex object. This is the index into the string at which the RE engine started looking for a match.

match.endpos

The value of endpos which was passed to the search() or match() method of a regex object. This is the index into the string beyond which the RE engine will not go.


A regex object has two unique, possibly useful, attributes:

regex.groups

The number of capturing groups in the pattern.

regex.groupindex

A dictionary mapping any symbolic group names defined by (?P) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.


And finally, a match object has this attribute:

match.re

The regular expression object whose match() or search() method produced this match instance.


回答 9

除了性能差异外,使用re.compile和使用编译后的正则表达式对象进行匹配(无论与正则表达式相关的任何操作)都使语义在Python运行时更清晰。

我有一些调试一些简单代码的痛苦经历:

compare = lambda s, p: re.match(p, s)

后来我用比较

[x for x in data if compare(patternPhrases, x[columnIndex])]

其中patternPhrases被认为是含有正则表达式字符串变量,x[columnIndex]的变量是包含字符串的变量。

我遇到了patternPhrases与某些预期字符串不匹配的问题!

但是,如果我使用re.compile形式:

compare = lambda s, p: p.match(s)

然后在

[x for x in data if compare(patternPhrases, x[columnIndex])]

Python会抱怨“字符串没有匹配的属性”,如在位置参数映射comparex[columnIndex]作为正则表达式!当我真正的意思

compare = lambda p, s: p.match(s)

在我的情况下,使用re.compile可以更清楚地说明正则表达式的目的,当它的值被肉眼隐藏时,因此可以从Python运行时检查中获得更多帮助。

因此,我这堂课的寓意是,当正则表达式不仅仅是文字字符串时,我应该使用re.compile来让Python帮助我断言自己的假设。

Performance difference aside, using re.compile and using the compiled regular expression object to do match (whatever regular expression related operations) makes the semantics clearer to Python run-time.

I had some painful experience of debugging some simple code:

compare = lambda s, p: re.match(p, s)

and later I’d use compare in

[x for x in data if compare(patternPhrases, x[columnIndex])]

where patternPhrases is supposed to be a variable containing regular expression string, x[columnIndex] is a variable containing string.

I had trouble that patternPhrases did not match some expected string!

But if I used the re.compile form:

compare = lambda s, p: p.match(s)

then in

[x for x in data if compare(patternPhrases, x[columnIndex])]

Python would have complained that “string does not have attribute of match”, as by positional argument mapping in compare, x[columnIndex] is used as regular expression!, when I actually meant

compare = lambda p, s: p.match(s)

In my case, using re.compile is more explicit of the purpose of regular expression, when it’s value is hidden to naked eyes, thus I could get more help from Python run-time checking.

So the moral of my lesson is that when the regular expression is not just literal string, then I should use re.compile to let Python to help me to assert my assumption.


回答 10

使用re.compile()有一个额外的好处,即使用re.VERBOSE向我的正则表达式模式添加注释

pattern = '''
hello[ ]world    # Some info on my pattern logic. [ ] to recognize space
'''

re.search(pattern, 'hello world', re.VERBOSE)

尽管这不会影响运行代码的速度,但我喜欢这样做,因为它是我注释习惯的一部分。当我想进行修改时,我完全不喜欢花时间试图记住代码后面2个月的逻辑。

There is one addition perk of using re.compile(), in the form of adding comments to my regex patterns using re.VERBOSE

pattern = '''
hello[ ]world    # Some info on my pattern logic. [ ] to recognize space
'''

re.search(pattern, 'hello world', re.VERBOSE)

Although this does not affect the speed of running your code, I like to do it this way as it is part of my commenting habit. I throughly dislike spending time trying to remember the logic that went behind my code 2 months down the line when I want to make modifications.


回答 11

根据Python 文档

序列

prog = re.compile(pattern)
result = prog.match(string)

相当于

result = re.match(pattern, string)

但是使用 re.compile(),当在单个程序中多次使用表达式时,并保存生成的正则表达式对象以供重用会更有效。

所以我的结论是,如果您要为许多不同的文本匹配相同的模式,则最好对其进行预编译。

According to the Python documentation:

The sequence

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

So my conclusion is, if you are going to match the same pattern for many different texts, you better precompile it.


回答 12

有趣的是,编译对我来说确实更有效(Win XP上的Python 2.5.2):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

按原样运行上面的代码,然后以if相反的方式注释两行,则编译后的regex的运行速度快一倍

Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast


回答 13

在绊倒这里的讨论之前,我进行了此测试。但是,运行它后,我认为我至少会发布结果。

我偷了杰夫·弗里德尔(Jeff Friedl)的“精通正则表达式”(Mastering Regular Expressions)中的示例并将其混为一谈。这是在运行OSX 10.6(2Ghz Intel Core 2 duo,4GB ram)的Macbook上。Python版本是2.6.1。

运行1-使用re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

运行2-不使用re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds

I ran this test before stumbling upon the discussion here. However, having run it I thought I’d at least post my results.

I stole and bastardized the example in Jeff Friedl’s “Mastering Regular Expressions”. This is on a macbook running OSX 10.6 (2Ghz intel core 2 duo, 4GB ram). Python version is 2.6.1.

Run 1 – using re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

Run 2 – Not using re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds

回答 14

这个答案可能迟到了,但这是一个有趣的发现。如果您打算多次使用正则表达式,那么使用编译确实可以节省您的时间(在文档中也提到了这一点)。在下面您可以看到,直接在其上调用match方法时,使用编译的正则表达式最快。将已编译的正则表达式传递给re.match会使它变得更慢,并且将re.match与模式字符串传递在中间。

>>> ipr = r'\D+((([0-2][0-5]?[0-5]?)\.){3}([0-2][0-5]?[0-5]?))\D+'
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.5077415757028423
>>> ipr = re.compile(ipr)
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.8324008992184038
>>> average(*timeit.repeat("ipr.match('abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
0.9187896518778871

This answer might be arriving late but is an interesting find. Using compile can really save you time if you are planning on using the regex multiple times (this is also mentioned in the docs). Below you can see that using a compiled regex is the fastest when the match method is directly called on it. passing a compiled regex to re.match makes it even slower and passing re.match with the patter string is somewhere in the middle.

>>> ipr = r'\D+((([0-2][0-5]?[0-5]?)\.){3}([0-2][0-5]?[0-5]?))\D+'
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.5077415757028423
>>> ipr = re.compile(ipr)
>>> average(*timeit.repeat("re.match(ipr, 'abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
1.8324008992184038
>>> average(*timeit.repeat("ipr.match('abcd100.10.255.255 ')", globals={'ipr': ipr, 're': re}))
0.9187896518778871

回答 15

除了表现。

使用compile帮助我区分
1. module(re)
2. regex对象
3. match对象的概念
当我开始学习regex时

#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'

作为补充,我制作了一个详尽的模块速查表,re以供您参考。

regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
            'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
            'repetition'      : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead'  : ['(?=...)', '(?!...)'],
            'lookbehind' : ['(?<=...)','(?<!...)'],
            'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor'          : ['^', '\b', '$'],
          'non_printable'   : ['\n', '\t', '\r', '\f', '\v'],
          'shorthand'       : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
              ['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}

Besides the performance.

Using compile helps me to distinguish the concepts of
1. module(re),
2. regex object
3. match object
When I started learning regex

#regex object
regex_object = re.compile(r'[a-zA-Z]+')
#match object
match_object = regex_object.search('1.Hello')
#matching content
match_object.group()
output:
Out[60]: 'Hello'
V.S.
re.search(r'[a-zA-Z]+','1.Hello').group()
Out[61]: 'Hello'

As a complement, I made an exhaustive cheatsheet of module re for your reference.

regex = {
'brackets':{'single_character': ['[]', '.', {'negate':'^'}],
            'capturing_group' : ['()','(?:)', '(?!)' '|', '\\', 'backreferences and named group'],
            'repetition'      : ['{}', '*?', '+?', '??', 'greedy v.s. lazy ?']},
'lookaround' :{'lookahead'  : ['(?=...)', '(?!...)'],
            'lookbehind' : ['(?<=...)','(?<!...)'],
            'caputuring' : ['(?P<name>...)', '(?P=name)', '(?:)'],},
'escapes':{'anchor'          : ['^', '\b', '$'],
          'non_printable'   : ['\n', '\t', '\r', '\f', '\v'],
          'shorthand'       : ['\d', '\w', '\s']},
'methods': {['search', 'match', 'findall', 'finditer'],
              ['split', 'sub']},
'match_object': ['group','groups', 'groupdict','start', 'end', 'span',]
}

回答 16

我真的尊重上述所有答案。我认为是的!当然,值得一次使用re.compile而不是一次编译regex。

使用re.compile可以使您的代码更具动态性,因为您可以调用已编译的regex,而无需再次编译。在以下情况下,这件事会使您受益:

  1. 处理器的工作
  2. 时间复杂度。
  3. 使正则表达式通用。(可用于findall,search,match)
  4. 并使您的程序看起来很酷。

范例:

  example_string = "The room number of her room is 26A7B."
  find_alpha_numeric_string = re.compile(r"\b\w+\b")

在Findall中使用

 find_alpha_numeric_string.findall(example_string)

在搜索中使用

  find_alpha_numeric_string.search(example_string)

同样,您可以将其用于:匹配和替换

I really respect all the above answers. From my opinion Yes! For sure it is worth to use re.compile instead of compiling the regex, again and again, every time.

Using re.compile makes your code more dynamic, as you can call the already compiled regex, instead of compiling again and aagain. This thing benefits you in cases:

  1. Processor Efforts
  2. Time Complexity.
  3. Makes regex Universal.(can be used in findall, search, match)
  4. And makes your program looks cool.

Example :

  example_string = "The room number of her room is 26A7B."
  find_alpha_numeric_string = re.compile(r"\b\w+\b")

Using in Findall

 find_alpha_numeric_string.findall(example_string)

Using in search

  find_alpha_numeric_string.search(example_string)

Similarly you can use it for: Match and Substitute


回答 17

这是一个很好的问题。您经常看到人们无缘无故地使用re.compile。它降低了可读性。但是请确保在很多时候需要对表达式进行预编译。就像您在循环中重复使用它或类似方法时一样。

就像关于编程的一切(实际上生活中的一切)一样。应用常识。

This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It’s like everything about programming (everything in life actually). Apply common sense.


回答 18

(几个月后),您可以轻松地在re.match或与此相关的其他任何事情上添加自己的缓存-

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

wibni,如果满足以下条件,那就不是很好了:cachehint(size =),cacheinfo()-> size,hits,nclear …

(months later) it’s easy to add your own cache around re.match, or anything else for that matter —

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

A wibni, wouldn’t it be nice if: cachehint( size= ), cacheinfo() -> size, hits, nclear …


回答 19

与动态编译相比,我有1000多次运行已编译的正则表达式的经验,并且没有注意到任何可察觉的差异

对已接受答案的投票导致一个假设,即@Triptych所说的在所有情况下都是正确的。这不一定是真的。一个很大的不同是何时必须决定是否接受正则表达式字符串或已编译的正则表达式对象作为函数的参数:

>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: x.match(y)       # accepts compiled regex as parameter
... h=re.compile('hello')
... """, stmt="f(h, 'hello world')")
0.32881879806518555
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: re.compile(x).match(y)   # compiles when called
... """, stmt="f('hello', 'hello world')")
0.809190034866333

最好编译正则表达式,以防您需要重用它们。

请注意,上面timeit中的示例在导入时一次模拟了一个已编译的regex对象的创建,而在进行匹配时则模拟了“即时”的创建。

I’ve had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference

The votes on the accepted answer leads to the assumption that what @Triptych says is true for all cases. This is not necessarily true. One big difference is when you have to decide whether to accept a regex string or a compiled regex object as a parameter to a function:

>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: x.match(y)       # accepts compiled regex as parameter
... h=re.compile('hello')
... """, stmt="f(h, 'hello world')")
0.32881879806518555
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: re.compile(x).match(y)   # compiles when called
... """, stmt="f('hello', 'hello world')")
0.809190034866333

It is always better to compile your regexs in case you need to reuse them.

Note the example in the timeit above simulates creation of a compiled regex object once at import time versus “on-the-fly” when required for a match.


回答 20

作为一个替代的答案,如我所见,以前没有提到过,我将继续引用Python 3文档

您应该使用这些模块级功能,还是应该自己获取模式并调用其方法?如果要在循环中访问正则表达式,则对其进行预编译将节省一些函数调用。在循环之外,由于内部缓存,差异不大。

As an alternative answer, as I see that it hasn’t been mentioned before, I’ll go ahead and quote the Python 3 docs:

Should you use these module-level functions, or should you get the pattern and call its methods yourself? If you’re accessing a regex within a loop, pre-compiling it will save a few function calls. Outside of loops, there’s not much difference thanks to the internal cache.


回答 21

这是一个示例,其中使用re.compile速度比要求快50倍以上。

这一点与我在上面的评论中提到的观点相同,即,re.compile当您的用法无法从编译缓存中获得太多好处时,使用可能会带来很大的好处。至少在一种特定情况下(我在实践中遇到过),即在满足以下所有条件时,才会发生这种情况:

  • 您有很多正则表达式模式(超过个re._MAXCACHE,当前默认值为512个),并且
  • 您经常使用这些正则表达式,并且
  • 您在相同模式下的连续使用之间的间隔要比re._MAXCACHE其他正则表达式更多,因此在两次连续使用之间,每个正则表达式都会从缓存中清除。
import re
import time

def setup(N=1000):
    # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
    patterns = [chr(i) + '.*' + chr(j)
                    for i in range(ord('a'), ord('z') + 1)
                    for j in range(ord('a'), ord('z') + 1)]
    # If this assertion below fails, just add more (distinct) patterns.
    # assert(re._MAXCACHE < len(patterns))
    # N strings. Increase N for larger effect.
    strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
    return (patterns, strings)

def without_compile():
    print('Without re.compile:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for s in strings:
        for pat in patterns:
            count += bool(re.search(pat, s))
    return count

def without_compile_cache_friendly():
    print('Without re.compile, cache-friendly order:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for pat in patterns:
        for s in strings:
            count += bool(re.search(pat, s))
    return count

def with_compile():
    print('With re.compile:')
    patterns, strings = setup()
    print('compiling')
    compiled = [re.compile(pattern) for pattern in patterns]
    print('searching')
    count = 0
    for s in strings:
        for regex in compiled:
            count += bool(regex.search(s))
    return count

start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')

start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')

start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')

print(f'Ratio: {d3/d1:.2f}')

我在笔记本电脑上得到的示例输出(Python 3.7.7):

With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.

Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.

Without re.compile:
searching
676000
-- That took 23.54 seconds.

Ratio: 70.89

我没有打扰,timeit因为差异是如此明显,但是每次我得到的定性数字都差不多。请注意,即使不re.compile使用,多次使用相同的regex并移至下一个也不是很糟糕(大约是的慢2倍re.compile),但以另一种顺序(遍历许多regexes),则更糟,正如预期的那样。另外,增加缓存大小也可以:仅re._MAXCACHE = len(patterns)setup()上面进行设置(当然,我不建议在生产环境中进行此类操作,因为带下划线的名称通常是“私有”的)将〜23秒降低为〜0.7秒,这也符合我们的理解。

Here is an example where using re.compile is over 50 times faster, as requested.

The point is just the same as what I made in the comment above, namely, using re.compile can be a significant advantage when your usage is such as to not benefit much from the compilation cache. This happens at least in one particular case (that I ran into in practice), namely when all of the following are true:

  • You have a lot of regex patterns (more than re._MAXCACHE, whose default is currently 512), and
  • you use these regexes a lot of times, and
  • you consecutive usages of the same pattern are separated by more than re._MAXCACHE other regexes in between, so that each one gets flushed from the cache between consecutive usages.
import re
import time

def setup(N=1000):
    # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
    patterns = [chr(i) + '.*' + chr(j)
                    for i in range(ord('a'), ord('z') + 1)
                    for j in range(ord('a'), ord('z') + 1)]
    # If this assertion below fails, just add more (distinct) patterns.
    # assert(re._MAXCACHE < len(patterns))
    # N strings. Increase N for larger effect.
    strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
    return (patterns, strings)

def without_compile():
    print('Without re.compile:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for s in strings:
        for pat in patterns:
            count += bool(re.search(pat, s))
    return count

def without_compile_cache_friendly():
    print('Without re.compile, cache-friendly order:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for pat in patterns:
        for s in strings:
            count += bool(re.search(pat, s))
    return count

def with_compile():
    print('With re.compile:')
    patterns, strings = setup()
    print('compiling')
    compiled = [re.compile(pattern) for pattern in patterns]
    print('searching')
    count = 0
    for s in strings:
        for regex in compiled:
            count += bool(regex.search(s))
    return count

start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')

start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')

start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')

print(f'Ratio: {d3/d1:.2f}')

Example output I get on my laptop (Python 3.7.7):

With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.

Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.

Without re.compile:
searching
676000
-- That took 23.54 seconds.

Ratio: 70.89

I didn’t bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn’t so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don’t recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.


回答 22

使用第二个版本时,正则表达式在使用前先进行编译。如果要执行多次,最好先编译一下。如果您每次都不匹配,则无需编译。

Regular Expressions are compiled before being used when using the second version. If you are going to executing it many times it is definatly better to compile it first. If not compiling every time you match for one off’s is fine.


回答 23

易读性/认知负荷偏好

对我来说,主要的收获是,我只需要记住,读,一个复杂的正则表达式语法API的形式-的<compiled_pattern>.method(xxx)形式,而不是re.func(<pattern>, xxx)形式。

re.compile(<pattern>)有点多余样板的,真实的。

但是就正则表达式而言,额外的编译步骤不太可能是造成认知负担的主要原因。实际上,在复杂的模式上,您甚至可以通过将声明与随后在其上调用的任何regex方法分开来获得清晰度。

我倾向于首先在Regex101之类的网站中甚至在一个单独的最小测试脚本中调整复杂的模式,然后将它们引入我的代码中,因此将声明与使用分开也是适合我的工作流程的。

Legibility/cognitive load preference

To me, the main gain is that I only need to remember, and read, one form of the complicated regex API syntax – the <compiled_pattern>.method(xxx) form rather than that and the re.func(<pattern>, xxx) form.

The re.compile(<pattern>) is a bit of extra boilerplate, true.

But where regex are concerned, that extra compile step is unlikely to be a big cause of cognitive load. And in fact, on complicated patterns, you might even gain clarity from separating the declaration from whatever regex method you then invoke on it.

I tend to first tune complicated patterns in a website like Regex101, or even in a separate minimal test script, then bring them into my code, so separating the declaration from its use fits my workflow as well.


回答 24

我想激发预编译在概念上和“文学上”(如“文学编程”中)都是有利的。看一下下面的代码片段:

from re import compile as _Re

class TYPO:

  def text_has_foobar( self, text ):
    return self._text_has_foobar_re_search( text ) is not None
  _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()

在您的应用程序中,您将编写:

from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )

就其可获得的功能而言,这几乎是简单的。因为这是一个简短的示例,所以我混淆了将_text_has_foobar_re_search所有内容统一在一起的方法。该代码的缺点是无论TYPO库对象的生存期如何,它都占用很少的内存;优点是,进行foobar搜索时,您将获得两个函数调用和两个类字典查找。缓存了多少个正则表达式re此处,该缓存的开销无关紧要。

将此与以下更常用的样式进行比较:

import re

class Typo:

  def text_has_foobar( self, text ):
    return re.compile( r"""(?i)foobar""" ).search( text ) is not None

在应用程序中:

typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )

我很容易承认我的风格对于python非常不寻常,甚至值得商bat。但是,在更紧密地匹配python最常用方式的示例中,为了进行单个匹配,我们必须实例化一个对象,执行三个实例字典查找,并执行三个函数调用;另外,我们可能会进入re当使用100多个正则表达式时缓存问题。同样,正则表达式隐藏在方法体内,大多数情况下并不是一个好主意。

是否说措施的每个子集-有针对性的别名导入语句;适用的别名方法;减少函数调用和对象字典查找-可以帮助减少计算和概念上的复杂性。

i’d like to motivate that pre-compiling is both conceptually and ‘literately’ (as in ‘literate programming’) advantageous. have a look at this code snippet:

from re import compile as _Re

class TYPO:

  def text_has_foobar( self, text ):
    return self._text_has_foobar_re_search( text ) is not None
  _text_has_foobar_re_search = _Re( r"""(?i)foobar""" ).search

TYPO = TYPO()

in your application, you’d write:

from TYPO import TYPO
print( TYPO.text_has_foobar( 'FOObar ) )

this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you’ll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

compare this with the more usual style, below:

import re

class Typo:

  def text_has_foobar( self, text ):
    return re.compile( r"""(?i)foobar""" ).search( text ) is not None

In the application:

typo = Typo()
print( typo.text_has_foobar( 'FOObar ) )

I readily admit that my style is highly unusual for python, maybe even debatable. however, in the example that more closely matches how python is mostly used, in order to do a single match, we must instantiate an object, do three instance dictionary lookups, and perform three function calls; additionally, we might get into re caching troubles when using more than 100 regexes. also, the regular expression gets hidden inside the method body, which most of the time is not such a good idea.

be it said that every subset of measures—targeted, aliased import statements; aliased methods where applicable; reduction of function calls and object dictionary lookups—can help reduce computational and conceptual complexity.


回答 25

我的理解是,这两个示例实际上是等效的。唯一的区别是,在第一个实例中,您可以在其他地方重用已编译的正则表达式,而无需再次对其进行编译。

这是给您的参考:http : //diveintopython3.ep.io/refactoring.html

用字符串“ M”调用已编译模式对象的搜索功能与使用正则表达式和字符串“ M”调用re.search的操作相同。只有很多,更快。(实际上,re.search函数只是编译正则表达式并为您调用结果模式对象的search方法。)

My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

Here’s a reference for you: http://diveintopython3.ep.io/refactoring.html

Calling the compiled pattern object’s search function with the string ‘M’ accomplishes the same thing as calling re.search with both the regular expression and the string ‘M’. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object’s search method for you.)


创建一个空的Pandas DataFrame,然后填充它?

问题:创建一个空的Pandas DataFrame,然后填充它?

我从这里的pandas DataFrame文档开始:http ://pandas.pydata.org/pandas-docs/stable/dsintro.html

我想用时间序列类型的计算中的值迭代地填充DataFrame。所以基本上,我想用列A,B和时间戳记行(全为0或全部为NaN)初始化DataFrame。

然后,我将添加初始值,然后遍历此数据,计算出大约某行之前的新行row[A][t] = row[A][t-1]+1

我目前正在使用下面的代码,但是我觉得这很丑陋,必须有一种直接使用DataFrame进行此操作的方法,或者通常来说是一种更好的方法。注意:我正在使用Python 2.7。

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

I’m starting from the pandas DataFrame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

I’d like to iteratively fill the DataFrame with values in a time series kind of calculation. So basically, I’d like to initialize the DataFrame with columns A, B and timestamp rows, all 0 or all NaN.

I’d then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.

I’m currently using the code as below, but I feel it’s kind of ugly and there must be a way to do this with a DataFrame directly, or just a better way in general. Note: I’m using Python 2.7.

import datetime as dt
import pandas as pd
import scipy as s

if __name__ == '__main__':
    base = dt.datetime.today().date()
    dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
    dates.sort()

    valdict = {}
    symbols = ['A','B', 'C']
    for symb in symbols:
        valdict[symb] = pd.Series( s.zeros( len(dates)), dates )

    for thedate in dates:
        if thedate > dates[0]:
            for symb in valdict:
                valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]

    print valdict

回答 0

这里有一些建议:

使用date_range的指标:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

注意:我们可以NaN简单地通过编写以下内容来创建一个空的DataFrame(带有s):

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

要对数据进行这些类型的计算,请使用numpy数组:

data = np.array([np.arange(10)]*3).T

因此,我们可以创建DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

Here’s a couple of suggestions:

Use date_range for the index:

import datetime
import pandas as pd
import numpy as np

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')

columns = ['A','B', 'C']

Note: we could create an empty DataFrame (with NaNs) simply by writing:

df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs

To do these type of calculations for the data, use a numpy array:

data = np.array([np.arange(10)]*3).T

Hence we can create the DataFrame:

In [10]: df = pd.DataFrame(data, index=index, columns=columns)

In [11]: df
Out[11]: 
            A  B  C
2012-11-29  0  0  0
2012-11-30  1  1  1
2012-12-01  2  2  2
2012-12-02  3  3  3
2012-12-03  4  4  4
2012-12-04  5  5  5
2012-12-05  6  6  6
2012-12-06  7  7  7
2012-12-07  8  8  8
2012-12-08  9  9  9

回答 1

如果您只想创建一个空的数据框并在以后用一些传入的数据框填充它,请尝试以下操作:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

在此示例中,我使用此pandas文档创建一个新的数据框,然后使用append将oldDF中的数据写入newDF。

如果我必须不断地将来自多个旧DF的新数据追加到此newDF中,则只需使用for循环即可遍历 pandas.DataFrame.append()

If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:

newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional 

In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.

If I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append()


回答 2

创建数据框的正确方法

TLDR;(只需阅读粗体文字)

这里的大多数答案将告诉您如何创建一个空的DataFrame并将其填写,但是没有人会告诉您这是一件坏事。

这是我的建议:等待直到您确定拥有所有需要使用的数据。使用列表收集数据,然后在准备好时初始化DataFrame。

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

一次添加到列表并创建一个DataFrame总是比创建一个空的DataFrame(或NaN之一)便宜,一遍又一遍地添加到列表列表还占用较少的内存,并且可以轻松处理,追加和删除(如果需要)的数据结构

此方法的另一个优点是dtypes可以自动推断(而不是分配object给所有对象)。

最后一个优点是为您的数据自动创建了aRangeIndex,因此不必担心(只需查看下面的劣势appendloc方法,您将在两种方法中看到需要正确处理索引的元素)。


你不应该做的事情

appendconcat在循环内

这是我从初学者看到的最大错误:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

内存重新分配给每一个appendconcat你有操作。再加上一个循环,就可以进行二次复杂度运算。从df.append文档页面

迭代地将行添加到DataFrame可能比单个连接更多地占用大量计算资源。更好的解决方案是将这些行添加到列表中,然后一次将列表与原始DataFrame连接起来。

与之相关的另一个错误df.append是用户倾向于忘记append不是就地函数,因此必须将结果分配回去。您还必须担心dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

处理对象列从来都不是一件好事,因为熊猫无法向量化这些列上的操作。您将需要执行以下操作来修复它:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc 循环内

我还曾经看到过loc将其追加到创建为空的DataFrame上:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

和以前一样,您没有每次都预先分配所需的内存量,因此每次创建新行时都会重新增加内存。就像一样糟糕append,甚至更难看。

NaN的空数据框

然后,创建一个NaN的DataFrame以及与此相关的所有警告。

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

它会像其他对象一样创建一个对象列的DataFrame。

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

如上所述,追加仍然存在所有问题。

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

证明在布丁中

对这些方法进行计时是最快的方法,以了解它们在内存和实用性方面的差异。

在此处输入图片说明

基准测试代码,以供参考。

The Right Way™ to Create a DataFrame

TLDR; (just read the bold text)

Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.

Here is my advice: Wait until you are sure you have all the data you need to work with. Use a list to collect your data, then initialise a DataFrame when you are ready.

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of of NaNs) and append to it over and over again. Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).

The other advantage of this method is dtypes are automatically inferred (rather than assigning object to all of them).

The last advantage is that a RangeIndex is automatically created for your data, so it is one less thing to worry about (take a look at the poor append and loc methods below, you will see elements in both that require handling the index appropriately).


Things you should NOT do

append or concat inside a loop

Here is the biggest mistake I’ve seen from beginners:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
    # or similarly,
    # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)

Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation. From the df.append doc page:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:

df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)

df.dtypes
A     object   # yuck!
B    float64
C     object
dtype: object

Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:

df.infer_objects().dtypes
A      int64
B    float64
C     object
dtype: object

loc inside a loop

I have also seen loc used to append to a DataFrame that was created empty:

df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
    df.loc[len(df)] = [a, b, c]

As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It’s just as bad as append, and even more ugly.

Empty DataFrame of NaNs

And then, there’s creating a DataFrame of NaNs, and all the caveats associated therewith.

df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  NaN  NaN  NaN

It creates a DataFrame of object columns, like the others.

df.dtypes
A    object  # you DON'T want this
B    object
C    object
dtype: object

Appending still has all the issues as the methods above.

for i, (a, b, c) in enumerate(some_function_that_yields_data()):
    df.iloc[i] = [a, b, c]

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.


回答 3

用列名初始化空框架

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

将新记录添加到框架

my_df.loc[len(my_df)] = [2, 4, 5]

您可能还想通过字典:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

将另一个框架附加到现有框架

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

性能考量

如果要在循环内添加行,请考虑性能问题。对于大约前1000条记录,“ my_df.loc”的性能较好,但通过增加循环中的记录数,它的性能逐渐变慢。

如果您打算在一个大循环中进行细化处理(例如10M‌条记录左右),那么最好将这两种方法混合使用;用iloc填充数据框,直到大小达到1000,然后将其附加到原始数据框,然后清空临时数据框。这将使您的性能提高大约10倍。

Initialize empty frame with column names

import pandas as pd

col_names =  ['A', 'B', 'C']
my_df  = pd.DataFrame(columns = col_names)
my_df

Add a new record to a frame

my_df.loc[len(my_df)] = [2, 4, 5]

You also might want to pass a dictionary:

my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic 

Append another frame to your existing frame

col_names =  ['A', 'B', 'C']
my_df2  = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)

Performance considerations

If you are adding rows inside a loop consider performance issues. For around the first 1000 records “my_df.loc” performance is better, but it gradually becomes slower by increasing the number of records in the loop.

If you plan to do thins inside a big loop (say 10M‌ records or so), you are better off using a mixture of these two; fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe. This would boost your performance by around 10 times.


回答 4

假设有19行的数据框

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

保持A列不变

test['A']=10

保持列b为循环给出的变量

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

您可以将第一个x替换为pd.Series([x], index = [x])任何值

Assume a dataframe with 19 rows

index=range(0,19)
index

columns=['A']
test = pd.DataFrame(index=index, columns=columns)

Keeping Column A as a constant

test['A']=10

Keeping column b as a variable given by a loop

for x in range(0,19):
    test.loc[[x], 'b'] = pd.Series([x], index = [x])

You can replace the first x in pd.Series([x], index = [x]) with any value


您是否应该始终偏爱xrange()而不是range()?

问题:您是否应该始终偏爱xrange()而不是range()?

为什么或者为什么不?

Why or why not?


回答 0

对于性能,尤其是在较大范围内进行迭代时,xrange()通常会更好。但是,在某些情况下,您可能更喜欢range()

  • 在python 3中,做过range()什么xrange()事并且xrange()不存在。如果要编写可在Python 2和Python 3上运行的代码,则不能使用xrange()

  • range()在某些情况下实际上可以更快-例如。如果多次重复相同的序列。 xrange()每次都必须重新构造整数对象,但是range()将拥有真正的整数对象。(但是,在内存方面,它总是会表现得更差)

  • xrange()在需要真实列表的所有情况下都不可用。例如,它不支持切片或任何列表方法。

[编辑]有range()几篇文章提到2to3工具将如何升级。为了记录在案,这里是对一些用法示例运行该工具的输出range()xrange()

RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: ws_comma
--- range_test.py (original)
+++ range_test.py (refactored)
@@ -1,7 +1,7 @@

 for x in range(20):
-    a=range(20)
+    a=list(range(20))
     b=list(range(20))
     c=[x for x in range(20)]
     d=(x for x in range(20))
-    e=xrange(20)
+    e=range(20)

如您所见,当在for循环或理解中使用时,或已经用list()包装的地方,范围保持不变。

For performance, especially when you’re iterating over a large range, xrange() is usually better. However, there are still a few cases why you might prefer range():

  • In python 3, range() does what xrange() used to do and xrange() does not exist. If you want to write code that will run on both Python 2 and Python 3, you can’t use xrange().

  • range() can actually be faster in some cases – eg. if iterating over the same sequence multiple times. xrange() has to reconstruct the integer object every time, but range() will have real integer objects. (It will always perform worse in terms of memory however)

  • xrange() isn’t usable in all cases where a real list is needed. For instance, it doesn’t support slices, or any list methods.

[Edit] There are a couple of posts mentioning how range() will be upgraded by the 2to3 tool. For the record, here’s the output of running the tool on some sample usages of range() and xrange()

RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: ws_comma
--- range_test.py (original)
+++ range_test.py (refactored)
@@ -1,7 +1,7 @@

 for x in range(20):
-    a=range(20)
+    a=list(range(20))
     b=list(range(20))
     c=[x for x in range(20)]
     d=(x for x in range(20))
-    e=xrange(20)
+    e=range(20)

As you can see, when used in a for loop or comprehension, or where already wrapped with list(), range is left unchanged.


回答 1

不,它们都有各自的用途:

xrange()在迭代时使用,因为它可以节省内存。说:

for x in xrange(1, one_zillion):

而不是:

for x in range(1, one_zillion):

另一方面,range()如果您确实想要数字列表,请使用。

multiples_of_seven = range(7,100,7)
print "Multiples of seven < 100: ", multiples_of_seven

No, they both have their uses:

Use xrange() when iterating, as it saves memory. Say:

for x in xrange(1, one_zillion):

rather than:

for x in range(1, one_zillion):

On the other hand, use range() if you actually want a list of numbers.

multiples_of_seven = range(7,100,7)
print "Multiples of seven < 100: ", multiples_of_seven

回答 2

仅当需要实际列表时range()xrange()才应优先考虑。例如,当您想修改所返回的列表range()时,或者希望对其进行切片时。对于迭代,甚至只是普通索引,xrange()都可以正常工作(通常效率更高)。有一点range()要比xrange()很小的列表快一点,但是取决于您的硬件和其他各种细节,收支平衡可能是长度为1或2的结果;不用担心。更喜欢xrange()

You should favour range() over xrange() only when you need an actual list. For instance, when you want to modify the list returned by range(), or when you wish to slice it. For iteration or even just normal indexing, xrange() will work fine (and usually much more efficiently). There is a point where range() is a bit faster than xrange() for very small lists, but depending on your hardware and various other details, the break-even can be at a result of length 1 or 2; not something to worry about. Prefer xrange().


回答 3

另一个区别是xrange()不能支持大于C ints的数字,因此,如果要使用python内置的支持大数的数字来获得范围,则必须使用range()。

Python 2.7.3 (default, Jul 13 2012, 22:29:01) 
[GCC 4.7.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> range(123456787676676767676676,123456787676676767676679)
[123456787676676767676676L, 123456787676676767676677L, 123456787676676767676678L]
>>> xrange(123456787676676767676676,123456787676676767676679)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

Python 3没有这个问题:

Python 3.2.3 (default, Jul 14 2012, 01:01:48) 
[GCC 4.7.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> range(123456787676676767676676,123456787676676767676679)
range(123456787676676767676676, 123456787676676767676679)

One other difference is that xrange() can’t support numbers bigger than C ints, so if you want to have a range using python’s built in large number support, you have to use range().

Python 2.7.3 (default, Jul 13 2012, 22:29:01) 
[GCC 4.7.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> range(123456787676676767676676,123456787676676767676679)
[123456787676676767676676L, 123456787676676767676677L, 123456787676676767676678L]
>>> xrange(123456787676676767676676,123456787676676767676679)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

Python 3 does not have this problem:

Python 3.2.3 (default, Jul 14 2012, 01:01:48) 
[GCC 4.7.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> range(123456787676676767676676,123456787676676767676679)
range(123456787676676767676676, 123456787676676767676679)

回答 4

xrange()效率更高,因为它无需生成对象列表,而只是一次生成一个对象。一次只能有一个整数,而不是100个整数及其所有开销和放入它们的列表。生成速度更快,内存使用更好,代码更高效。

除非我特别需要某件事的清单,否则我总是会支持 xrange()

xrange() is more efficient because instead of generating a list of objects, it just generates one object at a time. Instead of 100 integers, and all of their overhead, and the list to put them in, you just have one integer at a time. Faster generation, better memory use, more efficient code.

Unless I specifically need a list for something, I always favor xrange()


回答 5

range()返回一个列表,xrange()返回一个xrange对象。

xrange()更快,内存使用效率更高。但是收益不是很大。

列表使用的额外内存当然不只是浪费,列表还具有更多功能(切片,重复,插入等)。确切的差异可以在文档中找到。没有硬性规定,请使用所需的。

Python 3.0仍在开发中,但是IIRC range()与2.X的xrange()非常相似,并且list(range())可用于生成列表。

range() returns a list, xrange() returns an xrange object.

xrange() is a bit faster, and a bit more memory efficient. But the gain is not very large.

The extra memory used by a list is of course not just wasted, lists have more functionality (slice, repeat, insert, …). Exact differences can be found in the documentation. There is no bonehard rule, use what is needed.

Python 3.0 is still in development, but IIRC range() will very similar to xrange() of 2.X and list(range()) can be used to generate lists.


回答 6

我只想说,获取具有切片和索引功能的xrange对象并不难。我编写了一些代码,这些代码相当有效,并且在计数(迭代)时与xrange一样快。

from __future__ import division

def read_xrange(xrange_object):
    # returns the xrange object's start, stop, and step
    start = xrange_object[0]
    if len(xrange_object) > 1:
       step = xrange_object[1] - xrange_object[0]
    else:
        step = 1
    stop = xrange_object[-1] + step
    return start, stop, step

class Xrange(object):
    ''' creates an xrange-like object that supports slicing and indexing.
    ex: a = Xrange(20)
    a.index(10)
    will work

    Also a[:5]
    will return another Xrange object with the specified attributes

    Also allows for the conversion from an existing xrange object
    '''
    def __init__(self, *inputs):
        # allow inputs of xrange objects
        if len(inputs) == 1:
            test, = inputs
            if type(test) == xrange:
                self.xrange = test
                self.start, self.stop, self.step = read_xrange(test)
                return

        # or create one from start, stop, step
        self.start, self.step = 0, None
        if len(inputs) == 1:
            self.stop, = inputs
        elif len(inputs) == 2:
            self.start, self.stop = inputs
        elif len(inputs) == 3:
            self.start, self.stop, self.step = inputs
        else:
            raise ValueError(inputs)

        self.xrange = xrange(self.start, self.stop, self.step)

    def __iter__(self):
        return iter(self.xrange)

    def __getitem__(self, item):
        if type(item) is int:
            if item < 0:
                item += len(self)

            return self.xrange[item]

        if type(item) is slice:
            # get the indexes, and then convert to the number
            start, stop, step = item.start, item.stop, item.step
            start = start if start != None else 0 # convert start = None to start = 0
            if start < 0:
                start += start
            start = self[start]
            if start < 0: raise IndexError(item)
            step = (self.step if self.step != None else 1) * (step if step != None else 1)
            stop = stop if stop is not None else self.xrange[-1]
            if stop < 0:
                stop += stop

            stop = self[stop]
            stop = stop

            if stop > self.stop:
                raise IndexError
            if start < self.start:
                raise IndexError
            return Xrange(start, stop, step)

    def index(self, value):
        error = ValueError('object.index({0}): {0} not in object'.format(value))
        index = (value - self.start)/self.step
        if index % 1 != 0:
            raise error
        index = int(index)


        try:
            self.xrange[index]
        except (IndexError, TypeError):
            raise error
        return index

    def __len__(self):
        return len(self.xrange)

老实说,我认为整个问题有点傻,无论如何xrange都应该做所有这一切……

I would just like to say that it REALLY isn’t that difficult to get an xrange object with slice and indexing functionality. I have written some code that works pretty dang well and is just as fast as xrange for when it counts (iterations).

from __future__ import division

def read_xrange(xrange_object):
    # returns the xrange object's start, stop, and step
    start = xrange_object[0]
    if len(xrange_object) > 1:
       step = xrange_object[1] - xrange_object[0]
    else:
        step = 1
    stop = xrange_object[-1] + step
    return start, stop, step

class Xrange(object):
    ''' creates an xrange-like object that supports slicing and indexing.
    ex: a = Xrange(20)
    a.index(10)
    will work

    Also a[:5]
    will return another Xrange object with the specified attributes

    Also allows for the conversion from an existing xrange object
    '''
    def __init__(self, *inputs):
        # allow inputs of xrange objects
        if len(inputs) == 1:
            test, = inputs
            if type(test) == xrange:
                self.xrange = test
                self.start, self.stop, self.step = read_xrange(test)
                return

        # or create one from start, stop, step
        self.start, self.step = 0, None
        if len(inputs) == 1:
            self.stop, = inputs
        elif len(inputs) == 2:
            self.start, self.stop = inputs
        elif len(inputs) == 3:
            self.start, self.stop, self.step = inputs
        else:
            raise ValueError(inputs)

        self.xrange = xrange(self.start, self.stop, self.step)

    def __iter__(self):
        return iter(self.xrange)

    def __getitem__(self, item):
        if type(item) is int:
            if item < 0:
                item += len(self)

            return self.xrange[item]

        if type(item) is slice:
            # get the indexes, and then convert to the number
            start, stop, step = item.start, item.stop, item.step
            start = start if start != None else 0 # convert start = None to start = 0
            if start < 0:
                start += start
            start = self[start]
            if start < 0: raise IndexError(item)
            step = (self.step if self.step != None else 1) * (step if step != None else 1)
            stop = stop if stop is not None else self.xrange[-1]
            if stop < 0:
                stop += stop

            stop = self[stop]
            stop = stop

            if stop > self.stop:
                raise IndexError
            if start < self.start:
                raise IndexError
            return Xrange(start, stop, step)

    def index(self, value):
        error = ValueError('object.index({0}): {0} not in object'.format(value))
        index = (value - self.start)/self.step
        if index % 1 != 0:
            raise error
        index = int(index)


        try:
            self.xrange[index]
        except (IndexError, TypeError):
            raise error
        return index

    def __len__(self):
        return len(self.xrange)

Honestly, I think the whole issue is kind of silly and xrange should do all of this anyway…


回答 7

书中的一个很好的例子:Magnus Lie Hetland的《实用Python》

>>> zip(range(5), xrange(100000000))
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

我不建议在前面的示例中使用range而不是xrange -尽管只需要前五个数字,但range会计算所有数字,这可能会花费很多时间。使用xrange,这不是问题,因为它仅计算所需的那些数字。

是的,我读了@Brian的答案:在python 3中,无论如何range()都是生成器,而xrange()不存在。

A good example given in book: Practical Python By Magnus Lie Hetland

>>> zip(range(5), xrange(100000000))
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

I wouldn’t recommend using range instead of xrange in the preceding example—although only the first five numbers are needed, range calculates all the numbers, and that may take a lot of time. With xrange, this isn’t a problem because it calculates only those numbers needed.

Yes I read @Brian’s answer: In python 3, range() is a generator anyway and xrange() does not exist.


回答 8

出于以下原因选择范围:

1)xrange将在较新的Python版本中消失。这样可以轻松实现将来的兼容性。

2)range将承担与xrange相关的效率。

Go with range for these reasons:

1) xrange will be going away in newer Python versions. This gives you easy future compatibility.

2) range will take on the efficiencies associated with xrange.


回答 9

好的,对于xrange与range的权衡和优势,每个人都有不同的看法。它们大体上是正确的,xrange是一个迭代器,range充实并创建实际列表。在大多数情况下,您不会真正注意到两者之间的差异。(您可以将map与range一起使用,但不能与xrange一起使用,但是会占用更多内存。)

但是,我认为您希望聚会中听到的是,首选是xrange。由于Python 3中的range是一个迭代器,因此代码转换工具2to3可以将xrange的所有用法正确转换为range,并且会针对range的使用抛出错误或警告。如果要确保将来轻松转换代码,则仅使用xrange,并在确定需要列表时使用list(xrange)。我是在今年(2008年)在芝加哥PyCon上进行的CPython冲刺中学到的。

Okay, everyone here as a different opinion as to the tradeoffs and advantages of xrange versus range. They’re mostly correct, xrange is an iterator, and range fleshes out and creates an actual list. For the majority of cases, you won’t really notice a difference between the two. (You can use map with range but not with xrange, but it uses up more memory.)

What I think you rally want to hear, however, is that the preferred choice is xrange. Since range in Python 3 is an iterator, the code conversion tool 2to3 will correctly convert all uses of xrange to range, and will throw out an error or warning for uses of range. If you want to be sure to easily convert your code in the future, you’ll use xrange only, and list(xrange) when you’re sure that you want a list. I learned this during the CPython sprint at PyCon this year (2008) in Chicago.


回答 10

  • range()range(1, 10)返回1到10个数字的列表,并将整个列表保存在内存中。
  • xrange():和一样range(),但不返回列表,而是返回一个对象,该对象根据需要生成范围内的数字。对于循环,这比速度快一点,range()并且内存效率更高。xrange()像迭代器这样的对象,并按需生成数字(惰性评估)。
In [1]: range(1,10)
Out[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

In [2]: xrange(10)
Out[2]: xrange(10)

In [3]: print xrange.__doc__
Out[3]: xrange([start,] stop[, step]) -> xrange object

range()xrange()执行与Python 3中相同的操作,并且在Python 3中不xrange()存在术语。 range()如果多次迭代同一序列,在某些情况下实际上可以更快。xrange()每次都必须重新构造整数对象,但是range()将拥有真正的整数对象。

  • range(): range(1, 10) returns a list from 1 to 10 numbers & hold whole list in memory.
  • xrange(): Like range(), but instead of returning a list, returns an object that generates the numbers in the range on demand. For looping, this is lightly faster than range() and more memory efficient. xrange() object like an iterator and generates the numbers on demand (Lazy Evaluation).
In [1]: range(1,10)
Out[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

In [2]: xrange(10)
Out[2]: xrange(10)

In [3]: print xrange.__doc__
Out[3]: xrange([start,] stop[, step]) -> xrange object

range() does the same thing as xrange() used to do in Python 3 and there is not term xrange() exist in Python 3. range() can actually be faster in some scenario if you iterating over the same sequence multiple times. xrange() has to reconstruct the integer object every time, but range() will have real integer objects.


回答 11

虽然xrangerange大多数情况下要快,但性能差异却很小。下面的小程序比较了对a range和an的迭代xrange

import timeit
# Try various list sizes.
for list_len in [1, 10, 100, 1000, 10000, 100000, 1000000]:
  # Time doing a range and an xrange.
  rtime = timeit.timeit('a=0;\nfor n in range(%d): a += n'%list_len, number=1000)
  xrtime = timeit.timeit('a=0;\nfor n in xrange(%d): a += n'%list_len, number=1000)
  # Print the result
  print "Loop list of len %d: range=%.4f, xrange=%.4f"%(list_len, rtime, xrtime)

下面的结果显示xrange确实更快,但不足以使工作过度。

Loop list of len 1: range=0.0003, xrange=0.0003
Loop list of len 10: range=0.0013, xrange=0.0011
Loop list of len 100: range=0.0068, xrange=0.0034
Loop list of len 1000: range=0.0609, xrange=0.0438
Loop list of len 10000: range=0.5527, xrange=0.5266
Loop list of len 100000: range=10.1666, xrange=7.8481
Loop list of len 1000000: range=168.3425, xrange=155.8719

因此,请务必使用xrange,但是除非您使用的是受限制的硬件,否则不要为它担心太多。

While xrange is faster than range in most circumstances, the difference in performance is pretty minimal. The little program below compares iterating over a range and an xrange:

import timeit
# Try various list sizes.
for list_len in [1, 10, 100, 1000, 10000, 100000, 1000000]:
  # Time doing a range and an xrange.
  rtime = timeit.timeit('a=0;\nfor n in range(%d): a += n'%list_len, number=1000)
  xrtime = timeit.timeit('a=0;\nfor n in xrange(%d): a += n'%list_len, number=1000)
  # Print the result
  print "Loop list of len %d: range=%.4f, xrange=%.4f"%(list_len, rtime, xrtime)

The results below shows that xrange is indeed faster, but not enough to sweat over.

Loop list of len 1: range=0.0003, xrange=0.0003
Loop list of len 10: range=0.0013, xrange=0.0011
Loop list of len 100: range=0.0068, xrange=0.0034
Loop list of len 1000: range=0.0609, xrange=0.0438
Loop list of len 10000: range=0.5527, xrange=0.5266
Loop list of len 100000: range=10.1666, xrange=7.8481
Loop list of len 1000000: range=168.3425, xrange=155.8719

So by all means use xrange, but unless you’re on a constrained hardware, don’t worry too much about it.


空集文字?

问题:空集文字?

[] =空 list

() =空 tuple

{} =空 dict

空有类似的记号set吗?还是我必须写set()

[] = empty list

() = empty tuple

{} = empty dict

Is there a similar notation for an empty set? Or do I have to write set()?


回答 0

不,空集没有文字语法。你必须写set()

No, there’s no literal syntax for the empty set. You have to write set().


回答 1

一定请使用 set()创建一个空集。

但是,如果您想打动别人,请告诉他们您可以使用文字和*Python> = 3.5(请参阅PEP 448)创建一个空集,方法是:

>>> s = {*()}  # or {*{}} or {*[]}
>>> print(s)
set()

这基本上是一种更简洁的方法{_ for _ in ()},但是,请不要这样做。

By all means, please use set() to create an empty set.

But, if you want to impress people, tell them that you can create an empty set using literals and * with Python >= 3.5 (see PEP 448) by doing:

>>> s = {*()}  # or {*{}} or {*[]}
>>> print(s)
set()

this is basically a more condensed way of doing {_ for _ in ()}, but, don’t do this.


回答 2

只是为了扩展公认的答案:

从version 2.73.1python起,set文字已经{}以用法的形式出现了{1,2,3},但是{}它本身仍然用于空字典。

Python 2.7(第一行在Python <2.7中无效)

>>> {1,2,3}.__class__
<type 'set'>
>>> {}.__class__
<type 'dict'>

Python 3.x

>>> {1,4,5}.__class__
<class 'set'>
>>> {}.__class__
<type 'dict'>

此处更多内容:https//docs.python.org/3/whatsnew/2.7.html#other-language-changes

Just to extend the accepted answer:

From version 2.7 and 3.1 python has got set literal {} in form of usage {1,2,3}, but {} itself still used for empty dict.

Python 2.7 (first line is invalid in Python <2.7)

>>> {1,2,3}.__class__
<type 'set'>
>>> {}.__class__
<type 'dict'>

Python 3.x

>>> {1,4,5}.__class__
<class 'set'>
>>> {}.__class__
<type 'dict'>

More here: https://docs.python.org/3/whatsnew/2.7.html#other-language-changes


回答 3

这取决于您是否要使用文字进行比较或赋值。

如果要将现有集设为空,则可以使用该.clear()方法,尤其是在要避免创建新对象的情况下。如果要进行比较,请使用set()或检查长度是否为0。

例:

#create a new set    
a=set([1,2,3,'foo','bar'])
#or, using a literal:
a={1,2,3,'foo','bar'}

#create an empty set
a=set()
#or, use the clear method
a.clear()

#comparison to a new blank set
if a==set():
    #do something

#length-checking comparison
if len(a)==0:
    #do something

It depends on if you want the literal for a comparison, or for assignment.

If you want to make an existing set empty, you can use the .clear() metod, especially if you want to avoid creating a new object. If you want to do a comparison, use set() or check if the length is 0.

example:

#create a new set    
a=set([1,2,3,'foo','bar'])
#or, using a literal:
a={1,2,3,'foo','bar'}

#create an empty set
a=set()
#or, use the clear method
a.clear()

#comparison to a new blank set
if a==set():
    #do something

#length-checking comparison
if len(a)==0:
    #do something

回答 4

更加疯狂的想法是:在Python 3接受Unicode标识符的情况下,您可以声明一个变量ϕ = frozenset()(ϕ为U + 03D5)并使用它。

Adding to the crazy ideas: with Python 3 accepting unicode identifiers, you could declare a variable ϕ = frozenset() (ϕ is U+03D5) and use it instead.


回答 5

是。适用于非空dict / set的相同表示法适用于空dict / set。

注意非空dictset文字之间的区别:

{1: 'a', 2: 'b', 3: 'c'}-一个数字键-值对的内部使得dict
{'aaa', 'bbb', 'ccc'}-一个元组值的内部使一个set

所以:

{}==零个键值对==空dict
{*()}==空值元组==空set

但是事实是您可以做到,但这并不意味着您应该这样做。除非您有很强的理由,否则最好显式构造一个空集,例如:

a = set()

注意:正如评论中注意到的那样,{()}不是一个空集合。这是一个包含1个元素的集合:空元组。

Yes. The same notation that works for non-empty dict/set works for empty ones.

Notice the difference between non-empty dict and set literals:

{1: 'a', 2: 'b', 3: 'c'} — a number of key-value pairs inside makes a dict
{'aaa', 'bbb', 'ccc'} — a tuple of values inside makes a set

So:

{} == zero number of key-value pairs == empty dict
{*()} == empty tuple of values == empty set

However the fact, that you can do it, doesn’t mean you should. Unless you have some strong reasons, it’s better to construct an empty set explicitly, like:

a = set()

NB: As ctrueden noticed in comments, {()} is not an empty set. It’s a set with 1 element: empty tuple.


回答 6

有几种方法可以在Python中创建空Set:

  1. 使用 set()方法
    这是python中的内置方法,可在该变量中创建Empty set。
  2. 使用clear()方法(创造性的工程师技术LOL)
    请参见以下示例:

    sets = {“ Hi”,“ How”,“ are”,“ You”,“ All”}
    类型(集合)(此行输出:set)
    集合.clear()
    print(sets)(此行输出:{})
    type(sets)(此行输出:set)

因此,这是创建空Set的2种方法。

There are few ways to create empty Set in Python :

  1. Using set() method
    This is the built-in method in python that creates Empty set in that variable.
  2. Using clear() method (creative Engineer Technique LOL)
    See this Example:

    sets={“Hi”,”How”,”are”,”You”,”All”}
    type(sets)  (This Line Output : set)
    sets.clear()
    print(sets)  (This Line Output : {})
    type(sets)  (This Line Output : set)

So, This are 2 ways to create empty Set.


如何解决:“ UnicodeDecodeError:’ascii’编解码器无法解码字节”

问题:如何解决:“ UnicodeDecodeError:’ascii’编解码器无法解码字节”

as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File "/usr/local/bin/wok", line 4, in
Engine()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 104, in init
self.load_pages()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 238, in load_pages
p = Page.from_file(os.path.join(root, f), self.options, self, renderer)
File "/usr/local/lib/python2.7/site-packages/wok/page.py", line 111, in from_file
page.meta['content'] = page.renderer.render(page.original)
File "/usr/local/lib/python2.7/site-packages/wok/renderers.py", line 46, in render
return markdown(plain, Markdown.plugins)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 419, in markdown
return md.convert(text)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281, in convert
source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128). -- Note: Markdown only accepts unicode input!

如何解决?

在其他基于python的静态博客应用中,中文帖子可以成功发布。像这个程序:http : //github.com/vrypan/bucket3。在我的网站http://bc3.brite.biz/中,中文帖子可以成功发布。

as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File "/usr/local/bin/wok", line 4, in
Engine()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 104, in init
self.load_pages()
File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 238, in load_pages
p = Page.from_file(os.path.join(root, f), self.options, self, renderer)
File "/usr/local/lib/python2.7/site-packages/wok/page.py", line 111, in from_file
page.meta['content'] = page.renderer.render(page.original)
File "/usr/local/lib/python2.7/site-packages/wok/renderers.py", line 46, in render
return markdown(plain, Markdown.plugins)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 419, in markdown
return md.convert(text)
File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281, in convert
source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128). -- Note: Markdown only accepts unicode input!

How to fix it?

In some other python-based static blog apps, Chinese post can be published successfully. Such as this app: http://github.com/vrypan/bucket3. In my site http://bc3.brite.biz/, Chinese post can be published successfully.


回答 0

tl; dr /快速修复

  • 不要对Willy Nilly进行解码/编码
  • 不要假设您的字符串是UTF-8编码的
  • 尝试在代码中尽快将字符串转换为Unicode字符串
  • 修复您的语言环境:如何在Python 3.6中解决UnicodeDecodeError?
  • 不要试图使用快速reloadhack

Python 2.x中的Unicode Zen-完整版

在没有看到来源的情况下,很难知道根本原因,因此,我将不得不大体讲。

UnicodeDecodeError: 'ascii' codec can't decode byte通常,当您尝试将str包含非ASCII 的Python 2.x转换为Unicode字符串而未指定原始字符串的编码时,通常会发生这种情况。

简而言之,Unicode字符串是一种完全独立的Python字符串类型,不包含任何编码。它们仅保存Unicode 点代码,因此可以保存整个频谱中的任何Unicode点。字符串包含编码的文本,包括UTF-8,UTF-16,ISO-8895-1,GBK,Big5等。字符串被解码为Unicode,Unicodes被编码为字符串。文件和文本数据始终以编码的字符串传输。

Markdown模块的作者可能会使用unicode()(抛出异常的地方)作为其余代码的质量门-它会转换ASCII或将现有的Unicode字符串重新包装为新的Unicode字符串。Markdown作者不知道传入字符串的编码,因此在传递给Markdown之前,将依靠您将字符串解码为Unicode字符串。

可以使用u字符串前缀在代码中声明Unicode 字符串。例如

>>> my_u = u'my ünicôdé strįng'
>>> type(my_u)
<type 'unicode'>

Unicode字符串也可能来自文件,数据库和网络模块。发生这种情况时,您无需担心编码。

陷阱

str即使不显式调用,也可能会发生从Unicode到Unicode的转换unicode()

以下情况导致UnicodeDecodeError异常:

# Explicit conversion without encoding
unicode('€')

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('€')

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % '€'

# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + '€'         

例子

在下图中,您可以看到如何café根据终端类型以“ UTF-8”或“ Cp1252”编码方式对单词进行编码。在两个示例中,caf都是常规的ascii。在UTF-8中,é使用两个字节进行编码。在“ Cp1252”中,é是0xE9(它也恰好是Unicode点值(这不是巧合))。正确的decode()被调用,并成功转换为Python Unicode: 将字符串转换为Python Unicode字符串的图

在此图中,使用decode()调用ascii(与unicode()没有给出编码的调用相同)。由于ASCII不能包含大于的字节0x7F,这将引发UnicodeDecodeError异常:

将字符串转换为编码错误的Python Unicode字符串的图

Unicode三明治

最好在代码中形成一个Unicode三明治,将所有传入数据解码为Unicode字符串,使用Unicode,然后在输出时编码为strs。这使您不必担心代码中间的字符串编码。

输入/解码

源代码

如果您需要将非ASCII烘烤到源代码中,只需通过在字符串前面加上来创建Unicode字符串u。例如

u'Zürich'

为了允许Python解码您的源代码,您将需要添加一个编码标头以匹配文件的实际编码。例如,如果您的文件编码为“ UTF-8”,则可以使用:

# encoding: utf-8

仅当源代码中包含非ASCII时才需要这样做。

档案

通常从文件接收非ASCII数据。该io模块提供了一个TextWrapper,它使用给定即时解码您的文件encoding。您必须为文件使用正确的编码-不容易猜测。例如,对于UTF-8文件:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read() 

my_unicode_string然后适合传递给Markdown。如果UnicodeDecodeErrorread()行开始,则您可能使用了错误的编码值。

CSV文件

Python 2.7 CSV模块不支持非ASCII字符😩。但是,https://pypi.python.org/pypi/backports.csv提供了帮助

像上面一样使用它,但是将打开的文件传递给它:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

资料库

大多数Python数据库驱动程序都可以Unicode格式返回数据,但是通常需要一些配置。始终对SQL查询使用Unicode字符串。

的MySQL

在连接字符串中添加:

charset='utf8',
use_unicode=True

例如

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
PostgreSQL的

加:

psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

HTTP

网页几乎可以采用任何编码方式进行编码。的Content-type报头应包含一个charset字段在编码暗示。然后可以根据该值手动解码内容。另外,Python-Requests在中返回Unicode response.text

手动地

如果必须手动解码字符串,则可以简单地执行my_string.decode(encoding),其中encoding是适当的编码。此处提供了Python 2.x支持的编解码器:标准编码。同样,如果您得到了,UnicodeDecodeError则可能是编码错误。

三明治的肉

像正常strs一样使用Unicode。

输出量

标准输出/打印

print通过标准输出流进行写入。Python尝试在stdout上配置编码器,以便将Unicode编码为控制台的编码。例如,如果Linux shell localeen_GB.UTF-8,则输出将被编码为UTF-8。在Windows上,您将被限制为8位代码页。

错误配置的控制台(例如损坏的语言环境)可能导致意外的打印错误。PYTHONIOENCODING环境变量可以强制对stdout进行编码。

档案

就像输入一样,io.open可用于将Unicode透明地转换为编码的字节字符串。

数据库

用于读取的相同配置将允许直接编写Unicode。

Python 3

Python 3不再比Python 2.x更具有Unicode功能,但是在该主题上的混淆却稍少一些。例如,常规str字符串现在是Unicode字符串,而旧字符串str现在是bytes

默认编码为UTF-8,因此,如果您.decode()未提供编码的字节字符串,Python 3将使用UTF-8编码。这可能解决了50%的人们的Unicode问题。

此外,open()默认情况下以文本模式运行,因此返回解码str(Unicode 编码)。编码来自您的语言环境,在Un * x系统上通常是UTF-8,在Windows机器上通常是8位代码页,例如Windows-1251。

为什么不应该使用 sys.setdefaultencoding('utf8')

这是一个令人讨厌的hack(这是您不得不使用的原因reload),只会掩盖问题并阻碍您迁移到Python3.x。理解问题,解决根本原因并享受Unicode zen。请参阅为什么我们不应该在py脚本中使用sys.setdefaultencoding(“ utf-8”)?了解更多详情

tl;dr / quick fix

  • Don’t decode/encode willy nilly
  • Don’t assume your strings are UTF-8 encoded
  • Try to convert strings to Unicode strings as soon as possible in your code
  • Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
  • Don’t be tempted to use quick reload hacks

Unicode Zen in Python 2.x – The Long Version

Without seeing the source it’s difficult to know the root cause, so I’ll have to speak generally.

UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code – it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can’t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u'my ünicôdé strįng'
>>> type(my_u)
<type 'unicode'>

Unicode strings may also come from file, databases and network modules. When this happens, you don’t need to worry about the encoding.

Gotchas

Conversion from str to Unicode can happen even when you don’t explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode('€')

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('€')

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % '€'

# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + '€'         

Examples

In the following diagram, you can see how the word café has been encoded in either “UTF-8” or “Cp1252” encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In “Cp1252”, é is 0xE9 (which is also happens to be the Unicode point value (it’s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull: Diagram of a string being converted to a Python Unicode string

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can’t contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

Diagram of a string being converted to a Python Unicode string with the wrong encoding

The Unicode Sandwich

It’s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

Input / Decode

Source code

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u'Zürich'

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as ‘UTF-8’, you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Files

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file – it can’t be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read() 

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you’ve probably used the wrong encoding value.

CSV Files

The Python 2.7 CSV module does not support non-ASCII characters 😩. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Databases

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQL

In the connection string add:

charset='utf8',
use_unicode=True

E.g.

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
PostgreSQL

Add:

psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

HTTP

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

Manually

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you’ve probably got the wrong encoding.

The meat of the sandwich

Work with Unicodes as you would normal strs.

Output

stdout / printing

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console’s encoding. For example, if a Linux shell’s locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

Files

Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.

Database

The same configuration for reading will allow Unicodes to be written directly.

Python 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.

The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people’s Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

Why you shouldn’t use sys.setdefaultencoding('utf8')

It’s a nasty hack (there’s a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding(“utf-8”) in a py script? for further details


回答 1

终于我明白了:

as3:/usr/local/lib/python2.7/site-packages# cat sitecustomize.py
# encoding=utf8  
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

让我检查一下:

as3:~/ngokevin-site# python
Python 2.7.6 (default, Dec  6 2013, 14:49:02)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.getdefaultencoding()
'utf8'
>>>

上面显示了python的默认编码为utf8。然后错误不再存在。

Finally I got it:

as3:/usr/local/lib/python2.7/site-packages# cat sitecustomize.py
# encoding=utf8  
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

Let me check:

as3:~/ngokevin-site# python
Python 2.7.6 (default, Dec  6 2013, 14:49:02)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.getdefaultencoding()
'utf8'
>>>

The above shows the default encoding of python is utf8. Then the error is no more.


回答 2

这是经典的“ unicode问题”。我相信解释这一点超出了StackOverflow答案的范围,无法完全解释正在发生的事情。

这里有很好的解释。

在非常简短的摘要中,您已经将某些内容解释为字节字符串,并将其解码为Unicode字符,但是默认编解码器(ascii)失败了。

我为您指出的演示文稿提供了避免这种情况的建议。使您的代码为“ unicode三明治”。在Python 2中,使用from __future__ import unicode_literals帮助。

更新:如何固定代码:

确定-在变量“源”中,您有一些字节。从您的问题中不清楚它们是如何到达的-也许您是从网络表单中读取它们的?无论如何,它们都不是用ascii编码的,但是python会假设它们是ASCII并尝试将它们转换为unicode。您需要明确告诉它编码是什么。这意味着您需要知道什么是编码!这并不总是那么容易,它完全取决于此字符串的来源。您可以尝试一些常见的编码-例如UTF-8。您将unicode()的编码作为第二个参数:

source = unicode(source, 'utf-8')

This is the classic “unicode issue”. I believe that explaining this is beyond the scope of a StackOverflow answer to completely explain what is happening.

It is well explained here.

In very brief summary, you have passed something that is being interpreted as a string of bytes to something that needs to decode it into Unicode characters, but the default codec (ascii) is failing.

The presentation I pointed you to provides advice for avoiding this. Make your code a “unicode sandwich”. In Python 2, the use of from __future__ import unicode_literals helps.

Update: how can the code be fixed:

OK – in your variable “source” you have some bytes. It is not clear from your question how they got in there – maybe you read them from a web form? In any case, they are not encoded with ascii, but python is trying to convert them to unicode assuming that they are. You need to explicitly tell it what the encoding is. This means that you need to know what the encoding is! That is not always easy, and it depends entirely on where this string came from. You could experiment with some common encodings – for example UTF-8. You tell unicode() the encoding as a second parameter:

source = unicode(source, 'utf-8')

回答 3

在某些情况下,当您检查默认编码(print sys.getdefaultencoding())时,它将返回您正在使用ASCII。如果更改为UTF-8,则无法使用,具体取决于变量的内容。我发现了另一种方法:

import sys
reload(sys)  
sys.setdefaultencoding('Cp1252')

In some cases, when you check your default encoding (print sys.getdefaultencoding()), it returns that you are using ASCII. If you change to UTF-8, it doesn’t work, depending on the content of your variable. I found another way:

import sys
reload(sys)  
sys.setdefaultencoding('Cp1252')

回答 4

我正在搜索以解决以下错误消息:

unicodedecodeerror:’ascii’编解码器无法解码位置5454的字节0xe2:序数不在范围内(128)

我终于通过指定’encoding’来解决它:

f = open('../glove/glove.6B.100d.txt', encoding="utf-8")

希望它能对您有所帮助。

I was searching to solve the following error message:

unicodedecodeerror: ‘ascii’ codec can’t decode byte 0xe2 in position 5454: ordinal not in range(128)

I finally got it fixed by specifying ‘encoding’:

f = open('../glove/glove.6B.100d.txt', encoding="utf-8")

Wish it could help you too.


回答 5

"UnicodeDecodeError: 'ascii' codec can't decode byte"

发生此错误的原因:input_string必须是unicode,但给出了str

"TypeError: Decoding Unicode is not supported"

发生此错误的原因:尝试将unicode input_string转换为unicode


因此,请首先检查您的input_string str是否为必需,并在必要时转换为unicode:

if isinstance(input_string, str):
   input_string = unicode(input_string, 'utf-8')

其次,以上内容仅更改类型,但不删除非ascii字符。如果要删除非ASCII字符:

if isinstance(input_string, str):
   input_string = input_string.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.

elif isinstance(input_string, unicode):
   input_string = input_string.encode('ascii', 'ignore')
"UnicodeDecodeError: 'ascii' codec can't decode byte"

Cause of this error: input_string must be unicode but str was given

"TypeError: Decoding Unicode is not supported"

Cause of this error: trying to convert unicode input_string into unicode


So first check that your input_string is str and convert to unicode if necessary:

if isinstance(input_string, str):
   input_string = unicode(input_string, 'utf-8')

Secondly, the above just changes the type but does not remove non ascii characters. If you want to remove non-ascii characters:

if isinstance(input_string, str):
   input_string = input_string.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.

elif isinstance(input_string, unicode):
   input_string = input_string.encode('ascii', 'ignore')

回答 6

我发现最好的方法是始终转换为unicode-但这很难实现,因为在实践中,您必须检查每个参数并将其转换为曾经编写的包括某种形式的字符串处理的每个函数和方法。

因此,我想出了以下方法来保证从任一输入的unicode或字节字符串。简而言之,请包含并使用以下lambda:

# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt) 
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)

例子:

text='Some string with codes > 127, like Zürich'
utext=u'Some string with codes > 127, like Zürich'
print "==> with _u, _uu"
print _u(text), type(_u(text))
print _u(utext), type(_u(utext))
print _uu(text, utext), type(_uu(text, utext))
print "==> with u8, uu8"
print _u8(text), type(_u8(text))
print _u8(utext), type(_u8(utext))
print _uu8(text, utext), type(_uu8(text, utext))
# with % formatting, always use _u() and _uu()
print "Some unknown input %s" % _u(text)
print "Multiple inputs %s, %s" % _uu(text, text)
# but with string.format be sure to always work with unicode strings
print u"Also works with formats: {}".format(_u(text))
print u"Also works with formats: {},{}".format(*_uu(text, text))
# ... or use _u8 and _uu8, because string.format expects byte strings
print "Also works with formats: {}".format(_u8(text))
print "Also works with formats: {},{}".format(*_uu8(text, text))

这是关于此的更多原因

I find the best is to always convert to unicode – but this is difficult to achieve because in practice you’d have to check and convert every argument to every function and method you ever write that includes some form of string processing.

So I came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:

# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt) 
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)

Examples:

text='Some string with codes > 127, like Zürich'
utext=u'Some string with codes > 127, like Zürich'
print "==> with _u, _uu"
print _u(text), type(_u(text))
print _u(utext), type(_u(utext))
print _uu(text, utext), type(_uu(text, utext))
print "==> with u8, uu8"
print _u8(text), type(_u8(text))
print _u8(utext), type(_u8(utext))
print _uu8(text, utext), type(_uu8(text, utext))
# with % formatting, always use _u() and _uu()
print "Some unknown input %s" % _u(text)
print "Multiple inputs %s, %s" % _uu(text, text)
# but with string.format be sure to always work with unicode strings
print u"Also works with formats: {}".format(_u(text))
print u"Also works with formats: {},{}".format(*_uu(text, text))
# ... or use _u8 and _uu8, because string.format expects byte strings
print "Also works with formats: {}".format(_u8(text))
print "Also works with formats: {},{}".format(*_uu8(text, text))

Here’s some more reasoning about this.


回答 7

为了在Ubuntu安装中的操作系统级别解决此问题,请检查以下内容:

$ locale charmap

如果你得到

locale: Cannot set LC_CTYPE to default locale: No such file or directory

代替

UTF-8

然后设置LC_CTYPELC_ALL像这样:

$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"

In order to resolve this on an operating system level in an Ubuntu installation check the following:

$ locale charmap

If you get

locale: Cannot set LC_CTYPE to default locale: No such file or directory

instead of

UTF-8

then set LC_CTYPE and LC_ALL like this:

$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"

回答 8

编码将unicode对象转换为字符串对象。我认为您正在尝试对字符串对象进行编码。首先将结果转换为unicode对象,然后将该unicode对象编码为’utf-8’。例如

    result = yourFunction()
    result.decode().encode('utf-8')

Encode converts a unicode object in to a string object. I think you are trying to encode a string object. first convert your result into unicode object and then encode that unicode object into ‘utf-8’. for example

    result = yourFunction()
    result.decode().encode('utf-8')

回答 9

我遇到了同样的问题,但是它不适用于Python3。我遵循了这一点,它解决了我的问题:

enc = sys.getdefaultencoding()
file = open(menu, "r", encoding = enc)

读取/写入文件时,必须设置编码。

I had the same problem but it didn’t work for Python 3. I followed this and it solved my problem:

enc = sys.getdefaultencoding()
file = open(menu, "r", encoding = enc)

You have to set the encoding when you are reading/writing the file.


回答 10

有一个相同的错误,这解决了我的错误。谢谢!python 2和python 3在unicode处理方面的不同使腌制的文件与加载不兼容。因此,请使用python pickle的encoding参数。当我尝试从python 3.7中打开腌制数据时,下面的链接帮助我解决了类似的问题,而我的文件最初保存在python 2.x版本中。 https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/ 我在脚本中复制了load_pickle函数,并在加载我的脚本时调用了load_pickle(pickle_file)像这样的input_data:

input_data = load_pickle("my_dataset.pkl")

load_pickle函数在这里:

def load_pickle(pickle_file):
    try:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f)
    except UnicodeDecodeError as e:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f, encoding='latin1')
    except Exception as e:
        print('Unable to load data ', pickle_file, ':', e)
        raise
    return pickle_data

Got a same error and this solved my error. Thanks! python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load. So Use python pickle’s encoding argument. Link below helped me solve the similar problem when I was trying to open pickled data from my python 3.7, while my file was saved originally in python 2.x version. https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/ I copy the load_pickle function in my script and called the load_pickle(pickle_file) while loading my input_data like this:

input_data = load_pickle("my_dataset.pkl")

The load_pickle function is here:

def load_pickle(pickle_file):
    try:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f)
    except UnicodeDecodeError as e:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f, encoding='latin1')
    except Exception as e:
        print('Unable to load data ', pickle_file, ':', e)
        raise
    return pickle_data

回答 11

这为我工作:

    file = open('docs/my_messy_doc.pdf', 'rb')

This worked for me:

    file = open('docs/my_messy_doc.pdf', 'rb')

回答 12

简而言之,为了确保在Python 2中正确处理unicode:

  • 使用io.open读/写文件
  • 采用 from __future__ import unicode_literals
  • 配置其他数据输入/输出(例如数据库,网络)以使用unicode
  • 如果您无法将输出配置为utf-8,则将其转换为输出 print(text.encode('ascii', 'replace').decode())

有关说明,请参见@Alastair McCormack的详细答案

In short, to ensure proper unicode handling in Python 2:

  • use io.open for reading/writing files
  • use from __future__ import unicode_literals
  • configure other data inputs/outputs (e.g., databases, network) to use unicode
  • if you cannot configure outputs to utf-8, convert your output for them print(text.encode('ascii', 'replace').decode())

For explanations, see @Alastair McCormack’s detailed answer.


回答 13

我遇到相同的错误,URL包含非ascii字符(值大于128的字节),我的解决方案是:

url = url.decode('utf8').encode('utf-8')

注意:utf-8,utf8只是别名。仅使用’utf8’或’utf-8’应该以相同的方式工作

就我而言,对我有用,在Python 2.7中,我认为此分配更改了str内部表示形式中的“某些内容”,即,它强制对后备字节序列进行正确的解码,url最后将字符串放入utf-8中 str,所有的魔法都在正确的地方。Python中的Unicode对我来说是黑魔法。希望有用

I had the same error, with URLs containing non-ascii chars (bytes with values > 128), my solution:

url = url.decode('utf8').encode('utf-8')

Note: utf-8, utf8 are simply aliases . Using only ‘utf8’ or ‘utf-8’ should work in the same way

In my case, worked for me, in Python 2.7, I suppose this assignment changed ‘something’ in the str internal representation–i.e., it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place. Unicode in Python is black magic for me. Hope useful


回答 14

我遇到了字符串“PastelerÃaMallorca”相同的问题,并解决了:

unicode("Pastelería Mallorca", 'latin-1')

I got the same problem with the string “Pastelería Mallorca” and I solved with:

unicode("Pastelería Mallorca", 'latin-1')

回答 15

在Django(1.9.10)/ Python 2.7.5项目中,我经常遇到一些UnicodeDecodeErrorexceptions。主要是当我尝试将unicode字符串提供给日志记录时。我为任意对象创建了一个辅助函数,基本上将其格式化为8位ascii字符串,并将表中未包含的任何字符替换为’?’。我认为这不是最佳解决方案,但由于默认编码为ascii(并且我不想更改它),因此可以:

def encode_for_logging(c,encoding ='ascii'):
    如果isinstance(c,basestring):
        返回c.encode(encoding,'replace')
    elif isinstance(c,Iterable):
        c_ = []
        对于c中的v:
            c_.append(encode_for_logging(v,编码))
        返回c_
    其他:
        返回encode_for_logging(unicode(c))
`

In a Django (1.9.10)/Python 2.7.5 project I have frequent UnicodeDecodeError exceptions; mainly when I try to feed unicode strings to logging. I made a helper function for arbitrary objects to basically format to 8-bit ascii strings and replacing any characters not in the table to ‘?’. I think it’s not the best solution but since the default encoding is ascii (and i don’t want to change it) it will do:

def encode_for_logging(c, encoding='ascii'):
    if isinstance(c, basestring):
        return c.encode(encoding, 'replace')
    elif isinstance(c, Iterable):
        c_ = []
        for v in c:
            c_.append(encode_for_logging(v, encoding))
        return c_
    else:
        return encode_for_logging(unicode(c))
`

回答 16

当我们的字符串中包含一些非ASCII字符并且我们对该字符串执行任何操作而没有正确解码时,就会发生此错误。这帮助我解决了我的问题。我正在读取具有ID列,文本和解码字符的CSV文件,如下所示:

train_df = pd.read_csv("Example.csv")
train_data = train_df.values
for i in train_data:
    print("ID :" + i[0])
    text = i[1].decode("utf-8",errors="ignore").strip().lower()
    print("Text: " + text)

This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding. This helped me solve my problem. I am reading a CSV file with columns ID,Text and decoding characters in it as below:

train_df = pd.read_csv("Example.csv")
train_data = train_df.values
for i in train_data:
    print("ID :" + i[0])
    text = i[1].decode("utf-8",errors="ignore").strip().lower()
    print("Text: " + text)

回答 17

这是我的解决方案,只需添加编码即可。 with open(file, encoding='utf8') as f

并且由于读取手套文件会花费很长时间,因此我建议将手套文件转换为numpy文件。当您使用netx时间阅读嵌入权重时,它将节省您的时间。

import numpy as np
from tqdm import tqdm


def load_glove(file):
    """Loads GloVe vectors in numpy array.
    Args:
        file (str): a path to a glove file.
    Return:
        dict: a dict of numpy arrays.
    """
    embeddings_index = {}
    with open(file, encoding='utf8') as f:
        for i, line in tqdm(enumerate(f)):
            values = line.split()
            word = ''.join(values[:-300])
            coefs = np.asarray(values[-300:], dtype='float32')
            embeddings_index[word] = coefs

    return embeddings_index

# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)

np.save('glove_embeddings.npy', embeddings) 

要点链接:https : //gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227

Here is my solution, just add the encoding. with open(file, encoding='utf8') as f

And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time.

import numpy as np
from tqdm import tqdm


def load_glove(file):
    """Loads GloVe vectors in numpy array.
    Args:
        file (str): a path to a glove file.
    Return:
        dict: a dict of numpy arrays.
    """
    embeddings_index = {}
    with open(file, encoding='utf8') as f:
        for i, line in tqdm(enumerate(f)):
            values = line.split()
            word = ''.join(values[:-300])
            coefs = np.asarray(values[-300:], dtype='float32')
            embeddings_index[word] = coefs

    return embeddings_index

# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)

np.save('glove_embeddings.npy', embeddings) 

Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227


回答 18

在您的Python文件顶部指定:#encoding = utf-8,它应该可以解决此问题

Specify: # encoding= utf-8 at the top of your Python File, It should fix the issue


如何计算pandas DataFrame列中的NaN值

问题:如何计算pandas DataFrame列中的NaN值

我有数据,我想在其中找到数量NaN,以便如果它小于某个阈值,我将删除此列。我看了一下,但是找不到任何功能。有value_counts,但对我来说会很慢,因为大多数值是不同的,并且我只想计数NaN

I have data, in which I want to find number of NaN, so that if it is less than some threshold, I will drop this columns. I looked, but didn’t able to find any function for this. there is value_counts, but it would be slow for me, because most of values are distinct and I want count of NaN only.


回答 0

您可以使用该isna()方法(或者它的别名isnull()也与<0.21.0的旧版熊猫兼容),然后求和以计算NaN值。对于一列:

In [1]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isna().sum()   # or s.isnull().sum() for older pandas versions
Out[4]: 2

对于几列,它也适用:

In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

In [6]: df.isna().sum()
Out[6]:
a    1
b    2
dtype: int64

You can use the isna() method (or it’s alias isnull() which is also compatible with older pandas versions < 0.21.0) and then sum to count the NaN values. For one column:

In [1]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isna().sum()   # or s.isnull().sum() for older pandas versions
Out[4]: 2

For several columns, it also works:

In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

In [6]: df.isna().sum()
Out[6]:
a    1
b    2
dtype: int64

回答 1

您可以从非Nan值的计数中减去总长度:

count_nan = len(df) - df.count()

您应该在数据上计时。与isnull解决方案相比,小型系列的速度提高了3倍。

You could subtract the total length from the count of non-nan values:

count_nan = len(df) - df.count()

You should time it on your data. For small Series got a 3x speed up in comparison with the isnull solution.


回答 2

假设df是一个熊猫DataFrame。

然后,

df.isnull().sum(axis = 0)

这将在每列中提供NaN值的数量。

如果需要,可以在每行中输入NaN值,

df.isnull().sum(axis = 1)

Lets assume df is a pandas DataFrame.

Then,

df.isnull().sum(axis = 0)

This will give number of NaN values in every column.

If you need, NaN values in every row,

df.isnull().sum(axis = 1)

回答 3

根据投票最多的答案,我们可以轻松定义一个函数,该函数为我们提供一个数据框,以预览每列中的缺失值和缺失值的百分比:

def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

Based on the most voted answer we can easily define a function that gives us a dataframe to preview the missing values and the % of missing values in each column:

def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

回答 4

从熊猫0.14.1开始,我在这里建议 value_counts方法中使用关键字参数:

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
    print df[col].value_counts(dropna=False)

2     1
 1     1
NaN    1
dtype: int64
NaN    2
 1     1
dtype: int64

Since pandas 0.14.1 my suggestion here to have a keyword argument in the value_counts method has been implemented:

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
for col in df:
    print df[col].value_counts(dropna=False)

2     1
 1     1
NaN    1
dtype: int64
NaN    2
 1     1
dtype: int64

回答 5

如果它只是在熊猫列中计算nan值,这是一种快速方法

import pandas as pd
## df1 as an example data frame 
## col1 name of column for which you want to calculate the nan values
sum(pd.isnull(df1['col1']))

if its just counting nan values in a pandas column here is a quick way

import pandas as pd
## df1 as an example data frame 
## col1 name of column for which you want to calculate the nan values
sum(pd.isnull(df1['col1']))

回答 6

如果您正在使用Jupyter Notebook,如何…。

 %%timeit
 df.isnull().any().any()

要么

 %timeit 
 df.isnull().values.sum()

或者,数据中是否存在NaN,如果是,在哪里?

 df.isnull().any()

if you are using Jupyter Notebook, How about….

 %%timeit
 df.isnull().any().any()

or

 %timeit 
 df.isnull().values.sum()

or, are there anywhere NaNs in the data, if yes, where?

 df.isnull().any()

回答 7

下面将按降序打印所有Nan列。

df.isnull().sum().sort_values(ascending = False)

要么

下面将按降序打印前15 Nan列。

df.isnull().sum().sort_values(ascending = False).head(15)

The below will print all the Nan columns in descending order.

df.isnull().sum().sort_values(ascending = False)

or

The below will print first 15 Nan columns in descending order.

df.isnull().sum().sort_values(ascending = False).head(15)

回答 8

import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

results 
'''
  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0
'''

您可以使用以下功能,这将在Dataframe中提供输出

  • 零值
  • 缺失值
  • 占总价值的百分比
  • 总零缺失值
  • 总零缺失值百分比
  • 数据类型

只需复制并粘贴以下函数,然后通过传递您的pandas Dataframe来调用它

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

输出量

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

如果要保持简单,则可以使用以下函数获取%的缺失值

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)
'''
Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
'''
import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

results 
'''
  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0
'''

You can use following function, which will give you output in Dataframe

  • Zero Values
  • Missing Values
  • % of Total Values
  • Total Zero Missing Values
  • % Total Zero Missing Values
  • Data Type

Just copy and paste following function and call it by passing your pandas Dataframe

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

Output

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

If you want to keep it simple then you can use following function to get missing values in %

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)
'''
Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
'''

回答 9

计数零:

df[df == 0].count(axis=0)

要计算NaN:

df.isnull().sum()

要么

df.isna().sum()

To count zeroes:

df[df == 0].count(axis=0)

To count NaN:

df.isnull().sum()

or

df.isna().sum()

回答 10

您可以使用value_counts方法并打印np.nan的值

s.value_counts(dropna = False)[np.nan]

You can use value_counts method and print values of np.nan

s.value_counts(dropna = False)[np.nan]

回答 11

请在下面使用特定的列数

dataframe.columnName.isnull().sum()

Please use below for particular column count

dataframe.columnName.isnull().sum()

回答 12

df1.isnull().sum()

这将达到目的。

df1.isnull().sum()

This will do the trick.


回答 13

这是用于按Null列计算值的代码:

df.isna().sum()

Here is the code for counting Null values column wise :

df.isna().sum()

回答 14

2017年7月有一篇不错的Dzone文章,其中详细介绍了总结NaN值的各种方法。检查它在这里

我引用的文章通过以下方式提供了附加值:(1)显示一种计数和显示每一列的NaN计数的方法,以便人们可以轻松地决定是否丢弃这些列,以及(2)演示一种在其中选择那些行的方法具有NaN的特定分子,因此可以有选择地丢弃或估算它们。

这是一个演示该方法实用性的简单示例-仅用几列,也许它的用处并不明显,但我发现它对较大的数据帧有帮助。

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# Check whether there are null values in columns
null_columns = df.columns[df.isnull().any()]
print(df[null_columns].isnull().sum())

# One can follow along further per the cited article

There is a nice Dzone article from July 2017 which details various ways of summarising NaN values. Check it out here.

The article I have cited provides additional value by: (1) Showing a way to count and display NaN counts for every column so that one can easily decide whether or not to discard those columns and (2) Demonstrating a way to select those rows in specific which have NaNs so that they may be selectively discarded or imputed.

Here’s a quick example to demonstrate the utility of the approach – with only a few columns perhaps its usefulness is not obvious but I found it to be of help for larger data-frames.

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# Check whether there are null values in columns
null_columns = df.columns[df.isnull().any()]
print(df[null_columns].isnull().sum())

# One can follow along further per the cited article

回答 15

为了计算NaN,尚未建议的另一个简单选项是添加形状以返回带有NaN的行数。

df[df['col_name'].isnull()]['col_name'].shape

One other simple option not suggested yet, to just count NaNs, would be adding in the shape to return the number of rows with NaN.

df[df['col_name'].isnull()]['col_name'].shape

回答 16

df.isnull()。sum()将给出缺失值的列式总和。

如果您想知道特定列中缺失值的总和,则可以使用以下代码df.column.isnull()。sum()

df.isnull().sum() will give the column-wise sum of missing values.

If you want to know the sum of missing values in a particular column then following code will work df.column.isnull().sum()


回答 17

根据给出的答案和一些改进,这是我的方法

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")

based to the answer that was given and some improvements this is my approach

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")

回答 18

如果您需要获取groupby提取的不同组之间的非NA(non-None)和NA(None)计数:

gdf = df.groupby(['ColumnToGroupBy'])

def countna(x):
    return (x.isna()).sum()

gdf.agg(['count', countna, 'size'])

这将返回非NA,NA的计数以及每个组的条目总数。

In case you need to get the non-NA (non-None) and NA (None) counts across different groups pulled out by groupby:

gdf = df.groupby(['ColumnToGroupBy'])

def countna(x):
    return (x.isna()).sum()

gdf.agg(['count', countna, 'size'])

This returns the counts of non-NA, NA and total number of entries per group.


回答 19

在我的代码中使用了@sushmit提出的解决方案。

相同的可能变体也可以是

colNullCnt = []
for z in range(len(df1.cols)):
    colNullCnt.append([df1.cols[z], sum(pd.isnull(trainPd[df1.cols[z]]))])

这样做的好处是,此后它将返回df中每个列的结果。

Used the solution proposed by @sushmit in my code.

A possible variation of the same can also be

colNullCnt = []
for z in range(len(df1.cols)):
    colNullCnt.append([df1.cols[z], sum(pd.isnull(trainPd[df1.cols[z]]))])

Advantage of this is that it returns the result for each of the columns in the df henceforth.


回答 20

import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# count the NaNs in a column
num_nan_a = df.loc[ (pd.isna(df['a'])) , 'a' ].shape[0]
num_nan_b = df.loc[ (pd.isna(df['b'])) , 'b' ].shape[0]

# summarize the num_nan_b
print(df)
print(' ')
print(f"There are {num_nan_a} NaNs in column a")
print(f"There are {num_nan_b} NaNs in column b")

给出作为输出:

     a    b
0  1.0  NaN
1  2.0  1.0
2  NaN  NaN

There are 1 NaNs in column a
There are 2 NaNs in column b
import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# count the NaNs in a column
num_nan_a = df.loc[ (pd.isna(df['a'])) , 'a' ].shape[0]
num_nan_b = df.loc[ (pd.isna(df['b'])) , 'b' ].shape[0]

# summarize the num_nan_b
print(df)
print(' ')
print(f"There are {num_nan_a} NaNs in column a")
print(f"There are {num_nan_b} NaNs in column b")

Gives as output:

     a    b
0  1.0  NaN
1  2.0  1.0
2  NaN  NaN

There are 1 NaNs in column a
There are 2 NaNs in column b

回答 21

假设您要在称为评论的数据框中获取称为价格的列(系列)中的缺失值(NaN)数

#import the dataframe
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

要获取缺失值,以n_missing_prices作为变量,只需执行

n_missing_prices = sum(reviews.price.isnull())
print(n_missing_prices)

sum是这里的关键方法,在我意识到sum是在这种情况下使用的正确方法之前,我曾尝试使用count

Suppose you want to get the number of missing values(NaN) in a column(series) known as price in a dataframe called reviews

#import the dataframe
import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

To get the missing values, with n_missing_prices as the variable, simple do

n_missing_prices = sum(reviews.price.isnull())
print(n_missing_prices)

sum is the key method here, was trying to use count before i realized sum is the right method to use in this context


回答 22

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html#pandas.Series.count

pandas.Series.count
Series.count(level=None)[source]

返回系列中非NA /空观测值的数量

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.count.html#pandas.Series.count

pandas.Series.count
Series.count(level=None)[source]

Return number of non-NA/null observations in the Series


回答 23

对于您的任务,您可以使用pandas.DataFrame.dropna(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, 3, 4, np.nan],
                   'b': [1, 2, np.nan, 4, np.nan],
                   'c': [np.nan, 2, np.nan, 4, np.nan]})
df = df.dropna(axis='columns', thresh=3)

print(df)

您可以使用thresh thresh参数为DataFrame中的所有列声明NaN值的最大计数。

代码输出:

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  4.0
4  NaN  NaN

For your task you can use pandas.DataFrame.dropna (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [1, 2, 3, 4, np.nan],
                   'b': [1, 2, np.nan, 4, np.nan],
                   'c': [np.nan, 2, np.nan, 4, np.nan]})
df = df.dropna(axis='columns', thresh=3)

print(df)

Whith thresh parameter you can declare the max count for NaN values for all columns in DataFrame.

Code outputs:

     a    b
0  1.0  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  4.0
4  NaN  NaN