标签归档:performance

pandas loc vs. iloc vs. ix vs. at vs. iat?

问题:pandas loc vs. iloc vs. ix vs. at vs. iat?

最近开始从我的安全地方(R)分支到Python,并且对中的单元格本地化/选择感到有些困惑Pandas。我已经阅读了文档,但仍在努力了解各种本地化/选择选项的实际含义。

  • 我为什么应该使用.loc.iloc超过最一般的选择.ix
  • 我的理解是.locilocat,和iat可以提供一些保证正确性是.ix不能提供的,但我也看到了在那里.ix往往是一刀切最快的解决方案。
  • 请说明使用除.ix?以外的任何东西背后的现实世界,最佳实践推理。

Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in Pandas. I’ve read the documentation but I’m struggling to understand the practical implications of the various localization/selection options.

  • Is there a reason why I should ever use .loc or .iloc over the most general option .ix?
  • I understand that .loc, iloc, at, and iat may provide some guaranteed correctness that .ix can’t offer, but I’ve also read where .ix tends to be the fastest solution across the board.
  • Please explain the real-world, best-practices reasoning behind utilizing anything other than .ix?

回答 0

loc:仅适用于索引
iloc:适用于位置
ix:您可以从数据获取数据,而无需将其包含在索引
中:获取标量值。这是一个非常快速的定位
获取标量值。这是一个非常快的iloc

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

注:由于pandas 0.20.0中,.ix索引被弃用赞成更加严格.iloc.loc索引。

loc: only work on index
iloc: work on position
ix: You can get data from dataframe without it being in the index
at: get scalar values. It’s a very fast loc
iat: Get scalar values. It’s a very fast iloc

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

Note: As of pandas 0.20.0, the .ix indexer is deprecated in favour of the more strict .iloc and .loc indexers.


回答 1

已更新,pandas 0.20因为ix已弃用。这不但表明了如何使用locilocatiatset_value,但如何实现,混合位置/标签基于索引。


loc基于标签
允许您将一维数组作为索引器传递。数组可以是索引或列的切片(子集),也可以是长度与索引或列相等的布尔数组。

特别说明:当传递标量索引器时,loc可以分配以前不存在的新索引或列值。

# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3

df.loc[df.index[1:3], 'ColName'] = 3

iloc基于位置
类似于,loc除了位置而不是索引值。但是,您不能分配新的列或索引。

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3

df.iloc[2, 4] = 3

df.iloc[:3, 2:4] = 3

at基于标签的
作品与loc标量索引器非常相似。 无法对数组索引器进行操作。 能够!分配新的索引和列。

优势loc是,这是速度更快。
缺点是不能将数组用于索引器。

# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3

df.at['C', 'ColName'] = 3

iat基于位置的
原理相似iloc无法在数组索引器中工作。 不能!分配新的索引和列。

优势iloc是,这是速度更快。
缺点是不能将数组用于索引器。

# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3

set_value基于标签的
作品与loc标量索引器非常相似。 无法对数组索引器进行操作。 能够!分配新的索引和列

优势超级快,因为几乎没有开销!
缺点由于pandas没有进行大量安全检查,因此开销很少。 使用风险自负。另外,这也不打算供公众使用。

# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)

set_valuetakable=True位置,并根据
原理相似iloc无法在数组索引器中工作。 不能!分配新的索引和列。

优势超级快,因为几乎没有开销!
缺点由于pandas没有进行大量安全检查,因此开销很少。 使用风险自负。另外,这也不打算供公众使用。

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)

Updated for pandas 0.20 given that ix is deprecated. This demonstrates not only how to use loc, iloc, at, iat, set_value, but how to accomplish, mixed positional/label based indexing.


loclabel based
Allows you to pass 1-D arrays as indexers. Arrays can be either slices (subsets) of the index or column, or they can be boolean arrays which are equal in length to the index or columns.

Special Note: when a scalar indexer is passed, loc can assign a new index or column value that didn’t exist before.

# label based, but we can use position values
# to get the labels from the index object
df.loc[df.index[2], 'ColName'] = 3

df.loc[df.index[1:3], 'ColName'] = 3

ilocposition based
Similar to loc except with positions rather that index values. However, you cannot assign new columns or indices.

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.iloc[2, df.columns.get_loc('ColName')] = 3

df.iloc[2, 4] = 3

df.iloc[:3, 2:4] = 3

atlabel based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns.

Advantage over loc is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

# label based, but we can use position values
# to get the labels from the index object
df.at[df.index[2], 'ColName'] = 3

df.at['C', 'ColName'] = 3

iatposition based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.

Advantage over iloc is that this is faster.
Disadvantage is that you can’t use arrays for indexers.

# position based, but we can get the position
# from the columns object via the `get_loc` method
IBM.iat[2, IBM.columns.get_loc('PNL')] = 3

set_valuelabel based
Works very similar to loc for scalar indexers. Cannot operate on array indexers. Can! assign new indices and columns

Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

# label based, but we can use position values
# to get the labels from the index object
df.set_value(df.index[2], 'ColName', 3)

set_value with takable=Trueposition based
Works similarly to iloc. Cannot work in array indexers. Cannot! assign new indices and columns.

Advantage Super fast, because there is very little overhead!
Disadvantage There is very little overhead because pandas is not doing a bunch of safety checks. Use at your own risk. Also, this is not intended for public use.

# position based, but we can get the position
# from the columns object via the `get_loc` method
df.set_value(2, df.columns.get_loc('ColName'), 3, takable=True)

回答 2

熊猫从DataFrame中进行选择的主要方式有两种。

  • 标签
  • 整数位置

该文档使用位置一词来指代整数位置。我不喜欢这个术语,因为我觉得它很混乱。整数位置更具描述性,正好.iloc代表该位置。此处的关键字是INTEGER-按整数位置选择时必须使用整数。

在显示摘要之前,让我们确保…

.ix已弃用且含糊不清,切勿使用

熊猫有三个主要的索引器。我们有索引运算符本身(括号[].loc,和.iloc。让我们总结一下:

  • []-主要选择列的子集,但也可以选择行。无法同时选择行和列。
  • .loc -仅按标签选择行和列的子集
  • .iloc -仅按整数位置选择行和列的子集

我几乎从未使用过,.at或者.iat因为它们没有添加任何附加功能并且只增加了一点性能。除非您有一个对时间敏感的应用程序,否则我不建议您使用它们。无论如何,我们有他们的摘要:

  • .at 仅通过标签在DataFrame中选择单个标量值
  • .iat 仅通过整数位置选择DataFrame中的单个标量值

除了按标签和整数位置进行选择外,还存在布尔选择(也称为布尔索引)


解释.loc,,.iloc布尔选择.at.iat的示例如下所示

我们将首先关注.loc和之间的差异.iloc。在讨论差异之前,必须了解DataFrame具有用于帮助标识每一列和每一行的标签,这一点很重要。让我们看一个示例DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

所有粗体字均为标签。标签,agecolorfoodheightscorestate被用于。其他标签,JaneNickAaronPenelopeDeanChristinaCornelia用作标签的行。这些行标签统称为index


在DataFrame中选择特定行的主要方式是使用.loc.iloc索引器。这些索引器中的每一个也可以用于同时选择列,但是现在只关注行比较容易。此外,每个索引器都使用紧跟其名称的一组括号进行选择。

.loc仅通过标签选择数据

我们将首先讨论.loc仅通过索引或列标签选择数据的索引器。在示例DataFrame中,我们提供了有意义的名称作为索引值。许多DataFrame都没有任何有意义的名称,而是默认为0到n-1之间的整数,其中n是DataFrame的长度(行数)。

您可以使用三种输入中的许多不同.loc,它们是

  • 一串
  • 字符串列表
  • 使用字符串作为起始值和终止值的切片符号

用带字符串的.loc选择单行

要选择单行数据,请将索引标签放在后面的括号内.loc

df.loc['Penelope']

这将数据行作为系列返回

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

使用.loc与字符串列表选择多行

df.loc[['Cornelia', 'Jane', 'Dean']]

这将返回一个DataFrame,其中的数据行按列表中指定的顺序进行:

使用带有切片符号的.loc选择多行

切片符号由开始值,停止值和步长值定义。按标签切片时,大熊猫在返回值中包含停止值。以下是从亚伦到迪恩(含)的片段。它的步长未明确定义,但默认为1。

df.loc['Aaron':'Dean']

可以采用与Python列表相同的方式获取复杂的切片。

.iloc仅按整数位置选择数据

现在转到.iloc。DataFrame中数据的每一行和每一列都有一个定义它的整数位置。这是输出中直观显示的标签的补充。整数位置就是从0开始从顶部/左侧开始的行数/列数。

您可以使用三种输入中的许多不同.iloc,它们是

  • 一个整数
  • 整数列表
  • 使用整数作为起始值和终止值的切片符号

用带整数的.iloc选择单行

df.iloc[4]

这将返回第5行(整数位置4)为系列

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

用.iloc选择带有整数列表的多行

df.iloc[[2, -2]]

这将返回第三行和倒数第二行的DataFrame:

使用带切片符号的.iloc选择多行

df.iloc[:5:3]


使用.loc和.iloc同时选择行和列

两者的一项出色功能.loc/.iloc是它们可以同时选择行和列。在上面的示例中,所有列都是从每个选择中返回的。我们可以选择输入类型与行相同的列。我们只需要用逗号分隔行和列的选择即可。

例如,我们可以选择Jane行和Dean行,它们的高度,得分和状态如下:

df.loc[['Jane', 'Dean'], 'height':]

这对行使用标签列表,对列使用切片符号

我们自然可以.iloc只使用整数来执行类似的操作。

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

带标签和整数位置的同时选择

.ix用来与标签和整数位置同时进行选择,这很有用,但有时会造成混淆和模棱两可,值得庆幸的是,它已弃用。如果您需要混合使用标签和整数位置进行选择,则必须同时选择标签或整数位置。

例如,如果我们要选择行Nick以及第Cornelia2列和第4列,则可以.loc通过以下方式将整数转换为标签来使用:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

或者,可以使用get_locindex方法将索引标签转换为整数。

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

布尔选择

.loc索引器还可以进行布尔选择。例如,如果我们有兴趣查找年龄在30岁以上的所有行并仅返回foodscore列,则可以执行以下操作:

df.loc[df['age'] > 30, ['food', 'score']] 

您可以使用复制它,.iloc但是不能将其传递为布尔系列。您必须将boolean Series转换为numpy数组,如下所示:

df.iloc[(df['age'] > 30).values, [2, 4]] 

选择所有行

可以.loc/.iloc仅用于列选择。您可以使用冒号选择所有行,如下所示:

df.loc[:, 'color':'score':2]


索引运算符[]可以切片也可以选择行和列,但不能同时选择。

大多数人都熟悉DataFrame索引运算符的主要目的,即选择列。字符串选择单个列作为系列,字符串列表选择多个列作为DataFrame。

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

使用列表选择多个列

df[['food', 'score']]

人们所不熟悉的是,当使用切片符号时,选择是通过行标签或整数位置进行的。这非常令人困惑,我几乎从未使用过,但是确实可以使用。

df['Penelope':'Christina'] # slice rows by label

df[2:6:2] # slice rows by integer location

.loc/.iloc选择行的明确性是高度首选的。单独的索引运算符无法同时选择行和列。

df[3:5, 'color']
TypeError: unhashable type: 'slice'

.at和选择.iat

选择与.at几乎相同,.loc但仅在DataFrame中选择一个“单元”。我们通常将此单元称为标量值。要使用.at,请将行标签和列标签都传递给它,并用逗号分隔。

df.at['Christina', 'color']
'black'

选择与.iat几乎相同,.iloc但仅选择一个标量值。您必须为行和列位置都传递一个整数

df.iat[2, 5]
'FL'

There are two primary ways that pandas makes selections from a DataFrame.

  • By Label
  • By Integer Location

The documentation uses the term position for referring to integer location. I do not like this terminology as I feel it is confusing. Integer location is more descriptive and is exactly what .iloc stands for. The key word here is INTEGER – you must use integers when selecting by integer location.

Before showing the summary let’s all make sure that …

.ix is deprecated and ambiguous and should never be used

There are three primary indexers for pandas. We have the indexing operator itself (the brackets []), .loc, and .iloc. Let’s summarize them:

  • [] – Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns.
  • .loc – selects subsets of rows and columns by label only
  • .iloc – selects subsets of rows and columns by integer location only

I almost never use .at or .iat as they add no additional functionality and with just a small performance increase. I would discourage their use unless you have a very time-sensitive application. Regardless, we have their summary:

  • .at selects a single scalar value in the DataFrame by label only
  • .iat selects a single scalar value in the DataFrame by integer location only

In addition to selection by label and integer location, boolean selection also known as boolean indexing exists.


Examples explaining .loc, .iloc, boolean selection and .at and .iat are shown below

We will first focus on the differences between .loc and .iloc. Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each row. Let’s take a look at a sample DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used as labels for the rows. Collectively, these row labels are known as the index.


The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.

.loc selects data only by labels

We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length(number of rows) of the DataFrame.

There are many different inputs you can use for .loc three out of them are

  • A string
  • A list of strings
  • Slice notation using strings as the start and stop values

Selecting a single row with .loc with a string

To select a single row of data, place the index label inside of the brackets following .loc.

df.loc['Penelope']

This returns the row of data as a Series

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

Selecting multiple rows with .loc with a list of strings

df.loc[['Cornelia', 'Jane', 'Dean']]

This returns a DataFrame with the rows in the order specified in the list:

Selecting multiple rows with .loc with slice notation

Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.

df.loc['Aaron':'Dean']

Complex slices can be taken in the same manner as Python lists.

.iloc selects data only by integer location

Let’s now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.

There are many different inputs you can use for .iloc three out of them are

  • An integer
  • A list of integers
  • Slice notation using integers as the start and stop values

Selecting a single row with .iloc with an integer

df.iloc[4]

This returns the 5th row (integer location 4) as a Series

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

Selecting multiple rows with .iloc with a list of integers

df.iloc[[2, -2]]

This returns a DataFrame of the third and second to last rows:

Selecting multiple rows with .iloc with slice notation

df.iloc[:5:3]


Simultaneous selection of rows and columns with .loc and .iloc

One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.

For example, we can select rows Jane, and Dean with just the columns height, score and state like this:

df.loc[['Jane', 'Dean'], 'height':]

This uses a list of labels for the rows and slice notation for the columns

We can naturally do similar operations with .iloc using only integers.

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

Simultaneous selection with labels and integer location

.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.

For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

Or alternatively, convert the index labels to integers with the get_loc index method.

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

Boolean Selection

The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows where age is above 30 and return just the food and score columns we can do the following:

df.loc[df['age'] > 30, ['food', 'score']] 

You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:

df.iloc[(df['age'] > 30).values, [2, 4]] 

Selecting all rows

It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:

df.loc[:, 'color':'score':2]


The indexing operator, [], can slice can select rows and columns too but not simultaneously.

Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

Using a list selects multiple columns

df[['food', 'score']]

What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.

df['Penelope':'Christina'] # slice rows by label

df[2:6:2] # slice rows by integer location

The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.

df[3:5, 'color']
TypeError: unhashable type: 'slice'

Selection by .at and .iat

Selection with .at is nearly identical to .loc but it only selects a single ‘cell’ in your DataFrame. We usually refer to this cell as a scalar value. To use .at, pass it both a row and column label separated by a comma.

df.at['Christina', 'color']
'black'

Selection with .iat is nearly identical to .iloc but it only selects a single scalar value. You must pass it an integer for both the row and column locations

df.iat[2, 5]
'FL'

回答 3

df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

                        A   B
                100     a   54
                200     b   67
                300     c   89
In [19]:    
df.loc[100]

Out[19]:
A     a
B    54
Name: 100, dtype: object

In [20]:    
df.iloc[0]

Out[20]:
A     a
B    54
Name: 100, dtype: object

In [24]:    
df2 = df.set_index([df.index,'A'])
df2

Out[24]:
        B
    A   
100 a   54
200 b   67
300 c   89

In [25]:    
df2.ix[100, 'a']

Out[25]:    
B    54
Name: (100, a), dtype: int64
df = pd.DataFrame({'A':['a', 'b', 'c'], 'B':[54, 67, 89]}, index=[100, 200, 300])

df

                        A   B
                100     a   54
                200     b   67
                300     c   89
In [19]:    
df.loc[100]

Out[19]:
A     a
B    54
Name: 100, dtype: object

In [20]:    
df.iloc[0]

Out[20]:
A     a
B    54
Name: 100, dtype: object

In [24]:    
df2 = df.set_index([df.index,'A'])
df2

Out[24]:
        B
    A   
100 a   54
200 b   67
300 c   89

In [25]:    
df2.ix[100, 'a']

Out[25]:    
B    54
Name: (100, a), dtype: int64

回答 4

让我们从这个小df开始:

import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))

我们会这样

df
Out[25]: 
        0   1   2   3   4   5   6   7   8   9
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99

有了这个我们有:

df.iloc[3,3]
Out[33]: 33

df.iat[3,3]
Out[34]: 33

df.iloc[:3,:3]
Out[35]: 
    0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33



df.iat[:3,:3]
Traceback (most recent call last):
   ... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers

因此,我们不能将.iat用于子集,而只能在其中使用.iloc。

但是,让我们尝试从较大的df中进行选择,并检查速度…

# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987

因此,使用.loc我们可以管理子集,并且仅使用单个标量即可使用.loc,但是.at比.loc更快

:-)

Let’s start with this small df:

import pandas as pd
import time as tm
import numpy as np
n=10
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))

We’ll so have

df
Out[25]: 
        0   1   2   3   4   5   6   7   8   9
    0   0   1   2   3   4   5   6   7   8   9
    1  10  11  12  13  14  15  16  17  18  19
    2  20  21  22  23  24  25  26  27  28  29
    3  30  31  32  33  34  35  36  37  38  39
    4  40  41  42  43  44  45  46  47  48  49
    5  50  51  52  53  54  55  56  57  58  59
    6  60  61  62  63  64  65  66  67  68  69
    7  70  71  72  73  74  75  76  77  78  79
    8  80  81  82  83  84  85  86  87  88  89
    9  90  91  92  93  94  95  96  97  98  99

With this we have:

df.iloc[3,3]
Out[33]: 33

df.iat[3,3]
Out[34]: 33

df.iloc[:3,:3]
Out[35]: 
    0   1   2   3
0   0   1   2   3
1  10  11  12  13
2  20  21  22  23
3  30  31  32  33



df.iat[:3,:3]
Traceback (most recent call last):
   ... omissis ...
ValueError: At based indexing on an integer index can only have integer indexers

Thus we cannot use .iat for subset, where we must use .iloc only.

But let’s try both to select from a larger df and let’s check the speed …

# -*- coding: utf-8 -*-
"""
Created on Wed Feb  7 09:58:39 2018

@author: Fabio Pomi
"""

import pandas as pd
import time as tm
import numpy as np
n=1000
a=np.arange(0,n**2)
df=pd.DataFrame(a.reshape(n,n))
t1=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iloc[j,i]
t2=tm.time()
for j in df.index:
    for i in df.columns:
        a=df.iat[j,i]
t3=tm.time()
loc=t2-t1
at=t3-t2
prc = loc/at *100
print('\nloc:%f at:%f prc:%f' %(loc,at,prc))

loc:10.485600 at:7.395423 prc:141.784987

So with .loc we can manage subsets and with .at only a single scalar, but .at is faster than .loc

:-)


为什么迭代一小串字符串比一小串列表慢?

问题:为什么迭代一小串字符串比一小串列表慢?

我在玩timeit时发现,对小字符串进行简单的列表理解要比对小字符串列表进行相同的操作花费的时间更长。有什么解释吗?时间几乎是原来的1.35倍。

>>> from timeit import timeit
>>> timeit("[x for x in 'abc']")
2.0691067844831528
>>> timeit("[x for x in ['a', 'b', 'c']]")
1.5286479570345861

导致此情况的较低级别发生了什么?

I was playing around with timeit and noticed that doing a simple list comprehension over a small string took longer than doing the same operation on a list of small single character strings. Any explanation? It’s almost 1.35 times as much time.

>>> from timeit import timeit
>>> timeit("[x for x in 'abc']")
2.0691067844831528
>>> timeit("[x for x in ['a', 'b', 'c']]")
1.5286479570345861

What’s happening on a lower level that’s causing this?


回答 0

TL; DR

  • 对于Python 2,一旦消除了很多开销,实际的速度差异就会接近70%(或更高)。

  • 对象创建没有错。这两种方法都不会创建新对象,因为会缓存一个字符的字符串。

  • 区别并不明显,但可能是由于对类型和格式正确的字符串索引进行了大量检查而造成的。由于很有必要检查返回的商品,因此很有可能。

  • 列表索引非常快。



>>> python3 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.388 usec per loop

>>> python3 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.436 usec per loop

这与您发现的内容不同…

然后,您必须使用Python 2。

>>> python2 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.212 usec per loop

让我们解释两个版本之间的区别。我将检查编译后的代码。

对于Python 3:

import dis

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   4           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>)
#>>>               3 LOAD_CONST               2 ('list_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('a')
#>>>              12 LOAD_CONST               4 ('b')
#>>>              15 LOAD_CONST               5 ('c')
#>>>              18 BUILD_LIST               3
#>>>              21 GET_ITER
#>>>              22 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              25 POP_TOP
#>>>              26 LOAD_CONST               0 (None)
#>>>              29 RETURN_VALUE

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>  21           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('abc')
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

您会在此处看到,由于每次都建立列表,列表变体可能会变慢。

这是

 9 LOAD_CONST   3 ('a')
12 LOAD_CONST   4 ('b')
15 LOAD_CONST   5 ('c')
18 BUILD_LIST   3

部分。字符串变体仅具有

 9 LOAD_CONST   3 ('abc')

您可以检查一下是否确实有所不同:

def string_iterate():
    [item for item in ("a", "b", "c")]

dis.dis(string_iterate)
#>>>  35           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               6 (('a', 'b', 'c'))
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

这产生了

 9 LOAD_CONST               6 (('a', 'b', 'c'))

因为元组是不可变的。测试:

>>> python3 -m timeit '[x for x in ("a", "b", "c")]'
1000000 loops, best of 3: 0.369 usec per loop

太好了,赶快行动吧。

对于Python 2:

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('a')
#>>>               6 LOAD_CONST               2 ('b')
#>>>               9 LOAD_CONST               3 ('c')
#>>>              12 BUILD_LIST               3
#>>>              15 GET_ITER            
#>>>         >>   16 FOR_ITER                12 (to 31)
#>>>              19 STORE_FAST               0 (item)
#>>>              22 LOAD_FAST                0 (item)
#>>>              25 LIST_APPEND              2
#>>>              28 JUMP_ABSOLUTE           16
#>>>         >>   31 POP_TOP             
#>>>              32 LOAD_CONST               0 (None)
#>>>              35 RETURN_VALUE        

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('abc')
#>>>               6 GET_ITER            
#>>>         >>    7 FOR_ITER                12 (to 22)
#>>>              10 STORE_FAST               0 (item)
#>>>              13 LOAD_FAST                0 (item)
#>>>              16 LIST_APPEND              2
#>>>              19 JUMP_ABSOLUTE            7
#>>>         >>   22 POP_TOP             
#>>>              23 LOAD_CONST               0 (None)
#>>>              26 RETURN_VALUE        

奇怪的是,我们具有相同的列表构建,但是这样做的速度仍然更快。Python 2的运行速度异常快。

让我们删除理解和重新计时。这_ =是为了防止它被优化。

>>> python3 -m timeit '_ = ["a", "b", "c"]'
10000000 loops, best of 3: 0.0707 usec per loop

>>> python3 -m timeit '_ = "abc"'
100000000 loops, best of 3: 0.0171 usec per loop

我们可以看到初始化不足以说明版本之间的差异(这些数字很小)!因此,我们可以得出结论,Python 3的理解速度较慢。随着Python 3将理解方式更改为具有更安全的作用域,这才有意义。

好吧,现在提高基准(我只是删除不是迭代的开销)。这通过预先分配来删除迭代器的构建:

>>> python3 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.387 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
1000000 loops, best of 3: 0.368 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
10000000 loops, best of 3: 0.164 usec per loop

我们可以检查调用iter是否是开销:

>>> python3 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.099 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.1 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.0913 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.0854 usec per loop

不,不是。差别太小,尤其是对于Python 3。

因此,让我们降低整体速度,从而消除更多不必要的开销!目的是使迭代时间更长,从而节省时间。

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 3.12 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.77 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 2.32 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.09 msec per loop

这实际上并没有太大变化,但有所帮助。

因此,消除理解。开销并不是问题的一部分:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.71 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 1.36 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.27 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 935 usec per loop

这还差不多!通过使用deque迭代,我们仍然可以稍微快一些。基本上是一样的,但是速度更快

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 805 usec per loop

>>> python2 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop

令我印象深刻的是,Unicode在字节串方面具有竞争力。我们可以通过尝试在bytesunicode两者中进行显式检查:

  • bytes

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)'                                                                    :(
    1000 loops, best of 3: 571 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127))                 for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 757 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127))                 for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 438 usec per loop
    

    在这里,您可以看到Python 3实际上比Python 2 更快

  • unicode

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join(   chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 800 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [   chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 1.07 msec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 469 usec per loop
    

    同样,Python 3更快,尽管这是可以预料的(str在Python 3中引起了很多关注)。

实际上,这unicodebytes差异很小,令人印象深刻。

因此,让我们分析一下这种情况,因为它对我来说既快速又方便:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop

实际上,我们可以排除蒂姆·彼得(Tim Peter)提出10次支持的答案!

>>> foo = iterable[123]
>>> iterable[36] is foo
True

这些不是新对象!

但这值得一提:索引成本。区别可能在于索引,因此删除迭代并仅索引:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]'
10000000 loops, best of 3: 0.0397 usec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]'
10000000 loops, best of 3: 0.0374 usec per loop

差异似乎很小,但是至少一半的成本是间接费用:

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123'
100000000 loops, best of 3: 0.0173 usec per loop

因此,速度差足以决定对此负责。我认为。

那么为什么索引列表这么快呢?

好吧,我会回来给你这一点,但我的猜测是的是倒在支票实习字符串(或缓存的字符,如果它是一个独立的机构)。这将不如最佳速度快。但是我会去检查源代码(尽管我对C语言不太满意):)。


所以这是来源:

static PyObject *
unicode_getitem(PyObject *self, Py_ssize_t index)
{
    void *data;
    enum PyUnicode_Kind kind;
    Py_UCS4 ch;
    PyObject *res;

    if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {
        PyErr_BadArgument();
        return NULL;
    }
    if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {
        PyErr_SetString(PyExc_IndexError, "string index out of range");
        return NULL;
    }
    kind = PyUnicode_KIND(self);
    data = PyUnicode_DATA(self);
    ch = PyUnicode_READ(kind, data, index);
    if (ch < 256)
        return get_latin1_char(ch);

    res = PyUnicode_New(1, ch);
    if (res == NULL)
        return NULL;
    kind = PyUnicode_KIND(res);
    data = PyUnicode_DATA(res);
    PyUnicode_WRITE(kind, data, 0, ch);
    assert(_PyUnicode_CheckConsistency(res, 1));
    return res;
}

从顶部走,我们将进行一些检查。这些无聊。然后一些分配,这也应该很无聊。第一个有趣的行是

ch = PyUnicode_READ(kind, data, index);

但是我们希望这很快,因为我们正在通过索引从连续的C数组读取数据。结果ch小于256,因此我们将在中返回缓存的字符get_latin1_char(ch)

因此,我们将运行(删除第一个检查)

kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
return get_latin1_char(ch);

哪里

#define PyUnicode_KIND(op) \
    (assert(PyUnicode_Check(op)), \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->state.kind)

(这很无聊,因为断言在调试中会被忽略(因此我可以检查它们是否很快),((PyASCIIObject *)(op))->state.kind)并且(我认为)是间接调用和C级强制转换);

#define PyUnicode_DATA(op) \
    (assert(PyUnicode_Check(op)), \
     PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \
     _PyUnicode_NONCOMPACT_DATA(op))

(由于类似的原因,这也很无聊,假设宏(Something_CAPITALIZED)都很快),

#define PyUnicode_READ(kind, data, index) \
    ((Py_UCS4) \
    ((kind) == PyUnicode_1BYTE_KIND ? \
        ((const Py_UCS1 *)(data))[(index)] : \
        ((kind) == PyUnicode_2BYTE_KIND ? \
            ((const Py_UCS2 *)(data))[(index)] : \
            ((const Py_UCS4 *)(data))[(index)] \
        ) \
    ))

(涉及索引,但实际上一点也不慢),并且

static PyObject*
get_latin1_char(unsigned char ch)
{
    PyObject *unicode = unicode_latin1[ch];
    if (!unicode) {
        unicode = PyUnicode_New(1, ch);
        if (!unicode)
            return NULL;
        PyUnicode_1BYTE_DATA(unicode)[0] = ch;
        assert(_PyUnicode_CheckConsistency(unicode, 1));
        unicode_latin1[ch] = unicode;
    }
    Py_INCREF(unicode);
    return unicode;
}

这证实了我的怀疑:

  • 这被缓存:

    PyObject *unicode = unicode_latin1[ch];
  • 这应该很快。在if (!unicode)没有运行,所以它是在这种情况下相当于字面上

    PyObject *unicode = unicode_latin1[ch];
    Py_INCREF(unicode);
    return unicode;
    

坦白地说,在测试asserts 之后(通过禁用它们[我认为它可以在C级断言上运行…]),只有看起来很慢的部分是:

PyUnicode_IS_COMPACT(op)
_PyUnicode_COMPACT_DATA(op)
_PyUnicode_NONCOMPACT_DATA(op)

哪个是:

#define PyUnicode_IS_COMPACT(op) \
    (((PyASCIIObject*)(op))->state.compact)

(和以前一样快),

#define _PyUnicode_COMPACT_DATA(op)                     \
    (PyUnicode_IS_ASCII(op) ?                   \
     ((void*)((PyASCIIObject*)(op) + 1)) :              \
     ((void*)((PyCompactUnicodeObject*)(op) + 1)))

(如果宏IS_ASCII很快,则很快),以及

#define _PyUnicode_NONCOMPACT_DATA(op)                  \
    (assert(((PyUnicodeObject*)(op))->data.any),        \
     ((((PyUnicodeObject *)(op))->data.any)))

(因为它是断言,间接寻址和强制转换,因此速度也很快)。

因此,我们进入(兔子洞)以:

PyUnicode_IS_ASCII

这是

#define PyUnicode_IS_ASCII(op)                   \
    (assert(PyUnicode_Check(op)),                \
     assert(PyUnicode_IS_READY(op)),             \
     ((PyASCIIObject*)op)->state.ascii)

嗯…似乎也很快…


好吧,但让我们将其与进行比较PyList_GetItem。(是的,感谢蒂姆·彼得斯(Tim Peters)为我提供了更多的工作要做:P。)

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    if (!PyList_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (i < 0 || i >= Py_SIZE(op)) {
        if (indexerr == NULL) {
            indexerr = PyUnicode_FromString(
                "list index out of range");
            if (indexerr == NULL)
                return NULL;
        }
        PyErr_SetObject(PyExc_IndexError, indexerr);
        return NULL;
    }
    return ((PyListObject *)op) -> ob_item[i];
}

我们可以看到,在非错误情况下,这只会运行:

PyList_Check(op)
Py_SIZE(op)
((PyListObject *)op) -> ob_item[i]

哪里PyList_Check

#define PyList_Check(op) \
     PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS)

TABS!TABS !!!)(issue215875分钟内修复并合并。就像…是的。该死的。他们让Skeet感到羞耻。

#define Py_SIZE(ob)             (((PyVarObject*)(ob))->ob_size)
#define PyType_FastSubclass(t,f)  PyType_HasFeature(t,f)
#ifdef Py_LIMITED_API
#define PyType_HasFeature(t,f)  ((PyType_GetFlags(t) & (f)) != 0)
#else
#define PyType_HasFeature(t,f)  (((t)->tp_flags & (f)) != 0)
#endif

因此,除非Py_LIMITED_API启用,否则通常这确实是微不足道的(两个间接调用和几个布尔检查)……???

然后是索引和强制转换(((PyListObject *)op) -> ob_item[i]),我们完成了。

因此,对列表检查肯定会更少,并且速度差异很小肯定意味着它可能是相关的。


我认为通常来说,(->)Unicode的类型检查和间接性更多。似乎我遗漏了一点,但是

TL;DR

  • The actual speed difference is closer to 70% (or more) once a lot of the overhead is removed, for Python 2.

  • Object creation is not at fault. Neither method creates a new object, as one-character strings are cached.

  • The difference is unobvious, but is likely created from a greater number of checks on string indexing, with regards to the type and well-formedness. It is also quite likely thanks to the need to check what to return.

  • List indexing is remarkably fast.



>>> python3 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.388 usec per loop

>>> python3 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.436 usec per loop

This disagrees with what you’ve found…

You must be using Python 2, then.

>>> python2 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.212 usec per loop

Let’s explain the difference between the versions. I’ll examine the compiled code.

For Python 3:

import dis

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   4           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>)
#>>>               3 LOAD_CONST               2 ('list_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('a')
#>>>              12 LOAD_CONST               4 ('b')
#>>>              15 LOAD_CONST               5 ('c')
#>>>              18 BUILD_LIST               3
#>>>              21 GET_ITER
#>>>              22 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              25 POP_TOP
#>>>              26 LOAD_CONST               0 (None)
#>>>              29 RETURN_VALUE

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>  21           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               3 ('abc')
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

You see here that the list variant is likely to be slower due to the building of the list each time.

This is the

 9 LOAD_CONST   3 ('a')
12 LOAD_CONST   4 ('b')
15 LOAD_CONST   5 ('c')
18 BUILD_LIST   3

part. The string variant only has

 9 LOAD_CONST   3 ('abc')

You can check that this does seem to make a difference:

def string_iterate():
    [item for item in ("a", "b", "c")]

dis.dis(string_iterate)
#>>>  35           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>)
#>>>               3 LOAD_CONST               2 ('string_iterate.<locals>.<listcomp>')
#>>>               6 MAKE_FUNCTION            0
#>>>               9 LOAD_CONST               6 (('a', 'b', 'c'))
#>>>              12 GET_ITER
#>>>              13 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
#>>>              16 POP_TOP
#>>>              17 LOAD_CONST               0 (None)
#>>>              20 RETURN_VALUE

This produces just

 9 LOAD_CONST               6 (('a', 'b', 'c'))

as tuples are immutable. Test:

>>> python3 -m timeit '[x for x in ("a", "b", "c")]'
1000000 loops, best of 3: 0.369 usec per loop

Great, back up to speed.

For Python 2:

def list_iterate():
    [item for item in ["a", "b", "c"]]

dis.dis(list_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('a')
#>>>               6 LOAD_CONST               2 ('b')
#>>>               9 LOAD_CONST               3 ('c')
#>>>              12 BUILD_LIST               3
#>>>              15 GET_ITER            
#>>>         >>   16 FOR_ITER                12 (to 31)
#>>>              19 STORE_FAST               0 (item)
#>>>              22 LOAD_FAST                0 (item)
#>>>              25 LIST_APPEND              2
#>>>              28 JUMP_ABSOLUTE           16
#>>>         >>   31 POP_TOP             
#>>>              32 LOAD_CONST               0 (None)
#>>>              35 RETURN_VALUE        

def string_iterate():
    [item for item in "abc"]

dis.dis(string_iterate)
#>>>   2           0 BUILD_LIST               0
#>>>               3 LOAD_CONST               1 ('abc')
#>>>               6 GET_ITER            
#>>>         >>    7 FOR_ITER                12 (to 22)
#>>>              10 STORE_FAST               0 (item)
#>>>              13 LOAD_FAST                0 (item)
#>>>              16 LIST_APPEND              2
#>>>              19 JUMP_ABSOLUTE            7
#>>>         >>   22 POP_TOP             
#>>>              23 LOAD_CONST               0 (None)
#>>>              26 RETURN_VALUE        

The odd thing is that we have the same building of the list, but it’s still faster for this. Python 2 is acting strangely fast.

Let’s remove the comprehensions and re-time. The _ = is to prevent it getting optimised out.

>>> python3 -m timeit '_ = ["a", "b", "c"]'
10000000 loops, best of 3: 0.0707 usec per loop

>>> python3 -m timeit '_ = "abc"'
100000000 loops, best of 3: 0.0171 usec per loop

We can see that initialization is not significant enough to account for the difference between the versions (those numbers are small)! We can thus conclude that Python 3 has slower comprehensions. This makes sense as Python 3 changed comprehensions to have safer scoping.

Well, now improve the benchmark (I’m just removing overhead that isn’t iteration). This removes the building of the iterable by pre-assigning it:

>>> python3 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.387 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
1000000 loops, best of 3: 0.368 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           '[x for x in iterable]'
1000000 loops, best of 3: 0.309 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
10000000 loops, best of 3: 0.164 usec per loop

We can check if calling iter is the overhead:

>>> python3 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.099 usec per loop

>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.1 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"'           'iter(iterable)'
10000000 loops, best of 3: 0.0913 usec per loop

>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.0854 usec per loop

No. No it is not. The difference is too small, especially for Python 3.

So let’s remove yet more unwanted overhead… by making the whole thing slower! The aim is just to have a longer iteration so the time hides overhead.

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 3.12 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.77 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 2.32 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.09 msec per loop

This hasn’t actually changed much, but it’s helped a little.

So remove the comprehension. It’s overhead that’s not part of the question:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.71 msec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 1.36 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.27 msec per loop

>>> python2 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 935 usec per loop

That’s more like it! We can get slightly faster still by using deque to iterate. It’s basically the same, but it’s faster:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 805 usec per loop

>>> python2 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop

What impresses me is that Unicode is competitive with bytestrings. We can check this explicitly by trying bytes and unicode in both:

  • bytes

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)'                                                                    :(
    1000 loops, best of 3: 571 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127))                 for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 757 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [chr(random.randint(0, 127))                 for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 438 usec per loop
    

    Here you see Python 3 actually faster than Python 2.

  • unicode

    >>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join(   chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 800 usec per loop
    
    >>> python3 -m timeit -s 'import random; from collections import deque; iterable =         [   chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 394 usec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 1.07 msec per loop
    
    >>> python2 -m timeit -s 'import random; from collections import deque; iterable =         [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
    1000 loops, best of 3: 469 usec per loop
    

    Again, Python 3 is faster, although this is to be expected (str has had a lot of attention in Python 3).

In fact, this unicodebytes difference is very small, which is impressive.

So let’s analyse this one case, seeing as it’s fast and convenient for me:

>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop

>>> python3 -m timeit -s 'import random; from collections import deque; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop

We can actually rule out Tim Peter’s 10-times-upvoted answer!

>>> foo = iterable[123]
>>> iterable[36] is foo
True

These are not new objects!

But this is worth mentioning: indexing costs. The difference will likely be in the indexing, so remove the iteration and just index:

>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]'
10000000 loops, best of 3: 0.0397 usec per loop

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]'
10000000 loops, best of 3: 0.0374 usec per loop

The difference seems small, but at least half of the cost is overhead:

>>> python3 -m timeit -s 'import random; iterable =        [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123'
100000000 loops, best of 3: 0.0173 usec per loop

so the speed difference is sufficient to decide to blame it. I think.

So why is indexing a list so much faster?

Well, I’ll come back to you on that, but my guess is that’s is down to the check for interned strings (or cached characters if it’s a separate mechanism). This will be less fast than optimal. But I’ll go check the source (although I’m not comfortable in C…) :).


So here’s the source:

static PyObject *
unicode_getitem(PyObject *self, Py_ssize_t index)
{
    void *data;
    enum PyUnicode_Kind kind;
    Py_UCS4 ch;
    PyObject *res;

    if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {
        PyErr_BadArgument();
        return NULL;
    }
    if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {
        PyErr_SetString(PyExc_IndexError, "string index out of range");
        return NULL;
    }
    kind = PyUnicode_KIND(self);
    data = PyUnicode_DATA(self);
    ch = PyUnicode_READ(kind, data, index);
    if (ch < 256)
        return get_latin1_char(ch);

    res = PyUnicode_New(1, ch);
    if (res == NULL)
        return NULL;
    kind = PyUnicode_KIND(res);
    data = PyUnicode_DATA(res);
    PyUnicode_WRITE(kind, data, 0, ch);
    assert(_PyUnicode_CheckConsistency(res, 1));
    return res;
}

Walking from the top, we’ll have some checks. These are boring. Then some assigns, which should also be boring. The first interesting line is

ch = PyUnicode_READ(kind, data, index);

but we’d hope that is fast, as we’re reading from a contiguous C array by indexing it. The result, ch, will be less than 256 so we’ll return the cached character in get_latin1_char(ch).

So we’ll run (dropping the first checks)

kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
return get_latin1_char(ch);

Where

#define PyUnicode_KIND(op) \
    (assert(PyUnicode_Check(op)), \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->state.kind)

(which is boring because asserts get ignored in debug [so I can check that they’re fast] and ((PyASCIIObject *)(op))->state.kind) is (I think) an indirection and a C-level cast);

#define PyUnicode_DATA(op) \
    (assert(PyUnicode_Check(op)), \
     PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \
     _PyUnicode_NONCOMPACT_DATA(op))

(which is also boring for similar reasons, assuming the macros (Something_CAPITALIZED) are all fast),

#define PyUnicode_READ(kind, data, index) \
    ((Py_UCS4) \
    ((kind) == PyUnicode_1BYTE_KIND ? \
        ((const Py_UCS1 *)(data))[(index)] : \
        ((kind) == PyUnicode_2BYTE_KIND ? \
            ((const Py_UCS2 *)(data))[(index)] : \
            ((const Py_UCS4 *)(data))[(index)] \
        ) \
    ))

(which involves indexes but really isn’t slow at all) and

static PyObject*
get_latin1_char(unsigned char ch)
{
    PyObject *unicode = unicode_latin1[ch];
    if (!unicode) {
        unicode = PyUnicode_New(1, ch);
        if (!unicode)
            return NULL;
        PyUnicode_1BYTE_DATA(unicode)[0] = ch;
        assert(_PyUnicode_CheckConsistency(unicode, 1));
        unicode_latin1[ch] = unicode;
    }
    Py_INCREF(unicode);
    return unicode;
}

Which confirms my suspicion that:

  • This is cached:

    PyObject *unicode = unicode_latin1[ch];
    
  • This should be fast. The if (!unicode) is not run, so it’s literally equivalent in this case to

    PyObject *unicode = unicode_latin1[ch];
    Py_INCREF(unicode);
    return unicode;
    

Honestly, after testing the asserts are fast (by disabling them [I think it works on the C-level asserts…]), the only plausibly-slow parts are:

PyUnicode_IS_COMPACT(op)
_PyUnicode_COMPACT_DATA(op)
_PyUnicode_NONCOMPACT_DATA(op)

Which are:

#define PyUnicode_IS_COMPACT(op) \
    (((PyASCIIObject*)(op))->state.compact)

(fast, as before),

#define _PyUnicode_COMPACT_DATA(op)                     \
    (PyUnicode_IS_ASCII(op) ?                   \
     ((void*)((PyASCIIObject*)(op) + 1)) :              \
     ((void*)((PyCompactUnicodeObject*)(op) + 1)))

(fast if the macro IS_ASCII is fast), and

#define _PyUnicode_NONCOMPACT_DATA(op)                  \
    (assert(((PyUnicodeObject*)(op))->data.any),        \
     ((((PyUnicodeObject *)(op))->data.any)))

(also fast as it’s an assert plus an indirection plus a cast).

So we’re down (the rabbit hole) to:

PyUnicode_IS_ASCII

which is

#define PyUnicode_IS_ASCII(op)                   \
    (assert(PyUnicode_Check(op)),                \
     assert(PyUnicode_IS_READY(op)),             \
     ((PyASCIIObject*)op)->state.ascii)

Hmm… that seems fast too…


Well, OK, but let’s compare it to PyList_GetItem. (Yeah, thanks Tim Peters for giving me more work to do :P.)

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    if (!PyList_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (i < 0 || i >= Py_SIZE(op)) {
        if (indexerr == NULL) {
            indexerr = PyUnicode_FromString(
                "list index out of range");
            if (indexerr == NULL)
                return NULL;
        }
        PyErr_SetObject(PyExc_IndexError, indexerr);
        return NULL;
    }
    return ((PyListObject *)op) -> ob_item[i];
}

We can see that on non-error cases this is just going to run:

PyList_Check(op)
Py_SIZE(op)
((PyListObject *)op) -> ob_item[i]

Where PyList_Check is

#define PyList_Check(op) \
     PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS)

(TABS! TABS!!!) (issue21587) That got fixed and merged in 5 minutes. Like… yeah. Damn. They put Skeet to shame.

#define Py_SIZE(ob)             (((PyVarObject*)(ob))->ob_size)
#define PyType_FastSubclass(t,f)  PyType_HasFeature(t,f)
#ifdef Py_LIMITED_API
#define PyType_HasFeature(t,f)  ((PyType_GetFlags(t) & (f)) != 0)
#else
#define PyType_HasFeature(t,f)  (((t)->tp_flags & (f)) != 0)
#endif

So this is normally really trivial (two indirections and a couple of boolean checks) unless Py_LIMITED_API is on, in which case… ???

Then there’s the indexing and a cast (((PyListObject *)op) -> ob_item[i]) and we’re done.

So there are definitely fewer checks for lists, and the small speed differences certainly imply that it could be relevant.


I think in general, there’s just more type-checking and indirection (->) for Unicode. It seems I’m missing a point, but what?


回答 1

当您遍历大多数容器对象(列表,元组,字典,…)时,迭代器会容器中传递对象。

但是,当您遍历字符串时,必须为传递的每个字符创建一个对象-字符串不是“容器”,就如同列表是容器一样。在迭代创建对象之前,字符串中的各个字符不作为不同的对象存在。

When you iterate over most container objects (lists, tuples, dicts, …), the iterator delivers the objects in the container.

But when you iterate over a string, a new object has to be created for each character delivered – a string is not “a container” in the same sense a list is a container. The individual characters in a string don’t exist as distinct objects before iteration creates those objects.


回答 2

创建字符串的迭代器可能会招致麻烦。而数组在实例化时已经包含一个迭代器。

编辑:

>>> timeit("[x for x in ['a','b','c']]")
0.3818681240081787
>>> timeit("[x for x in 'abc']")
0.3732869625091553

这是使用2.7运行的,但是在我的Mac book pro i7上。这可能是系统配置不同的结果。

You could be incurring and overhead for creating the iterator for the string. Whereas the array already contains an iterator upon instantiation.

EDIT:

>>> timeit("[x for x in ['a','b','c']]")
0.3818681240081787
>>> timeit("[x for x in 'abc']")
0.3732869625091553

This was ran using 2.7, but on my mac book pro i7. This could be the result of a system configuration difference.


“ x

问题:“ x

从此页面,我们知道:

链式比较比使用and运算符要快。写x < y < z而不是x < y and y < z

但是,测试以下代码片段时,我得到了不同的结果:

$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.8" "x < y < z"
1000000 loops, best of 3: 0.322 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.8" "x < y and y < z"
1000000 loops, best of 3: 0.22 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.1" "x < y < z"
1000000 loops, best of 3: 0.279 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.1" "x < y and y < z"
1000000 loops, best of 3: 0.215 usec per loop

看来x < y and y < z比快x < y < z为什么?

在搜索了该站点的一些帖子(如本篇文章)之后,我知道“仅评估一次”是的关键x < y < z,但是我仍然感到困惑。为了进一步研究,我使用dis.dis以下命令分解了这两个函数:

import dis

def chained_compare():
        x = 1.2
        y = 1.3
        z = 1.1
        x < y < z

def and_compare():
        x = 1.2
        y = 1.3
        z = 1.1
        x < y and y < z

dis.dis(chained_compare)
dis.dis(and_compare)

输出为:

## chained_compare ##

  4           0 LOAD_CONST               1 (1.2)
              3 STORE_FAST               0 (x)

  5           6 LOAD_CONST               2 (1.3)
              9 STORE_FAST               1 (y)

  6          12 LOAD_CONST               3 (1.1)
             15 STORE_FAST               2 (z)

  7          18 LOAD_FAST                0 (x)
             21 LOAD_FAST                1 (y)
             24 DUP_TOP
             25 ROT_THREE
             26 COMPARE_OP               0 (<)
             29 JUMP_IF_FALSE_OR_POP    41
             32 LOAD_FAST                2 (z)
             35 COMPARE_OP               0 (<)
             38 JUMP_FORWARD             2 (to 43)
        >>   41 ROT_TWO
             42 POP_TOP
        >>   43 POP_TOP
             44 LOAD_CONST               0 (None)
             47 RETURN_VALUE

## and_compare ##

 10           0 LOAD_CONST               1 (1.2)
              3 STORE_FAST               0 (x)

 11           6 LOAD_CONST               2 (1.3)
              9 STORE_FAST               1 (y)

 12          12 LOAD_CONST               3 (1.1)
             15 STORE_FAST               2 (z)

 13          18 LOAD_FAST                0 (x)
             21 LOAD_FAST                1 (y)
             24 COMPARE_OP               0 (<)
             27 JUMP_IF_FALSE_OR_POP    39
             30 LOAD_FAST                1 (y)
             33 LOAD_FAST                2 (z)
             36 COMPARE_OP               0 (<)
        >>   39 POP_TOP
             40 LOAD_CONST               0 (None)

看来,的x < y and y < z分解命令比少x < y < z。我应该考虑x < y and y < zx < y < z吗?

在Intel®Xeon®CPU E5640 @ 2.67GHz上使用Python 2.7.6进行了测试。

From this page, we know that:

Chained comparisons are faster than using the and operator. Write x < y < z instead of x < y and y < z.

However, I got a different result testing the following code snippets:

$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.8" "x < y < z"
1000000 loops, best of 3: 0.322 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.8" "x < y and y < z"
1000000 loops, best of 3: 0.22 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.1" "x < y < z"
1000000 loops, best of 3: 0.279 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.1" "x < y and y < z"
1000000 loops, best of 3: 0.215 usec per loop

It seems that x < y and y < z is faster than x < y < z. Why?

After searching some posts in this site (like this one) I know that “evaluated only once” is the key for x < y < z, however I’m still confused. To do further study, I disassembled these two functions using dis.dis:

import dis

def chained_compare():
        x = 1.2
        y = 1.3
        z = 1.1
        x < y < z

def and_compare():
        x = 1.2
        y = 1.3
        z = 1.1
        x < y and y < z

dis.dis(chained_compare)
dis.dis(and_compare)

And the output is:

## chained_compare ##

  4           0 LOAD_CONST               1 (1.2)
              3 STORE_FAST               0 (x)

  5           6 LOAD_CONST               2 (1.3)
              9 STORE_FAST               1 (y)

  6          12 LOAD_CONST               3 (1.1)
             15 STORE_FAST               2 (z)

  7          18 LOAD_FAST                0 (x)
             21 LOAD_FAST                1 (y)
             24 DUP_TOP
             25 ROT_THREE
             26 COMPARE_OP               0 (<)
             29 JUMP_IF_FALSE_OR_POP    41
             32 LOAD_FAST                2 (z)
             35 COMPARE_OP               0 (<)
             38 JUMP_FORWARD             2 (to 43)
        >>   41 ROT_TWO
             42 POP_TOP
        >>   43 POP_TOP
             44 LOAD_CONST               0 (None)
             47 RETURN_VALUE

## and_compare ##

 10           0 LOAD_CONST               1 (1.2)
              3 STORE_FAST               0 (x)

 11           6 LOAD_CONST               2 (1.3)
              9 STORE_FAST               1 (y)

 12          12 LOAD_CONST               3 (1.1)
             15 STORE_FAST               2 (z)

 13          18 LOAD_FAST                0 (x)
             21 LOAD_FAST                1 (y)
             24 COMPARE_OP               0 (<)
             27 JUMP_IF_FALSE_OR_POP    39
             30 LOAD_FAST                1 (y)
             33 LOAD_FAST                2 (z)
             36 COMPARE_OP               0 (<)
        >>   39 POP_TOP
             40 LOAD_CONST               0 (None)

It seems that the x < y and y < z has less dissembled commands than x < y < z. Should I consider x < y and y < z faster than x < y < z?

Tested with Python 2.7.6 on an Intel(R) Xeon(R) CPU E5640 @ 2.67GHz.


回答 0

区别在于in x < y < z y仅被评估一次。如果y是一个变量,这并没有太大的区别,但是当它是一个函数调用时,它却会产生很大的差异,这需要花费一些时间来计算。

from time import sleep
def y():
    sleep(.2)
    return 1.3
%timeit 1.2 < y() < 1.8
10 loops, best of 3: 203 ms per loop
%timeit 1.2 < y() and y() < 1.8
1 loops, best of 3: 405 ms per loop

The difference is that in x < y < z y is only evaluated once. This does not make a large difference if y is a variable, but it does when it is a function call, which takes some time to compute.

from time import sleep
def y():
    sleep(.2)
    return 1.3
%timeit 1.2 < y() < 1.8
10 loops, best of 3: 203 ms per loop
%timeit 1.2 < y() and y() < 1.8
1 loops, best of 3: 405 ms per loop

回答 1

您定义的两个函数的最佳字节码将是

          0 LOAD_CONST               0 (None)
          3 RETURN_VALUE

因为不使用比较结果。通过返回比较结果,使情况变得更加有趣。让我们在编译时也无法得知结果。

def interesting_compare(y):
    x = 1.1
    z = 1.3
    return x < y < z  # or: x < y and y < z

同样,比较的两个版本在语义上是相同的,因此两个结构的最佳字节码相同。尽我所能,它看起来像这样。我已经在每个操作码之前和之后用Forth注释(在右边的栈顶,在--前后划分,尾部?表示可能存在或不存在的东西)的每一行都用了堆栈内容。请注意,RETURN_VALUE将丢弃所有遗留在返回值下面的堆栈中的所有内容。

          0 LOAD_FAST                0 (y)    ;          -- y
          3 DUP_TOP                           ; y        -- y y
          4 LOAD_CONST               0 (1.1)  ; y y      -- y y 1.1
          7 COMPARE_OP               4 (>)    ; y y 1.1  -- y pred
         10 JUMP_IF_FALSE_OR_POP     19       ; y pred   -- y
         13 LOAD_CONST               1 (1.3)  ; y        -- y 1.3
         16 COMPARE_OP               0 (<)    ; y 1.3    -- pred
     >>  19 RETURN_VALUE                      ; y? pred  --

如果CPython,PyPy等语言的实现未针对两种变体生成此字节码(或其等效的操作序列),则说明该字节码编译器的质量较差。从上面发布的字节码序列中获取是一个已解决的问题(我想在这种情况下,您需要做的就是不断折叠消除无效代码以及对堆栈内容进行更好的建模;常见的子表达式消除也将是廉价且有价值的),而没有在现代语言实现中没有这样做的借口。

现在,碰巧该语言的所有当前实现都具有劣质的字节码编译器。但是您在编码时应该忽略这一点!假装字节码编译器很好,并编写最易读的代码。无论如何它可能足够快。如果不是这样,请首先寻找算法上的改进,然后尝试Cython-与您可能应用的任何表达式级调整相比,在相同的工作量下将提供更多的改进。

Optimal bytecode for both of the functions you defined would be

          0 LOAD_CONST               0 (None)
          3 RETURN_VALUE

because the result of the comparison is not used. Let’s make the situation more interesting by returning the result of the comparison. Let’s also have the result not be knowable at compile time.

def interesting_compare(y):
    x = 1.1
    z = 1.3
    return x < y < z  # or: x < y and y < z

Again, the two versions of the comparison are semantically identical, so the optimal bytecode is the same for both constructs. As best I can work it out, it would look like this. I’ve annotated each line with the stack contents before and after each opcode, in Forth notation (top of stack at right, -- divides before and after, trailing ? indicates something that might or might not be there). Note that RETURN_VALUE discards everything that happens to be left on the stack underneath the value returned.

          0 LOAD_FAST                0 (y)    ;          -- y
          3 DUP_TOP                           ; y        -- y y
          4 LOAD_CONST               0 (1.1)  ; y y      -- y y 1.1
          7 COMPARE_OP               4 (>)    ; y y 1.1  -- y pred
         10 JUMP_IF_FALSE_OR_POP     19       ; y pred   -- y
         13 LOAD_CONST               1 (1.3)  ; y        -- y 1.3
         16 COMPARE_OP               0 (<)    ; y 1.3    -- pred
     >>  19 RETURN_VALUE                      ; y? pred  --

If an implementation of the language, CPython, PyPy, whatever, does not generate this bytecode (or its own equivalent sequence of operations) for both variations, that demonstrates the poor quality of that bytecode compiler. Getting from the bytecode sequences you posted to the above is a solved problem (I think all you need for this case is constant folding, dead code elimination, and better modeling of the contents of the stack; common subexpression elimination would also be cheap and valuable), and there’s really no excuse for not doing it in a modern language implementation.

Now, it happens that all current implementations of the language have poor-quality bytecode compilers. But you should ignore that while coding! Pretend the bytecode compiler is good, and write the most readable code. It will probably be plenty fast enough anyway. If it isn’t, look for algorithmic improvements first, and give Cython a try second — that will provide far more improvement for the same effort than any expression-level tweaks you might apply.


回答 2

由于输出的差异似乎是由于缺乏优化所致,所以我认为在大多数情况下您应该忽略该差异-可能差异会消失。区别在于,y只应评估一次,然后通过将其复制到堆栈上来解决该问题,这需要额外的费用POP_TOPLOAD_FAST尽管有可能使用解决方案。

但是,重要的区别在于,如果对x<y and y<z第二个y进行评估,则如果应x<y为true,则应评估两次,如果对的评估y花费大量时间或具有副作用,则可能会产生影响。

在大多数情况下,x<y<z尽管速度稍慢,但仍应使用。

Since the difference in the output seem to be due to lack of optimization I think you should ignore that difference for most cases – it could be that the difference will go away. The difference is because y only should be evaluated once and that is solved by duplicating it on the stack which requires an extra POP_TOP – the solution to use LOAD_FAST might be possible though.

The important difference though is that in x<y and y<z the second y should be evaluated twice if x<y evaluates to true, this has implications if the evaluation of y takes considerable time or have side effects.

In most scenarios you should use x<y<z despite the fact it’s somewhat slower.


回答 3

首先,您的比较几乎没有意义,因为没有引入两种不同的构造来提高性能,因此您不应基于此决定是否使用一个构造来代替另一个构造。

x < y < z结构:

  1. 其含义更清晰,更直接。
  2. 它的语义是您从比较的“数学意义”中所期望的:evalute xyz 一次并检查是否整个条件成立。使用可以and通过y多次评估来更改语义,这可以更改结果

因此,请根据您想要的语义以及是否相等来选择一个,以代替另一个。

这就是说:更多的反汇编代码确实 并不意味着慢的代码。但是,执行更多的字节码操作意味着每个操作都比较简单,但是需要主循环的迭代。这意味着,如果您正在执行的操作非常快(例如,您在那里执行的本地变量查找),那么执行更多字节码操作的开销可能会很重要。

但要注意,这个结果并没有在更一般的情况下举行,仅在“最坏情况”那你碰巧轮廓。正如其他人指出的那样,如果更改y为花费更多时间的内容,您将看到结果更改,因为链接表示法仅对它进行一次评估。

总结:

  • 性能之前要考虑语义。
  • 考虑到可读性。
  • 不要相信微型基准。始终使用不同种类的参数进行分析,以了解功能/表达式时序相对于所述参数的行为,并考虑您打算如何使用它。

First of all, your comparison is pretty much meaningless because the two different constructs were not introduced to provide a performance improvement, so you shouldn’t decide whether to use one in place of the other based on that.

The x < y < z construct:

  1. Is clearer and more direct in its meaning.
  2. Its semantics is what you’d expect from the “mathematical meaning” of the comparison: evalute x, y and z once and check if the whole condition holds. Using and changes the semantics by evaluating y multiple times, which can change the result.

So choose one in place of the other depending on the semantics you want and, if they are equivalent, whether one is more readable than the other.

This said: more disassembled code does does not imply slower code. However executing more bytecode operations means that each operation is simpler and yet it requires an iteration of the main loop. This means that if the operations you are performing are extremely fast (e.g. local variable lookup as you are doing there), then the overhead of executing more bytecode operations can matter.

But note that this result does not hold in the more generic situation, only to the “worst case” that you happen to profile. As others have noted, if you change y to something that takes even a bit more time you’ll see that the results change, because the chained notation evaluates it only once.

Summarizing:

  • Consider semantics before performance.
  • Take into account readability.
  • Don’t trust micro benchmarks. Always profile with different kind of parameters to see how a function/expression timing behave in relation to said parameters and consider how you plan to use it.

在Python 3中加速数百万个正则表达式的替换

问题:在Python 3中加速数百万个正则表达式的替换

我正在使用Python 3.5.2

我有两个清单

  • 大约750,000个“句子”(长字符串)的列表
  • 我想从我的750,000个句子中删除的大约20,000个“单词”的列表

因此,我必须遍历750,000个句子并执行大约20,000个替换,但前提是我的单词实际上是“单词”,并且不属于较大的字符串。

我这样做是通过预编译我的单词,使它们位于\b元字符的侧面

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我遍历我的“句子”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

这个嵌套循环每秒处理大约50个句子,这很好,但是处理我所有的句子仍需要几个小时。

  • 有没有一种方法可以使用该str.replace方法(我认为该方法更快),但仍然要求仅在单词边界处进行替换?

  • 或者,有没有办法加快该re.sub方法?re.sub如果单词的长度大于句子的长度,我已经略微提高了速度,但这并没有太大的改进。

感谢您的任何建议。

I’m using Python 3.5.2

I have two lists

  • a list of about 750,000 “sentences” (long strings)
  • a list of about 20,000 “words” that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my “sentences”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.

Thank you for any suggestions.


回答 0

您可以尝试做的一件事是编译一个单一模式,例如"\b(word1|word2|word3)\b"

由于re依靠C代码进行实际匹配,因此节省的费用可观。

正如@pvg在评论中指出的,它也受益于单遍匹配。

如果您的单词不是正则表达式,那么Eric的答案会更快。

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric’s answer is faster.


回答 1

TLDR

如果您想要最快的解决方案,请使用此方法(带有设置的查找)。对于类似于OP的数据集,它比接受的答案快大约2000倍。

如果您坚持使用正则表达式进行查找,请使用此基于Trie的版本,该版本仍比正则表达式联合快1000倍。

理论

如果您的句子不是笨拙的字符串,每秒处理50个以上的句子可能是可行的。

如果将所有禁止的单词保存到集合中,则可以非常快速地检查该集合中是否包含另一个单词。

将逻辑打包到一个函数中,将此函数作为参数提供给re.sub您,您就完成了!

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

转换后的句子为:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

注意:

  • 搜索不区分大小写(感谢lower()
  • 用替换一个单词""可能会留下两个空格(如您的代码中所示)
  • 使用python3,\w+还可以匹配带重音符号的字符(例如"ångström")。
  • 任何非单词字符(制表符,空格,换行符,标记等)都将保持不变。

性能

一百万个句子,banned_words近十万个单词,脚本运行时间不到7秒。

相比之下,Liteye的答案需要1万个句子需要160秒。

由于n是单词的总数和m被禁止的单词的数量,OP和Liteye的代码为O(n*m)

相比之下,我的代码应在中运行O(n+m)。考虑到句子比禁止词多得多,该算法变为O(n)

正则表达式联合测试

使用'\b(word1|word2|...|wordN)\b'模式进行正则表达式搜索的复杂性是什么?是O(N)还是O(1)

很难了解正则表达式引擎的工作方式,因此让我们编写一个简单的测试。

此代码将10**i随机的英语单词提取到列表中。它创建相应的正则表达式联合,并用不同的词对其进行测试:

  • 一个人显然不是一个词(以开头#
  • 一个是列表中的第一个单词
  • 一个是列表中的最后一个单词
  • 一个看起来像一个单词,但不是


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

它输出:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

因此,看起来像一个带有'\b(word1|word2|...|wordN)\b'模式的单词的搜索具有:

  • O(1) 最好的情况
  • O(n/2) 一般情况,仍然 O(n)
  • O(n) 最糟糕的情况

这些结果与简单的循环搜索一致。

regex联合的一种更快的替代方法是从trie创建regex模式

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you’re done!

Code

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

Converted sentences are:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

Note that:

  • the search is case-insensitive (thanks to lower())
  • replacing a word with "" might leave two spaces (as in your code)
  • With python3, \w+ also matches accented characters (e.g. "ångström").
  • Any non-word character (tab, space, newline, marks, …) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye’s answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

  • one is clearly not a word (it begins with #)
  • one is the first word in the list
  • one is the last word in the list
  • one looks like a word but isn’t


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

It outputs:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

  • O(1) best case
  • O(n/2) average case, which is still O(n)
  • O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.


回答 2

TLDR

如果您想要最快的基于正则表达式的解决方案,请使用此方法。对于类似于OP的数据集,它比接受的答案快大约1000倍。

如果您不关心正则表达式,请使用此基于集合的版本,它比正则表达式联合快2000倍。

使用Trie优化正则表达式

一个简单的正则表达式工会的做法与许多禁用词语变得缓慢,这是因为正则表达式引擎不会做了很好的工作优化格局。

可以使用所有禁止的单词创建Trie并编写相应的正则表达式。生成的trie或regex并不是真正的人类可读的,但是它们确实允许非常快速的查找和匹配。

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

该列表将转换为特里:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

然后到此正则表达式模式:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

巨大的优势在于,要测试是否zoo匹配,正则表达式引擎只需比较第一个字符(不匹配),而无需尝试5个单词。这是5个单词的预处理过大杀伤力,但它显示了成千上万个单词的有希望的结果。

请注意,使用(?:)非捕获组是因为:

  • foobar|baz将匹配foobarbaz但不匹配foobaz
  • foo(bar|baz)将不需要的信息保存到捕获组

这是一个经过稍微修改的gist,我们可以将其用作trie.py库:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

测试

这是一个小测试(与测试相同):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

它输出:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

对于信息,正则表达式开始如下:

(?:a(?:(?:\’s | a(?:\’s | chen | liyah(?:\’s)?| r(?:dvark(?:(?:\’s | s ))?|| on))| b(?:\’s | a(?:c(?:us(?:(?:\’s | es))?| [ik])| ft | lone(? :(?:\’s | s))?| ndon(?:( ?: ed | ing | ment(?:\’s)?| s))?| s(?:e(?:( ?: ment(?:\’s)?| [ds]))?| h(?:( ?: e [ds] | ing))?| ing)| t(?:e(?:( ?: ment( ?:\’s)?| [ds]))?| ing | toir(?:(?:\’s | s))?))| b(?:as(?:id)?| e(? :ss(?:(?:\’s | es))?| y(?:(?:\’s | s))?)| ot(?:(?:\’s | t(?:\ ‘s)?| s))?| reviat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))| y(?:\’ s)?| \é(?:(?:\’s | s))?)| d(?:icat(?:e [ds]?| i(?:ng | on(?:(?:\ ‘s | s))?)))| om(?:en(?:(?:\’s | s))?| inal)| u(?:ct(?:( ?: ed | i(?: ng | on(?:(?:\’s | s))?)|或(?:(?:\’s | s))?| s))?| l(?:\’s)?) )| e(?:(?:\’s | am | l(?:(?:\’s | ard | son(?:\’s)?)))?| r(?:deen(?:\ ‘s)?| nathy(?:\’s)?| ra(?:nt | tion(?:(?:\’s | s))?))| t(?:( ?: t(?: e(?:r(?:(?:\’s | s))?| d)| ing | or(?:(?:\’s | s))?)| s))?| yance(?:\’s)?| d))?| hor(?:( ?: r(?:e(?:n(?:ce(? :\’s)?| t)| d)| ing)| s)))| i(?:d(?:e [ds]?| ing | jan(?:\’s)?)|盖尔| l(?:ene | it(?:ies | y(?:\’s)?)))| j(?:ect(?:ly)?| ur(?:ation(?:(?:\’ s | s))?| e [ds]?| ing))| l(?:a(?:tive(?:(?:\’s | s))?| ze)| e(?:(? :st | r))?| oom | ution(?:(?:\’s | s))?| y)| m \’s | n(?:e(?:gat(?:e [ds] || i(?:ng | on(?:\’s)?))| r(?:\’s)?)| ormal(?:( ?: it(?:ies | y(?:\’ s)?)| ly))?)| o(?:ard | de(?:(?:\’s | s))?| li(?:sh(?:( ?: e [ds] | ing ))|| tion(?:(?:\’s | ist(?:(?:\’s | s))?))?)| mina(?:bl [ey] | t(?:e [ ds]?| i(?:ng | on(?:(?:\’s | s))?))))| r(?:igin(?:al(?:(?:\’s | s) )?| e(?:(?:\’s | s))?)| t(?:( ?: ed | i(?:ng | on(?:(?:\’s | ist(?: (?:\’s | s))?| s))?| ve)| s))))| u(?:nd(?:(?:( ?: ed | ing | s |))?| t)| ve (?:(?:\’s | board))?)| r(?:a(?:cadabra(?:\’s)?| d(?:e [ds]?| ing)| ham(? :\’s)?| m(?:(?:\’s | s))?| si(?:on(?:(?:\’s | s))?| ve(?:( ?:\’s | ly | ness(?:\’s)?| s))?))| east | idg(?:e(?:( ?: ment(?:((?:\’s | s))) ?| [ds]))?| ing | ment(?:(?:\’s | s))?)| o(?:ad | gat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))))| upt(?:( ?: e(?:st | r)| ly | ness(?:\’s)?))?)) | s(?:alom | c(?:ess(?:(?:\’s | e [ds] | ing)))?| issa(?:(?:\’s | [es])))?| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?:( ?: e(?:e( ?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))))| inth(?:(?:\’s | e( ?:o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))?| i(?:on(?: \’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?: e(?:n(? :cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti …s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。

这确实让人难以理解,但是对于100000个禁用词的列表而言,此Trie regex比简单的regex联合快1000倍!

这是完整的trie的图,并通过trie-python-graphviz和graphviz 导出twopi

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.

If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn’t do a very good job of optimizing the pattern.

It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

The list is converted to a trie:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

Code

Here’s a slightly modified gist, which we can use as a trie.py library:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

Test

Here’s a small test (the same as this one):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

It outputs:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

For info, the regex begins like this:

(?:a(?:(?:\’s|a(?:\’s|chen|liyah(?:\’s)?|r(?:dvark(?:(?:\’s|s))?|on))|b(?:\’s|a(?:c(?:us(?:(?:\’s|es))?|[ik])|ft|lone(?:(?:\’s|s))?|ndon(?:(?:ed|ing|ment(?:\’s)?|s))?|s(?:e(?:(?:ment(?:\’s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\’s)?|[ds]))?|ing|toir(?:(?:\’s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\’s|es))?|y(?:(?:\’s|s))?)|ot(?:(?:\’s|t(?:\’s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|y(?:\’s)?|\é(?:(?:\’s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|om(?:en(?:(?:\’s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\’s|s))?)|or(?:(?:\’s|s))?|s))?|l(?:\’s)?))|e(?:(?:\’s|am|l(?:(?:\’s|ard|son(?:\’s)?))?|r(?:deen(?:\’s)?|nathy(?:\’s)?|ra(?:nt|tion(?:(?:\’s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\’s|s))?|d)|ing|or(?:(?:\’s|s))?)|s))?|yance(?:\’s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\’s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\’s)?)|gail|l(?:ene|it(?:ies|y(?:\’s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\’s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\’s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\’s|s))?|y)|m\’s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\’s)?))|r(?:\’s)?)|ormal(?:(?:it(?:ies|y(?:\’s)?)|ly))?)|o(?:ard|de(?:(?:\’s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\’s|ist(?:(?:\’s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|r(?:igin(?:al(?:(?:\’s|s))?|e(?:(?:\’s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\’s|ist(?:(?:\’s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\’s|board))?)|r(?:a(?:cadabra(?:\’s)?|d(?:e[ds]?|ing)|ham(?:\’s)?|m(?:(?:\’s|s))?|si(?:on(?:(?:\’s|s))?|ve(?:(?:\’s|ly|ness(?:\’s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\’s|s))?|[ds]))?|ing|ment(?:(?:\’s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\’s)?))?)|s(?:alom|c(?:ess(?:(?:\’s|e[ds]|ing))?|issa(?:(?:\’s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\’s|s))?|t(?:(?:e(?:e(?:(?:\’s|ism(?:\’s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\’s|e(?:\’s)?))?|o(?:l(?:ut(?:e(?:(?:\’s|ly|st?))?|i(?:on(?:\’s)?|sm(?:\’s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\’s)?|t(?:(?:\’s|s))?)|d)|ing|s))?|pti…

It’s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here’s a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:


回答 3

您可能想尝试的一件事是对句子进行预处理以对单词边界进行编码。基本上,通过划分单词边界将每个句子变成单词列表。

这应该更快,因为要处理一个句子,您只需要逐步检查每个单词并检查它是否匹配即可。

当前,正则表达式搜索每次必须再次遍历整个字符串,以查找单词边界,然后在下一次遍历之前“舍弃”这项工作的结果。

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.


回答 4

好吧,这是一个快速简单的解决方案,带有测试仪。

取胜策略:

re.sub(“ \ w +”,repl,sentence)搜索单词。

“ repl”可以是可调用的。我使用了一个执行字典查找的函数,该字典包含要搜索和替换的单词。

这是最简单,最快的解决方案(请参见下面的示例代码中的函数replace4)。

次好的

想法是使用re.split将句子拆分为单词,同时保留分隔符以稍后重建句子。然后,通过简单的字典查找完成替换。

(请参见下面的示例代码中的函数replace3)。

功能示例的时间:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…和代码:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

编辑:检查是否传递小写的句子列表并编辑repl时,您也可以忽略小写

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

Well, here’s a quick and easy solution, with test set.

Winning strategy:

re.sub(“\w+”,repl,sentence) searches for words.

“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

This is the simplest and fastest solution (see function replace4 in example code below).

Second best

The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

(see function replace3 in example code below).

Timings for example functions:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…and code:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

回答 5

也许Python不是这里的正确工具。这是Unix工具链中的一个

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

假设您的黑名单文件已经过预处理,并添加了字边界。步骤是:将文件转换为双倍行距,将每个句子拆分为每行一个单词,从文件中批量删除黑名单单词,然后合并回行。

这应该至少快一个数量级。

用于从单词中预处理黑名单文件(每行一个单词)

sed 's/.*/\\b&\\b/' words > blacklist

Perhaps Python is not the right tool here. Here is one with the Unix toolchain

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)

sed 's/.*/\\b&\\b/' words > blacklist

回答 6

这个怎么样:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

这些解决方案在单词边界上划分并查找集合中的每个单词。它们应该比re.sub单词替代(Liteyes的解决方案)更快,因为这些解决方案是O(n),其中n是由于amortized O(1)设置查找而导致的,而使用正则表达式替代项将导致regex引擎必须检查单词是否匹配在每个字符上,而不仅仅是在单词边界上。我的解决方案a格外小心,以保留原始文本中使用的空格(即,它不压缩空格,并保留制表符,换行符和其他空格字符),但是如果您决定不关心它,则可以从输出中删除它们应该非常简单。

我在corpus.txt上进行了测试,corpus.txt是从Gutenberg Project下载的多本电子书的串联,并且banned_words.txt是从Ubuntu的单词表(/ usr / share / dict / american-english)中随机选择的20000个单词。处理862462个句子(约占PyPy的一半)大约需要30秒。我已将句子定义为以“。”分隔的任何内容。

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy特别受益于第二种方法,而CPython在第一种方法上表现更好。上面的代码在Python 2和Python 3上都可以使用。

How about this:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.


回答 7

实用方法

下述解决方案使用大量内存将所有文本存储在同一字符串中,并降低了复杂度。如果RAM是一个问题,请在使用前三思。

使用join/ split技巧,您可以完全避免循环,从而可以加快算法的速度。

  • 用特殊分隔符连接句子,这些特殊分隔符不包含在句子中:
  • merged_sentences = ' * '.join(sentences)

  • 使用|“或”正则表达式语句为需要从句子中摆脱的所有单词编译一个正则表达式:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

  • 用已编译的正则表达式对单词下标,并用特殊的分隔符将其拆分回单独的句子:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

    性能

    "".join复杂度为O(n)。这是非常直观的,但是无论如何都会有一个简短的报价来源:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);

    因此,join/split有了O(words)+ 2 * O(sentences)仍然是线性复杂度,而初始方法为2 * O(N 2)。


    顺便说一句,不要使用多线程。GIL将阻止每个操作,因为您的任务严格地受CPU限制,因此GIL没有机会被释放,但是每个线程将同时发送滴答声,这会导致额外的工作量,甚至导致操作达到无穷大。

    Practical approach

    A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

    With join/split tricks you can avoid loops at all which should speed up the algorithm.

  • Concatenate a sentences with a special delimeter which is not contained by the sentences:
  • merged_sentences = ' * '.join(sentences)
    

  • Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
    

  • Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
    

    Performance

    "".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);
    

    Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.


    b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.


    回答 8

    将所有句子连接到一个文档中。使用Aho-Corasick算法的任何实现(这里是)来查找所有“不好”的单词。遍历文件,替换每个坏词,更新后跟的发现词的偏移量等。

    Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.


    字典与对象-哪个更有效,为什么?

    问题:字典与对象-哪个更有效,为什么?

    在内存使用和CPU消耗方面,在Python中更有效的方法是-字典还是对象?

    背景: 我必须将大量数据加载到Python中。我创建了一个只是字段容器的对象。创建4M实例并将其放入字典中大约需要10分钟和约6GB的内存。字典准备就绪后,只需眨眼即可访问。

    示例: 为了检查性能,我编写了两个简单的程序,它们执行相同的操作-一个使用对象,另一个使用字典:

    对象(执行时间〜18sec):

    class Obj(object):
      def __init__(self, i):
        self.i = i
        self.l = []
    all = {}
    for i in range(1000000):
      all[i] = Obj(i)

    字典(执行时间约12秒):

    all = {}
    for i in range(1000000):
      o = {}
      o['i'] = i
      o['l'] = []
      all[i] = o

    问题: 我做错什么了吗?字典比对象快?如果确实字典表现更好,有人可以解释为什么吗?

    What is more efficient in Python in terms of memory usage and CPU consumption – Dictionary or Object?

    Background: I have to load huge amount of data into Python. I created an object that is just a field container. Creating 4M instances and putting them into a dictionary took about 10 minutes and ~6GB of memory. After dictionary is ready, accessing it is a blink of an eye.

    Example: To check the performance I wrote two simple programs that do the same – one is using objects, other dictionary:

    Object (execution time ~18sec):

    class Obj(object):
      def __init__(self, i):
        self.i = i
        self.l = []
    all = {}
    for i in range(1000000):
      all[i] = Obj(i)
    

    Dictionary (execution time ~12sec):

    all = {}
    for i in range(1000000):
      o = {}
      o['i'] = i
      o['l'] = []
      all[i] = o
    

    Question: Am I doing something wrong or dictionary is just faster than object? If indeed dictionary performs better, can somebody explain why?


    回答 0

    您是否尝试过使用__slots__

    文档中

    默认情况下,新旧类的实例都有用于属性存储的字典。这浪费了具有很少实例变量的对象的空间。创建大量实例时,空间消耗会变得非常大。

    可以通过__slots__在新式类定义中进行定义来覆盖默认值。该__slots__声明采用一系列实例变量,并且在每个实例中仅保留足够的空间来容纳每个变量的值。因为__dict__未为每个实例创建空间,所以节省了空间。

    那么,这样既节省时间又节省内存吗?

    比较计算机上的三种方法:

    test_slots.py:

    class Obj(object):
      __slots__ = ('i', 'l')
      def __init__(self, i):
        self.i = i
        self.l = []
    all = {}
    for i in range(1000000):
      all[i] = Obj(i)

    test_obj.py:

    class Obj(object):
      def __init__(self, i):
        self.i = i
        self.l = []
    all = {}
    for i in range(1000000):
      all[i] = Obj(i)

    test_dict.py:

    all = {}
    for i in range(1000000):
      o = {}
      o['i'] = i
      o['l'] = []
      all[i] = o

    test_namedtuple.py(在2.6中受支持):

    import collections
    
    Obj = collections.namedtuple('Obj', 'i l')
    
    all = {}
    for i in range(1000000):
      all[i] = Obj(i, [])

    运行基准测试(使用CPython 2.5):

    $ lshw | grep product | head -n 1
              product: Intel(R) Pentium(R) M processor 1.60GHz
    $ python --version
    Python 2.5
    $ time python test_obj.py && time python test_dict.py && time python test_slots.py 
    
    real    0m27.398s (using 'normal' object)
    real    0m16.747s (using __dict__)
    real    0m11.777s (using __slots__)

    使用CPython 2.6.2,包括命名的元组测试:

    $ python --version
    Python 2.6.2
    $ time python test_obj.py && time python test_dict.py && time python test_slots.py && time python test_namedtuple.py 
    
    real    0m27.197s (using 'normal' object)
    real    0m17.657s (using __dict__)
    real    0m12.249s (using __slots__)
    real    0m12.262s (using namedtuple)

    因此,是的(不是很意外),使用__slots__是一种性能优化。使用命名元组的性能与相似__slots__

    Have you tried using __slots__?

    From the documentation:

    By default, instances of both old and new-style classes have a dictionary for attribute storage. This wastes space for objects having very few instance variables. The space consumption can become acute when creating large numbers of instances.

    The default can be overridden by defining __slots__ in a new-style class definition. The __slots__ declaration takes a sequence of instance variables and reserves just enough space in each instance to hold a value for each variable. Space is saved because __dict__ is not created for each instance.

    So does this save time as well as memory?

    Comparing the three approaches on my computer:

    test_slots.py:

    class Obj(object):
      __slots__ = ('i', 'l')
      def __init__(self, i):
        self.i = i
        self.l = []
    all = {}
    for i in range(1000000):
      all[i] = Obj(i)
    

    test_obj.py:

    class Obj(object):
      def __init__(self, i):
        self.i = i
        self.l = []
    all = {}
    for i in range(1000000):
      all[i] = Obj(i)
    

    test_dict.py:

    all = {}
    for i in range(1000000):
      o = {}
      o['i'] = i
      o['l'] = []
      all[i] = o
    

    test_namedtuple.py (supported in 2.6):

    import collections
    
    Obj = collections.namedtuple('Obj', 'i l')
    
    all = {}
    for i in range(1000000):
      all[i] = Obj(i, [])
    

    Run benchmark (using CPython 2.5):

    $ lshw | grep product | head -n 1
              product: Intel(R) Pentium(R) M processor 1.60GHz
    $ python --version
    Python 2.5
    $ time python test_obj.py && time python test_dict.py && time python test_slots.py 
    
    real    0m27.398s (using 'normal' object)
    real    0m16.747s (using __dict__)
    real    0m11.777s (using __slots__)
    

    Using CPython 2.6.2, including the named tuple test:

    $ python --version
    Python 2.6.2
    $ time python test_obj.py && time python test_dict.py && time python test_slots.py && time python test_namedtuple.py 
    
    real    0m27.197s (using 'normal' object)
    real    0m17.657s (using __dict__)
    real    0m12.249s (using __slots__)
    real    0m12.262s (using namedtuple)
    

    So yes (not really a surprise), using __slots__ is a performance optimization. Using a named tuple has similar performance to __slots__.


    回答 1

    对象中的属性访问使用幕后的字典访问-因此,使用属性访问会增加额外的开销。另外,在对象情况下,由于例如额外的内存分配和代码执行(例如__init__方法的执行),您将承担额外的开销。

    在您的代码中,如果o是一个Obj实例,o.attr则等效于o.__dict__['attr']少量的额外开销。

    Attribute access in an object uses dictionary access behind the scenes – so by using attribute access you are adding extra overhead. Plus in the object case, you are incurring additional overhead because of e.g. additional memory allocations and code execution (e.g. of the __init__ method).

    In your code, if o is an Obj instance, o.attr is equivalent to o.__dict__['attr'] with a small amount of extra overhead.


    回答 2

    您是否考虑过使用namedtuple?(python 2.4 / 2.5的链接

    这是表示结构化数据的新标准方式,可为您提供元组的性能和类的便利性。

    与字典相比,它的唯一缺点是(如元组)它不具有创建后更改属性的能力。

    Have you considered using a namedtuple? (link for python 2.4/2.5)

    It’s the new standard way of representing structured data that gives you the performance of a tuple and the convenience of a class.

    It’s only downside compared with dictionaries is that (like tuples) it doesn’t give you the ability to change attributes after creation.


    回答 3

    这是python 3.6.1的@hughdbrown答案的副本,我将计数增加了5倍,并在每次运行结束时添加了一些代码来测试python进程的内存占用量。

    在不愿接受投票的人之前,请注意,这种计算对象大小的方法并不准确。

    from datetime import datetime
    import os
    import psutil
    
    process = psutil.Process(os.getpid())
    
    
    ITER_COUNT = 1000 * 1000 * 5
    
    RESULT=None
    
    def makeL(i):
        # Use this line to negate the effect of the strings on the test 
        # return "Python is smart and will only create one string with this line"
    
        # Use this if you want to see the difference with 5 million unique strings
        return "This is a sample string %s" % i
    
    def timeit(method):
        def timed(*args, **kw):
            global RESULT
            s = datetime.now()
            RESULT = method(*args, **kw)
            e = datetime.now()
    
            sizeMb = process.memory_info().rss / 1024 / 1024
            sizeMbStr = "{0:,}".format(round(sizeMb, 2))
    
            print('Time Taken = %s, \t%s, \tSize = %s' % (e - s, method.__name__, sizeMbStr))
    
        return timed
    
    class Obj(object):
        def __init__(self, i):
           self.i = i
           self.l = makeL(i)
    
    class SlotObj(object):
        __slots__ = ('i', 'l')
        def __init__(self, i):
           self.i = i
           self.l = makeL(i)
    
    from collections import namedtuple
    NT = namedtuple("NT", ["i", 'l'])
    
    @timeit
    def profile_dict_of_nt():
        return [NT(i=i, l=makeL(i)) for i in range(ITER_COUNT)]
    
    @timeit
    def profile_list_of_nt():
        return dict((i, NT(i=i, l=makeL(i))) for i in range(ITER_COUNT))
    
    @timeit
    def profile_dict_of_dict():
        return dict((i, {'i': i, 'l': makeL(i)}) for i in range(ITER_COUNT))
    
    @timeit
    def profile_list_of_dict():
        return [{'i': i, 'l': makeL(i)} for i in range(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_obj():
        return dict((i, Obj(i)) for i in range(ITER_COUNT))
    
    @timeit
    def profile_list_of_obj():
        return [Obj(i) for i in range(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_slot():
        return dict((i, SlotObj(i)) for i in range(ITER_COUNT))
    
    @timeit
    def profile_list_of_slot():
        return [SlotObj(i) for i in range(ITER_COUNT)]
    
    profile_dict_of_nt()
    profile_list_of_nt()
    profile_dict_of_dict()
    profile_list_of_dict()
    profile_dict_of_obj()
    profile_list_of_obj()
    profile_dict_of_slot()
    profile_list_of_slot()

    这些是我的结果

    Time Taken = 0:00:07.018720,    provile_dict_of_nt,     Size = 951.83
    Time Taken = 0:00:07.716197,    provile_list_of_nt,     Size = 1,084.75
    Time Taken = 0:00:03.237139,    profile_dict_of_dict,   Size = 1,926.29
    Time Taken = 0:00:02.770469,    profile_list_of_dict,   Size = 1,778.58
    Time Taken = 0:00:07.961045,    profile_dict_of_obj,    Size = 1,537.64
    Time Taken = 0:00:05.899573,    profile_list_of_obj,    Size = 1,458.05
    Time Taken = 0:00:06.567684,    profile_dict_of_slot,   Size = 1,035.65
    Time Taken = 0:00:04.925101,    profile_list_of_slot,   Size = 887.49

    我的结论是:

    1. 插槽具有最佳的内存占用,并且速度合理。
    2. dict是最快的,但使用最多的内存。

    Here is a copy of @hughdbrown answer for python 3.6.1, I’ve made the count 5x larger and added some code to test the memory footprint of the python process at the end of each run.

    Before the downvoters have at it, Be advised that this method of counting the size of objects is not accurate.

    from datetime import datetime
    import os
    import psutil
    
    process = psutil.Process(os.getpid())
    
    
    ITER_COUNT = 1000 * 1000 * 5
    
    RESULT=None
    
    def makeL(i):
        # Use this line to negate the effect of the strings on the test 
        # return "Python is smart and will only create one string with this line"
    
        # Use this if you want to see the difference with 5 million unique strings
        return "This is a sample string %s" % i
    
    def timeit(method):
        def timed(*args, **kw):
            global RESULT
            s = datetime.now()
            RESULT = method(*args, **kw)
            e = datetime.now()
    
            sizeMb = process.memory_info().rss / 1024 / 1024
            sizeMbStr = "{0:,}".format(round(sizeMb, 2))
    
            print('Time Taken = %s, \t%s, \tSize = %s' % (e - s, method.__name__, sizeMbStr))
    
        return timed
    
    class Obj(object):
        def __init__(self, i):
           self.i = i
           self.l = makeL(i)
    
    class SlotObj(object):
        __slots__ = ('i', 'l')
        def __init__(self, i):
           self.i = i
           self.l = makeL(i)
    
    from collections import namedtuple
    NT = namedtuple("NT", ["i", 'l'])
    
    @timeit
    def profile_dict_of_nt():
        return [NT(i=i, l=makeL(i)) for i in range(ITER_COUNT)]
    
    @timeit
    def profile_list_of_nt():
        return dict((i, NT(i=i, l=makeL(i))) for i in range(ITER_COUNT))
    
    @timeit
    def profile_dict_of_dict():
        return dict((i, {'i': i, 'l': makeL(i)}) for i in range(ITER_COUNT))
    
    @timeit
    def profile_list_of_dict():
        return [{'i': i, 'l': makeL(i)} for i in range(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_obj():
        return dict((i, Obj(i)) for i in range(ITER_COUNT))
    
    @timeit
    def profile_list_of_obj():
        return [Obj(i) for i in range(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_slot():
        return dict((i, SlotObj(i)) for i in range(ITER_COUNT))
    
    @timeit
    def profile_list_of_slot():
        return [SlotObj(i) for i in range(ITER_COUNT)]
    
    profile_dict_of_nt()
    profile_list_of_nt()
    profile_dict_of_dict()
    profile_list_of_dict()
    profile_dict_of_obj()
    profile_list_of_obj()
    profile_dict_of_slot()
    profile_list_of_slot()
    

    And these are my results

    Time Taken = 0:00:07.018720,    provile_dict_of_nt,     Size = 951.83
    Time Taken = 0:00:07.716197,    provile_list_of_nt,     Size = 1,084.75
    Time Taken = 0:00:03.237139,    profile_dict_of_dict,   Size = 1,926.29
    Time Taken = 0:00:02.770469,    profile_list_of_dict,   Size = 1,778.58
    Time Taken = 0:00:07.961045,    profile_dict_of_obj,    Size = 1,537.64
    Time Taken = 0:00:05.899573,    profile_list_of_obj,    Size = 1,458.05
    Time Taken = 0:00:06.567684,    profile_dict_of_slot,   Size = 1,035.65
    Time Taken = 0:00:04.925101,    profile_list_of_slot,   Size = 887.49
    

    My conclusion is:

    1. Slots have the best memory footprint and are reasonable on speed.
    2. dicts are the fastest, but use the most memory.

    回答 4

    from datetime import datetime
    
    ITER_COUNT = 1000 * 1000
    
    def timeit(method):
        def timed(*args, **kw):
            s = datetime.now()
            result = method(*args, **kw)
            e = datetime.now()
    
            print method.__name__, '(%r, %r)' % (args, kw), e - s
            return result
        return timed
    
    class Obj(object):
        def __init__(self, i):
           self.i = i
           self.l = []
    
    class SlotObj(object):
        __slots__ = ('i', 'l')
        def __init__(self, i):
           self.i = i
           self.l = []
    
    @timeit
    def profile_dict_of_dict():
        return dict((i, {'i': i, 'l': []}) for i in xrange(ITER_COUNT))
    
    @timeit
    def profile_list_of_dict():
        return [{'i': i, 'l': []} for i in xrange(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_obj():
        return dict((i, Obj(i)) for i in xrange(ITER_COUNT))
    
    @timeit
    def profile_list_of_obj():
        return [Obj(i) for i in xrange(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_slotobj():
        return dict((i, SlotObj(i)) for i in xrange(ITER_COUNT))
    
    @timeit
    def profile_list_of_slotobj():
        return [SlotObj(i) for i in xrange(ITER_COUNT)]
    
    if __name__ == '__main__':
        profile_dict_of_dict()
        profile_list_of_dict()
        profile_dict_of_obj()
        profile_list_of_obj()
        profile_dict_of_slotobj()
        profile_list_of_slotobj()

    结果:

    hbrown@hbrown-lpt:~$ python ~/Dropbox/src/StackOverflow/1336791.py 
    profile_dict_of_dict ((), {}) 0:00:08.228094
    profile_list_of_dict ((), {}) 0:00:06.040870
    profile_dict_of_obj ((), {}) 0:00:11.481681
    profile_list_of_obj ((), {}) 0:00:10.893125
    profile_dict_of_slotobj ((), {}) 0:00:06.381897
    profile_list_of_slotobj ((), {}) 0:00:05.860749
    from datetime import datetime
    
    ITER_COUNT = 1000 * 1000
    
    def timeit(method):
        def timed(*args, **kw):
            s = datetime.now()
            result = method(*args, **kw)
            e = datetime.now()
    
            print method.__name__, '(%r, %r)' % (args, kw), e - s
            return result
        return timed
    
    class Obj(object):
        def __init__(self, i):
           self.i = i
           self.l = []
    
    class SlotObj(object):
        __slots__ = ('i', 'l')
        def __init__(self, i):
           self.i = i
           self.l = []
    
    @timeit
    def profile_dict_of_dict():
        return dict((i, {'i': i, 'l': []}) for i in xrange(ITER_COUNT))
    
    @timeit
    def profile_list_of_dict():
        return [{'i': i, 'l': []} for i in xrange(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_obj():
        return dict((i, Obj(i)) for i in xrange(ITER_COUNT))
    
    @timeit
    def profile_list_of_obj():
        return [Obj(i) for i in xrange(ITER_COUNT)]
    
    @timeit
    def profile_dict_of_slotobj():
        return dict((i, SlotObj(i)) for i in xrange(ITER_COUNT))
    
    @timeit
    def profile_list_of_slotobj():
        return [SlotObj(i) for i in xrange(ITER_COUNT)]
    
    if __name__ == '__main__':
        profile_dict_of_dict()
        profile_list_of_dict()
        profile_dict_of_obj()
        profile_list_of_obj()
        profile_dict_of_slotobj()
        profile_list_of_slotobj()
    

    Results:

    hbrown@hbrown-lpt:~$ python ~/Dropbox/src/StackOverflow/1336791.py 
    profile_dict_of_dict ((), {}) 0:00:08.228094
    profile_list_of_dict ((), {}) 0:00:06.040870
    profile_dict_of_obj ((), {}) 0:00:11.481681
    profile_list_of_obj ((), {}) 0:00:10.893125
    profile_dict_of_slotobj ((), {}) 0:00:06.381897
    profile_list_of_slotobj ((), {}) 0:00:05.860749
    

    回答 5

    没问题。
    您有没有其他属性的数据(没有方法,没有任何东西)。因此,您有一个数据容器(在本例中为字典)。

    我通常更喜欢在数据建模方面进行思考。如果存在巨大的性能问题,那么我可以放弃抽象中的某些内容,但是只有非常好的理由。
    编程是关于管理复杂性的,维护正确的抽象常常是实现这种结果的最有用的方法之一。

    关于物体变慢的原因,我认为您的测量不正确。
    您在for循环内执行的分配太少,因此看到的实例化dict(本机对象)和“ custom”对象所需的时间不同。尽管从语言角度看它们是相同的,但它们的实现却大不相同。
    之后,两者的分配时间应几乎相同,因为最终成员将保留在词典中。

    There is no question.
    You have data, with no other attributes (no methods, nothing). Hence you have a data container (in this case, a dictionary).

    I usually prefer to think in terms of data modeling. If there is some huge performance issue, then I can give up something in the abstraction, but only with very good reasons.
    Programming is all about managing complexity, and the maintaining the correct abstraction is very often one of the most useful way to achieve such result.

    About the reasons an object is slower, I think your measurement is not correct.
    You are performing too little assignments inside the for loop, and therefore what you see there is the different time necessary to instantiate a dict (intrinsic object) and a “custom” object. Although from the language perspective they are the same, they have quite a different implementation.
    After that, the assignment time should be almost the same for both, as in the end members are maintained inside a dictionary.


    回答 6

    如果数据结构不应包含参考周期,则还有另一种减少内存使用的方法。

    让我们比较两个类:

    class DataItem:
        __slots__ = ('name', 'age', 'address')
        def __init__(self, name, age, address):
            self.name = name
            self.age = age
            self.address = address

    $ pip install recordclass
    
    >>> from recordclass import structclass
    >>> DataItem2 = structclass('DataItem', 'name age address')
    >>> inst = DataItem('Mike', 10, 'Cherry Street 15')
    >>> inst2 = DataItem2('Mike', 10, 'Cherry Street 15')
    >>> print(inst2)
    >>> print(sys.getsizeof(inst), sys.getsizeof(inst2))
    DataItem(name='Mike', age=10, address='Cherry Street 15')
    64 40

    由于structclass基于类的类不支持循环垃圾收集,在这种情况下不需要,因此成为可能。

    __slots__基于类的类相比,还有一个优点:您可以添加额外的属性:

    >>> DataItem3 = structclass('DataItem', 'name age address', usedict=True)
    >>> inst3 = DataItem3('Mike', 10, 'Cherry Street 15')
    >>> inst3.hobby = ['drawing', 'singing']
    >>> print(inst3)
    >>> print(sizeof(inst3), 'has dict:',  bool(inst3.__dict__))
    DataItem(name='Mike', age=10, address='Cherry Street 15', **{'hobby': ['drawing', 'singing']})
    48 has dict: True

    There is yet another way to reduce memory usage if data structure isn’t supposed to contain reference cycles.

    Let’s compare two classes:

    class DataItem:
        __slots__ = ('name', 'age', 'address')
        def __init__(self, name, age, address):
            self.name = name
            self.age = age
            self.address = address
    

    and

    $ pip install recordclass
    
    >>> from recordclass import structclass
    >>> DataItem2 = structclass('DataItem', 'name age address')
    >>> inst = DataItem('Mike', 10, 'Cherry Street 15')
    >>> inst2 = DataItem2('Mike', 10, 'Cherry Street 15')
    >>> print(inst2)
    >>> print(sys.getsizeof(inst), sys.getsizeof(inst2))
    DataItem(name='Mike', age=10, address='Cherry Street 15')
    64 40
    

    It became possible since structclass-based classes doesn’t support cyclic garbage collection, which is not needed in such cases.

    There is also one advantage over __slots__-based class: you are able to add extra attributes:

    >>> DataItem3 = structclass('DataItem', 'name age address', usedict=True)
    >>> inst3 = DataItem3('Mike', 10, 'Cherry Street 15')
    >>> inst3.hobby = ['drawing', 'singing']
    >>> print(inst3)
    >>> print(sizeof(inst3), 'has dict:',  bool(inst3.__dict__))
    DataItem(name='Mike', age=10, address='Cherry Street 15', **{'hobby': ['drawing', 'singing']})
    48 has dict: True
    

    回答 7

    这是我对@ Jarrod-Chesney非常好的脚本的测试运行。为了进行比较,我还针对python2运行了它,将“ range”替换为“ xrange”。

    出于好奇,我还使用OrderedDict(ordict)添加了类似的测试以进行比较。

    Python 3.6.9:

    Time Taken = 0:00:04.971369,    profile_dict_of_nt,     Size = 944.27
    Time Taken = 0:00:05.743104,    profile_list_of_nt,     Size = 1,066.93
    Time Taken = 0:00:02.524507,    profile_dict_of_dict,   Size = 1,920.35
    Time Taken = 0:00:02.123801,    profile_list_of_dict,   Size = 1,760.9
    Time Taken = 0:00:05.374294,    profile_dict_of_obj,    Size = 1,532.12
    Time Taken = 0:00:04.517245,    profile_list_of_obj,    Size = 1,441.04
    Time Taken = 0:00:04.590298,    profile_dict_of_slot,   Size = 1,030.09
    Time Taken = 0:00:04.197425,    profile_list_of_slot,   Size = 870.67
    
    Time Taken = 0:00:08.833653,    profile_ordict_of_ordict, Size = 3,045.52
    Time Taken = 0:00:11.539006,    profile_list_of_ordict, Size = 2,722.34
    Time Taken = 0:00:06.428105,    profile_ordict_of_obj,  Size = 1,799.29
    Time Taken = 0:00:05.559248,    profile_ordict_of_slot, Size = 1,257.75

    Python 2.7.15+:

    Time Taken = 0:00:05.193900,    profile_dict_of_nt,     Size = 906.0
    Time Taken = 0:00:05.860978,    profile_list_of_nt,     Size = 1,177.0
    Time Taken = 0:00:02.370905,    profile_dict_of_dict,   Size = 2,228.0
    Time Taken = 0:00:02.100117,    profile_list_of_dict,   Size = 2,036.0
    Time Taken = 0:00:08.353666,    profile_dict_of_obj,    Size = 2,493.0
    Time Taken = 0:00:07.441747,    profile_list_of_obj,    Size = 2,337.0
    Time Taken = 0:00:06.118018,    profile_dict_of_slot,   Size = 1,117.0
    Time Taken = 0:00:04.654888,    profile_list_of_slot,   Size = 964.0
    
    Time Taken = 0:00:59.576874,    profile_ordict_of_ordict, Size = 7,427.0
    Time Taken = 0:10:25.679784,    profile_list_of_ordict, Size = 11,305.0
    Time Taken = 0:05:47.289230,    profile_ordict_of_obj,  Size = 11,477.0
    Time Taken = 0:00:51.485756,    profile_ordict_of_slot, Size = 11,193.0

    因此,在两个主要版本上,@ Jarrod-Chesney的结论仍然看起来不错。

    Here are my test runs of the very nice script of @Jarrod-Chesney. For comparison, I also run it against python2 with “range” replaced by “xrange”.

    By curiosity, I also added similar tests with OrderedDict (ordict) for comparison.

    Python 3.6.9:

    Time Taken = 0:00:04.971369,    profile_dict_of_nt,     Size = 944.27
    Time Taken = 0:00:05.743104,    profile_list_of_nt,     Size = 1,066.93
    Time Taken = 0:00:02.524507,    profile_dict_of_dict,   Size = 1,920.35
    Time Taken = 0:00:02.123801,    profile_list_of_dict,   Size = 1,760.9
    Time Taken = 0:00:05.374294,    profile_dict_of_obj,    Size = 1,532.12
    Time Taken = 0:00:04.517245,    profile_list_of_obj,    Size = 1,441.04
    Time Taken = 0:00:04.590298,    profile_dict_of_slot,   Size = 1,030.09
    Time Taken = 0:00:04.197425,    profile_list_of_slot,   Size = 870.67
    
    Time Taken = 0:00:08.833653,    profile_ordict_of_ordict, Size = 3,045.52
    Time Taken = 0:00:11.539006,    profile_list_of_ordict, Size = 2,722.34
    Time Taken = 0:00:06.428105,    profile_ordict_of_obj,  Size = 1,799.29
    Time Taken = 0:00:05.559248,    profile_ordict_of_slot, Size = 1,257.75
    

    Python 2.7.15+:

    Time Taken = 0:00:05.193900,    profile_dict_of_nt,     Size = 906.0
    Time Taken = 0:00:05.860978,    profile_list_of_nt,     Size = 1,177.0
    Time Taken = 0:00:02.370905,    profile_dict_of_dict,   Size = 2,228.0
    Time Taken = 0:00:02.100117,    profile_list_of_dict,   Size = 2,036.0
    Time Taken = 0:00:08.353666,    profile_dict_of_obj,    Size = 2,493.0
    Time Taken = 0:00:07.441747,    profile_list_of_obj,    Size = 2,337.0
    Time Taken = 0:00:06.118018,    profile_dict_of_slot,   Size = 1,117.0
    Time Taken = 0:00:04.654888,    profile_list_of_slot,   Size = 964.0
    
    Time Taken = 0:00:59.576874,    profile_ordict_of_ordict, Size = 7,427.0
    Time Taken = 0:10:25.679784,    profile_list_of_ordict, Size = 11,305.0
    Time Taken = 0:05:47.289230,    profile_ordict_of_obj,  Size = 11,477.0
    Time Taken = 0:00:51.485756,    profile_ordict_of_slot, Size = 11,193.0
    

    So, on both major versions, the conclusions of @Jarrod-Chesney are still looking good.


    在NumPy数组的每个单元中高效评估函数

    问题:在NumPy数组的每个单元中高效评估函数

    给定一个NumPy数组A,将相同的函数f应用于每个单元的最快/最有效的方法是什么?

    1. 假设我们将分配给A(I,J)F(A(I,J))

    2. 函数f没有二进制输出,因此mask(ing)操作将无济于事。

    “显而易见的”双循环迭代(通过每个单元)是否是最佳解决方案?

    Given a NumPy array A, what is the fastest/most efficient way to apply the same function, f, to every cell?

    1. Suppose that we will assign to A(i,j) the f(A(i,j)).

    2. The function, f, doesn’t have a binary output, thus the mask(ing) operations won’t help.

    Is the “obvious” double loop iteration (through every cell) the optimal solution?


    回答 0

    您可以对函数进行矢量化处理,然后在每次需要时将其直接应用于Numpy数组:

    import numpy as np
    
    def f(x):
        return x * x + 3 * x - 2 if x > 0 else x * 5 + 8
    
    f = np.vectorize(f)  # or use a different name if you want to keep the original f
    
    result_array = f(A)  # if A is your Numpy array

    向量化时最好直接指定一个显式输出类型:

    f = np.vectorize(f, otypes=[np.float])

    You could just vectorize the function and then apply it directly to a Numpy array each time you need it:

    import numpy as np
    
    def f(x):
        return x * x + 3 * x - 2 if x > 0 else x * 5 + 8
    
    f = np.vectorize(f)  # or use a different name if you want to keep the original f
    
    result_array = f(A)  # if A is your Numpy array
    

    It’s probably better to specify an explicit output type directly when vectorizing:

    f = np.vectorize(f, otypes=[np.float])
    

    回答 1

    一个类似的问题是:将NumPy数组映射到适当的位置。如果可以为f()找到一个ufunc,则应使用out参数。

    A similar question is: Mapping a NumPy array in place. If you can find a ufunc for your f(), then you should use the out parameter.


    回答 2

    如果您使用数字和f(A(i,j)) = f(A(j,i)),则可以使用scipy.spatial.distance.cdist将f定义为A(i)和之间的距离A(j)

    If you are working with numbers and f(A(i,j)) = f(A(j,i)), you could use scipy.spatial.distance.cdist defining f as a distance between A(i) and A(j).


    回答 3

    我相信我找到了更好的解决方案。将函数更改为python通用函数的想法(请参阅文档),可以在后台进行并行计算。

    一个人可以用ufuncC 编写自己的自定义脚本,这肯定会更有效,也可以通过调用np.frompyfunc内置工厂方法来编写。经过测试,此方法比np.vectorize

    f = lambda x, y: x * y
    f_arr = np.frompyfunc(f, 2, 1)
    vf = np.vectorize(f)
    arr = np.linspace(0, 1, 10000)
    
    %timeit f_arr(arr, arr) # 307ms
    %timeit f_arr(arr, arr) # 450ms

    我还测试了较大的样本,并且改进成比例。有关其他方法的性能比较,请参阅这篇文章

    I believe I have found a better solution. The idea to change the function to python universal function (see documentation), which can exercise parallel computation under the hood.

    One can write his own customised ufunc in C, which surely is more efficient, or by invoking np.frompyfunc, which is built-in factory method. After testing, this is more efficient than np.vectorize:

    f = lambda x, y: x * y
    f_arr = np.frompyfunc(f, 2, 1)
    vf = np.vectorize(f)
    arr = np.linspace(0, 1, 10000)
    
    %timeit f_arr(arr, arr) # 307ms
    %timeit f_arr(arr, arr) # 450ms
    

    I have also tested larger samples, and the improvement is proportional. For comparison of performances of other methods, see this post


    回答 4

    当2d数组(或nd数组)为C或F连续时,将函数映射到2d数组的任务实际上与将函数映射到1d数组的任务相同-我们只是必须以这种方式查看它,例如通过np.ravel(A,'K')

    例如,这里讨论了1d阵列的可能解决方案。

    但是,当2d数组的内存不连续时,情况会稍微复杂一些,因为如果轴处理顺序错误,则希望避免可能发生的高速缓存未命中。

    Numpy已经有机器以最佳顺序加工轴。使用这种机械的一种可能性是np.vectorize。但是,numpy的文档np.vectorize指出“主要是为了方便而不是为了性能而提供”-慢速python函数保持慢速python函数以及所有相关的开销!另一个问题是其巨大的内存消耗-例如,参见此SO-post

    当想要具有C函数的性能但要使用numpy的机器时,一个好的解决方案是使用numba创建ufunc,例如:

    # runtime generated C-function as ufunc
    import numba as nb
    @nb.vectorize(target="cpu")
    def nb_vf(x):
        return x+2*x*x+4*x*x*x

    它很容易击败,np.vectorize但是当执行相同的功能作为numpy-array乘法/加法时,即

    # numpy-functionality
    def f(x):
        return x+2*x*x+4*x*x*x
    
    # python-function as ufunc
    import numpy as np
    vf=np.vectorize(f)
    vf.__name__="vf"

    有关时间测量代码,请参见此答案的附录:

    Numba的版本(绿色)比python函数(即np.vectorize)快约100倍,这并不奇怪。但这也比numpy功能快约10倍,因为numbas版本不需要中间数组,因此可以更有效地使用缓存。


    尽管numba的ufunc方法是可用性和性能之间的良好折衷,但它仍然不是我们能做的最好的选择。然而,没有灵丹妙药或最适合任何任务的方法-人们必须了解什么是局限性以及如何减轻它们。

    例如,对于超越函数(例如expsincos)numba不提供超过任何优点numpy的的np.exp(有没有创建临时数组-高速化的主要来源)。但是,我的Anaconda安装使用Intel的VML来处理大于8192的矢量-如果内存不连续,则无法执行。因此,最好将元素复制到连续内存中,以便能够使用英特尔的VML:

    import numba as nb
    @nb.vectorize(target="cpu")
    def nb_vexp(x):
        return np.exp(x)
    
    def np_copy_exp(x):
        copy = np.ravel(x, 'K')
        return np.exp(copy).reshape(x.shape) 

    为了公平起见,我关闭了VML的并行化功能(请参见附录中的代码):

    可以看到,一旦VML开始运行,复制的开销就远远超过了补偿。但是,一旦数据对于L3高速缓存而言变得太大,则优势就变得微不足道了,因为任务再次变得与内存带宽绑定。

    在另一方面,numba可以使用英特尔的SVML为好,在解释这个帖子

    from llvmlite import binding
    # set before import
    binding.set_option('SVML', '-vector-library=SVML')
    
    import numba as nb
    
    @nb.vectorize(target="cpu")
    def nb_vexp_svml(x):
        return np.exp(x)

    并使用具有并行化功能的VML生成:

    numba的版本开销较小,但是对于某些大小,VML甚至比SVML都要高,尽管有额外的复制开销-这并不奇怪,因为numba的ufunc没有并行化。


    清单:

    A.多项式函数的比较:

    import perfplot
    perfplot.show(
        setup=lambda n: np.random.rand(n,n)[::2,::2],
        n_range=[2**k for k in range(0,12)],
        kernels=[
            f,
            vf, 
            nb_vf
            ],
        logx=True,
        logy=True,
        xlabel='len(x)'
        ) 

    B.比较exp

    import perfplot
    import numexpr as ne # using ne is the easiest way to set vml_num_threads
    ne.set_vml_num_threads(1)
    perfplot.show(
        setup=lambda n: np.random.rand(n,n)[::2,::2],
        n_range=[2**k for k in range(0,12)],
        kernels=[
            nb_vexp, 
            np.exp,
            np_copy_exp,
            ],
        logx=True,
        logy=True,
        xlabel='len(x)',
        )

    When the 2d-array (or nd-array) is C- or F-contiguous, then this task of mapping a function onto a 2d-array is practically the same as the task of mapping a function onto a 1d-array – we just have to view it that way, e.g. via np.ravel(A,'K').

    Possible solution for 1d-array have been discussed for example here.

    However, when the memory of the 2d-array isn’t contiguous, then the situation a little bit more complicated, because one would like to avoid possible cache misses if axis are handled in wrong order.

    Numpy has already a machinery in place to process axes in the best possible order. One possibility to use this machinery is np.vectorize. However, numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance” – a slow python function stays a slow python function with the whole associated overhead! Another issue is its huge memory-consumption – see for example this SO-post.

    When one wants to have a performance of a C-function but to use numpy’s machinery, a good solution is to use numba for creation of ufuncs, for example:

    # runtime generated C-function as ufunc
    import numba as nb
    @nb.vectorize(target="cpu")
    def nb_vf(x):
        return x+2*x*x+4*x*x*x
    

    It easily beats np.vectorize but also when the same function would be performed as numpy-array multiplication/addition, i.e.

    # numpy-functionality
    def f(x):
        return x+2*x*x+4*x*x*x
    
    # python-function as ufunc
    import numpy as np
    vf=np.vectorize(f)
    vf.__name__="vf"
    

    See appendix of this answer for time-measurement-code:

    Numba’s version (green) is about 100 times faster than the python-function (i.e. np.vectorize), which is not surprising. But it is also about 10 times faster than the numpy-functionality, because numbas version doesn’t need intermediate arrays and thus uses cache more efficiently.


    While numba’s ufunc approach is a good trade-off between usability and performance, it is still not the best we can do. Yet there is no silver bullet or an approach best for any task – one has to understand what are the limitation and how they can be mitigated.

    For example, for transcendental functions (e.g. exp, sin, cos) numba doesn’t provide any advantages over numpy’s np.exp (there are no temporary arrays created – the main source of the speed-up). However, my Anaconda installation utilizes Intel’s VML for vectors bigger than 8192 – it just cannot do it if memory is not contiguous. So it might be better to copy the elements to a contiguous memory in order to be able to use Intel’s VML:

    import numba as nb
    @nb.vectorize(target="cpu")
    def nb_vexp(x):
        return np.exp(x)
    
    def np_copy_exp(x):
        copy = np.ravel(x, 'K')
        return np.exp(copy).reshape(x.shape) 
    

    For the fairness of the comparison, I have switched off VML’s parallelization (see code in the appendix):

    As one can see, once VML kicks in, the overhead of copying is more than compensated. Yet once data becomes too big for L3 cache, the advantage is minimal as task becomes once again memory-bandwidth-bound.

    On the other hand, numba could use Intel’s SVML as well, as explained in this post:

    from llvmlite import binding
    # set before import
    binding.set_option('SVML', '-vector-library=SVML')
    
    import numba as nb
    
    @nb.vectorize(target="cpu")
    def nb_vexp_svml(x):
        return np.exp(x)
    

    and using VML with parallelization yields:

    numba’s version has less overhead, but for some sizes VML beats SVML even despite of the additional copying overhead – which isn’t a bit surprise as numba’s ufuncs aren’t parallelized.


    Listings:

    A. comparison of polynomial function:

    import perfplot
    perfplot.show(
        setup=lambda n: np.random.rand(n,n)[::2,::2],
        n_range=[2**k for k in range(0,12)],
        kernels=[
            f,
            vf, 
            nb_vf
            ],
        logx=True,
        logy=True,
        xlabel='len(x)'
        ) 
    

    B. comparison of exp:

    import perfplot
    import numexpr as ne # using ne is the easiest way to set vml_num_threads
    ne.set_vml_num_threads(1)
    perfplot.show(
        setup=lambda n: np.random.rand(n,n)[::2,::2],
        n_range=[2**k for k in range(0,12)],
        kernels=[
            nb_vexp, 
            np.exp,
            np_copy_exp,
            ],
        logx=True,
        logy=True,
        xlabel='len(x)',
        )
    

    回答 5

    以上所有答案比较起来都不错,但是如果您需要使用自定义函数进行映射,并且拥有numpy.ndarray,则需要保留数组的形状。

    我只比较了两个,但它将保留的形状ndarray。我已将具有100万个条目的数组用于比较。在这里我使用平方函数。我正在介绍n维数组的一般情况。对于二维,只需制作iter2D。

    import numpy, time
    
    def A(e):
        return e * e
    
    def timeit():
        y = numpy.arange(1000000)
        now = time.time()
        numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)        
        print(time.time() - now)
        now = time.time()
        numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
        print(time.time() - now)
        now = time.time()
        numpy.square(y)  
        print(time.time() - now)

    输出量

    >>> timeit()
    1.162431240081787    # list comprehension and then building numpy array
    1.0775556564331055   # from numpy.fromiter
    0.002948284149169922 # using inbuilt function

    在这里你可以清楚地看到 numpy.fromiter用户平方函数,可以使用任何选择。如果你的功能是依赖于i, j 那就是数组的索引,迭代上数组的大小一样for ind in range(arr.size),用numpy.unravel_index得到i, j, ..基于阵列的您1D指数和形状numpy.unravel_index

    这个答案是受到我对其他问题的回答的启发 这里

    All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.

    I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function. I am presenting the general case for n dimensional array. For two dimensional just make iter for 2D.

    import numpy, time
    
    def A(e):
        return e * e
    
    def timeit():
        y = numpy.arange(1000000)
        now = time.time()
        numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)        
        print(time.time() - now)
        now = time.time()
        numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
        print(time.time() - now)
        now = time.time()
        numpy.square(y)  
        print(time.time() - now)
    

    Output

    >>> timeit()
    1.162431240081787    # list comprehension and then building numpy array
    1.0775556564331055   # from numpy.fromiter
    0.002948284149169922 # using inbuilt function
    

    here you can clearly see numpy.fromiter user square function, use any of your choice. If you function is dependent on i, j that is indices of array, iterate on size of array like for ind in range(arr.size), use numpy.unravel_index to get i, j, .. based on your 1D index and shape of array numpy.unravel_index

    This answers is inspired by my answer on other question here


    []和{}与list()和dict(),哪个更好?

    问题:[]和{}与list()和dict(),哪个更好?

    我知道它们本质上是同一件事,但是就样式而言,哪个是创建空列表或字典的更好(更Pythonic的)?

    I understand that they are both essentially the same thing, but in terms of style, which is the better (more Pythonic) one to use to create an empty list or dict?


    回答 0

    在速度方面,空列表/字典没有竞争:

    >>> from timeit import timeit
    >>> timeit("[]")
    0.040084982867934334
    >>> timeit("list()")
    0.17704233359267718
    >>> timeit("{}")
    0.033620194745424214
    >>> timeit("dict()")
    0.1821558326547077
    

    对于非空:

    >>> timeit("[1,2,3]")
    0.24316302770330367
    >>> timeit("list((1,2,3))")
    0.44744206316727286
    >>> timeit("list(foo)", setup="foo=(1,2,3)")
    0.446036018543964
    >>> timeit("{'a':1, 'b':2, 'c':3}")
    0.20868602015059423
    >>> timeit("dict(a=1, b=2, c=3)")
    0.47635635255323905
    >>> timeit("dict(bar)", setup="bar=[('a', 1), ('b', 2), ('c', 3)]")
    0.9028228448029267
    

    另外,使用方括号表示法还可以使您使用列表和字典理解,这可能就足够了。

    In terms of speed, it’s no competition for empty lists/dicts:

    >>> from timeit import timeit
    >>> timeit("[]")
    0.040084982867934334
    >>> timeit("list()")
    0.17704233359267718
    >>> timeit("{}")
    0.033620194745424214
    >>> timeit("dict()")
    0.1821558326547077
    

    and for non-empty:

    >>> timeit("[1,2,3]")
    0.24316302770330367
    >>> timeit("list((1,2,3))")
    0.44744206316727286
    >>> timeit("list(foo)", setup="foo=(1,2,3)")
    0.446036018543964
    >>> timeit("{'a':1, 'b':2, 'c':3}")
    0.20868602015059423
    >>> timeit("dict(a=1, b=2, c=3)")
    0.47635635255323905
    >>> timeit("dict(bar)", setup="bar=[('a', 1), ('b', 2), ('c', 3)]")
    0.9028228448029267
    

    Also, using the bracket notation lets you use list and dictionary comprehensions, which may be reason enough.


    回答 1

    我认为[]并且{}是创建空列表/字典的最Python易懂的方法。

    不过请注意set(),例如:

    this_set = {5}
    some_other_set = {}
    

    可能会造成混乱。第一个创建一个带有一个元素的集合,第二个创建一个空字典而不是一个集合。

    In my opinion [] and {} are the most pythonic and readable ways to create empty lists/dicts.

    Be wary of set()‘s though, for example:

    this_set = {5}
    some_other_set = {}
    

    Can be confusing. The first creates a set with one element, the second creates an empty dict and not a set.


    回答 2

    该字典的文字可能是一个小小的其字节码更短更快一点:

    In [1]: import dis
    In [2]: a = lambda: {}
    In [3]: b = lambda: dict()
    
    In [4]: dis.dis(a)
      1           0 BUILD_MAP                0
                  3 RETURN_VALUE
    
    In [5]: dis.dis(b)
      1           0 LOAD_GLOBAL              0 (dict)
                  3 CALL_FUNCTION            0
                  6 RETURN_VALUE
    

    同样适用于listVS[]

    The dict literal might be a tiny bit faster as its bytecode is shorter:

    In [1]: import dis
    In [2]: a = lambda: {}
    In [3]: b = lambda: dict()
    
    In [4]: dis.dis(a)
      1           0 BUILD_MAP                0
                  3 RETURN_VALUE
    
    In [5]: dis.dis(b)
      1           0 LOAD_GLOBAL              0 (dict)
                  3 CALL_FUNCTION            0
                  6 RETURN_VALUE
    

    Same applies to the list vs []


    回答 3

    恕我直言,使用list()dict()让你的Python看起来像C.唉。

    IMHO, using list() and dict() makes your Python look like C. Ugh.


    回答 4

    对于[]和list()之间的差异,存在一个陷阱,我没有看到其他人指出过。如果将字典用作列表的成员,则两者将给出完全不同的结果:

    In [1]: foo_dict = {"1":"foo", "2":"bar"}
    
    In [2]: [foo_dict]
    Out [2]: [{'1': 'foo', '2': 'bar'}]
    
    In [3]: list(foo_dict)
    Out [3]: ['1', '2'] 
    

    In the case of difference between [] and list(), there is a pitfall that I haven’t seen anyone else point out. If you use a dictionary as a member of the list, the two will give entirely different results:

    In [1]: foo_dict = {"1":"foo", "2":"bar"}
    
    In [2]: [foo_dict]
    Out [2]: [{'1': 'foo', '2': 'bar'}]
    
    In [3]: list(foo_dict)
    Out [3]: ['1', '2'] 
    

    回答 5

    list()和[]的工作方式不同:

    >>> def a(p=None):
    ...     print(id(p))
    ... 
    >>> for r in range(3):
    ...     a([])
    ... 
    139969725291904
    139969725291904
    139969725291904
    >>> for r in range(3):
    ...     a(list())
    ... 
    139969725367296
    139969725367552
    139969725367616
    

    list()总是在堆中创建新对象,但是[]可以出于多种原因重用存储单元。

    list() and [] work differently:

    >>> def a(p):
    ...     print(id(p))
    ... 
    >>> for r in range(3):
    ...     a([])
    ... 
    139969725291904
    139969725291904
    139969725291904
    >>> for r in range(3):
    ...     a(list())
    ... 
    139969725367296
    139969725367552
    139969725367616
    

    list() always create new object in heap, but [] can reuse memory cell in many reason.


    回答 6

    如下所示,[]和list()在行为上有一个区别。如果要返回数字列表,则需要使用list(),否则我们将获得一个map对象!不确定如何解释。

    sth = [(1,2), (3,4),(5,6)]
    sth2 = map(lambda x: x[1], sth) 
    print(sth2) # print returns object <map object at 0x000001AB34C1D9B0>
    
    sth2 = [map(lambda x: x[1], sth)]
    print(sth2) # print returns object <map object at 0x000001AB34C1D9B0>
    type(sth2) # list 
    type(sth2[0]) # map
    
    sth2 = list(map(lambda x: x[1], sth))
    print(sth2) #[2, 4, 6]
    type(sth2) # list
    type(sth2[0]) # int

    there is one difference in behavior between [] and list() as example below shows. we need to use list() if we want to have the list of numbers returned, otherwise we get a map object! No sure how to explain it though.

    sth = [(1,2), (3,4),(5,6)]
    sth2 = map(lambda x: x[1], sth) 
    print(sth2) # print returns object <map object at 0x000001AB34C1D9B0>
    
    sth2 = [map(lambda x: x[1], sth)]
    print(sth2) # print returns object <map object at 0x000001AB34C1D9B0>
    type(sth2) # list 
    type(sth2[0]) # map
    
    sth2 = list(map(lambda x: x[1], sth))
    print(sth2) #[2, 4, 6]
    type(sth2) # list
    type(sth2[0]) # int
    

    回答 7

    方括号对表示列表对象或索引下标my_List [x]中的一个。

    大括号对表示字典对象。

    a_list = [‘on’,’off’,1,2]

    a_dict = {开启:1,关闭:2}

    A box bracket pair denotes one of a list object, or an index subscript, my_List[x].

    A curly brace pair denotes a dictionary object.

    a_list = [‘on’, ‘off’, 1, 2]

    a_dict = { on: 1, off: 2 }


    回答 8

    大多数时候,这主要是选择问题。这是一个偏好问题。

    但是请注意,例如,如果您有数字键,则不能这样做:

    mydict = dict(1="foo", 2="bar")

    你所要做的:

    mydict = {"1":"foo", "2":"bar"}

    It’s mainly a matter of choice most of the time. It’s a matter of preference.

    Note however that if you have numeric keys for example, that you can’t do:

    mydict = dict(1="foo", 2="bar")
    

    You have to do:

    mydict = {"1":"foo", "2":"bar"}
    

    Python中异常处理程序的成本

    问题:Python中异常处理程序的成本

    另一个问题中,可接受的答案建议用try / except块替换Python代码中的(非常便宜)if语句,以提高性能。

    除了编码样式问题外,并假设永远不会触发异常,拥有一个异常处理程序与没有一个异常处理程序,与拥有与零比较的if语句相比,在性能方面有何不同?

    In another question, the accepted answer suggested replacing a (very cheap) if statement in Python code with a try/except block to improve performance.

    Coding style issues aside, and assuming that the exception is never triggered, how much difference does it make (performance-wise) to have an exception handler, versus not having one, versus having a compare-to-zero if-statement?


    回答 0

    为什么不使用timeit模块进行测量?这样,您可以查看它是否与您的应用程序相关。

    好,所以我已经尝试了以下方法:

    import timeit
    
    statements=["""\
    try:
        b = 10/a
    except ZeroDivisionError:
        pass""",
    """\
    if a:
        b = 10/a""",
    "b = 10/a"]
    
    for a in (1,0):
        for s in statements:
            t = timeit.Timer(stmt=s, setup='a={}'.format(a))
            print("a = {}\n{}".format(a,s))
            print("%.2f usec/pass\n" % (1000000 * t.timeit(number=100000)/100000))

    结果:

    a = 1
    try:
        b = 10/a
    except ZeroDivisionError:
        pass
    0.25 usec/pass
    
    a = 1
    if a:
        b = 10/a
    0.29 usec/pass
    
    a = 1
    b = 10/a
    0.22 usec/pass
    
    a = 0
    try:
        b = 10/a
    except ZeroDivisionError:
        pass
    0.57 usec/pass
    
    a = 0
    if a:
        b = 10/a
    0.04 usec/pass
    
    a = 0
    b = 10/a
    ZeroDivisionError: int division or modulo by zero

    因此,正如预期的那样,没有任何异常处理程序会稍快一些(但在发生异常时会炸毁您的脸),并且只要不满足条件,它try/except就会比显式的要快if

    但这一切都在同一数量级内,并且无论哪种方式都不太重要。仅当实际满足条件时,if版本才会明显更快。

    Why don’t you measure it using the timeit module? That way you can see whether it’s relevant to your application.

    OK, so I’ve just tried the following:

    import timeit
    
    statements=["""\
    try:
        b = 10/a
    except ZeroDivisionError:
        pass""",
    """\
    if a:
        b = 10/a""",
    "b = 10/a"]
    
    for a in (1,0):
        for s in statements:
            t = timeit.Timer(stmt=s, setup='a={}'.format(a))
            print("a = {}\n{}".format(a,s))
            print("%.2f usec/pass\n" % (1000000 * t.timeit(number=100000)/100000))
    

    Result:

    a = 1
    try:
        b = 10/a
    except ZeroDivisionError:
        pass
    0.25 usec/pass
    
    a = 1
    if a:
        b = 10/a
    0.29 usec/pass
    
    a = 1
    b = 10/a
    0.22 usec/pass
    
    a = 0
    try:
        b = 10/a
    except ZeroDivisionError:
        pass
    0.57 usec/pass
    
    a = 0
    if a:
        b = 10/a
    0.04 usec/pass
    
    a = 0
    b = 10/a
    ZeroDivisionError: int division or modulo by zero
    

    So, as expected, not having any exception handler is slightly faster (but blows up in your face when the exception happens), and try/except is faster than an explicit if as long as the condition is not met.

    But it’s all within the same order of magnitude and unlikely to matter either way. Only if the condition is actually met, then the if version is significantly faster.


    回答 1

    设计和历史常见问题解答中实际上回答了这个问题

    如果没有引发异常,则try / except块非常有效。实际上捕获异常是昂贵的。

    This question is actually answered in the Design and History FAQ:

    A try/except block is extremely efficient if no exceptions are raised. Actually catching an exception is expensive.


    回答 2

    这个问题是误导的。如果您假设从未触发该异常,那么这两个都不是最佳代码。

    如果您假设异常是作为错误条件的一部分而触发的,那么您已经超出了想要最佳代码的范围(而且您可能始终无法像这样细粒度地对其进行处理)。

    如果您将异常用作标准控制流程的一部分(这是Pythonic的“询问宽恕,而不是允许”的方式),那么异常将被触发,代价取决于异常的种类,如果,以及您估计发生异常的时间百分比。

    This question is misleading. If you assume the exception is never triggered, neither one is optimal code.

    If you assume the exception is triggered as part of an error condition, you are already outside the realm of wanting optimal code (and you probably aren’t handling it at a fine-grained level like that anyway).

    If you are using the exception as part of the standard control flow – which is the Pythonic “ask forgiveness, not permission” way – then the exception is going to be triggered, and the cost depends on the kind of exception, the kind of if, and what percentage of time you estimate the exception happens.


    Python vs Bash-在性能方面,每个任务都胜过其他任务?

    问题:Python vs Bash-在性能方面,每个任务都胜过其他任务?

    显然,Python更加用户友好,在Google上进行的快速搜索显示了许多结果,这些结果表明,由于Python是字节编译的,因此通常速度更快。我什至发现声称可以在基于字典的操作上看到超过2000%的改进。

    您对此事有何经验?每个人在哪种任务上都是明显的赢家?

    Obviously Python is more user friendly, a quick search on google shows many results that say that, as Python is byte-compiled is usually faster. I even found this that claims that you can see an improvement of over 2000% on dictionary-based operations.

    What is your experience on this matter? In which kind of task each one is a clear winner?


    回答 0

    典型的大型机流程…

    Input Disk/Tape/User (runtime) --> Job Control Language (JCL) --> Output Disk/Tape/Screen/Printer
                                       |                          ^
                                       v                          |
                                       `--> COBOL Program --------' 

    典型的Linux流程…

    Input Disk/SSD/User (runtime) --> sh/bash/ksh/zsh/... ----------> Output Disk/SSD/Screen/Printer
                                       |                          ^
                                       v                          |
                                       `--> Python script --------'
                                       |                          ^
                                       v                          |
                                       `--> awk script -----------'
                                       |                          ^
                                       v                          |
                                       `--> sed script -----------'
                                       |                          ^
                                       v                          |
                                       `--> C/C++ program --------'
                                       |                          ^
                                       v                          |
                                       `--- Java program ---------'
                                       |                          ^
                                       v                          |
                                       :                          :

    外壳是Linux的粘合剂

    像sh / ksh / bash / … 这样的Linux shell 提供输入/输出/流控制指定功能,就像旧的大型机Job Control Language …一样,但是在类固醇上!它们本身就是图灵完整的语言,同时经过优化以有效地将数据和控制传递给O / S支持的以任何语言编写的其他执行过程和从其他执行过程进行传递。

    大多数Linux应用程序,无论程序的大部分语言是哪种语言,都取决于shell脚本,而Bash已成为最常见的应用程序。单击桌面上的图标通常会运行一个简短的Bash脚本。该脚本直接或间接知道所有所需文件的位置,并设置变量和命令行参数,最后调用程序。这是shell最简单的用法。

    然而,众所周知,如果没有成千上万的外壳脚本来启动系统,响应事件,控制执行优先级以及编译,配置和运行程序,几乎就不会是Linux。其中许多都是非常大而复杂的。

    Shell提供了一种基础结构,使我们可以使用在运行时而不是编译时链接在一起的预构建组件。这些组件本身就是独立的程序,可以单独使用或以其他组合使用,而无需重新编译。调用它们的语法与Bash内置命令的语法没有区别,实际上,有许多内置命令,在系统上也有独立的可执行文件,这些命令通常具有其他选项。

    PythonBash在性能上没有语言范围的差异。这完全取决于每个代码的编码方式以及调用哪些外部工具。

    任何众所周知的工具,例如awk,sed,grep,bc,dc,tr等,都将以任何一种语言进行操作。然后,对于没有图形用户界面的任何事物,Bash都是首选的,因为从Bash之类的工具中调用和传递数据比从Python那里更容易,更有效。

    性能

    它的总体吞吐量和/或响应能力是否好于等效的Python取决于Bash shell脚本调用的程序及其对子任务的适用性。使事情复杂化的是,Python和大多数语言一样,也可以调用其他可执行文件,尽管它比较麻烦,因此不常用。

    用户界面

    一个领域的Python是明显的赢家是用户界面。这使其成为构建本地或客户端服务器应用程序的极佳语言,因为它本身支持GTK图形,并且比Bash直观得多。

    Bash仅能理解文本。必须为GUI调用其他工具,并从这些工具传回数据。一个Python的脚本是一个选项。更快但更不灵活的选项是YAD,Zenity和GTKDialog之类的二进制文件。

    虽然像Bash这样的shell 可以与YadGtkDialog(GTK +函数的嵌入式XML相似的接口)dialogxmessage等GUI很好地配合使用,但Python的功能更加强大,因此对于复杂的GUI窗口也更好。

    摘要

    使用Shell脚本进行构建就像使用台式机组装具有现成组件的计算机一样。

    使用PythonC ++或大多数其他语言进行构建更像是通过像智能手机一样将芯片(库)和其他电子零件焊接在一起来构建计算机。

    最好的结果通常是通过使用多种语言来获得的,每种语言都可以尽其所能。一个开发人员称此为“ 多语言编程 ”。

    Typical mainframe flow…

    Input Disk/Tape/User (runtime) --> Job Control Language (JCL) --> Output Disk/Tape/Screen/Printer
                                       |                          ^
                                       v                          |
                                       `--> COBOL Program --------' 
    

    Typical Linux flow…

    Input Disk/SSD/User (runtime) --> sh/bash/ksh/zsh/... ----------> Output Disk/SSD/Screen/Printer
                                       |                          ^
                                       v                          |
                                       `--> Python script --------'
                                       |                          ^
                                       v                          |
                                       `--> awk script -----------'
                                       |                          ^
                                       v                          |
                                       `--> sed script -----------'
                                       |                          ^
                                       v                          |
                                       `--> C/C++ program --------'
                                       |                          ^
                                       v                          |
                                       `--- Java program ---------'
                                       |                          ^
                                       v                          |
                                       :                          :
    

    Shells are the glue of Linux

    Linux shells like sh/ksh/bash/… provide input/output/flow-control designation facilities much like the old mainframe Job Control Language… but on steroids! They are Turing complete languages in their own right while being optimized to efficiently pass data and control to and from other executing processes written in any language the O/S supports.

    Most Linux applications, regardless what language the bulk of the program is written in, depend on shell scripts and Bash has become the most common. Clicking an icon on the desktop usually runs a short Bash script. That script, either directly or indirectly, knows where all the files needed are and sets variables and command line parameters, finally calling the program. That’s a shell’s simplest use.

    Linux as we know it however would hardly be Linux without the thousands of shell scripts that startup the system, respond to events, control execution priorities and compile, configure and run programs. Many of these are quite large and complex.

    Shells provide an infrastructure that lets us use pre-built components that are linked together at run time rather than compile time. Those components are free-standing programs in their own right that can be used alone or in other combinations without recompiling. The syntax for calling them is indistinguishable from that of a Bash builtin command, and there are in fact numerous builtin commands for which there is also a stand-alone executable on the system, often having additional options.

    There is no language-wide difference between Python and Bash in performance. It entirely depends on how each is coded and which external tools are called.

    Any of the well known tools like awk, sed, grep, bc, dc, tr, etc. will leave doing those operations in either language in the dust. Bash then is preferred for anything without a graphical user interface since it is easier and more efficient to call and pass data back from a tool like those with Bash than Python.

    Performance

    It depends on which programs the Bash shell script calls and their suitability for the subtask they are given whether the overall throughput and/or responsiveness will be better or worse than the equivalent Python. To complicate matters Python, like most languages, can also call other executables, though it is more cumbersome and thus not as often used.

    User Interface

    One area where Python is the clear winner is user interface. That makes it an excellent language for building local or client-server applications as it natively supports GTK graphics and is far more intuitive than Bash.

    Bash only understands text. Other tools must be called for a GUI and data passed back from them. A Python script is one option. Faster but less flexible options are the binaries like YAD, Zenity, and GTKDialog.

    While shells like Bash work well with GUIs like Yad, GtkDialog (embedded XML-like interface to GTK+ functions), dialog, and xmessage, Python is much more capable and so better for complex GUI windows.

    Summary

    Building with shell scripts is like assembling a computer with off-the-shelf components the way desktop PCs are.

    Building with Python, C++ or most any other language is more like building a computer by soldering the chips (libraries) and other electronic parts together the way smartphones are.

    The best results are usually obtained by using a combination of languages where each can do what they do best. One developer calls this “polyglot programming“.


    回答 1

    通常,只有在python不可用的环境中,bash才能比python更好。:)

    认真地讲,我每天都必须处理两种语言,并且如果可以选择的话,Python将比bash立即使用。las,我被迫在某些“小型”平台上使用bash,因为有人(错误地,恕我直言)认为python“太大”以致无法容纳。

    虽然对于某些选择任务,bash的确可以比python快,但它的开发速度或维护速度都不可能如此之快(至少在经过10行左右的代码之后)。Bash的无处不在是python或ruby或lua等的唯一优点。

    Generally, bash works better than python only in those environments where python is not available. :)

    Seriously, I have to deal with both languages daily, and will take python instantly over bash if given the choice. Alas, I am forced to use bash on certain “small” platforms because someone has (mistakenly, IMHO) decided that python is “too large” to fit.

    While it is true that bash might be faster than python for some select tasks, it can never be as quick to develop with, or as easy to maintain (at least after you get past 10 lines of code or so). Bash’s sole strong point wrt python or ruby or lua, etc., is its ubiquity.


    回答 2

    在bash和Python都是明智的选择的情况下,开发人员的效率对我而言更为重要。

    有些任务很适合bash,另一些则适合Python。对于我来说,将其作为bash脚本启动并将其更改为Python并不罕见,因为它会持续数周的发展。

    Python的一大优势是在处理文件名的极端情况下,尽管它具有globshutil,和其他常见的脚本的需求。

    Developer efficiency matters much more to me in scenarios where both bash and Python are sensible choices.

    Some tasks lend themselves well to bash, and others to Python. It also isn’t unusual for me to start something as a bash script and change it to Python as it evolves over several weeks.

    A big advantage Python has is in corner cases around filename handling, while it has glob, shutil, subprocess, and others for common scripting needs.


    回答 3

    在编写脚本时,性能并不重要(在大多数情况下)。
    如果您关心性能,“ Python vs Bash”是一个错误的问题。

    Python
    +易于编写
    +易于维护
    +代码重用(尝试在通用代码中找到通用的防错方法来包含文件sh,我敢)
    +您也可以使用OOP!
    +更轻松的参数解析。好吧,确实不容易。它仍然太罗to了,但python argparse内置了功能
    。-丑陋的’subprocess’。尝试链接命令,不要哭了,您的代码将变得多么丑陋。特别是如果您关心退出代码。

    Bash
    +如前所述,无处不在。
    +简单的命令链接。这就是您以简单的方式将不同的命令粘合在一起的方式。也Bash(不是sh)也有一些改进,例如pipefail,因此链接确实很短且富有表现力。
    +不需要安装第三方程序。可以立即执行。
    -天哪,到处都是陷阱。IFS,CDPATH ..数千种。

    如果编写的脚本大于100 LOC:请选择Python
    如果需要在脚本中进行路径操作:请选择Python(3)
    如果需要一些类似于alias但有点复杂的:请选择Bash / sh

    无论如何,一个人应该尽力让双方了解他们的能力。

    也许可以通过打包和IDE支持点来扩展答案,但是我对此并不熟悉。

    与往常一样,您必须选择粪便三明治和巨型水饺。请记住,仅在几年前,Perl就是新希望。现在在哪里。

    When you writing scripts performance does not matter (in most cases).
    If you care about performance ‘Python vs Bash’ is a false question.

    Python:
    + easier to write
    + easier to maintain
    + easier code reuse (try to find universal error-proof way to include files with common code in sh, I dare you)
    + you can do OOP with it too!
    + easier arguments parsing. well, not easier, exactly. it still will be too wordy to my taste, but python have argparse facility built in.
    – ugly ugly ‘subprocess’. try to chain commands and not to cry a river how ugly your code will become. especially if you care about exit codes.

    Bash:
    + ubiquity, as was said earlier, indeed.
    + simple commands chaining. that’s how you glue together different commands in a simple way. Also Bash (not sh) have some improvements, like pipefail, so chaining is really short and expressive.
    + do not require 3rd-party programs to be installed. can be executed right away.
    – god, it’s full of gotchas. IFS, CDPATH.. thousands of them.

    If one writing a script bigger than 100 LOC: choose Python
    If one need path manipulation in script: choose Python(3)
    If one need somewhat like alias but slightly complicated: choose Bash/sh

    Anyway, one should try both sides to get the idea what are they capable of.

    Maybe answer can be extended with packaging and IDE support points, but I’m not familiar with this sides.

    As always you have to choose from turd sandwich and giant douche. And remember, just a few years ago Perl was new hope. Where it is now.


    回答 4

    在进程启动时,性能方面的bash优于python。

    以下是我的运行Linux Mint的核心i7笔记本电脑的一些测量结果:

    Starting process                       Startup time
    
    empty /bin/sh script                   1.7 ms
    empty /bin/bash script                 2.8 ms
    empty python script                    11.1 ms
    python script with a few libs*         110 ms

    * Python加载的库是:os,os.path,json,时间,请求,线程,子进程

    这显示出巨大的差异,但是如果bash必须做任何明智的事情,因为它通常必须调用外部进程,则执行时间会迅速缩短。

    如果您关心性能,请仅将bash用于:

    • 非常简单且经常调用的脚本
    • 主要调用其他进程的脚本
    • 当您需要手动管理操作和脚本之间的最小摩擦时-快速检查一些命令并将其放置在file.sh中

    Performance-wise bash outperforms python in the process startup time.

    Here are some measurements from my core i7 laptop running Linux Mint:

    Starting process                       Startup time
    
    empty /bin/sh script                   1.7 ms
    empty /bin/bash script                 2.8 ms
    empty python script                    11.1 ms
    python script with a few libs*         110 ms
    

    *Python loaded libs are: os, os.path, json, time, requests, threading, subprocess

    This shows a huge difference however bash execution time degrades quickly if it has to do anything sensible since it usually must call external processes.

    If you care about performance use bash only for:

    • really simple and frequently called scripts
    • scripts that mainly call other processes
    • when you need minimal friction between manual administrative actions and scripting – fast check a few commands and place them in the file.sh

    回答 5

    Bash主要是一种批处理/ shell脚本语言,对各种数据类型和围绕控制结构的各种怪癖的支持要少得多,更不用说兼容性问题了。

    哪个更快?两者都不是,因为您这里没有将苹果与其他苹果进行比较。如果您必须对一个ascii文本文件进行排序,并且正在使用zcat,sort,uniq和sed之类的工具,那么您将明智地利用Python性能。

    但是,如果您需要一个支持浮点和各种控制流的适当编程环境,那么Python无疑是明智之举。如果您在Bash和Python中写了一个递归算法,则Python版本将赢得一个数量级或更多。

    Bash is primarily a batch / shell scripting language with far less support for various data types and all sorts of quirks around control structures — not to mention compatibility issues.

    Which is faster? Neither, because you are not comparing apples to apples here. If you had to sort an ascii text file and you were using tools like zcat, sort, uniq, and sed then you will smoke Python performance wise.

    However, if you need a proper programming environment that supports floating point and various control flow, then Python wins hands down. If you wrote say a recursive algorithm in Bash and Python, the Python version will win in an order of magnitude or more.


    回答 6

    如果您希望以最小的努力拼凑快速的实用程序,那么bash就是不错的选择。对于应用程序的包装器而言,bash不可估量。

    任何可能使您一遍又一遍地添加改进的东西(尽管并非总是如此)可能更适合于Python之类的语言,因为包含超过1000行的Bash代码很难维护。当Bash代码变长时,它也很烦人调试。

    根据我的经验,这类问题的部分问题是shell脚本通常都是自定义任务。在已经有免费解决方案的地方,遇到的shell脚本任务很少。

    If you are looking to cobble together a quick utility with minimal effort, bash is good. For a wrapper round an application, bash is invaluable.

    Anything that may have you coming back over and over to add improvements is probably (though not always) better suited to a language like Python as Bash code comprising over a 1000 lines gets very painful to maintain. Bash code is also irritating to debug when it gets long…….

    Part of the problem with these kind of questions is, from my experience, that shell scripts are usually all custom tasks. There have been very few shell scripting tasks that I have come across where there is already a solution freely available.


    回答 7

    我相信有两种方案的Bash性能至少相等:

    • 命令行实用程序的脚本
    • 只需很短时间即可执行的脚本;在其中启动Python解释器需要比操作本身更多的时间

    就是说,我通常并不真正关心脚本语言本身的性能。如果性能是一个真正的问题,那么您不必编写脚本而是编写程序(可能使用Python)。

    There are 2 scenario’s where Bash performance is at least equal I believe:

    • Scripting of command line utilities
    • Scripts which take only a short time to execute; where starting the Python interpreter takes more time than the operation itself

    That said, I usually don’t really concern myself with performance of the scripting language itself. If performance is a real issue you don’t script but program (possibly in Python).


    回答 8

    我之所以发布此最新答案,主要是因为Google喜欢这个问题。

    我认为问题和背景确实应该与工作流程有关,而不是工具。总体理念始终是“使用正确的工具完成工作”。但是在此之前,许多人常常在工具迷路时忘记了这一点:“完成工作”。

    当我遇到一个尚未完全定义的问题时,我几乎总是从Bash开始。我已经解决了大型Bash脚本中易读且可维护的一些棘手问题。

    但是问题什么时候开始超过应该要求Bash做什么的呢?我有一些支票可以用来警告我:

    1. 我是否希望Bash具有2D(或更高)阵列?如果是的话,是时候意识到Bash不是很好的数据处理语言了。
    2. 与为其他实用程序准备数据相比,我是否正在做更多的工作?如果是,请再次意识到Bash不是一种出色的数据处理语言。
    3. 我的脚本仅仅是变得太大而无法管理吗?如果是,那么很重要的一点是要意识到,尽管Bash可以导入脚本库,但它缺少像其他语言一样的软件包系统。与大多数其他语言相比,它确实是一种“自己动手”的语言。再说一次,它具有大量的内置功能(有人说太多…)

    清单继续。底线是,当您为添加功能而更加努力地保持脚本运行时,该离开Bash了。

    假设您已决定将工作移至Python。如果您的Bash脚本干净,则初始转换非常简单。甚至还有几个转换器/翻译器将为您做第一遍。

    下一个问题是:您放弃转向Python的什么?

    1. 必须将对外部实用程序的所有调用包装在subprocess模块(或等效模块)中的某些内容中。有多种方法可以做到这一点,直到3.7,它才花了点力气才将其改正(改进subprocess.run()了3.7,可以自行处理所有常见情况)。

    2. 令人惊讶的是,Python没有用于轮询键盘(stdin)的标准独立于平台的非阻塞实用程序(带有超时)。Bash read命令是一个很棒的工具,用于简单的用户交互。我最常见的用法是显示一个微调框,直到用户按下某个键为止,同时还运行轮询功能(每个微调框步骤都执行一次),以确保一切运行正常。这是一个比刚开始时要棘手的问题,所以我经常简单地打电话给Bash:昂贵,但这恰恰满足了我的需求。

    3. 如果您是在嵌入式或受内存限制的系统上进行开发,Python的内存占用量可能是Bash的很多倍(取决于手头的任务)。另外,内存中几乎总是有一个Bash实例,而Python可能并非如此。

    4. 对于只运行一次并快速退出的脚本,Python的启动时间可能比Bash的启动时间长得多。但是,如果脚本中包含大量计算,Python会迅速前进。

    5. Python具有地球上最全面的软件包系统。当Bash变得稍微复杂时,Python可能会提供一个程序包,使整个Bash块成为单个调用。但是,找到合适的软件包成为Pythonista的最大也是最艰巨的任务。幸运的是,Google和StackExchange是您的朋友。

    I’m posting this late answer primarily because Google likes this question.

    I believe the issue and context really should be about the workflow, not the tools. The overall philosophy is always “Use the right tool for the job.” But before this comes one that many often forget when they get lost in the tools: “Get the job done.”

    When I have a problem that isn’t completely defined, I almost always start with Bash. I have solved some gnarly problems in large Bash scripts that are both readable and maintainable.

    But when does the problem start to exceed what Bash should be asked to do? I have some checks I use to give me warnings:

    1. Am I wishing Bash had 2D (or higher) arrays? If yes, it’s time to realize that Bash is not a great data processing language.
    2. Am I doing more work preparing data for other utilities than I am actually running those utilities? If yes, time again to realize Bash is not a great data processing language.
    3. Is my script simply getting too large to manage? If yes, it is important to realize that while Bash can import script libraries, it lacks a package system like other languages. It’s really a “roll your own” language compared to most others. Then again, it has a enormous amount of functionality built-in (some say too much…)

    The list goes on. Bottom-line, when you are working harder to keep your scripts running that you do adding features, it’s time to leave Bash.

    Let’s assume you’ve decided to move your work to Python. If your Bash scripts are clean, the initial conversion is quite straightforward. There are even several converters / translators that will do the first pass for you.

    The next question is: What do you give up moving to Python?

    1. All calls to external utilities must be wrapped in something from the subprocess module (or equivalent). There are multiple ways to do this, and until 3.7 it took some effort to get it right (3.7 improved subprocess.run() to handle all common cases on its own).

    2. Surprisingly, Python has no standard platform-independent non-blocking utility (with timeout) for polling the keyboard (stdin). The Bash read command is an awesome tool for simple user interaction. My most common use is to show a spinner until the user presses a key, while also running a polling function (with each spinner step) to make sure things are still running well. This is a harder problem than it would appear at first, so I often simply make a call to Bash: Expensive, but it does precisely what I need.

    3. If you are developing on an embedded or memory-constrained system, Python’s memory footprint can be many times larger than Bash’s (depending on the task at hand). Plus, there is almost always an instance of Bash already in memory, which may not be the case for Python.

    4. For scripts that run once and exit quickly, Python’s startup time can be much longer than Bash’s. But if the script contains significant calculations, Python quickly pulls ahead.

    5. Python has the most comprehensive package system on the planet. When Bash gets even slightly complex, Python probably has a package that makes whole chunks of Bash become a single call. However, finding the right package(s) to use is the biggest and most daunting part of becoming a Pythonista. Fortunately, Google and StackExchange are your friends.


    回答 9

    我不知道这是否正确,但是我发现python / ruby​​在具有大量数学计算的脚本中效果更好。否则你必须使用dc或其他“任意精度计算器”。这只是一个很大的痛苦。使用python,您可以更好地控制浮点数和整数,并且有时执行许多计算要容易得多。

    特别是,我永远不会使用bash脚本来处理二进制信息或字节。相反,我会使用python(也许)或C ++或什至Node.JS之类的东西。

    I don’t know if this is accurate, but I have found that python/ruby works much better for scripts that have a lot of mathematical computations. Otherwise you have to use dc or some other “arbitrary precision calculator”. It just becomes a very big pain. With python you have much more control over floats vs ints and it is much easier to perform a lot of computations and sometimes.

    In particular, I would never work with a bash script to handle binary information or bytes. Instead I would use something like python (maybe) or C++ or even Node.JS.


    回答 10

    在性能方面,两者可以做同样的事情,所以问题就变成了节省更多开发时间的问题?

    Bash依赖于调用其他命令,并通过管道传递它们来创建新命令。这样做的好处是,无论他们使用什么编程语言,都可以使用从其他人那里借来的代码快速创建新程序。

    这也具有很好的抵抗子命令更改的副作用,因为它们之间的界面只是纯文本。

    另外,Bash在如何编写方面非常宽容。这意味着它可以在更广泛的上下文中很好地工作,但是它也依赖于程序员以一种干净安全的方式进行编码的意图。否则,Bash不会阻止您制造混乱。

    Python的样式更加结构化,因此凌乱的程序员不会那么凌乱。它也可以在Linux以外的操作系统上运行,如果需要这种可移植性,使其立即变得更合适。

    但这并不是调用其他命令那么简单。因此,如果您的操作系统是Unix,那么您将发现在Bash上进行开发是最快的开发方法。

    何时使用Bash:

    • 它是一个非图形程序,或者是图形程序的引擎。
    • 仅适用于Unix。

    何时使用Python:

    • 这是一个图形程序。
    • 它可以在Windows上运行。

    Performance wise both can do equally the same, so the question becomes which saves more development time?

    Bash relies on calling other commands, and piping them for creating new ones. This has the advantage that you can quickly create new programs just with the code borrowed from other people, no matter what programming language they used.

    This also has the side effect of resisting change in sub-commands pretty well, as the interface between them is just plain text.

    Additionally Bash is very permissive on how you can write on it. This means it will work well for a wider variety of context, but it also relies on the programmer having the intention of coding in a clean safe manner. Otherwise Bash won’t stop you from building a mess.

    Python is more structured on style, so a messy programmer won’t be as messy. It will also work on operating systems outside Linux, making it instantly more appropriate if you need that kind of portability.

    But it isn’t as simple for calling other commands. So if your operating system is Unix most likely you will find that developing on Bash is the fastest way to develop.

    When to use Bash:

    • It’s a non graphical program, or the engine of a graphical one.
    • It’s only for Unix.

    When to use Python:

    • It’s a graphical program.
    • It shall work on Windows.

    为什么pow(a,d,n)比a ** d%n快得多?

    问题:为什么pow(a,d,n)比a ** d%n快得多?

    我正在尝试实施Miller-Rabin素数测试,并对为什么中号(〜7位数)要花这么长时间(> 20秒)感到困惑。我最终发现以下代码行是问题的根源:

    x = a**d % n

    (其中adn都是相似的,但不相等的中号,**是幂运算符,并且%是模运算符)

    然后,我尝试将其替换为以下内容:

    x = pow(a, d, n)

    相比之下,它几乎是瞬时的。

    对于上下文,这是原始功能:

    from random import randint
    
    def primalityTest(n, k):
        if n < 2:
            return False
        if n % 2 == 0:
            return False
        s = 0
        d = n - 1
        while d % 2 == 0:
            s += 1
            d >>= 1
        for i in range(k):
            rand = randint(2, n - 2)
            x = rand**d % n         # offending line
            if x == 1 or x == n - 1:
                continue
            for r in range(s):
                toReturn = True
                x = pow(x, 2, n)
                if x == 1:
                    return False
                if x == n - 1:
                    toReturn = False
                    break
            if toReturn:
                return False
        return True
    
    print(primalityTest(2700643,1))

    定时计算示例:

    from timeit import timeit
    
    a = 2505626
    d = 1520321
    n = 2700643
    
    def testA():
        print(a**d % n)
    
    def testB():
        print(pow(a, d, n))
    
    print("time: %(time)fs" % {"time":timeit("testA()", setup="from __main__ import testA", number=1)})
    print("time: %(time)fs" % {"time":timeit("testB()", setup="from __main__ import testB", number=1)})

    输出(与PyPy 1.9.0一起运行):

    2642565
    time: 23.785543s
    2642565
    time: 0.000030s

    输出(在Python 3.3.0中运行,2.7.2返回的时间非常相似):

    2642565
    time: 14.426975s
    2642565
    time: 0.000021s

    还有一个相关的问题,为什么使用Python 2或3运行时,这种计算几乎比使用PyPy时快两倍,而通常PyPy却要快得多

    I was trying to implement a Miller-Rabin primality test, and was puzzled why it was taking so long (> 20 seconds) for midsize numbers (~7 digits). I eventually found the following line of code to be the source of the problem:

    x = a**d % n
    

    (where a, d, and n are all similar, but unequal, midsize numbers, ** is the exponentiation operator, and % is the modulo operator)

    I then I tried replacing it with the following:

    x = pow(a, d, n)
    

    and it by comparison it is almost instantaneous.

    For context, here is the original function:

    from random import randint
    
    def primalityTest(n, k):
        if n < 2:
            return False
        if n % 2 == 0:
            return False
        s = 0
        d = n - 1
        while d % 2 == 0:
            s += 1
            d >>= 1
        for i in range(k):
            rand = randint(2, n - 2)
            x = rand**d % n         # offending line
            if x == 1 or x == n - 1:
                continue
            for r in range(s):
                toReturn = True
                x = pow(x, 2, n)
                if x == 1:
                    return False
                if x == n - 1:
                    toReturn = False
                    break
            if toReturn:
                return False
        return True
    
    print(primalityTest(2700643,1))
    

    An example timed calculation:

    from timeit import timeit
    
    a = 2505626
    d = 1520321
    n = 2700643
    
    def testA():
        print(a**d % n)
    
    def testB():
        print(pow(a, d, n))
    
    print("time: %(time)fs" % {"time":timeit("testA()", setup="from __main__ import testA", number=1)})
    print("time: %(time)fs" % {"time":timeit("testB()", setup="from __main__ import testB", number=1)})
    

    Output (run with PyPy 1.9.0):

    2642565
    time: 23.785543s
    2642565
    time: 0.000030s
    

    Output (run with Python 3.3.0, 2.7.2 returns very similar times):

    2642565
    time: 14.426975s
    2642565
    time: 0.000021s
    

    And a related question, why is this calculation almost twice as fast when run with Python 2 or 3 than with PyPy, when usually PyPy is much faster?


    回答 0

    请参阅Wikipedia上有关模幂的文章。基本上,当您这样做时a**d % n,实际上必须计算a**d,这可能会很大。但是有些计算方法a**d % n不必自己计算a**d,这就是pow它的作用。该**运营商不能做到这一点,因为它不能“预见未来”知道你要立即采取模数。

    See the Wikipedia article on modular exponentiation. Basically, when you do a**d % n, you actually have to calculate a**d, which could be quite large. But there are ways of computing a**d % n without having to compute a**d itself, and that is what pow does. The ** operator can’t do this because it can’t “see into the future” to know that you are going to immediately take the modulus.


    回答 1

    BrenBarn回答了您的主要问题。除了您:

    为什么用Python 2或3运行时,它的速度几乎是PyPy的两倍,而通常PyPy要快得多?

    如果您阅读了PyPy的性能页面,这正是PyPy不擅长的事情-实际上,他们给出的第一个示例是:

    不良的例子包括进行大量的计算-这是由无法优化的支持代码执行的。

    从理论上讲,将巨大的幂乘以Mod转换为模块化幂(至少在第一遍之后)是JIT可以实现的一种转换,但不是PyPy的JIT。

    附带说明一下,如果您需要使用巨大的整数进行计算,则可能需要查看第三方模块,例如gmpy,在某些情况下,它有时会比CPython的本机实现快得多,在某些主流用途之外,并且也有很多用途。否则,您将不得不编写自己的其他功能,而代价是不太方便。

    BrenBarn answered your main question. For your aside:

    why is it almost twice as fast when run with Python 2 or 3 than PyPy, when usually PyPy is much faster?

    If you read PyPy’s performance page, this is exactly the kind of thing PyPy is not good at—in fact, the very first example they give:

    Bad examples include doing computations with large longs – which is performed by unoptimizable support code.

    Theoretically, turning a huge exponentiation followed by a mod into a modular exponentiation (at least after the first pass) is a transformation a JIT might be able to make… but not PyPy’s JIT.

    As a side note, if you need to do calculations with huge integers, you may want to look at third-party modules like gmpy, which can sometimes be much faster than CPython’s native implementation in some cases outside the mainstream uses, and also has a lot of additional functionality that you’d otherwise have to write yourself, at the cost of being less convenient.


    回答 2

    进行模幂运算有一些捷径:例如,您可以找到从到的a**(2i) mod n每个,并将所需的中间结果相乘(mod )。专用的模幂函数(例如3参数)可以利用这些技巧,因为它知道您正在执行模数运算。Python解析器无法识别给定的裸表达式,因此它将执行完整的计算(这将花费更长的时间)。i1log(d)npow()a**d % n

    There are shortcuts to doing modular exponentiation: for instance, you can find a**(2i) mod n for every i from 1 to log(d) and multiply together (mod n) the intermediate results you need. A dedicated modular-exponentiation function like 3-argument pow() can leverage such tricks because it knows you’re doing modular arithmetic. The Python parser can’t recognize this given the bare expression a**d % n, so it will perform the full calculation (which will take much longer).


    回答 3

    x = a**d % n计算的方法是提高功率,然后用adn。首先,如果a很大,则会创建一个巨大的数字,然后将其截短。但是,x = pow(a, d, n)最有可能进行了优化,以便仅n跟踪最后一位,这是计算以模为模的乘法所需的全部数字。

    The way x = a**d % n is calculated is to raise a to the d power, then modulo that with n. Firstly, if a is large, this creates a huge number which is then truncated. However, x = pow(a, d, n) is most likely optimized so that only the last n digits are tracked, which are all that are required for calculating multiplication modulo a number.