标签归档:vectorization

在NumPy数组的每个单元中高效评估函数

问题:在NumPy数组的每个单元中高效评估函数

给定一个NumPy数组A,将相同的函数f应用于每个单元的最快/最有效的方法是什么?

  1. 假设我们将分配给A(I,J)F(A(I,J))

  2. 函数f没有二进制输出,因此mask(ing)操作将无济于事。

“显而易见的”双循环迭代(通过每个单元)是否是最佳解决方案?

Given a NumPy array A, what is the fastest/most efficient way to apply the same function, f, to every cell?

  1. Suppose that we will assign to A(i,j) the f(A(i,j)).

  2. The function, f, doesn’t have a binary output, thus the mask(ing) operations won’t help.

Is the “obvious” double loop iteration (through every cell) the optimal solution?


回答 0

您可以对函数进行矢量化处理,然后在每次需要时将其直接应用于Numpy数组:

import numpy as np

def f(x):
    return x * x + 3 * x - 2 if x > 0 else x * 5 + 8

f = np.vectorize(f)  # or use a different name if you want to keep the original f

result_array = f(A)  # if A is your Numpy array

向量化时最好直接指定一个显式输出类型:

f = np.vectorize(f, otypes=[np.float])

You could just vectorize the function and then apply it directly to a Numpy array each time you need it:

import numpy as np

def f(x):
    return x * x + 3 * x - 2 if x > 0 else x * 5 + 8

f = np.vectorize(f)  # or use a different name if you want to keep the original f

result_array = f(A)  # if A is your Numpy array

It’s probably better to specify an explicit output type directly when vectorizing:

f = np.vectorize(f, otypes=[np.float])

回答 1

一个类似的问题是:将NumPy数组映射到适当的位置。如果可以为f()找到一个ufunc,则应使用out参数。

A similar question is: Mapping a NumPy array in place. If you can find a ufunc for your f(), then you should use the out parameter.


回答 2

如果您使用数字和f(A(i,j)) = f(A(j,i)),则可以使用scipy.spatial.distance.cdist将f定义为A(i)和之间的距离A(j)

If you are working with numbers and f(A(i,j)) = f(A(j,i)), you could use scipy.spatial.distance.cdist defining f as a distance between A(i) and A(j).


回答 3

我相信我找到了更好的解决方案。将函数更改为python通用函数的想法(请参阅文档),可以在后台进行并行计算。

一个人可以用ufuncC 编写自己的自定义脚本,这肯定会更有效,也可以通过调用np.frompyfunc内置工厂方法来编写。经过测试,此方法比np.vectorize

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit f_arr(arr, arr) # 450ms

我还测试了较大的样本,并且改进成比例。有关其他方法的性能比较,请参阅这篇文章

I believe I have found a better solution. The idea to change the function to python universal function (see documentation), which can exercise parallel computation under the hood.

One can write his own customised ufunc in C, which surely is more efficient, or by invoking np.frompyfunc, which is built-in factory method. After testing, this is more efficient than np.vectorize:

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit f_arr(arr, arr) # 450ms

I have also tested larger samples, and the improvement is proportional. For comparison of performances of other methods, see this post


回答 4

当2d数组(或nd数组)为C或F连续时,将函数映射到2d数组的任务实际上与将函数映射到1d数组的任务相同-我们只是必须以这种方式查看它,例如通过np.ravel(A,'K')

例如,这里讨论了1d阵列的可能解决方案。

但是,当2d数组的内存不连续时,情况会稍微复杂一些,因为如果轴处理顺序错误,则希望避免可能发生的高速缓存未命中。

Numpy已经有机器以最佳顺序加工轴。使用这种机械的一种可能性是np.vectorize。但是,numpy的文档np.vectorize指出“主要是为了方便而不是为了性能而提供”-慢速python函数保持慢速python函数以及所有相关的开销!另一个问题是其巨大的内存消耗-例如,参见此SO-post

当想要具有C函数的性能但要使用numpy的机器时,一个好的解决方案是使用numba创建ufunc,例如:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

它很容易击败,np.vectorize但是当执行相同的功能作为numpy-array乘法/加法时,即

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

有关时间测量代码,请参见此答案的附录:

Numba的版本(绿色)比python函数(即np.vectorize)快约100倍,这并不奇怪。但这也比numpy功能快约10倍,因为numbas版本不需要中间数组,因此可以更有效地使用缓存。


尽管numba的ufunc方法是可用性和性能之间的良好折衷,但它仍然不是我们能做的最好的选择。然而,没有灵丹妙药或最适合任何任务的方法-人们必须了解什么是局限性以及如何减轻它们。

例如,对于超越函数(例如expsincos)numba不提供超过任何优点numpy的的np.exp(有没有创建临时数组-高速化的主要来源)。但是,我的Anaconda安装使用Intel的VML来处理大于8192的矢量-如果内存不连续,则无法执行。因此,最好将元素复制到连续内存中,以便能够使用英特尔的VML:

import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp(x):
    return np.exp(x)

def np_copy_exp(x):
    copy = np.ravel(x, 'K')
    return np.exp(copy).reshape(x.shape) 

为了公平起见,我关闭了VML的并行化功能(请参见附录中的代码):

可以看到,一旦VML开始运行,复制的开销就远远超过了补偿。但是,一旦数据对于L3高速缓存而言变得太大,则优势就变得微不足道了,因为任务再次变得与内存带宽绑定。

在另一方面,numba可以使用英特尔的SVML为好,在解释这个帖子

from llvmlite import binding
# set before import
binding.set_option('SVML', '-vector-library=SVML')

import numba as nb

@nb.vectorize(target="cpu")
def nb_vexp_svml(x):
    return np.exp(x)

并使用具有并行化功能的VML生成:

numba的版本开销较小,但是对于某些大小,VML甚至比SVML都要高,尽管有额外的复制开销-这并不奇怪,因为numba的ufunc没有并行化。


清单:

A.多项式函数的比较:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        f,
        vf, 
        nb_vf
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    ) 

B.比较exp

import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        nb_vexp, 
        np.exp,
        np_copy_exp,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)',
    )

When the 2d-array (or nd-array) is C- or F-contiguous, then this task of mapping a function onto a 2d-array is practically the same as the task of mapping a function onto a 1d-array – we just have to view it that way, e.g. via np.ravel(A,'K').

Possible solution for 1d-array have been discussed for example here.

However, when the memory of the 2d-array isn’t contiguous, then the situation a little bit more complicated, because one would like to avoid possible cache misses if axis are handled in wrong order.

Numpy has already a machinery in place to process axes in the best possible order. One possibility to use this machinery is np.vectorize. However, numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance” – a slow python function stays a slow python function with the whole associated overhead! Another issue is its huge memory-consumption – see for example this SO-post.

When one wants to have a performance of a C-function but to use numpy’s machinery, a good solution is to use numba for creation of ufuncs, for example:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

It easily beats np.vectorize but also when the same function would be performed as numpy-array multiplication/addition, i.e.

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

See appendix of this answer for time-measurement-code:

Numba’s version (green) is about 100 times faster than the python-function (i.e. np.vectorize), which is not surprising. But it is also about 10 times faster than the numpy-functionality, because numbas version doesn’t need intermediate arrays and thus uses cache more efficiently.


While numba’s ufunc approach is a good trade-off between usability and performance, it is still not the best we can do. Yet there is no silver bullet or an approach best for any task – one has to understand what are the limitation and how they can be mitigated.

For example, for transcendental functions (e.g. exp, sin, cos) numba doesn’t provide any advantages over numpy’s np.exp (there are no temporary arrays created – the main source of the speed-up). However, my Anaconda installation utilizes Intel’s VML for vectors bigger than 8192 – it just cannot do it if memory is not contiguous. So it might be better to copy the elements to a contiguous memory in order to be able to use Intel’s VML:

import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp(x):
    return np.exp(x)

def np_copy_exp(x):
    copy = np.ravel(x, 'K')
    return np.exp(copy).reshape(x.shape) 

For the fairness of the comparison, I have switched off VML’s parallelization (see code in the appendix):

As one can see, once VML kicks in, the overhead of copying is more than compensated. Yet once data becomes too big for L3 cache, the advantage is minimal as task becomes once again memory-bandwidth-bound.

On the other hand, numba could use Intel’s SVML as well, as explained in this post:

from llvmlite import binding
# set before import
binding.set_option('SVML', '-vector-library=SVML')

import numba as nb

@nb.vectorize(target="cpu")
def nb_vexp_svml(x):
    return np.exp(x)

and using VML with parallelization yields:

numba’s version has less overhead, but for some sizes VML beats SVML even despite of the additional copying overhead – which isn’t a bit surprise as numba’s ufuncs aren’t parallelized.


Listings:

A. comparison of polynomial function:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        f,
        vf, 
        nb_vf
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    ) 

B. comparison of exp:

import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
    setup=lambda n: np.random.rand(n,n)[::2,::2],
    n_range=[2**k for k in range(0,12)],
    kernels=[
        nb_vexp, 
        np.exp,
        np_copy_exp,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)',
    )

回答 5

以上所有答案比较起来都不错,但是如果您需要使用自定义函数进行映射,并且拥有numpy.ndarray,则需要保留数组的形状。

我只比较了两个,但它将保留的形状ndarray。我已将具有100万个条目的数组用于比较。在这里我使用平方函数。我正在介绍n维数组的一般情况。对于二维,只需制作iter2D。

import numpy, time

def A(e):
    return e * e

def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

输出量

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

在这里你可以清楚地看到 numpy.fromiter用户平方函数,可以使用任何选择。如果你的功能是依赖于i, j 那就是数组的索引,迭代上数组的大小一样for ind in range(arr.size),用numpy.unravel_index得到i, j, ..基于阵列的您1D指数和形状numpy.unravel_index

这个答案是受到我对其他问题的回答的启发 这里

All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.

I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function. I am presenting the general case for n dimensional array. For two dimensional just make iter for 2D.

import numpy, time

def A(e):
    return e * e

def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

Output

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

here you can clearly see numpy.fromiter user square function, use any of your choice. If you function is dependent on i, j that is indices of array, iterate on size of array like for ind in range(arr.size), use numpy.unravel_index to get i, j, .. based on your 1D index and shape of array numpy.unravel_index

This answers is inspired by my answer on other question here


熊猫中的for循环真的不好吗?我什么时候应该在意?

问题:熊猫中的for循环真的不好吗?我什么时候应该在意?

for循环真正的“坏”?如果不是,在什么情况下它们会比使用更常规的“矢量化”方法更好?1个

我熟悉“矢量化”的概念,以及熊猫如何利用矢量化技术来加快计算速度。向量化功能在整个系列或DataFrame上广播操作,以实现比传统上迭代数据快得多的加速。

但是,我很惊讶地看到很多代码(包括来自Stack Overflow的答案)提供了解决问题的解决方案,这些问题涉及使用for循环和列表推导来遍历数据。文档和API指出循环是“不好的”循环,并且“绝不能”循环访问数组,序列或DataFrame。那么,为什么有时我会看到用户建议基于循环的解决方案?


1-虽然问题听起来似乎有些宽泛,但事实是,在某些非常特殊的情况下,for循环通常比传统上遍历数据更好。这篇文章的目的是为了后代。

Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?1

I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.

However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?


1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.


回答 0

TLDR;不,for循环并非总会“坏”,至少并非总是如此。说某些矢量化操作比迭代慢,而不是说迭代快于某些矢量化操作,可能更准确。知道何时以及为什么是使代码获得最大性能的关键。简而言之,在以下情况下,值得考虑使用矢量化熊猫函数的替代方法:

  1. 当您的数据很小时(…取决于您的工作),
  2. 处理object/ mixed dtypes时
  3. 使用str/ regex访问器功能时

让我们分别检查这些情况。


小数据上的迭代v / s矢量化

熊猫在其API设计中遵循“配置惯例”方法。这意味着已经安装了相同的API,以适应广泛的数据和用例。

调用pandas函数时,该函数必须在内部处理以下各项(除其他事项外),以确保工作正常

  1. 索引/轴对齐
  2. 处理混合数据类型
  3. 处理丢失的数据

几乎每个函数都必须在不同程度上处理这些问题,这带来了开销。数字函数(例如Series.add)的开销较少,而字符串函数(例如Series.str.replace)的开销更为明显。

for另一方面,循环比您想象的要快。更好的是列表理解(通过for循环创建列表)更快,因为它们是优化的列表创建迭代机制。

列表理解遵循模式

[f(x) for x in seq]

seq熊猫系列或DataFrame列在哪里。或者,当对多列进行操作时,

[f(x, y) for x, y in zip(seq1, seq2)]

seq1seq2列。

数值比较
考虑一个简单的布尔索引操作。列表推导方法已针对Series.ne!=)和进行计时query。功能如下:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

为简单起见,本文中使用了该perfplot包来运行所有的timeit测试。上述操作的时间安排如下:

query对于中等大小的N,列表理解要胜过,甚至对于较小的N而言,列表理解要胜过向量化不等于比较。不幸的是,列表理解是线性缩放的,因此对于较大的N而言,它不能提供很多性能提升。

注意
值得一提的是,列表理解的许多好处来自于不必担心索引对齐,但是这意味着,如果您的代码依赖于索引对齐,则此操作会中断。在某些情况下,可以将对基础NumPy数组的矢量化操作视为“两全其美”,从而实现了矢量化,而没有熊猫函数的所有不必要开销。这意味着您可以将上面的操作重写为

df[df.A.values != df.B.values]

它的性能优于熊猫和列表理解同等物:

NumPy矢量化不在本文讨论范围之内,但是如果性能很重要,则绝对值得考虑。

值计数
再举一个例子-这次,使用另一个比for循环快的香草python构造- collections.Counter。通常的要求是计算值计数并将结果作为字典返回。这与做value_countsnp.unique以及Counter

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

结果更明显,Counter在较大的小N范围(〜3500)下胜过两种矢量化方法。

注意
更多琐事(由@ user2357112提供)。的Counter实现是使用C加速器实现的,因此尽管它仍必须使用python对象而不是底层的C数据类型,但它仍比for循环要快。Python的力量!

当然,这里的好处是性能取决于您的数据和用例。这些示例的目的是说服您不要将这些解决方案排除为合法选项。如果这些仍然不能满足您的需求,那么总会有cythonnumba。让我们将此测试添加到混合中。

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba可以将循环python代码的JIT编译为功能非常强大的矢量化代码。了解如何使numba发挥作用需要学习。


混合/ objectdtype操作

基于字符串的比较再
来看第一部分的过滤示例,如果要比较的列是字符串怎么办?考虑上面相同的3个函数,但将输入DataFrame强制转换为字符串。

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

那么,发生了什么变化?这里要注意的是,字符串操作本质上难以向量化。Pandas将字符串视为对象,并且对对象的所有操作都会回退到缓慢,循环的实现中。

现在,由于此循环实现被上述所有开销所包围,因此,即使这些解决方案按比例缩放,它们之间也存在恒定的幅度差异。

当涉及对可变/复杂对象的操作时,没有比较。列表理解胜过所有涉及字典和列表的操作。

通过键访问字典值
以下是从字典列中提取值的两个操作的时间安排:map和列表理解。该设置位于附录的“代码段”标题下。

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension


3个操作的位置列表索引计时,这些操作从列列表中提取第0个元素(处理异常)mapstr.get访问器方法和列表推导:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

注意
如果索引很重要,则需要执行以下操作:

pd.Series([...], index=ser.index)

重建系列时。

列表扁平
化最后一个例子是扁平化列表。这是另一个常见问题,它演示了纯python在这里有多么强大。

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

无论itertools.chain.from_iterable和嵌套列表理解是纯Python结构,并且规模比更好stack的解决方案。

这些时间点充分说明了熊猫没有为混合dtypes做好准备的事实,并且您可能应该避免使用它来这样做。数据应尽可能在单独的列中作为标量值(整数/浮点数/字符串)存在。

最后,这些解决方案的适用性在很大程度上取决于您的数据。因此,最好的办法是先决定对数据进行这些操作,然后再决定要做什么。请注意,我尚未apply对这些解决方案计时,因为它会使图形倾斜(是​​的,那太慢了)。


正则表达式操作和访问器.str方法

熊猫可以应用正则表达式的操作,如str.containsstr.extractstr.extractall,以及其他的“矢量”字符串操作(例如str.split,str.find ,str.translate`,等等)的字符串列。这些功能比列表理解要慢,并且是比其他功能更方便的功能。

预编译正则表达式模式并使用遍历数据通常要快得多re.compile(另请参阅使用Python的re.compile是否值得?)。列表组合等效于str.contains如下所示:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

要么,

ser2 = ser[[bool(p.search(x)) for x in ser]]

如果您需要处理NaN,则可以执行以下操作

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

相当于str.extract(无组)的列表组合看起来像:

df['col2'] = [p.search(x).group(0) for x in df['col']]

如果您需要处理不匹配和NaN,则可以使用自定义函数(速度更快!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

matcher功能是非常可扩展的。根据需要,它可以适合返回每个捕获组的列表。只需提取查询匹配对象的groupor groups属性即可。

对于str.extractall,请更改p.searchp.findall

字符串提取
考虑简单的过滤操作。这个想法是提取一个大写字母开头的4位数字。

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

更多示例
完全公开-我是以下列出的这些帖子的作者(部分或全部)。


结论

如上面的示例所示,当处理少量的DataFrame,混合的数据类型和正则表达式时,迭代会发光。

您获得的提速取决于您的数据和问题,因此里程可能会有所不同。最好的办法是仔细运行测试,看看是否值得付出努力。

“向量化”功能的优点在于其简单性和可读性,因此,如果性能不是很关键,则绝对应该首选这些功能。

另一个注意事项是,某些字符串操作处理了一些使用NumPy的约束。以下是两个示例,其中仔细的NumPy向量化性能胜过python:

此外,有时.values相对于Series或DataFrame ,仅通过底层数组进行操作就可以为大多数常见情况提供足够健康的加速(请参见上面“ 数字比较”部分的“ 注意 ” )。因此,举例来说,即时性能会提高。使用可能并非在每种情况下都适用,但这是一个有用的技巧。df[df.A.values != df.B.values]df[df.A != df.B].values

如上所述,由您决定这些解决方案是否值得实施。


附录:代码段

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (…depending on what you’re doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let’s examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the “best of both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df['col2'] = [p.search(x).group(0) for x in df['col']]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

回答 1

简而言之

  • for循环+ iterrows非常慢。大约1k行的开销并不重要,但超过10k行的开销却很明显。
  • for loop + itertuplesiterrowsor 快得多apply
  • 向量化通常比 itertuples

基准测试

In short

  • for loop + iterrows is extremely slow. Overhead isn’t significant on ~1k rows, but noticeable on 10k+ rows.
  • for loop + itertuples is much faster than iterrows or apply.
  • vectorization is usually much faster than itertuples

Benchmark


Pandas中map,applymap和apply方法之间的区别

问题:Pandas中map,applymap和apply方法之间的区别

您能否通过基本示例告诉我何时使用这些矢量化方法?

我看到这map是一种Series方法,而其余都是DataFrame方法。我糊涂了约applyapplymap,虽然方法。为什么我们有两种将函数应用于DataFrame的方法?同样,简单的例子可以很好地说明用法!

Can you tell me when to use these vectorization methods with basic examples?

I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!


回答 0

直接来自Wes McKinney的《Python for Data Analysis》一书,第16页。132(我强烈推荐这本书):

另一种常见的操作是将一维数组上的函数应用于每一列或每一行。DataFrame的apply方法正是这样做的:

In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: frame
Out[117]: 
               b         d         e
Utah   -0.029638  1.081563  1.280300
Ohio    0.647747  0.831136 -1.549481
Texas   0.513416 -0.884417  0.195343
Oregon -0.485454 -0.477388 -0.309548

In [118]: f = lambda x: x.max() - x.min()

In [119]: frame.apply(f)
Out[119]: 
b    1.133201
d    1.965980
e    2.829781
dtype: float64

许多最常见的数组统计信息(例如sum和mean)都是DataFrame方法,因此不必使用apply。

也可以使用基于元素的Python函数。假设您要根据帧中的每个浮点值来计算格式化的字符串。您可以使用applymap做到这一点:

In [120]: format = lambda x: '%.2f' % x

In [121]: frame.applymap(format)
Out[121]: 
            b      d      e
Utah    -0.03   1.08   1.28
Ohio     0.65   0.83  -1.55
Texas    0.51  -0.88   0.20
Oregon  -0.49  -0.48  -0.31

之所以使用applymap之所以命名,是因为Series具有用于应用逐元素函数的map方法:

In [122]: frame['e'].map(format)
Out[122]: 
Utah       1.28
Ohio      -1.55
Texas      0.20
Oregon    -0.31
Name: e, dtype: object

总结起来,apply在DataFrame的行/列基础上工作,在DataFrame applymap上按map元素工作,在Series上按元素工作。

Straight from Wes McKinney’s Python for Data Analysis book, pg. 132 (I highly recommended this book):

Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:

In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [117]: frame
Out[117]: 
               b         d         e
Utah   -0.029638  1.081563  1.280300
Ohio    0.647747  0.831136 -1.549481
Texas   0.513416 -0.884417  0.195343
Oregon -0.485454 -0.477388 -0.309548

In [118]: f = lambda x: x.max() - x.min()

In [119]: frame.apply(f)
Out[119]: 
b    1.133201
d    1.965980
e    2.829781
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:

In [120]: format = lambda x: '%.2f' % x

In [121]: frame.applymap(format)
Out[121]: 
            b      d      e
Utah    -0.03   1.08   1.28
Ohio     0.65   0.83  -1.55
Texas    0.51  -0.88   0.20
Oregon  -0.49  -0.48  -0.31

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [122]: frame['e'].map(format)
Out[122]: 
Utah       1.28
Ohio      -1.55
Texas      0.20
Oregon    -0.31
Name: e, dtype: object

Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series.


回答 1

比较mapapplymap和:上下文问题apply

第一个主要区别:定义

  • map 仅在系列上定义
  • applymap 仅在DataFrames上定义
  • apply 两者都定义

第二个主要区别:输入参数

  • map接受dictS, Series,或可调用
  • applymap并且apply只接受可调用

第三大区别:行为

  • map 对于系列是元素
  • applymap 对于DataFrames是元素
  • apply也可以逐元素工作,但适用于更复杂的操作和聚合。行为和返回值取决于函数。

四主要的区别(最重要的):用例

  • map是用于将值从一个域映射到另一个域,因此针对性能进行了优化(例如df['A'].map({1:'a', 2:'b', 3:'c'})
  • applymap适用于跨多个行/列的元素转换(例如df[['A', 'B', 'C']].applymap(str.strip)
  • apply用于应用无法向量化的任何功能(例如df['sentences'].apply(nltk.sent_tokenize)

总结

脚注

  1. map通过字典/系列时,将基于该字典/系列中的键映射元素。缺少的值将在输出中记录为NaN。
  2. applymap在最新版本中,已针对某些操作进行了优化。您会发现applymapapply某些情况下要快一些。我的建议是对它们都进行测试,并使用更好的方法。

  3. map针对元素映射和转换进行了优化。涉及字典或系列的操作将使熊猫能够使用更快的代码路径来获得更好的性能。

  4. Series.apply返回用于汇总操作的标量,否则返回Series。同样适用于DataFrame.apply。需要注意的是apply,当某些NumPy的功能,如所谓的也有FastPaths的meansum等等。

Comparing map, applymap and apply: Context Matters

First major difference: DEFINITION

  • map is defined on Series ONLY
  • applymap is defined on DataFrames ONLY
  • apply is defined on BOTH

Second major difference: INPUT ARGUMENT

  • map accepts dicts, Series, or callable
  • applymap and apply accept callables only

Third major difference: BEHAVIOR

  • map is elementwise for Series
  • applymap is elementwise for DataFrames
  • apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.

Fourth major difference (the most important one): USE CASE

  • map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
  • applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
  • apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))

Summarising

Footnotes

  1. map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.
  2. applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in some cases. My suggestion is to test them both and use whatever works better.

  3. map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.

  4. Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has fastpaths when called with certain NumPy functions such as mean, sum, etc.

回答 2

这些答案中有很多有用的信息,但是我要添加自己的信息,以明确总结哪些方法在数组方式与元素方式下均有效。jeremiahbuddha主要这样做,但没有提及Series.apply。我没有代表对此发表评论。

  • DataFrame.apply 一次对整个行或列进行操作。

  • DataFrame.applymap,,Series.apply并同时Series.map对一个元素进行操作。

Series.apply和的功能之间有很多重叠之处Series.map,这意味着任何一种在大多数情况下都可以使用。但是,它们确实有一些细微的差异,其中一些已在osa的答案中进行了讨论。

There’s great information in these answers, but I’m adding my own to clearly summarize which methods work array-wise versus element-wise. jeremiahbuddha mostly did this but did not mention Series.apply. I don’t have the rep to comment.

  • DataFrame.apply operates on entire rows or columns at a time.

  • DataFrame.applymap, Series.apply, and Series.map operate on one element at time.

There is a lot of overlap between the capabilities of Series.apply and Series.map, meaning that either one will work in most cases. They do have some slight differences though, some of which were discussed in osa’s answer.


回答 3

除了其他答案外,Series还有mapapply

Apply可以使DataFrame脱离系列;但是,map只会将一个系列放在另一个系列的每个单元格中,这可能不是您想要的。

In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0    1
1    2
2    3
dtype: int64

In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]: 
   0  1
0  1  1
1  2  2
2  3  3

In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]: 
0    0    1
1    1
dtype: int64
1    0    2
1    2
dtype: int64
2    0    3
1    3
dtype: int64
dtype: object

另外,如果我有一个带有副作用的功能,例如“连接到Web服务器”,那么我可能apply只是为了清楚起见而使用。

series.apply(download_file_for_every_element) 

Map不仅可以使用功能,还可以使用字典或其他系列。假设您要操纵排列

采取

1 2 3 4 5
2 1 4 5 3

此排列的平方是

1 2 3 4 5
1 2 5 3 4

您可以使用进行计算map。不知道自助申请是否已记录在案,但可以在中使用0.15.1

In [39]: p=pd.Series([1,0,3,4,2])

In [40]: p.map(p)
Out[40]: 
0    0
1    1
2    4
3    2
4    3
dtype: int64

Adding to the other answers, in a Series there are also map and apply.

Apply can make a DataFrame out of a series; however, map will just put a series in every cell of another series, which is probably not what you want.

In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0    1
1    2
2    3
dtype: int64

In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]: 
   0  1
0  1  1
1  2  2
2  3  3

In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]: 
0    0    1
1    1
dtype: int64
1    0    2
1    2
dtype: int64
2    0    3
1    3
dtype: int64
dtype: object

Also if I had a function with side effects, such as “connect to a web server”, I’d probably use apply just for the sake of clarity.

series.apply(download_file_for_every_element) 

Map can use not only a function, but also a dictionary or another series. Let’s say you want to manipulate permutations.

Take

1 2 3 4 5
2 1 4 5 3

The square of this permutation is

1 2 3 4 5
1 2 5 3 4

You can compute it using map. Not sure if self-application is documented, but it works in 0.15.1.

In [39]: p=pd.Series([1,0,3,4,2])

In [40]: p.map(p)
Out[40]: 
0    0
1    1
2    4
3    2
4    3
dtype: int64

回答 4

@jeremiahbuddha提到了apply在行/列上的工作,而applymap在元素上工作。但似乎您仍可以使用apply进行元素计算。

    frame.apply(np.sqrt)
    Out[102]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

    frame.applymap(np.sqrt)
    Out[103]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

@jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation….

    frame.apply(np.sqrt)
    Out[102]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

    frame.applymap(np.sqrt)
    Out[103]: 
                   b         d         e
    Utah         NaN  1.435159       NaN
    Ohio    1.098164  0.510594  0.729748
    Texas        NaN  0.456436  0.697337
    Oregon  0.359079       NaN       NaN

回答 5

只是想指出一点,因为我为此苦了一点

def f(x):
    if x < 0:
        x = 0
    elif x > 100000:
        x = 100000
    return x

df.applymap(f)
df.describe()

这不会修改数据框本身,必须重新分配

df = df.applymap(f)
df.describe()

Just wanted to point out, as I struggled with this for a bit

def f(x):
    if x < 0:
        x = 0
    elif x > 100000:
        x = 100000
    return x

df.applymap(f)
df.describe()

this does not modify the dataframe itself, has to be reassigned

df = df.applymap(f)
df.describe()

回答 6

可能最简单的解释apply和applymap之间的区别:

apply将整个列作为参数,然后将结果分配给该列

applymap将单独的单元格值作为参数,并将结果分配回该单元格。

注意:如果apply返回单个值,则分配后将具有该值而不是列,最终将仅具有一行而不是矩阵。

Probably simplest explanation the difference between apply and applymap:

apply takes the whole column as a parameter and then assign the result to this column

applymap takes the separate cell value as a parameter and assign the result back to this cell.

NB If apply returns the single value you will have this value instead of the column after assigning and eventually will have just a row instead of matrix.


回答 7

我的理解:

从功能上看:

如果函数具有需要在列/行中进行比较的变量,请使用 apply

例如:lambda x: x.max()-x.mean()

如果要将函数应用于每个元素:

1>如果找到列/行,请使用 apply

2>如果适用于整个数据框,请使用 applymap

majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)

def times10(x):
  if type(x) is int:
    x *= 10 
  return x
df2.applymap(times10)

My understanding:

From the function point of view:

If the function has variables that need to compare within a column/ row, use apply.

e.g.: lambda x: x.max()-x.mean().

If the function is to be applied to each element:

1> If a column/row is located, use apply

2> If apply to entire dataframe, use applymap

majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)

def times10(x):
  if type(x) is int:
    x *= 10 
  return x
df2.applymap(times10)

回答 8

基于cs95的答案

  • map 仅在系列上定义
  • applymap 仅在DataFrames上定义
  • apply 两者都定义

举一些例子

In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [4]: frame
Out[4]:
            b         d         e
Utah    0.129885 -0.475957 -0.207679
Ohio   -2.978331 -1.015918  0.784675
Texas  -0.256689 -0.226366  2.262588
Oregon  2.605526  1.139105 -0.927518

In [5]: myformat=lambda x: f'{x:.2f}'

In [6]: frame.d.map(myformat)
Out[6]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [7]: frame.d.apply(myformat)
Out[7]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [8]: frame.applymap(myformat)
Out[8]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93

In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93


In [10]: myfunc=lambda x: x**2

In [11]: frame.applymap(myfunc)
Out[11]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

In [12]: frame.apply(myfunc)
Out[12]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

Based on the answer of cs95

  • map is defined on Series ONLY
  • applymap is defined on DataFrames ONLY
  • apply is defined on BOTH

give some examples

In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [4]: frame
Out[4]:
            b         d         e
Utah    0.129885 -0.475957 -0.207679
Ohio   -2.978331 -1.015918  0.784675
Texas  -0.256689 -0.226366  2.262588
Oregon  2.605526  1.139105 -0.927518

In [5]: myformat=lambda x: f'{x:.2f}'

In [6]: frame.d.map(myformat)
Out[6]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [7]: frame.d.apply(myformat)
Out[7]:
Utah      -0.48
Ohio      -1.02
Texas     -0.23
Oregon     1.14
Name: d, dtype: object

In [8]: frame.applymap(myformat)
Out[8]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93

In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
            b      d      e
Utah     0.13  -0.48  -0.21
Ohio    -2.98  -1.02   0.78
Texas   -0.26  -0.23   2.26
Oregon   2.61   1.14  -0.93


In [10]: myfunc=lambda x: x**2

In [11]: frame.applymap(myfunc)
Out[11]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

In [12]: frame.apply(myfunc)
Out[12]:
            b         d         e
Utah    0.016870  0.226535  0.043131
Ohio    8.870453  1.032089  0.615714
Texas   0.065889  0.051242  5.119305
Oregon  6.788766  1.297560  0.860289

回答 9

FOMO:

以下示例显示applyapplymap应用于DataFrame

map函数仅适用于Series。您不能map 在DataFrame上申请。

需要记住的是, apply可以做任何事情 applymap都可以,但apply额外的选项。

X因子选项包括:axisresult_typeresult_type仅在axis=1(仅适用于)时才适用。

df = DataFrame(1, columns=list('abc'),
                  index=list('1234'))
print(df)

f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only

# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1))  # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result

map附带说明一下,Series 函数不应与Python map函数混淆。

第一个应用于Series,以映射值,第二个应用于迭代对象的每个项目。


最后,不要将dataframe apply方法与groupby apply方法混淆。

FOMO:

The following example shows apply and applymap applied to a DataFrame.

map function is something you do apply on Series only. You cannot apply map on DataFrame.

The thing to remember is that apply can do anything applymap can, but apply has eXtra options.

The X factor options are: axis and result_type where result_type only works when axis=1 (for columns).

df = DataFrame(1, columns=list('abc'),
                  index=list('1234'))
print(df)

f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only

# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1))  # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result

As a sidenote, Series map function, should not be confused with the Python map function.

The first one is applied on Series, to map the values, and the second one to every item of an iterable.


Lastly don’t confuse the dataframe apply method with groupby apply method.