标签归档:apply

我何时应该在代码中使用pandas apply()?

问题:我何时应该在代码中使用pandas apply()?

我已经看到许多有关使用Pandas方法的堆栈溢出问题的答案apply。我还看到用户在他们的下面发表评论,说“ apply缓慢,应避免使用”。

我已经阅读了许多有关性能的文章,这些文章解释apply得很慢。我还在文档中看到了关于免除apply传递UDF的便捷功能的免责声明(现在似乎找不到)。因此,普遍的共识是,apply应尽可能避免。但是,这引起了以下问题:

  1. 如果apply太糟糕了,那为什么在API中呢?
  2. 我应该如何以及何时使代码apply免费?
  3. 在任何情况下apply都有良好的情况(比其他可能的解决方案更好)吗?

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that “apply is slow, and should be avoided”.

I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can’t seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:

  1. If apply is so bad, then why is it in the API?
  2. How and when should I make my code apply-free?
  3. Are there ever any situations where apply is good (better than other possible solutions)?

回答 0

apply,您不需要的便利功能

我们首先在OP中逐一解决问题。

如果应用是如此糟糕,那么为什么要在API中使用它呢?

DataFrame.applySeries.apply是分别在DataFrame和Series对象上定义的便捷函数apply接受任何在DataFrame上应用转换/聚合的用户定义函数。apply实际上是完成任何现有熊猫功能无法完成的灵丹妙药。

一些事情apply可以做:

  • 在DataFrame或Series上运行任何用户定义的函数
  • 在DataFrame上按行(axis=1)或按列()应用函数axis=0
  • 应用功能时执行索引对齐
  • 使用用户定义的函数执行汇总(但是,我们通常更喜欢aggtransform在这种情况下)
  • 执行逐元素转换
  • 将汇总结果广播到原始行(请参阅result_type参数)。
  • 接受位置/关键字参数以传递给用户定义的函数。

…其他 有关更多信息,请参见文档中的行或列函数应用程序

那么,具有所有这些功能,为什么apply不好?这是因为apply 缓慢的。Pandas对功能的性质不做任何假设,因此在必要时将您的功能迭代地应用于每个行/列。此外,处理上述所有情况均意味着apply每次迭代都会产生一些重大开销。此外,apply会消耗更多的内存,这对于内存受限的应用程序是一个挑战。

在极少数情况下,apply适合使用(以下更多内容)。如果不确定是否应该使用apply,则可能不应该使用。


让我们解决下一个问题。

如何当我应该让我的代码申请-免费?

重新说明一下,这是一些常见的情况,在这些情况下您将希望摆脱对的任何调用apply

数值数据

如果您正在使用数字数据,则可能已经有一个矢量化的cython函数可以完全实现您要执行的操作(如果没有,请在Stack Overflow上提问或在GitHub上打开功能请求)。

对比一下apply简单加法运算的性能。

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

在性能方面,没有任何可比的,被cythonized的等效物要快得多。不需要图表,因为即使对于玩具数据,差异也很明显。

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

即使您启用带有raw参数的原始数组传递,它的速度仍然是原来的两倍。

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

另一个例子:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

通常,如果可能寻找向量化的替代方案。

字符串/正则表达式

在大多数情况下,Pandas提供“矢量化”字符串函数,但是在极少数情况下,这些函数不会…“应用”,可以这么说。

一个常见的问题是检查同一行的另一列中是否存在一列中的值。

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

这应该返回第二行和第三行,因为“唐纳德”和“米妮”出现在它们各自的“标题”列中。

使用apply,这将使用

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool

df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

但是,使用列表推导存在更好的解决方案。

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这里要注意的是apply,由于开销较低,因此迭代例程的运行速度比快。如果您需要处理NaN和无效的dtype,则可以使用自定义函数在此基础上进行构建,然后再使用列表推导中的参数进行调用。

有关何时应该将列表理解视为一个不错的选择的更多信息,请参见我的文章:对于熊猫循环-我何时应该关心?

注意
日期和日期时间操作也具有矢量化版本。因此,例如,您应该更喜欢pd.to_datetime(df['date'])df['date'].apply(pd.to_datetime)

docs上阅读更多内容 。

一个常见的陷阱:列表的爆炸列

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

人们很想使用apply(pd.Series)。就性能而言,这太可怕了。

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

更好的选择是列出该列并将其传递给pd.DataFrame。

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

最后,

有什么情况 apply 是好的吗?

Apply是一项便利功能,因此在某些情况下开销可以忽略不计,可以原谅。它实际上取决于函数被调用多少次。

为系列矢量化的函数,但不是数据帧的函数
如果要对多列应用字符串操作该怎么办?如果要将多列转换为日期时间怎么办?这些函数仅针对系列进行矢量化处理,因此必须将它们应用于要转换/操作的每一列。

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object

这是以下情况的可接受案例apply

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

请注意,这对于stack还是有意义的,或者仅使用显式循环。所有这些选项都比使用稍微快一点apply,但是差异很小,可以原谅。

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

您可以对其他操作(例如字符串操作或转换为类别)进行类似的设置。

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

伏/秒

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

等等…

将Series转换为strastypevsapply

这似乎是API的特质。与使用相比,apply用于将Series中的整数转换为字符串的方法具有可比性(有时更快)astype

使用该perfplot库绘制该图。

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

使用浮点数时,我看到的astype速度始终与一样快,或略快于apply。因此,这与测试中的数据是整数类型有关。

GroupBy 链式转换操作

GroupBy.apply到目前为止尚未进行讨论,但是GroupBy.apply它也是一个迭代便利函数,用于处理现有GroupBy函数未处理的任何事情。

一个常见的要求是执行GroupBy,然后执行两个主要操作,例如“滞后的累积量”:

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

您需要在此处进行两个连续的groupby调用:

df.groupby('A').B.cumsum().groupby(df.A).shift()

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

使用apply,您可以将其缩短为一个电话。

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

量化性能非常困难,因为它取决于数据。但是总的来说,apply如果目标是减少groupby通话,这是一个可以接受的解决方案(因为groupby它也很昂贵)。


其他注意事项

除了上述注意事项外,还值得一提的是apply在第一行(或列)上执行两次。这样做是为了确定该功能是否有任何副作用。如果不是,则apply可能能够使用快速路径来评估结果,否则将退回到缓慢的实施方式。

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

GroupBy.apply小于0.25的熊猫版本中也可以看到此行为(已固定为0.25,有关更多信息请参见此处。)

apply, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

“If apply is so bad, then why is it in the API?”

DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.

Some of the things apply can do:

  • Run any user-defined function on a DataFrame or Series
  • Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
  • Perform index alignment while applying the function
  • Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
  • Perform element-wise transformations
  • Broadcast aggregated results to original rows (see the result_type argument).
  • Accept positional/keyword arguments to pass to the user-defined functions.

…Among others. For more information, see Row or Column-wise Function Application in the documentation.

So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.

There are very few situations where apply is appropriate to use (more on that below). If you’re not sure whether you should be using apply, you probably shouldn’t.



Let’s address the next question.

“How and when should I make my code apply-free?”

To rephrase, here are some common situations where you will want to get rid of any calls to apply.

Numeric Data

If you’re working with numeric data, there is likely already a vectorized cython function that does exactly what you’re trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).

Contrast the performance of apply for a simple addition operation.

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

<!- ->

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

Performance wise, there’s no comparison, the cythonized equivalent is much faster. There’s no need for a graph, because the difference is obvious even for toy data.

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even if you enable passing raw arrays with the raw argument, it’s still twice as slow.

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another example:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, seek out vectorized alternatives if possible.


String/Regex

Pandas provides “vectorized” string functions in most situations, but there are rare cases where those functions do not… “apply”, so to speak.

A common problem is to check whether a value in a column is present in another column of the same row.

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

This should return the row second and third row, since “donald” and “minnie” are present in their respective “Title” columns.

Using apply, this would be done using

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool
 
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

However, a better solution exists using list comprehensions.

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

<!- ->

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.

For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over, say, df['date'].apply(pd.to_datetime).

Read more at the docs.


A Common Pitfall: Exploding Columns of Lists

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

People are tempted to use apply(pd.Series). This is horrible in terms of performance.

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

A better option is to listify the column and pass it to pd.DataFrame.

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

<!- ->

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Lastly,

“Are there any situations where apply is good?”

Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object
    

This is an admissible case for apply:

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can make a similar case for other operations such as string operations, or conversion to category.

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

And so on…


Converting Series to str: astype versus apply

This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.

The graph was plotted using the perfplot library.

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.


GroupBy operations with chained transformations

GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.

One common requirement is to perform a GroupBy and then two prime operations such as a “lagged cumsum”:

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

<!- ->

You’d need two successive groupby calls here:

df.groupby('A').B.cumsum().groupby(df.A).shift()
 
0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Using apply, you can shorten this to a a single call.

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).



Other Caveats

Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)


回答 1

并非所有人apply都一样

下图建议何时考虑apply1。绿色意味着高效。红色避免。

其中一些是直观的:pd.Series.apply是Python级的逐行循环,同上是pd.DataFrame.apply逐行(axis=1)。这些的滥用是广泛的。另一篇文章更深入地探讨了它们。流行的解决方案是使用矢量化方法,列表推导(假定数据干净)或有效的工具pd.DataFrame(例如构造函数)(例如避免使用apply(pd.Series))。

如果使用pd.DataFrame.apply逐行方式,则指定raw=True(如果可能)通常是有益的。在这个阶段,numba通常是一个更好的选择。

GroupBy.apply:普遍偏爱

groupby避免重复操作apply会损害性能。GroupBy.apply只要您在自定义函数中使用的方法本身是矢量化的,通常在这里就可以了。有时,没有适用于希望应用的逐组聚合的本地Pandas方法。在这种情况下,对于少数apply具有自定义功能的组可能仍会提供合理的性能。

pd.DataFrame.apply 专栏式:混合袋

pd.DataFrame.apply按列(axis=0)是一个有趣的情况。对于少量的行而不是大量的列,几乎总是很昂贵的。对于相对于列的大量行(更常见的情况),使用以下命令有时可能看到显着的性能改进apply

# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3)))     # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns

                                               # Scenario_1  | Scenario_2
%timeit df.sum()                               # 800 ms      | 109 ms
%timeit df.apply(pd.Series.sum)                # 568 ms      | 325 ms

%timeit df.max() - df.min()                    # 1.63 s      | 314 ms
%timeit df.apply(lambda x: x.max() - x.min())  # 838 ms      | 473 ms

%timeit df.mean()                              # 108 ms      | 94.4 ms
%timeit df.apply(pd.Series.mean)               # 276 ms      | 233 ms

1有exceptions,但通常很少或很少。几个例子:

  1. df['col'].apply(str)可能略胜一筹df['col'].astype(str)
  2. df.apply(pd.to_datetime)与常规for循环相比,对字符串进行处理无法很好地适应行缩放。

Not all applys are alike

The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.

Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).

If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.

GroupBy.apply: generally favoured

Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.

pd.DataFrame.apply column-wise: a mixed bag

pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it’s almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:

# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3)))     # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns

                                               # Scenario_1  | Scenario_2
%timeit df.sum()                               # 800 ms      | 109 ms
%timeit df.apply(pd.Series.sum)                # 568 ms      | 325 ms

%timeit df.max() - df.min()                    # 1.63 s      | 314 ms
%timeit df.apply(lambda x: x.max() - x.min())  # 838 ms      | 473 ms

%timeit df.mean()                              # 108 ms      | 94.4 ms
%timeit df.apply(pd.Series.mean)               # 276 ms      | 233 ms

1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:

  1. df['col'].apply(str) may slightly outperform df['col'].astype(str).
  2. df.apply(pd.to_datetime) working on strings doesn’t scale well with rows versus a regular for loop.

回答 2

对于axis=1(即按行函数),则可以使用以下函数代替apply。我想知道为什么这不是pandas行为。(未经复合索引测试,但确实比快得多apply

def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)

For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn’t the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)

def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)

回答 3

有没有什么情况apply是好的?是的,有时。

任务:解码Unicode字符串。

import numpy as np
import pandas as pd
import unidecode

s = pd.Series(['mañana','Ceñía'])
s.head()
0    mañana
1     Ceñía


s.apply(unidecode.unidecode)
0    manana
1     Cenia

更新
我绝不是提倡使用apply,只是考虑到NumPy无法解决上述情况,因此它可能是一个很好的选择pandas apply。但是由于@jpp的提醒,我忘记了普通的ol列表理解。

Are there ever any situations where apply is good? Yes, sometimes.

Task: decode Unicode strings.

import numpy as np
import pandas as pd
import unidecode

s = pd.Series(['mañana','Ceñía'])
s.head()
0    mañana
1     Ceñía


s.apply(unidecode.unidecode)
0    manana
1     Cenia

Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by @jpp.


从熊猫返回多列apply()

问题:从熊猫返回多列apply()

我有一个熊猫DataFrame ,df_test。它包含一列“大小”,以字节为单位表示大小。我已经使用以下代码计算了KB,MB和GB:

df_test = pd.DataFrame([
    {'dir': '/Users/uname1', 'size': 994933},
    {'dir': '/Users/uname2', 'size': 109338711},
])

df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')

df_test


             dir       size       size_kb   size_mb size_gb
0  /Users/uname1     994933      971.6 KB    0.9 MB  0.0 GB
1  /Users/uname2  109338711  106,776.1 KB  104.3 MB  0.1 GB

[2 rows x 5 columns]

我已经运行了超过120,000行,并且根据%timeit,每列花费的时间约为2.97秒* 3 =〜9秒。

无论如何,我可以使它更快吗?例如,我是否可以代替一次套用并运行3次而不是一次返回一列,可以一次返回所有三列以将其插入回原始数据帧吗?

我发现的其他问题都希望采用多个值并返回一个值。我想要一个值并返回多列

I have a pandas DataFrame, df_test. It contains a column ‘size’ which represents size in bytes. I’ve calculated KB, MB, and GB using the following code:

df_test = pd.DataFrame([
    {'dir': '/Users/uname1', 'size': 994933},
    {'dir': '/Users/uname2', 'size': 109338711},
])

df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')

df_test


             dir       size       size_kb   size_mb size_gb
0  /Users/uname1     994933      971.6 KB    0.9 MB  0.0 GB
1  /Users/uname2  109338711  106,776.1 KB  104.3 MB  0.1 GB

[2 rows x 5 columns]

I’ve run this over 120,000 rows and time it takes about 2.97 seconds per column * 3 = ~9 seconds according to %timeit.

Is there anyway I can make this faster? For example, can I instead of returning one column at a time from apply and running it 3 times, can I return all three columns in one pass to insert back into the original dataframe?

The other questions I’ve found all want to take multiple values and return a single value. I want to take a single value and return multiple columns.


回答 0

这是一个古老的问题,但是为了完整起见,您可以从包含新数据的应用函数中返回一个Series,从而避免了需要迭代3次的麻烦。传递axis=1到apply函数sizes会将函数应用于数据框的每一行,并返回一个序列以添加到新的数据框。该系列s包含新值以及原始数据。

def sizes(s):
    s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return s

df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

This is an old question, but for completeness, you can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1 to the apply function applies the function sizes to each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.

def sizes(s):
    s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return s

df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

回答 1

使用apply和zip将比Series方式快3倍。

def sizes(s):    
    return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
        locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
        locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))

测试结果为:

Separate df.apply(): 

    100 loops, best of 3: 1.43 ms per loop

Return Series: 

    100 loops, best of 3: 2.61 ms per loop

Return tuple:

    1000 loops, best of 3: 819 µs per loop

Use apply and zip will 3 times fast than Series way.

def sizes(s):    
    return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
        locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
        locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))

Test result are:

Separate df.apply(): 

    100 loops, best of 3: 1.43 ms per loop

Return Series: 

    100 loops, best of 3: 2.61 ms per loop

Return tuple:

    1000 loops, best of 3: 819 µs per loop

回答 2

当前的某些回复效果很好,但我想提供另一个也许更“惊慌”的选择。这适用于我当前的大熊猫0.23(不确定是否可以在以前的版本中使用):

import pandas as pd

df_test = pd.DataFrame([
  {'dir': '/Users/uname1', 'size': 994933},
  {'dir': '/Users/uname2', 'size': 109338711},
])

def sizes(s):
  a = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
  b = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
  c = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
  return a, b, c

df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")

注意,技巧在的result_type参数上apply,它将把结果扩展为DataFrame,可以直接分配给新/旧列。

Some of the current replies work fine, but I want to offer another, maybe more “pandifyed” option. This works for me with the current pandas 0.23 (not sure if it will work in previous versions):

import pandas as pd

df_test = pd.DataFrame([
  {'dir': '/Users/uname1', 'size': 994933},
  {'dir': '/Users/uname2', 'size': 109338711},
])

def sizes(s):
  a = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
  b = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
  c = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
  return a, b, c

df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")

Notice that the trick is on the result_type parameter of apply, that will expand its result into a DataFrame that can be directly assign to new/old columns.


回答 3

只是另一种可读的方式。这段代码将添加三个新列及其值,并在apply函数中返回不使用任何参数的序列。

def sizes(s):

    val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])

df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)

来自以下的一般示例:https : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.apply.html

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

#foo  bar
#0    1    2
#1    1    2
#2    1    2

Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.

def sizes(s):

    val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])

df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)

A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

#foo  bar
#0    1    2
#1    1    2
#2    1    2

回答 4

真的很酷的答案!感谢Jesse和jaumebonet!关于以下方面的一些观察:

  • zip(* ...
  • ... result_type="expand")

尽管expand有点更优雅(平庸),但是zip至少快** 2倍。在下面这个简单的示例中,我的速度提高4倍

import pandas as pd

dat = [ [i, 10*i] for i in range(1000)]

df = pd.DataFrame(dat, columns = ["a","b"])

def add_and_sub(row):
    add = row["a"] + row["b"]
    sub = row["a"] - row["b"]
    return add, sub

df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))

Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:

  • zip(* ...
  • ... result_type="expand")

Although expand is kind of more elegant (pandifyed), zip is at least **2x faster. On this simple example bellow, I got 4x faster.

import pandas as pd

dat = [ [i, 10*i] for i in range(1000)]

df = pd.DataFrame(dat, columns = ["a","b"])

def add_and_sub(row):
    add = row["a"] + row["b"]
    sub = row["a"] - row["b"]
    return add, sub

df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))

回答 5

顶部答案之间的性能显著变化,和杰西&famaral42已经讨论过这一点,但它是值得分享的顶部答案之间的公平比较,并制定杰西的回答的一个微妙但重要的细节:在传递给说法功能,也会影响性能

(Python 3.7.4,Pandas 1.0.3)

import pandas as pd
import locale
import timeit


def create_new_df_test():
    df_test = pd.DataFrame([
      {'dir': '/Users/uname1', 'size': 994933},
      {'dir': '/Users/uname2', 'size': 109338711},
    ])
    return df_test


def sizes_pass_series_return_series(series):
    series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return series


def sizes_pass_series_return_tuple(series):
    a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c


def sizes_pass_value_return_tuple(value):
    a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c

结果如下:

# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

请注意如何返回元组是最快的方法,但是什么传递作为一个参数,也是影响性能。代码上的差异很细微,但性能却有显着提高。

测试#4(传递一个值)的速度是测试#3(传递一个值)的两倍,即使表面上执行的操作相同。

但是还有更多…

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

在某些情况下(#1a和#4a),将函数应用到已经存在输出列的DataFrame上比从函数创建它们更快。

这是运行测试的代码:

# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)

print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")

print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))

print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))

The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse’s answer: the argument passed in to the function, also affects performance.

(Python 3.7.4, Pandas 1.0.3)

import pandas as pd
import locale
import timeit


def create_new_df_test():
    df_test = pd.DataFrame([
      {'dir': '/Users/uname1', 'size': 994933},
      {'dir': '/Users/uname2', 'size': 109338711},
    ])
    return df_test


def sizes_pass_series_return_series(series):
    series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return series


def sizes_pass_series_return_tuple(series):
    a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c


def sizes_pass_value_return_tuple(value):
    a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c

Here are the results:

# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice how returning tuples is the fastest method, but what is passed in as an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.

Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.

But there’s more…

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.

Here is the code for running the tests:

# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)

print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")

print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))

print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))

回答 6

我相信1.1版会破坏此处最佳答案中建议的行为。

import pandas as pd
def test_func(row):
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)

上面的代码在pandas 1.1.0返回上运行:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

在大熊猫1.0.5中返回:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

我认为这是您所期望的。

不知道怎么的发行说明解释这种行为,然而由于解释这里通过复制他们复活旧的行为避免了原始行的突变。即:

def test_func(row):
    row = row.copy()   #  <---- Avoid mutating the original reference
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

I believe the 1.1 version breaks the behavior suggested in the top answer here.

import pandas as pd
def test_func(row):
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)

The above code ran on pandas 1.1.0 returns:

   a  b   c  d
0  1  i  1i  2
1  1  i  1i  2
2  1  i  1i  2

While in pandas 1.0.5 it returned:

   a   b    c  d
0  1   i   1i  2
1  2   j   2j  3
2  3   k   3k  4

Which I think is what you’d expect.

Not sure how the release notes explain this behavior, however as explained here avoiding mutation of the original rows by copying them resurrects the old behavior. i.e.:

def test_func(row):
    row = row.copy()   #  <---- Avoid mutating the original reference
    row['c'] = str(row['a']) + str(row['b'])
    row['d'] = row['a'] + 1
    return row

回答 7

通常,要返回多个值,这就是我要做的

def gimmeMultiple(group):
    x1 = 1
    x2 = 2
    return array([[1, 2]])
def gimmeMultipleDf(group):
    x1 = 1
    x2 = 2
    return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)

明确地返回数据帧具有其优势,但有时并非必需。您可以查看apply()收益,并使用这些功能;)

Generally, to return multiple values, this is what I do

def gimmeMultiple(group):
    x1 = 1
    x2 = 2
    return array([[1, 2]])
def gimmeMultipleDf(group):
    x1 = 1
    x2 = 2
    return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)

Returning a dataframe definitively has its perks, but sometimes not required. You can look at what the apply() returns and play a bit with the functions ;)


回答 8

它提供了一个新数据框,其中包含原始列的两列。

import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)

It gives a new dataframe with two columns from the original one.

import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)

python pandas:将带有参数的函数应用于一系列

问题:python pandas:将带有参数的函数应用于一系列

我想将带有参数的函数应用于python pandas中的系列:

x = my_series.apply(my_function, more_arguments_1)
y = my_series.apply(my_function, more_arguments_2)
...

文档描述了对apply方法的支持,但不接受任何参数。是否存在接受参数的其他方法?另外,我是否缺少一个简单的解决方法?

更新(2017年10月): 请注意,由于最初询问此问题以来,apply()已对熊猫进行了更新以处理位置和关键字参数,并且上面的文档链接现在反映了这一点,并说明了如何包括这两种类型的参数。

I want to apply a function with arguments to a series in python pandas:

x = my_series.apply(my_function, more_arguments_1)
y = my_series.apply(my_function, more_arguments_2)
...

The documentation describes support for an apply method, but it doesn’t accept any arguments. Is there a different method that accepts arguments? Alternatively, am I missing a simple workaround?

Update (October 2017): Note that since this question was originally asked that pandas apply() has been updated to handle positional and keyword arguments and the documentation link above now reflects that and shows how to include either type of argument.


回答 0

较新版本的pandas 确实允许您传递额外的参数(请参阅新文档)。现在,您可以执行以下操作:

my_series.apply(your_function, args=(2,3,4), extra_kw=1)

位置参数添加系列元素之后


对于旧版本的熊猫:

文档对此进行了清晰的解释。apply方法接受应具有单个参数的python函数。如果要传递更多参数,则应functools.partial按照Joel Cornett在其评论中的建议使用。

一个例子:

>>> import functools
>>> import operator
>>> add_3 = functools.partial(operator.add,3)
>>> add_3(2)
5
>>> add_3(7)
10

您也可以使用传递关键字参数partial

另一种方法是创建一个lambda:

my_series.apply((lambda x: your_func(a,b,c,d,...,x)))

但我认为使用partial会更好。

Newer versions of pandas do allow you to pass extra arguments (see the new documentation). So now you can do:

my_series.apply(your_function, args=(2,3,4), extra_kw=1)

The positional arguments are added after the element of the series.


For older version of pandas:

The documentation explains this clearly. The apply method accepts a python function which should have a single parameter. If you want to pass more parameters you should use functools.partial as suggested by Joel Cornett in his comment.

An example:

>>> import functools
>>> import operator
>>> add_3 = functools.partial(operator.add,3)
>>> add_3(2)
5
>>> add_3(7)
10

You can also pass keyword arguments using partial.

Another way would be to create a lambda:

my_series.apply((lambda x: your_func(a,b,c,d,...,x)))

But I think using partial is better.


回答 1

脚步:

  1. 创建一个数据框
  2. 创建一个功能
  3. 在apply语句中使用函数的命名参数。

x=pd.DataFrame([1,2,3,4])  

def add(i1, i2):  
    return i1+i2

x.apply(add,i2=9)

此示例的结果是,数据框中的每个数字都将添加到数字9中。

    0
0  10
1  11
2  12
3  13

说明:

“添加”功能具有两个参数:i1,i2。第一个参数将是数据帧中的值,第二个参数是我们传递给“ apply”函数的值。在这种情况下,我们使用关键字参数“ i2”将“ 9”传递给apply函数。

Steps:

  1. Create a dataframe
  2. Create a function
  3. Use the named arguments of the function in the apply statement.

Example

x=pd.DataFrame([1,2,3,4])  

def add(i1, i2):  
    return i1+i2

x.apply(add,i2=9)

The outcome of this example is that each number in the dataframe will be added to the number 9.

    0
0  10
1  11
2  12
3  13

Explanation:

The “add” function has two parameters: i1, i2. The first parameter is going to be the value in data frame and the second is whatever we pass to the “apply” function. In this case, we are passing “9” to the apply function using the keyword argument “i2”.


回答 2

Series.apply(func, convert_dtype=True, args=(), **kwds)

args : tuple

x = my_series.apply(my_function, args = (arg1,))
Series.apply(func, convert_dtype=True, args=(), **kwds)

args : tuple

x = my_series.apply(my_function, args = (arg1,))

回答 3

您可以将任何数量的参数传递给apply正在通过未命名参数传递,作为元组传递给args参数或通过内部由关键字捕获为字典的其他关键字参数传递给函数的kwds函数。

例如,让我们构建一个函数,该函数对于3到6之间的值返回True,否则返回False。

s = pd.Series(np.random.randint(0,10, 10))
s

0    5
1    3
2    1
3    1
4    6
5    0
6    3
7    4
8    9
9    6
dtype: int64

s.apply(lambda x: x >= 3 and x <= 6)

0     True
1     True
2    False
3    False
4     True
5    False
6     True
7     True
8    False
9     True
dtype: bool

这个匿名函数不是很灵活。让我们创建一个带有两个参数的普通函数,以控制我们在系列中所需的最小值和最大值。

def between(x, low, high):
    return x >= low and x =< high

我们可以通过将未命名的参数传递给来复制第一个函数的输出args

s.apply(between, args=(3,6))

或者我们可以使用命名参数

s.apply(between, low=3, high=6)

或两者兼而有之

s.apply(between, args=(3,), high=6)

You can pass any number of arguments to the function that apply is calling through either unnamed arguments, passed as a tuple to the args parameter, or through other keyword arguments internally captured as a dictionary by the kwds parameter.

For instance, let’s build a function that returns True for values between 3 and 6, and False otherwise.

s = pd.Series(np.random.randint(0,10, 10))
s

0    5
1    3
2    1
3    1
4    6
5    0
6    3
7    4
8    9
9    6
dtype: int64

s.apply(lambda x: x >= 3 and x <= 6)

0     True
1     True
2    False
3    False
4     True
5    False
6     True
7     True
8    False
9     True
dtype: bool

This anonymous function isn’t very flexible. Let’s create a normal function with two arguments to control the min and max values we want in our Series.

def between(x, low, high):
    return x >= low and x =< high

We can replicate the output of the first function by passing unnamed arguments to args:

s.apply(between, args=(3,6))

Or we can use the named arguments

s.apply(between, low=3, high=6)

Or even a combination of both

s.apply(between, args=(3,), high=6)

为什么我的Pandas的“应用”功能不能引用多个列?[关闭]

问题:为什么我的Pandas的“应用”功能不能引用多个列?[关闭]

当将多个列与以下数据框一起使用时,Pandas Apply函数存在一些问题

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

和以下功能

def my_test(a, b):
    return a % b

当我尝试使用以下功能时:

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

我收到错误消息:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

我不明白此消息,我正确定义了名称。

非常感谢您对此问题的帮助

更新资料

谢谢你的帮助。我确实在代码中犯了一些语法错误,索引应该放在”。但是,使用更复杂的功能仍然会遇到相同的问题,例如:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

I have some problems with the Pandas apply function, when using multiple columns with the following dataframe

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

and the following function

def my_test(a, b):
    return a % b

When I try to apply this function with :

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

I get the error message:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

I do not understand this message, I defined the name properly.

I would highly appreciate any help on this issue

Update

Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ”. However I still get the same issue using a more complex function such as:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

回答 0

好像您忘记了''字符串。

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

在我看来,顺便说一句,以下方式更为优雅:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)

Seems you forgot the '' of your string.

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

BTW, in my opinion, following way is more elegant:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)

回答 1

如果您只想计算(a栏)%(b栏),则不需要apply,只需直接执行:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

If you just want to compute (column a) % (column b), you don’t need apply, just do it directly:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

回答 2

假设我们要对DataFrame df的列“ a”和“ b”应用add5函数

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

Let’s say we want to apply a function add5 to columns ‘a’ and ‘b’ of DataFrame df

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

回答 3

以上所有建议均有效,但如果您希望提高计算效率,则应利用numpy向量运算(如此处所述)

import pandas as pd
import numpy as np


df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})

示例1:循环pandas.apply()

%%timeit
def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

最慢的运行时间比最快的运行时间长7.49倍。这可能意味着正在缓存中间结果。1000个循环,最佳3:每个循环481 µs

示例2:使用进行矢量化pandas.apply()

%%timeit
df['a'] % df['c']

最慢的运行时间比最快的运行时间长458.85倍。这可能意味着正在缓存中间结果。10000次循环,最好为3次:每个循环70.9 µs

示例3:使用numpy数组进行向量化:

%%timeit
df['a'].values % df['c'].values

最慢的运行时间比最快的运行时间长7.98倍。这可能意味着正在缓存中间结果。100000次循环,每循环3:6.39 µs最佳

因此,使用numpy数组进行向量化将速度提高了近两个数量级。

All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).

import pandas as pd
import numpy as np


df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})

Example 1: looping with pandas.apply():

%%timeit
def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop

Example 2: vectorize using pandas.apply():

%%timeit
df['a'] % df['c']

The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop

Example 3: vectorize using numpy arrays:

%%timeit
df['a'].values % df['c'].values

The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop

So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.


回答 4

这与先前的解决方案相同,但是我已经在df.apply本身中定义了该函数:

df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

This is same as the previous solution but I have defined the function in df.apply itself:

df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

回答 5

我已经比较了上面讨论的所有三个。

使用值

%timeit df ['value'] = df ['a']。values%df ['c']。values

每个回路139 µs±1.91 µs(平均±标准偏差,共运行7次,每个回路10000个)

没有价值

%timeit df ['value'] = df ['a']%df ['c'] 

每个循环216 µs±1.86 µs(平均±标准偏差,共运行7次,每个循环1000个)

套用功能

%timeit df ['Value'] = df.apply(lambda row:row ['a']%row ['c'],axis = 1)

每个回路474 µs±5.07 µs(平均±标准偏差,共运行7次,每个回路1000个)

I have given the comparison of all three discussed above.

Using values

%timeit df['value'] = df['a'].values % df['c'].values

139 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Without values

%timeit df['value'] = df['a']%df['c'] 

216 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Apply function

%timeit df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

474 µs ± 5.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


熊猫根据其他列的值创建新列/逐行应用多列的功能

问题:熊猫根据其他列的值创建新列/逐行应用多列的功能

我想申请我的自定义函数(它使用的if-else梯)这六个列(ERI_HispanicERI_AmerInd_AKNatvERI_AsianERI_Black_Afr.AmerERI_HI_PacIslERI_White我的数据帧的每一行中)。

我尝试了与其他问题不同的方法,但似乎仍然找不到适合我问题的正确答案。关键在于,如果该人被视为西班牙裔,就不能被视为其他任何人。即使他们在另一个种族栏中的得分为“ 1”,他们仍然被视为西班牙裔,而不是两个或两个以上的种族。同样,如果所有ERI列的总和大于1,则将它们计为两个或多个种族,并且不能计为唯一的种族(西班牙裔除外)。希望这是有道理的。任何帮助将不胜感激。

这几乎就像在每行中进行一个for循环一样,如果每条记录都符合条件,则将它们添加到一个列表中并从原始列表中删除。

从下面的数据框中,我需要根据以下SQL规范来计算新列:

===================================================== =======

IF [ERI_Hispanic] = 1 THEN RETURN Hispanic
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN Two or More
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN A/I AK Native
ELSE IF [ERI_Asian] = 1 THEN RETURN Asian
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN Black/AA
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN White

评论:如果西班牙裔ERI标志为True(1),则该雇员被分类为“西班牙裔”

注释:如果多个非西班牙裔ERI标志为真,则返回“两个或更多”

======================数据帧===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I’ve tried different methods from other questions but still can’t seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can’t be counted as anything else. Even if they have a “1” in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can’t be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.

Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

========================= CRITERIA ===============================

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

====================== DATAFRAME ===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

回答 0

好的,执行此步骤有两个步骤-首先是编写一个可以执行所需翻译的函数-我已根据您的伪代码将一个示例放在一起:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

您可能想要解决这个问题,但这似乎可以解决问题-请注意,进入函数的参数被视为标记为“行”的Series对象。

接下来,在熊猫中使用apply函数来应用该函数-例如

df.apply (lambda row: label_race(row), axis=1)

请注意axis = 1说明符,这意味着应用程序是在行而不是列级别完成的。结果在这里:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

如果您对这些结果感到满意,请再次运行它,将结果保存到原始数据框中的新列中。

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

结果数据框如下所示(向右滚动以查看新列):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

OK, two steps to this – first is to write a function that does the translation you want – I’ve put an example together based on your pseudo-code:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

You may want to go over this, but it seems to do the trick – notice that the parameter going into the function is considered to be a Series object labelled “row”.

Next, use the apply function in pandas to apply the function – e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

If you’re happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

回答 1

由于这是Google针对“来自其他人的熊猫专栏”的第一个结果,因此下面是一个简单的示例:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

如果得到了,SettingWithCopyWarning您也可以通过以下方式进行操作:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

资料来源:https : //stackoverflow.com/a/12555510/243392

如果列名包含空格,则可以使用如下语法:

df = df.assign(**{'some column name': col.values})

这是applyAssign的文档。

Since this is the first Google result for ‘pandas new column from others’, here’s a simple example:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here’s the documentation for apply, and assign.


回答 2

上面的答案是完全正确的,但是存在矢量化的解决方案,形式为numpy.select。这使您可以定义条件,然后为这些条件定义输出,这比使用apply以下命令更有效:


首先,定义条件:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

现在,定义相应的输出:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

最后,使用numpy.select

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

为什么要numpy.select用完apply?以下是一些性能检查:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy.select给了我们极大地提高了性能,并且随着数据增长的差异只会增加。

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

Now, define the corresponding outputs:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.


回答 3

.apply()以函数作为第一个参数;这样传递label_race函数:

df['race_label'] = df.apply(label_race, axis=1)

您无需使一个lambda函数即可传递函数。

.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1)

You don’t need to make a lambda function to pass in a function.


回答 4

尝试这个,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O / P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

使用.loc代替apply

它改善了向量化。

.loc 以简单的方式工作,根据条件屏蔽行,将值应用于冻结行。

有关更多详细信息,请访问.loc docs

性能指标:

接受的答案:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

每个循环1.15 s±46.5 ms(平均±标准偏差,共7次运行,每个循环1次)

我的建议答案:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

每个循环24.7 ms±1.7 ms(平均±标准偏差,运行7次,每个循环10个)

try this,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, .loc docs

Performance metrics:

Accepted Answer:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)