

当将多个列与以下数据框一起使用时,Pandas Apply函数存在一些问题

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})


def my_test(a, b):
    return a % b


df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)


NameError: ("global name 'a' is not defined", u'occurred at index 0')





def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

I have some problems with the Pandas apply function, when using multiple columns with the following dataframe

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

and the following function

def my_test(a, b):
    return a % b

When I try to apply this function with :

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

I get the error message:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

I do not understand this message, I defined the name properly.

I would highly appreciate any help on this issue


Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ”. However I still get the same issue using a more complex function such as:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 

回答 0


In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417


In [53]: def my_test2(row):
....:     return row['a'] % row['c']

In [54]: df['Value'] = df.apply(my_test2, axis=1)

Seems you forgot the '' of your string.

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

BTW, in my opinion, following way is more elegant:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']

In [54]: df['Value'] = df.apply(my_test2, axis=1)

回答 1


In [7]: df['a'] % df['c']                                                                                                                                                        
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

If you just want to compute (column a) % (column b), you don’t need apply, just do it directly:

In [7]: df['a'] % df['c']                                                                                                                                                        
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a

回答 2

假设我们要对DataFrame df的列“ a”和“ b”应用add5函数

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

Let’s say we want to apply a function add5 to columns ‘a’ and ‘b’ of DataFrame df

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)

回答 3


import pandas as pd
import numpy as np

df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})


def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

最慢的运行时间比最快的运行时间长7.49倍。这可能意味着正在缓存中间结果。1000个循环,最佳3:每个循环481 µs


df['a'] % df['c']

最慢的运行时间比最快的运行时间长458.85倍。这可能意味着正在缓存中间结果。10000次循环,最好为3次:每个循环70.9 µs


df['a'].values % df['c'].values

最慢的运行时间比最快的运行时间长7.98倍。这可能意味着正在缓存中间结果。100000次循环,每循环3:6.39 µs最佳


All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).

import pandas as pd
import numpy as np

df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})

Example 1: looping with pandas.apply():

def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop

Example 2: vectorize using pandas.apply():

df['a'] % df['c']

The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop

Example 3: vectorize using numpy arrays:

df['a'].values % df['c'].values

The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop

So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.

回答 4


df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

This is same as the previous solution but I have defined the function in df.apply itself:

df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

回答 5



%timeit df ['value'] = df ['a']。values%df ['c']。values

每个回路139 µs±1.91 µs(平均±标准偏差,共运行7次,每个回路10000个)


%timeit df ['value'] = df ['a']%df ['c'] 

每个循环216 µs±1.86 µs(平均±标准偏差,共运行7次,每个循环1000个)


%timeit df ['Value'] = df.apply(lambda row:row ['a']%row ['c'],axis = 1)

每个回路474 µs±5.07 µs(平均±标准偏差,共运行7次,每个回路1000个)

I have given the comparison of all three discussed above.

Using values

%timeit df['value'] = df['a'].values % df['c'].values

139 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Without values

%timeit df['value'] = df['a']%df['c'] 

216 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Apply function

%timeit df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

474 µs ± 5.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)