如何将函数应用于Pandas数据框的两列

问题:如何将函数应用于Pandas数据框的两列

假设我有一个df包含的列'ID', 'col_1', 'col_2'。我定义一个函数:

f = lambda x, y : my_function_expression

现在,我要应用fdf的两列'col_1', 'col_2',以逐元素的计算新列'col_3',有点像:

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

怎么做 ?

** 如下添加详细样本 ***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :

f = lambda x, y : my_function_expression.

Now I want to apply the f to df‘s two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

How to do ?

** Add detail sample as below ***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

回答 0

这是apply在数据框上使用的示例,我正在用进行调用axis = 1

请注意,区别在于,与其尝试将两个值传递给函数f,不如重写函数以接受pandas Series对象,然后对Series进行索引以获取所需的值。

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

根据您的用例,有时创建一个pandas group对象然后apply在组中使用会很有帮助。

Here’s an example using apply on the dataframe, which I am calling with axis = 1.

Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.


回答 1

在Pandas中,有一种简单的方法可以做到这一点:

df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)

这允许 f成为具有多个输入值的用户定义函数,并使用(安全)列名而不是(不安全)数字索引来访问列。

数据示例(基于原始问题):

import pandas as pd

df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)

输出print(df)

  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

如果您的列名包含空格或与现有的dataframe属性共享名称,则可以使用方括号进行索引:

df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)

There is a clean, one-line way of doing this in Pandas:

df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)

This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.

Example with data (based on original question):

import pandas as pd

df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)

Output of print(df):

  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:

df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)

回答 2

一个简单的解决方案是:

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

A simple solution is:

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

回答 3

一个有趣的问题!我的回答如下:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

输出:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

我将列名称更改为ID,J1,J2,J3以确保ID <J1 <J2 <J3,因此列以正确的顺序显示。

另一个简短的版本:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

A interesting question! my answer as below:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

Output:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.

One more brief version:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

回答 4

您正在寻找的方法是Series.combine。但是,似乎必须谨慎处理数据类型。在您的示例中,您会(就像我在测试答案时所做的那样)天真地调用

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

但是,这将引发错误:

ValueError: setting an array element with a sequence.

我最好的猜测是,似乎期望结果与调用该方法的系列(此处为df.col_1)具有相同的类型。但是,以下工作原理:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

The method you are looking for is Series.combine. However, it seems some care has to be taken around datatypes. In your example, you would (as I did when testing the answer) naively call

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

However, this throws the error:

ValueError: setting an array element with a sequence.

My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

回答 5

您的书写方式需要两个输入。如果查看错误消息,它表示您没有为f提供两个输入,仅一个。错误消息是正确的。
之所以不匹配,是因为df [[”col1’,’col2′]]返回一个包含两列而不是两列的单个数据帧。

你需要改变你的f,将它需要一个输入,保持上述数据帧作为输入,然后把它分解成X,Y 内部函数体。然后执行所需的任何操作并返回一个值。

您需要此函数签名,因为语法为.apply(f),因此f需要采用单个对象=数据帧,而不是当前f所期望的两件事。

由于您尚未提供f的正文,因此我无济于事-但这应提供解决方法,而无需从根本上更改您的代码或使用其他方法(而不是应用)

The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[[‘col1′,’col2’]] returns a single dataframe with two columns, not two separate columns.

You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.

You need this function signature because the syntax is .apply(f) So f needs to take the single thing = dataframe and not two things which is what your current f expects.

Since you haven’t provided the body of f I can’t help in anymore detail – but this should provide the way out without fundamentally changing your code or using some other methods rather than apply


回答 6

我将对np.vectorize进行投票。它允许您仅拍摄x列数,而不处理函数中的数据框,因此非常适合您无法控制的函数或执行向函数发送2列和常量之类的操作(例如col_1,col_2, ‘foo’)。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

I’m going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it’s great for functions you don’t control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, ‘foo’).

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

回答 7

从中返回列表apply是危险的操作,因为不能保证结果对象不是Series还是DataFrame。在某些情况下可能会引发exceptions情况。让我们来看一个简单的例子:

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

从以下位置返回列表可能会导致三种结果 apply

1)如果返回列表的长度不等于列数,则返回一系列列表。

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2)当返回列表的长度等于列数时,则返回一个DataFrame,并且每一列都获得列表中的相应值。

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3)如果返回列表的长度等于第一行的列数,但至少具有一行,其中列表的元素数与列数不同,则引发ValueError。

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

不申请就回答问题

apply与axis = 1一起使用非常慢。使用基本的迭代方法可能会获得更好的性能(尤其是在较大的数据集上)。

创建更大的数据框

df1 = df.sample(100000, replace=True).reset_index(drop=True)

时机

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@托马斯回答

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let’s walk through a simple example:

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

There are three possible outcomes with returning a list from apply

1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

Answering the problem without apply

Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.

Create larger dataframe

df1 = df.sample(100000, replace=True).reset_index(drop=True)

Timings

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Thomas answer

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答 8

我敢肯定这不如使用Pandas或Numpy操作的解决方案快,但是如果您不想重写函数,则可以使用map。使用原始示例数据-

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

我们可以通过这种方式将任意数量的参数传递给函数。输出就是我们想要的

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

I’m sure this isn’t as fast as the solutions using Pandas or Numpy operations, but if you don’t want to rewrite your function you can use map. Using the original example data –

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

We could pass as many arguments as we wanted into the function this way. The output is what we wanted

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

回答 9

我的问题示例:

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

My example to your questions:

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

回答 10

如果您有庞大的数据集,则可以使用简单但更快的(执行时间)方式使用swifter:

import pandas as pd
import swifter

def fnc(m,x,c):
    return m*x+c

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:

import pandas as pd
import swifter

def fnc(m,x,c):
    return m*x+c

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

回答 11

我想您不想更改get_sublist功能,而只想使用DataFrame的apply方法来完成这项工作。为了获得所需的结果,我编写了两个帮助函数:get_sublist_listunlist。顾名思义,首先获取子列表,然后从列表中提取该子列表。最后,我们需要调用apply函数以df[['col_1','col_2']]随后将这两个函数应用于DataFrame。

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

如果不使用[]get_sublist函数,则该get_sublist_list函数将返回一个纯列表,它会引发ValueError: could not broadcast input array from shape (3) into shape (2)@Ted Petrou提到的列表。

I suppose you don’t want to change get_sublist function, and just want to use DataFrame’s apply method to do the job. To get the result you want, I’ve wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

If you don’t use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it’ll raise ValueError: could not broadcast input array from shape (3) into shape (2), as @Ted Petrou had mentioned.