标签归档:dataframe

如何将函数应用于Pandas数据框的两列

问题:如何将函数应用于Pandas数据框的两列

假设我有一个df包含的列'ID', 'col_1', 'col_2'。我定义一个函数:

f = lambda x, y : my_function_expression

现在,我要应用fdf的两列'col_1', 'col_2',以逐元素的计算新列'col_3',有点像:

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

怎么做 ?

** 如下添加详细样本 ***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :

f = lambda x, y : my_function_expression.

Now I want to apply the f to df‘s two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

How to do ?

** Add detail sample as below ***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

回答 0

这是apply在数据框上使用的示例,我正在用进行调用axis = 1

请注意,区别在于,与其尝试将两个值传递给函数f,不如重写函数以接受pandas Series对象,然后对Series进行索引以获取所需的值。

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

根据您的用例,有时创建一个pandas group对象然后apply在组中使用会很有帮助。

Here’s an example using apply on the dataframe, which I am calling with axis = 1.

Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.


回答 1

在Pandas中,有一种简单的方法可以做到这一点:

df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)

这允许 f成为具有多个输入值的用户定义函数,并使用(安全)列名而不是(不安全)数字索引来访问列。

数据示例(基于原始问题):

import pandas as pd

df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)

输出print(df)

  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

如果您的列名包含空格或与现有的dataframe属性共享名称,则可以使用方括号进行索引:

df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)

There is a clean, one-line way of doing this in Pandas:

df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)

This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.

Example with data (based on original question):

import pandas as pd

df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)

Output of print(df):

  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:

df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)

回答 2

一个简单的解决方案是:

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

A simple solution is:

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

回答 3

一个有趣的问题!我的回答如下:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

输出:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

我将列名称更改为ID,J1,J2,J3以确保ID <J1 <J2 <J3,因此列以正确的顺序显示。

另一个简短的版本:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

A interesting question! my answer as below:

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

Output:

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.

One more brief version:

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

回答 4

您正在寻找的方法是Series.combine。但是,似乎必须谨慎处理数据类型。在您的示例中,您会(就像我在测试答案时所做的那样)天真地调用

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

但是,这将引发错误:

ValueError: setting an array element with a sequence.

我最好的猜测是,似乎期望结果与调用该方法的系列(此处为df.col_1)具有相同的类型。但是,以下工作原理:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

The method you are looking for is Series.combine. However, it seems some care has to be taken around datatypes. In your example, you would (as I did when testing the answer) naively call

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

However, this throws the error:

ValueError: setting an array element with a sequence.

My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

回答 5

您的书写方式需要两个输入。如果查看错误消息,它表示您没有为f提供两个输入,仅一个。错误消息是正确的。
之所以不匹配,是因为df [[”col1’,’col2′]]返回一个包含两列而不是两列的单个数据帧。

你需要改变你的f,将它需要一个输入,保持上述数据帧作为输入,然后把它分解成X,Y 内部函数体。然后执行所需的任何操作并返回一个值。

您需要此函数签名,因为语法为.apply(f),因此f需要采用单个对象=数据帧,而不是当前f所期望的两件事。

由于您尚未提供f的正文,因此我无济于事-但这应提供解决方法,而无需从根本上更改您的代码或使用其他方法(而不是应用)

The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[[‘col1′,’col2’]] returns a single dataframe with two columns, not two separate columns.

You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.

You need this function signature because the syntax is .apply(f) So f needs to take the single thing = dataframe and not two things which is what your current f expects.

Since you haven’t provided the body of f I can’t help in anymore detail – but this should provide the way out without fundamentally changing your code or using some other methods rather than apply


回答 6

我将对np.vectorize进行投票。它允许您仅拍摄x列数,而不处理函数中的数据框,因此非常适合您无法控制的函数或执行向函数发送2列和常量之类的操作(例如col_1,col_2, ‘foo’)。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

I’m going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it’s great for functions you don’t control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, ‘foo’).

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

回答 7

从中返回列表apply是危险的操作,因为不能保证结果对象不是Series还是DataFrame。在某些情况下可能会引发exceptions情况。让我们来看一个简单的例子:

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

从以下位置返回列表可能会导致三种结果 apply

1)如果返回列表的长度不等于列数,则返回一系列列表。

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2)当返回列表的长度等于列数时,则返回一个DataFrame,并且每一列都获得列表中的相应值。

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3)如果返回列表的长度等于第一行的列数,但至少具有一行,其中列表的元素数与列数不同,则引发ValueError。

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

不申请就回答问题

apply与axis = 1一起使用非常慢。使用基本的迭代方法可能会获得更好的性能(尤其是在较大的数据集上)。

创建更大的数据框

df1 = df.sample(100000, replace=True).reset_index(drop=True)

时机

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@托马斯回答

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let’s walk through a simple example:

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

There are three possible outcomes with returning a list from apply

1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

Answering the problem without apply

Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.

Create larger dataframe

df1 = df.sample(100000, replace=True).reset_index(drop=True)

Timings

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Thomas answer

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答 8

我敢肯定这不如使用Pandas或Numpy操作的解决方案快,但是如果您不想重写函数,则可以使用map。使用原始示例数据-

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

我们可以通过这种方式将任意数量的参数传递给函数。输出就是我们想要的

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

I’m sure this isn’t as fast as the solutions using Pandas or Numpy operations, but if you don’t want to rewrite your function you can use map. Using the original example data –

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

We could pass as many arguments as we wanted into the function this way. The output is what we wanted

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

回答 9

我的问题示例:

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

My example to your questions:

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

回答 10

如果您有庞大的数据集,则可以使用简单但更快的(执行时间)方式使用swifter:

import pandas as pd
import swifter

def fnc(m,x,c):
    return m*x+c

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:

import pandas as pd
import swifter

def fnc(m,x,c):
    return m*x+c

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

回答 11

我想您不想更改get_sublist功能,而只想使用DataFrame的apply方法来完成这项工作。为了获得所需的结果,我编写了两个帮助函数:get_sublist_listunlist。顾名思义,首先获取子列表,然后从列表中提取该子列表。最后,我们需要调用apply函数以df[['col_1','col_2']]随后将这两个函数应用于DataFrame。

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

如果不使用[]get_sublist函数,则该get_sublist_list函数将返回一个纯列表,它会引发ValueError: could not broadcast input array from shape (3) into shape (2)@Ted Petrou提到的列表。

I suppose you don’t want to change get_sublist function, and just want to use DataFrame’s apply method to do the job. To get the result you want, I’ve wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

If you don’t use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it’ll raise ValueError: could not broadcast input array from shape (3) into shape (2), as @Ted Petrou had mentioned.


系列的真值含糊不清。使用a.empty,a.bool(),a.item(),a.any()或a.all()

问题:系列的真值含糊不清。使用a.empty,a.bool(),a.item(),a.any()或a.all()

在用or条件过滤我的结果数据框时出现问题。我希望我的结果df提取var大于0.25且小于-0.25的所有列值。

下面的逻辑为我提供了一个模糊的真实值,但是当我将此过滤分为两个独立的操作时,它可以工作。这是怎么回事 不知道在哪里使用建议a.empty(), a.bool(), a.item(),a.any() or a.all()

 result = result[(result['var']>0.25) or (result['var']<-0.25)]

Having issue filtering my result dataframe with an or condition. I want my result df to extract all column var values that are above 0.25 and below -0.25.

This logic below gives me an ambiguous truth value however it work when I split this filtering in two separate operations. What is happening here? not sure where to use the suggested a.empty(), a.bool(), a.item(),a.any() or a.all().

 result = result[(result['var']>0.25) or (result['var']<-0.25)]

回答 0

orandPython语句需要truth-值。因为pandas这些被认为是模棱两可的,所以您应该使用“按位” |(或)或&(和)操作:

result = result[(result['var']>0.25) | (result['var']<-0.25)]

对于此类数据结构,它们会重载以生成元素级or(或and)。


只是为该语句添加更多解释:

当您想获取的时bool,将引发异常pandas.Series

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

什么你打是一处经营隐含转换的操作数bool(你用or,但它也恰好为andifwhile):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

除了这些4个语句有一些隐藏某几个Python函数bool调用(如anyallfilter,…),这些都是通常不会有问题的pandas.Series,但出于完整性我想提一提这些。


在您的情况下,该异常并不是真正有用的,因为它没有提到正确的替代方法。对于and和,or您可以使用(如果您想要逐元素比较):

  • numpy.logical_or

    >>> import numpy as np
    >>> np.logical_or(x, y)

    或简单地|算:

    >>> x | y
  • numpy.logical_and

    >>> np.logical_and(x, y)

    或简单地&算:

    >>> x & y

如果您使用的是运算符,请确保由于运算符优先级而正确设置了括号。

几个逻辑numpy的功能,应该工作的pandas.Series


如果您在执行if或时遇到异常,则异常中提到的替代方法更适合while。我将在下面简短地解释每个:

  • 如果要检查您的系列是否为

    >>> x = pd.Series([])
    >>> x.empty
    True
    >>> x = pd.Series([1])
    >>> x.empty
    False

    如果没有明确的布尔值解释,Python通常会将len容器的gth(如list,,tuple…)解释为真值。因此,如果您想进行类似python的检查,可以执行:if x.sizeif not x.empty代替if x

  • 如果您Series包含一个且只有一个布尔值:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    True
    >>> (x < 50).bool()
    False
  • 如果要检查系列的第一个也是唯一的一项(例如,.bool()但即使不是布尔型内容也可以使用):

    >>> x = pd.Series([100])
    >>> x.item()
    100
  • 如果要检查所有任何项目是否为非零,非空或非False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # because one element is zero
    False
    >>> x.any()   # because one (or more) elements are non-zero
    True

The or and and python statements require truth-values. For pandas these are considered ambiguous so you should use “bitwise” | (or) or & (and) operations:

result = result[(result['var']>0.25) | (result['var']<-0.25)]

These are overloaded for these kind of datastructures to yield the element-wise or (or and).


Just to add some more explanation to this statement:

The exception is thrown when you want to get the bool of a pandas.Series:

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

What you hit was a place where the operator implicitly converted the operands to bool (you used or but it also happens for and, if and while):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Besides these 4 statements there are several python functions that hide some bool calls (like any, all, filter, …) these are normally not problematic with pandas.Series but for completeness I wanted to mention these.


In your case the exception isn’t really helpful, because it doesn’t mention the right alternatives. For and and or you can use (if you want element-wise comparisons):

  • numpy.logical_or:

    >>> import numpy as np
    >>> np.logical_or(x, y)
    

    or simply the | operator:

    >>> x | y
    
  • numpy.logical_and:

    >>> np.logical_and(x, y)
    

    or simply the & operator:

    >>> x & y
    

If you’re using the operators then make sure you set your parenthesis correctly because of the operator precedence.

There are several logical numpy functions which should work on pandas.Series.


The alternatives mentioned in the Exception are more suited if you encountered it when doing if or while. I’ll shortly explain each of these:

  • If you want to check if your Series is empty:

    >>> x = pd.Series([])
    >>> x.empty
    True
    >>> x = pd.Series([1])
    >>> x.empty
    False
    

    Python normally interprets the length of containers (like list, tuple, …) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do: if x.size or if not x.empty instead of if x.

  • If your Series contains one and only one boolean value:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    True
    >>> (x < 50).bool()
    False
    
  • If you want to check the first and only item of your Series (like .bool() but works even for not boolean contents):

    >>> x = pd.Series([100])
    >>> x.item()
    100
    
  • If you want to check if all or any item is not-zero, not-empty or not-False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # because one element is zero
    False
    >>> x.any()   # because one (or more) elements are non-zero
    True
    

回答 1

对于布尔逻辑,请使用&|

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

要查看发生了什么,您可以为每个比较获得一列布尔值,例如

df.C > 0.25
0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

当您有多个条件时,将返回多个列。这就是为什么联接逻辑模棱两可的原因。单独使用andor对待每列,因此您首先需要将该列减少为单个布尔值。例如,查看每个列中的任何值或所有值是否为True。

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False

一种实现相同目的的复杂方法是​​将所有这些列压缩在一起,并执行适当的逻辑。

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

有关更多详细信息,请参阅文档中的布尔索引

For boolean logic, use & and |.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

To see what is happening, you get a column of booleans for each comparison, e.g.

df.C > 0.25
0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

When you have multiple criteria, you will get multiple columns returned. This is why the the join logic is ambiguous. Using and or or treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()
True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()
False

One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]
          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

For more details, refer to Boolean Indexing in the docs.


回答 2

好吧熊猫使用按位’&”|’ 并且每个条件都应该用’()’包装

例如以下作品

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

但是没有适当括号的相同查询不会

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

Well pandas use bitwise ‘&’ ‘|’ and each condition should be wrapped in a ‘()’

For example following works

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

But the same query without proper brackets does not

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

回答 3

或者,您也可以使用操作员模块。更详细的信息在这里Python文档

import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

Or, alternatively, you could use Operator module. More detailed information is here Python docs

import operator
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

回答 4

这个极好的答案很好地解释了正在发生的事情并提供了解决方案。我想添加另一种可能在类似情况下适用的解决方案:使用query方法:

result = result.query("(var > 0.25) or (var < -0.25)")

另请参见http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query

(一些我正在使用的数据帧的测试表明,该方法比在一系列布尔值上使用按位运算符要慢一些:2 ms vs. 870 µs)

警告:至少其中一种情况不是很简单,那就是列名恰好是python表达式。我有名为的列WT_38hph_IP_2WT_38hph_input_2log2(WT_38hph_IP_2/WT_38hph_input_2)想执行以下查询:"(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

我获得了以下异常级联:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

我猜这是因为查询解析器试图从前两列中获取内容,而不是用第三列的名称来标识表达式。

这里提出一种可能的解决方法。

This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query method:

result = result.query("(var > 0.25) or (var < -0.25)")

See also http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.

(Some tests with a dataframe I’m currently working with suggest that this method is a bit slower than using the bitwise operators on series of booleans: 2 ms vs. 870 µs)

A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2, WT_38hph_input_2 and log2(WT_38hph_IP_2/WT_38hph_input_2) and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

I obtained the following exception cascade:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.

A possible workaround is proposed here.


回答 5

我遇到了同样的错误,并在pyspark数据帧中停滞了几天,我能够通过将na值填充为0来成功解决它,因为我正在比较2个字段的整数值。

I encountered the same error and got stalled with a pyspark dataframe for few days, I was able to resolve it successfully by filling na values with 0 since I was comparing integer values from 2 fields.


如何从数据框的单元格获取值?

问题:如何从数据框的单元格获取值?

我构造了一个条件,可以从我的数据帧中准确提取一行:

d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]

现在,我想从特定列中获取一个值:

val = d2['col_name']

但是结果是我得到一个包含一行一列(一个单元格)的数据帧。这不是我所需要的。我需要一个值(一个浮点数)。我该如何在熊猫中做到这一点?

I have constructed a condition that extract exactly one row from my data frame:

d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]

Now I would like to take a value from a particular column:

val = d2['col_name']

But as a result I get a data frame that contains one row and one column (i.e. one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?


回答 0

如果您的DataFrame仅包含一行,则使用iloc,以Series的形式访问第一(唯一)行,然后使用列名访问值:

In [3]: sub_df
Out[3]:
          A         B
2 -0.133653 -0.030854

In [4]: sub_df.iloc[0]
Out[4]:
A   -0.133653
B   -0.030854
Name: 2, dtype: float64

In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493

If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:

In [3]: sub_df
Out[3]:
          A         B
2 -0.133653 -0.030854

In [4]: sub_df.iloc[0]
Out[4]:
A   -0.133653
B   -0.030854
Name: 2, dtype: float64

In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493

回答 1

这些是标量的快速访问

In [15]: df = pandas.DataFrame(numpy.random.randn(5,3),columns=list('ABC'))

In [16]: df
Out[16]: 
          A         B         C
0 -0.074172 -0.090626  0.038272
1 -0.128545  0.762088 -0.714816
2  0.201498 -0.734963  0.558397
3  1.563307 -1.186415  0.848246
4  0.205171  0.962514  0.037709

In [17]: df.iat[0,0]
Out[17]: -0.074171888537611502

In [18]: df.at[0,'A']
Out[18]: -0.074171888537611502

These are fast access for scalars

In [15]: df = pandas.DataFrame(numpy.random.randn(5,3),columns=list('ABC'))

In [16]: df
Out[16]: 
          A         B         C
0 -0.074172 -0.090626  0.038272
1 -0.128545  0.762088 -0.714816
2  0.201498 -0.734963  0.558397
3  1.563307 -1.186415  0.848246
4  0.205171  0.962514  0.037709

In [17]: df.iat[0,0]
Out[17]: -0.074171888537611502

In [18]: df.at[0,'A']
Out[18]: -0.074171888537611502

回答 2

您可以将1×1数据帧转换为numpy数组,然后访问该数组的第一个也是唯一的值:

val = d2['col_name'].values[0]

You can turn your 1×1 dataframe into a numpy array, then access the first and only value of that array:

val = d2['col_name'].values[0]

回答 3

多数答案都在使用iloc,这对于按位置选择非常有用。

如果需要按标签选择 loc会更方便。

用于显式获取值(等于不推荐使用的df.get_value(’a’,’A’))

# this is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A'] 
Out[55]: 0.13200317033032932

Most answers are using iloc which is good for selection by position.

If you need selection-by-label loc would be more convenient.

For getting a value explicitly (equiv to deprecated df.get_value(‘a’,’A’))

# this is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A'] 
Out[55]: 0.13200317033032932

回答 4

我需要一个由列和索引名称选择的单元格的值。此解决方案为我工作:

original_conversion_frequency.loc[1,:].values[0]

I needed the value of one cell, selected by column and index names. This solution worked for me:

original_conversion_frequency.loc[1,:].values[0]


回答 5

熊猫10.1 / 13.1之后看起来像变化

在iloc不可用之前,我从10.1升级到13.1。

现在有了13.1,iloc[0]['label']将获得单个值数组而不是标量。

像这样:

lastprice=stock.iloc[-1]['Close']

输出:

date
2014-02-26 118.2
name:Close, dtype: float64

It looks like changes after pandas 10.1/13.1

I upgraded from 10.1 to 13.1, before iloc is not available.

Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.

Like this:

lastprice=stock.iloc[-1]['Close']

Output:

date
2014-02-26 118.2
name:Close, dtype: float64

回答 6

我找到的最快/最简单的选项如下。501表示行索引。

df.at[501,'column_name']
df.get_value(501,'column_name')

The quickest/easiest options I have found are the following. 501 represents the row index.

df.at[501,'column_name']
df.get_value(501,'column_name')

回答 7

对于0.10的大熊猫,在iloc不可取的地方,过滤a DF并获取列的第一行数据VALUE

df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')

如果过滤的行数超过1,则获取第一行的值。如果过滤器导致数据帧为空,则会出现异常。

For pandas 0.10, where iloc is unavalable, filter a DF and get the first row data for the column VALUE:

df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')

if there is more then 1 row filtered, obtain the first row value. There will be an exception if the filter result in empty data frame.


回答 8

不知道这是否是一个好习惯,但是我注意到我也可以通过将序列强制转换为来获得值float

例如

rate

3 0.042679

名称:Unemployment_rate,dtype:float64

float(rate)

0.0426789

Not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.

e.g.

rate

3 0.042679

Name: Unemployment_rate, dtype: float64

float(rate)

0.0426789


回答 9

它不需要很复杂:

val = df.loc[df.wd==1, 'col_name'].values[0]

It doesn’t need to be complicated:

val = df.loc[df.wd==1, 'col_name'].values[0]

回答 10

df_gdp.columns

索引([u’Country’,u’Country Code’,u’Indicator Name’,u’Indicator Code’,u’1960’,u’1961’,u’1962’,u’1963’,u’1964′ ,u’1965’,u’1966’,u’1967’,u’1968’,u’1969’,u’1970’,u’1971’,u’1972’,u’1973’,u’1974′ ,u’1975’,u’1976’,u’1977’,u’1978’,u’1979’,u’1980’,u’1981’,u’1982’,u’1983’,u’1984′ ,u’1985’,u’1986’,u’1987’,u’1988’,u’1989’,u’1990’,u’1991’,u’1992’,u’1993’,u’1994′ ,u’1995’,u’1996’,u’1997’,u’1998’,u’1999’,u’2000’,u’2001’,u’2002’,u’2003’,u’2004’,u’2005’,u’2006’,u’2007’,u’2008’,u’2009’,u’2010’, u’2011’,u’2012’,u’2013’,u’2014’,u’2015’,u’2016′],dtype =’object’)

df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]

8100000000000.0

df_gdp.columns

Index([u’Country’, u’Country Code’, u’Indicator Name’, u’Indicator Code’, u’1960′, u’1961′, u’1962′, u’1963′, u’1964′, u’1965′, u’1966′, u’1967′, u’1968′, u’1969′, u’1970′, u’1971′, u’1972′, u’1973′, u’1974′, u’1975′, u’1976′, u’1977′, u’1978′, u’1979′, u’1980′, u’1981′, u’1982′, u’1983′, u’1984′, u’1985′, u’1986′, u’1987′, u’1988′, u’1989′, u’1990′, u’1991′, u’1992′, u’1993′, u’1994′, u’1995′, u’1996′, u’1997′, u’1998′, u’1999′, u’2000′, u’2001′, u’2002′, u’2003′, u’2004′, u’2005′, u’2006′, u’2007′, u’2008′, u’2009′, u’2010′, u’2011′, u’2012′, u’2013′, u’2014′, u’2015′, u’2016′], dtype=’object’)

df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]

8100000000000.0


熊猫:使用运算符链接过滤DataFrame的行

问题:熊猫:使用运算符链接过滤DataFrame的行

在大部分操作pandas可以与运营商链接(来完成groupbyaggregateapply,等),但我发现过滤行唯一方法是通过正常的托架索引

df_filtered = df[df['column'] == value]

这没有吸引力,因为它要求我先分配df一个变量,然后才能根据其值进行过滤。还有以下内容吗?

df_filtered = df.mask(lambda x: x['column'] == value)

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I’ve found to filter rows is via normal bracket indexing

df_filtered = df[df['column'] == value]

This is unappealing as it requires I assign df to a variable before being able to filter on its values. Is there something more like the following?

df_filtered = df.mask(lambda x: x['column'] == value)

回答 0

我不确定您想要什么,您的最后一行代码也无济于事,但是无论如何:

通过“链接”布尔索引中的条件来完成“链式”过滤。

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

如果要链接方法,可以添加自己的mask方法并使用该方法。

In [90]: def mask(df, key, value):
   ....:     return df[df[key] == value]
   ....:

In [92]: pandas.DataFrame.mask = mask

In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))

In [95]: df.ix['d','A'] = df.ix['a', 'A']

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [97]: df.mask('A', 1)
Out[97]:
   A  B  C  D
a  1  4  9  1
d  1  3  9  6

In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
   A  B  C  D
d  1  3  9  6

I’m not entirely sure what you want, and your last line of code does not help either, but anyway:

“Chained” filtering is done by “chaining” the criteria in the boolean index.

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

If you want to chain methods, you can add your own mask method and use that one.

In [90]: def mask(df, key, value):
   ....:     return df[df[key] == value]
   ....:

In [92]: pandas.DataFrame.mask = mask

In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))

In [95]: df.ix['d','A'] = df.ix['a', 'A']

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [97]: df.mask('A', 1)
Out[97]:
   A  B  C  D
a  1  4  9  1
d  1  3  9  6

In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
   A  B  C  D
d  1  3  9  6

回答 1

可以使用Pandas 查询链接过滤器:

df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')

过滤器也可以组合在一个查询中:

df_filtered = df.query('a > 0 and 0 < b < 2')

Filters can be chained using a Pandas query:

df = pd.DataFrame(np.random.randn(30, 3), columns=['a','b','c'])
df_filtered = df.query('a > 0').query('0 < b < 2')

Filters can also be combined in a single query:

df_filtered = df.query('a > 0 and 0 < b < 2')

回答 2

@lodagro的答案很好。我可以通过将mask函数概括为:

def mask(df, f):
  return df[f(df)]

然后,您可以执行以下操作:

df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)

The answer from @lodagro is great. I would extend it by generalizing the mask function as:

def mask(df, f):
  return df[f(df)]

Then you can do stuff like:

df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)

回答 3

0.18.1版开始,.loc方法接受可调用的选择。与lambda函数一起,您可以创建非常灵活的可链接过滤器:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80]  # equivalent to df[df.A == 80] but chainable

df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]

如果您所做的只是过滤,也可以省略.loc

Since version 0.18.1 the .loc method accepts a callable for selection. Together with lambda functions you can create very flexible chainable filters:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.loc[lambda df: df.A == 80]  # equivalent to df[df.A == 80] but chainable

df.sort_values('A').loc[lambda df: df.A > 80].loc[lambda df: df.B > df.A]

If all you’re doing is filtering, you can also omit the .loc.


回答 4

我提供了其他示例。这是与https://stackoverflow.com/a/28159296/相同的答案

我将添加其他修改,以使该帖子更有用。

pandas.DataFrame.query
query正是出于这个目的。考虑数据框df

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 5)),
    columns=list('ABCDE')
)

df

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

让我们使用query过滤所有行D > B

df.query('D > B')

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

我们连锁

df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/

I’ll add other edits to make this post more useful.

pandas.DataFrame.query
query was made for exactly this purpose. Consider the dataframe df

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(10, size=(10, 5)),
    columns=list('ABCDE')
)

df

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

Let’s use query to filter all rows where D > B

df.query('D > B')

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

Which we chain

df.query('D > B').query('C > B')
# equivalent to
# df.query('D > B and C > B')
# but defeats the purpose of demonstrating chaining

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
4  3  6  7  7  4
5  5  3  7  5  9
7  6  2  6  6  5

回答 5

除了要将条件合并为OR条件外,我有相同的问题。Wouter Overmeire给出的格式将条件合并为AND条件,因此必须同时满足两个条件:

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

但是我发现,如果将每个条件包装起来(... == True)并用管道将这些条件连接起来,则这些条件将以OR条件组合,只要它们中的任何一个为true都将满足:

df[((df.A==1) == True) | ((df.D==6) == True)]

I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:

df[((df.A==1) == True) | ((df.D==6) == True)]

回答 6

熊猫提供了Wouter Overmeire答案的两种替代方法,不需要任何替代。一个是.loc[.]可调用的,例如

df_filtered = df.loc[lambda x: x['column'] == value]

另一个是.pipe(),如

df_filtered = df.pipe(lambda x: x['column'] == value)

pandas provides two alternatives to Wouter Overmeire’s answer which do not require any overriding. One is .loc[.] with a callable, as in

df_filtered = df.loc[lambda x: x['column'] == value]

the other is .pipe(), as in

df_filtered = df.pipe(lambda x: x['column'] == value)

回答 7

我的答案与其他人相似。如果您不想创建新功能,则可以使用已经为您定义的pandas。使用管道方法。

df.pipe(lambda d: d[d['column'] == value])

My answer is similar to the others. If you do not want to create a new function you can use what pandas has defined for you already. Use the pipe method.

df.pipe(lambda d: d[d['column'] == value])

回答 8

如果您想应用所有通用布尔掩码以及通用掩码,则可以将以下内容放在文件中,然后按如下所示简单地分配它们:

pd.DataFrame = apply_masks()

用法:

A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary

这有点骇人听闻,但是如果您不断根据过滤器来分割和更改数据集,则可以使事情变得更清晰。gen_mask函数中还有一个上面丹尼尔·韦尔科夫(Daniel Velkov)改编的通用过滤器,您可以将其与lambda函数或其他需要的函数一起使用。

要保存的文件(我使用masks.py):

import pandas as pd

def eq_mask(df, key, value):
    return df[df[key] == value]

def ge_mask(df, key, value):
    return df[df[key] >= value]

def gt_mask(df, key, value):
    return df[df[key] > value]

def le_mask(df, key, value):
    return df[df[key] <= value]

def lt_mask(df, key, value):
    return df[df[key] < value]

def ne_mask(df, key, value):
    return df[df[key] != value]

def gen_mask(df, f):
    return df[f(df)]

def apply_masks():

    pd.DataFrame.eq_mask = eq_mask
    pd.DataFrame.ge_mask = ge_mask
    pd.DataFrame.gt_mask = gt_mask
    pd.DataFrame.le_mask = le_mask
    pd.DataFrame.lt_mask = lt_mask
    pd.DataFrame.ne_mask = ne_mask
    pd.DataFrame.gen_mask = gen_mask

    return pd.DataFrame

if __name__ == '__main__':
    pass

If you would like to apply all of the common boolean masks as well as a general purpose mask you can chuck the following in a file and then simply assign them all as follows:

pd.DataFrame = apply_masks()

Usage:

A = pd.DataFrame(np.random.randn(4, 4), columns=["A", "B", "C", "D"])
A.le_mask("A", 0.7).ge_mask("B", 0.2)... (May be repeated as necessary

It’s a little bit hacky but it can make things a little bit cleaner if you’re continuously chopping and changing datasets according to filters. There’s also a general purpose filter adapted from Daniel Velkov above in the gen_mask function which you can use with lambda functions or otherwise if desired.

File to be saved (I use masks.py):

import pandas as pd

def eq_mask(df, key, value):
    return df[df[key] == value]

def ge_mask(df, key, value):
    return df[df[key] >= value]

def gt_mask(df, key, value):
    return df[df[key] > value]

def le_mask(df, key, value):
    return df[df[key] <= value]

def lt_mask(df, key, value):
    return df[df[key] < value]

def ne_mask(df, key, value):
    return df[df[key] != value]

def gen_mask(df, f):
    return df[f(df)]

def apply_masks():

    pd.DataFrame.eq_mask = eq_mask
    pd.DataFrame.ge_mask = ge_mask
    pd.DataFrame.gt_mask = gt_mask
    pd.DataFrame.le_mask = le_mask
    pd.DataFrame.lt_mask = lt_mask
    pd.DataFrame.ne_mask = ne_mask
    pd.DataFrame.gen_mask = gen_mask

    return pd.DataFrame

if __name__ == '__main__':
    pass

回答 9

该解决方案在实现方面更缺乏技巧,但我发现它的用法更加简洁,并且肯定比其他建议的方案更通用。

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

您无需下载整个存储库:保存文件并执行

from where import where as W

应该足够了。然后像这样使用它:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

一个不太愚蠢的用法示例:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

顺便说一句:即使在您仅使用布尔cols的情况下,

df.loc[W['cond1']].loc[W['cond2']]

可以比

df.loc[W['cond1'] & W['cond2']]

因为它的计算结果cond2只在cond1True

免责声明:我首先在其他地方给出了此答案,因为我还没有看到。

This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

You don’t need to download the entire repo: saving the file and doing

from where import where as W

should suffice. Then you use it like this:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire - or subset of a - DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

A slightly less stupid usage example:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

By the way: even in the case in which you are just using boolean cols,

df.loc[W['cond1']].loc[W['cond2']]

can be much more efficient than

df.loc[W['cond1'] & W['cond2']]

because it evaluates cond2 only where cond1 is True.

DISCLAIMER: I first gave this answer elsewhere because I hadn’t seen this.


回答 10

只想使用添加演示 loc不仅用于按行过滤,还可以按列过滤,并对链式操作有一些优点。

下面的代码可以按值过滤行。

df_filtered = df.loc[df['column'] == value]

通过稍作修改,您也可以过滤列。

df_filtered = df.loc[df['column'] == value, ['year', 'column']]

那么为什么我们要使用链式方法呢?答案是,如果您有很多操作,它很容易阅读。例如,

res =  df\
    .loc[df['station']=='USA', ['TEMP', 'RF']]\
    .groupby('year')\
    .agg(np.nanmean)

Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.

The code below can filter the rows by value.

df_filtered = df.loc[df['column'] == value]

By modifying it a bit you can filter the columns as well.

df_filtered = df.loc[df['column'] == value, ['year', 'column']]

So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,

res =  df\
    .loc[df['station']=='USA', ['TEMP', 'RF']]\
    .groupby('year')\
    .agg(np.nanmean)

回答 11

这没有吸引力,因为它要求我先分配df一个变量,然后才能根据其值进行过滤。

df[df["column_name"] != 5].groupby("other_column_name")

似乎有效:您也可以嵌套[]运算符。也许他们是在您提出问题后才添加的。

This is unappealing as it requires I assign df to a variable before being able to filter on its values.

df[df["column_name"] != 5].groupby("other_column_name")

seems to work: you can nest the [] operator as well. Maybe they added it since you asked the question.


回答 12

如果将列设置为作为索引搜索,则可以使用DataFrame.xs()横截面。这没有query答案的通用性,但在某些情况下可能有用。

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(3, size=(10, 5)),
    columns=list('ABCDE')
)

df
# Out[55]: 
#    A  B  C  D  E
# 0  0  2  2  2  2
# 1  1  1  2  0  2
# 2  0  2  0  0  2
# 3  0  2  2  0  1
# 4  0  1  1  2  0
# 5  0  0  0  1  2
# 6  1  0  1  1  1
# 7  0  0  2  0  2
# 8  2  2  2  2  2
# 9  1  2  0  2  1

df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]: 
#    A  D  B  C  E
# 0  0  2  2  2  2
# 1  0  2  1  1  0

If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(
    np.random.randint(3, size=(10, 5)),
    columns=list('ABCDE')
)

df
# Out[55]: 
#    A  B  C  D  E
# 0  0  2  2  2  2
# 1  1  1  2  0  2
# 2  0  2  0  0  2
# 3  0  2  2  0  1
# 4  0  1  1  2  0
# 5  0  0  0  1  2
# 6  1  0  1  1  1
# 7  0  0  2  0  2
# 8  2  2  2  2  2
# 9  1  2  0  2  1

df.set_index(['A', 'D']).xs([0, 2]).reset_index()
# Out[57]: 
#    A  D  B  C  E
# 0  0  2  2  2  2
# 1  0  2  1  1  0

回答 13

您还可以将numpy库用于逻辑操作。它非常快。

df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]

You can also leverage the numpy library for logical operations. Its pretty fast.

df[np.logical_and(df['A'] == 1 ,df['B'] == 6)]

熊猫-如何展平列中的层次结构索引

问题:熊猫-如何展平列中的层次结构索引

我有一个在轴1(列)中具有层次结构索引的数据框(来自groupby.agg操作):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

我想将其展平,使其看起来像这样(名称不是关键的-我可以重命名):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

我该怎么做呢?(我已经尝试了很多,无济于事。)

根据建议,这是字典形式的头

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg operation):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf       
                                     sum   sum   sum    sum   amax   amin
0  702730  26451  1993      1    1     1     0    12     13  30.92  24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00  24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00   6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04   3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94  10.94

I want to flatten it, so that it looks like this (names aren’t critical – I could rename):

     USAF   WBAN  year  month  day  s_PC  s_CL  s_CD  s_CNT  tempf_amax  tmpf_amin   
0  702730  26451  1993      1    1     1     0    12     13  30.92          24.98
1  702730  26451  1993      1    2     0     0    13     13  32.00          24.98
2  702730  26451  1993      1    3     1    10     2     13  23.00          6.98
3  702730  26451  1993      1    4     1     0    12     13  10.04          3.92
4  702730  26451  1993      1    5     3     0    10     13  19.94          10.94

How do I do this? (I’ve tried a lot, to no avail.)

Per a suggestion, here is the head in dict form

{('USAF', ''): {0: '702730',
  1: '702730',
  2: '702730',
  3: '702730',
  4: '702730'},
 ('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
 ('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
 ('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 ('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
 ('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
 ('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
 ('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
 ('tempf', 'amax'): {0: 30.920000000000002,
  1: 32.0,
  2: 23.0,
  3: 10.039999999999999,
  4: 19.939999999999998},
 ('tempf', 'amin'): {0: 24.98,
  1: 24.98,
  2: 6.9799999999999969,
  3: 3.9199999999999982,
  4: 10.940000000000001},
 ('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}

回答 0

我认为最简单的方法是将列设置为顶级:

df.columns = df.columns.get_level_values(0)

注意:如果to级别具有名称,您也可以通过此名称(而不是0)来访问它。

如果要将joinMultiIndex 组合成一个索引(假设您的列中仅包含字符串条目),则可以:

df.columns = [' '.join(col).strip() for col in df.columns.values]

注意:strip没有第二个索引时,必须使用空格。

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']

I think the easiest way to do this would be to set the columns to the top level:

df.columns = df.columns.get_level_values(0)

Note: if the to level has a name you can also access it by this, rather than 0.

.

If you want to combine/join your MultiIndex into one Index (assuming you have just string entries in your columns) you could:

df.columns = [' '.join(col).strip() for col in df.columns.values]

Note: we must strip the whitespace for when there is no second index.

In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]: 
['USAF',
 'WBAN',
 'day',
 'month',
 's_CD sum',
 's_CL sum',
 's_CNT sum',
 's_PC sum',
 'tempf amax',
 'tempf amin',
 'year']

回答 1

pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only
pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only

回答 2

该线程上的所有当前答案都必须已过时。从pandas0.24.0版开始,.to_flat_index()您需要做什么。

从熊猫自己的文档中

MultiIndex.to_flat_index()

将MultiIndex转换为包含级别值的元组索引。

文档中的一个简单示例:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

申请to_flat_index()

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

用它来替换现有的pandas

一个如何在上使用它的示例dat,它是一个带有MultiIndex列的DataFrame :

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

All of the current answers on this thread must have been a bit dated. As of pandas version 0.24.0, the .to_flat_index() does what you need.

From panda’s own documentation:

MultiIndex.to_flat_index()

Convert a MultiIndex to an Index of Tuples containing the level values.

A simple example from its documentation:

import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
        [['foo', 'bar'], ['baz', 'qux']],
        names=['a', 'b'])

print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
#           codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
#           names=['a', 'b'])

Applying to_flat_index():

index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')

Using it to replace existing pandas column

An example of how you’d use it on dat, which is a DataFrame with a MultiIndex column:

dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
#            codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])

dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'),  ('class_size', 'mean'),
#     ('class_size', 'std'),   ('class_size', 'min'),
#     ('class_size', '25%'),   ('class_size', '50%'),
#     ('class_size', '75%'),   ('class_size', 'max')],
#  dtype='object')

回答 3

安迪·海登(Andy Hayden)的答案当然是最简单的方法-如果要避免重复的列标签,则需要进行一些调整

In [34]: df
Out[34]: 
     USAF   WBAN  day  month  s_CD  s_CL  s_CNT  s_PC  tempf         year
                               sum   sum    sum   sum   amax   amin      
0  702730  26451    1      1    12     0     13     1  30.92  24.98  1993
1  702730  26451    2      1    13     0     13     0  32.00  24.98  1993
2  702730  26451    3      1     2    10     13     1  23.00   6.98  1993
3  702730  26451    4      1    12     0     13     1  10.04   3.92  1993
4  702730  26451    5      1    10     0     13     3  19.94  10.94  1993


In [35]: mi = df.columns

In [36]: mi
Out[36]: 
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]


In [37]: mi.tolist()
Out[37]: 
[('USAF', ''),
 ('WBAN', ''),
 ('day', ''),
 ('month', ''),
 ('s_CD', 'sum'),
 ('s_CL', 'sum'),
 ('s_CNT', 'sum'),
 ('s_PC', 'sum'),
 ('tempf', 'amax'),
 ('tempf', 'amin'),
 ('year', '')]

In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)

In [40]: df.columns = ind




In [46]: df
Out[46]: 
     USAF   WBAN  day  month  s_CDsum  s_CLsum  s_CNTsum  s_PCsum  tempfamax  tempfamin  \
0  702730  26451    1      1       12        0        13        1      30.92      24.98   
1  702730  26451    2      1       13        0        13        0      32.00      24.98   
2  702730  26451    3      1        2       10        13        1      23.00       6.98   
3  702730  26451    4      1       12        0        13        1      10.04       3.92   
4  702730  26451    5      1       10        0        13        3      19.94      10.94   




   year  
0  1993  
1  1993  
2  1993  
3  1993  
4  1993

Andy Hayden’s answer is certainly the easiest way — if you want to avoid duplicate column labels you need to tweak a bit

In [34]: df
Out[34]: 
     USAF   WBAN  day  month  s_CD  s_CL  s_CNT  s_PC  tempf         year
                               sum   sum    sum   sum   amax   amin      
0  702730  26451    1      1    12     0     13     1  30.92  24.98  1993
1  702730  26451    2      1    13     0     13     0  32.00  24.98  1993
2  702730  26451    3      1     2    10     13     1  23.00   6.98  1993
3  702730  26451    4      1    12     0     13     1  10.04   3.92  1993
4  702730  26451    5      1    10     0     13     3  19.94  10.94  1993


In [35]: mi = df.columns

In [36]: mi
Out[36]: 
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]


In [37]: mi.tolist()
Out[37]: 
[('USAF', ''),
 ('WBAN', ''),
 ('day', ''),
 ('month', ''),
 ('s_CD', 'sum'),
 ('s_CL', 'sum'),
 ('s_CNT', 'sum'),
 ('s_PC', 'sum'),
 ('tempf', 'amax'),
 ('tempf', 'amin'),
 ('year', '')]

In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)

In [40]: df.columns = ind




In [46]: df
Out[46]: 
     USAF   WBAN  day  month  s_CDsum  s_CLsum  s_CNTsum  s_PCsum  tempfamax  tempfamin  \
0  702730  26451    1      1       12        0        13        1      30.92      24.98   
1  702730  26451    2      1       13        0        13        0      32.00      24.98   
2  702730  26451    3      1        2       10        13        1      23.00       6.98   
3  702730  26451    4      1       12        0        13        1      10.04       3.92   
4  702730  26451    5      1       10        0        13        3      19.94      10.94   




   year  
0  1993  
1  1993  
2  1993  
3  1993  
4  1993

回答 4

df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]
df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]

回答 5

而且,如果您想保留第二级多索引中的任何聚合信息,则可以尝试以下操作:

In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
 'WBAN',
 'day',
 'month',
 's_CDsum',
 's_CLsum',
 's_CNTsum',
 's_PCsum',
 'tempfamax',
 'tempfamin',
 'year']

In [2]: df.columns = new_cols

And if you want to retain any of the aggregation info from the second level of the multiindex you can try this:

In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
 'WBAN',
 'day',
 'month',
 's_CDsum',
 's_CLsum',
 's_CNTsum',
 's_PCsum',
 'tempfamax',
 'tempfamin',
 'year']

In [2]: df.columns = new_cols

回答 6

使用map函数的最pythonic方法。

df.columns = df.columns.map(' '.join).str.strip()

输出print(df.columns)

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

使用Python 3.6+和f字符串进行更新:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

输出:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

The most pythonic way to do this to use map function.

df.columns = df.columns.map(' '.join).str.strip()

Output print(df.columns):

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

Update using Python 3.6+ with f string:

df.columns = [f'{f} {s}' if s != '' else f'{f}' 
              for f, s in df.columns]

print(df.columns)

Output:

Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
       's_PC sum', 'tempf amax', 'tempf amin', 'year'],
      dtype='object')

回答 7

对我来说,最简单,最直观的解决方案是使用get_level_values组合列名称。当您在同一列上执行多个聚合时,这可以防止重复的列名称:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

如果要在列之间使用分隔符,则可以执行此操作。这将返回与Seiji Armstrong关于已接受答案的评论相同的内容,该评论仅包括两个索引级别中的值的列的下划线:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

我知道这与Andy Hayden的出色答案具有相同的作用,但我认为这种方式更直观,并且更容易记住(因此,我不必继续引用此线程),尤其是对于熊猫新手用户。

在您可能具有3个列级别的情况下,此方法也可以扩展。

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three

The easiest and most intuitive solution for me was to combine the column names using get_level_values. This prevents duplicate column names when you do more than one aggregation on the same column:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two

If you want a separator between columns, you can do this. This will return the same thing as Seiji Armstrong’s comment on the accepted answer that only includes underscores for columns with values in both index levels:

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two

I know this does the same thing as Andy Hayden’s great answer above, but I think it is a bit more intuitive this way and is easier to remember (so I don’t have to keep referring to this thread), especially for novice pandas users.

This method is also more extensible in the case where you may have 3 column levels.

level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three

回答 8

阅读完所有答案后,我想到了:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

用法:

给定一个数据框:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7
  • 单一聚合方法与源名称相同的结果变量:

    df.groupby(by="grouper").agg("min").my_flatten_cols()
    • df.groupby(by="grouper", as_index = False).agg(...).reset_index()相同
    • ----- before -----
                 val1  2
        grouper         
      
      ------ after -----
        grouper  val1  2
      0       x     0  1
      1       y     4  5
  • 单源变量,多个聚合以统计信息命名的结果变量:

    df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
    • 与相同a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index()
    • ----- before -----
                  val1    
                 min max
        grouper         
      
      ------ after -----
        grouper  min  max
      0       x    0    2
      1       y    4    6
  • 多个变量,多个聚合:名为(varname)_(statname)的结果变量:

    df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
    # you can combine the names in other ways too, e.g. use a different delimiter:
    #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
    • 在后台运行a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values](因为这种形式的agg()结果出现MultiIndex在列上)。
    • 如果您没有my_flatten_cols帮助者,则输入@Seigi建议的解决方案可能会更容易:a.columns = ["_".join(t).rstrip("_") for t in a.columns.values]在这种情况下,它的工作原理类似(但如果列上有数字标签,则会失败)
    • 要处理列上的数字标签,可以使用@jxstanford和@Nolan Conawaya.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values])建议的解决方案,但我不明白为什么tuple()需要调用,并且我相信rstrip()只有在某些列具有类似("colname", "")(如果您reset_index()在尝试修复之前会发生这种情况.columns
    • ----- before -----
                 val1           2     
                 min       sum    size
        grouper              
      
      ------ after -----
        grouper  val1_min  2_sum  2_size
      0       x         0      4       2
      1       y         4     12       2
  • 要手动命名结果变量:(这是因为大熊猫0.20.0弃用没有适当的替代性为0.23

    df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                       2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
    • 其他建议包括:手动设置列:res.columns = ['A_sum', 'B_sum', 'count'].join()输入多个groupby语句。
    • ----- before -----
                         val1                      2         
                count_of_val1 sum_of_val1 count_of_2 sum_of_2
        grouper                                              
      
      ------ after -----
        grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
      0       x              2            2           2         4
      1       y              2           10           2        12

助手功能处理的案件

  • 级别名称可以是非字符串,例如,当列名称是整数时按列号使用Index pandas DataFrame,因此我们必须使用map(str, ..)
  • 它们也可以是空的,所以我们必须 filter(None, ..)
  • 对于单级列(即,除MultiIndex之外的任何内容),columns.values返回名称(str,而不是元组)
  • 根据您的使用方式,.agg()您可能需要保留一列的最底端标签或连接多个标签
  • (因为我是熊猫新手?),我希望reset_index()能够以常规方式使用group-by列,因此默认情况下会这样做

After reading through all the answers, I came up with this:

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
                    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index() if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

Usage:

Given a data frame:

df = pd.DataFrame({"grouper": ["x","x","y","y"], "val1": [0,2,4,6], 2: [1,3,5,7]}, columns=["grouper", "val1", 2])

  grouper  val1  2
0       x     0  1
1       x     2  3
2       y     4  5
3       y     6  7
  • Single aggregation method: resulting variables named the same as source:

    df.groupby(by="grouper").agg("min").my_flatten_cols()
    
    • Same as df.groupby(by="grouper", as_index=False) or .agg(...).reset_index()
    • ----- before -----
                 val1  2
        grouper         
      
      ------ after -----
        grouper  val1  2
      0       x     0  1
      1       y     4  5
      
  • Single source variable, multiple aggregations: resulting variables named after statistics:

    df.groupby(by="grouper").agg({"val1": [min,max]}).my_flatten_cols("last")
    
    • Same as a = df.groupby(..).agg(..); a.columns = a.columns.droplevel(0); a.reset_index().
    • ----- before -----
                  val1    
                 min max
        grouper         
      
      ------ after -----
        grouper  min  max
      0       x    0    2
      1       y    4    6
      
  • Multiple variables, multiple aggregations: resulting variables named (varname)_(statname):

    df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols()
    # you can combine the names in other ways too, e.g. use a different delimiter:
    #df.groupby(by="grouper").agg({"val1": min, 2:[sum, "size"]}).my_flatten_cols(" ".join)
    
    • Runs a.columns = ["_".join(filter(None, map(str, levels))) for levels in a.columns.values] under the hood (since this form of agg() results in MultiIndex on columns).
    • If you don’t have the my_flatten_cols helper, it might be easier to type in the solution suggested by @Seigi: a.columns = ["_".join(t).rstrip("_") for t in a.columns.values], which works similarly in this case (but fails if you have numeric labels on columns)
    • To handle the numeric labels on columns, you could use the solution suggested by @jxstanford and @Nolan Conaway (a.columns = ["_".join(tuple(map(str, t))).rstrip("_") for t in a.columns.values]), but I don’t understand why the tuple() call is needed, and I believe rstrip() is only required if some columns have a descriptor like ("colname", "") (which can happen if you reset_index() before trying to fix up .columns)
    • ----- before -----
                 val1           2     
                 min       sum    size
        grouper              
      
      ------ after -----
        grouper  val1_min  2_sum  2_size
      0       x         0      4       2
      1       y         4     12       2
      
  • You want to name the resulting variables manually: (this is deprecated since pandas 0.20.0 with no adequate alternative as of 0.23)

    df.groupby(by="grouper").agg({"val1": {"sum_of_val1": "sum", "count_of_val1": "count"},
                                       2: {"sum_of_2":    "sum", "count_of_2":    "count"}}).my_flatten_cols("last")
    
    • Other suggestions include: setting the columns manually: res.columns = ['A_sum', 'B_sum', 'count'] or .join()ing multiple groupby statements.
    • ----- before -----
                         val1                      2         
                count_of_val1 sum_of_val1 count_of_2 sum_of_2
        grouper                                              
      
      ------ after -----
        grouper  count_of_val1  sum_of_val1  count_of_2  sum_of_2
      0       x              2            2           2         4
      1       y              2           10           2        12
      

Cases handled by the helper function

  • level names can be non-string, e.g. Index pandas DataFrame by column numbers, when column names are integers, so we have to convert with map(str, ..)
  • they can also be empty, so we have to filter(None, ..)
  • for single-level columns (i.e. anything except MultiIndex), columns.values returns the names (str, not tuples)
  • depending on how you used .agg() you may need to keep the bottom-most label for a column or concatenate multiple labels
  • (since I’m new to pandas?) more often than not, I want reset_index() to be able to work with the group-by columns in the regular way, so it does that by default

回答 9

处理多个级别和混合类型的常规解决方案:

df.columns = ['_'.join(tuple(map(str, t))) for t in df.columns.values]

A general solution that handles multiple levels and mixed types:

df.columns = ['_'.join(tuple(map(str, t))) for t in df.columns.values]

回答 10

也许有些晚,但是如果您不担心重复的列名:

df.columns = df.columns.tolist()

A bit late maybe, but if you are not worried about duplicate column names:

df.columns = df.columns.tolist()

回答 11

如果您希望在各个级别之间使用分隔符,则此功能会很好用。

def flattenHierarchicalCol(col,sep = '_'):
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

df.columns = df.columns.map(flattenHierarchicalCol)

In case you want to have a separator in the name between levels, this function works well.

def flattenHierarchicalCol(col,sep = '_'):
    if not type(col) is tuple:
        return col
    else:
        new_col = ''
        for leveli,level in enumerate(col):
            if not level == '':
                if not leveli == 0:
                    new_col += sep
                new_col += level
        return new_col

df.columns = df.columns.map(flattenHierarchicalCol)

回答 12

在@jxstanford和@ tvt173之后,我编写了一个快速函数,无论字符串/ int列名如何,该函数都可以完成此任务:

def flatten_cols(df):
    df.columns = [
        '_'.join(tuple(map(str, t))).rstrip('_') 
        for t in df.columns.values
        ]
    return df

Following @jxstanford and @tvt173, I wrote a quick function which should do the trick, regardless of string/int column names:

def flatten_cols(df):
    df.columns = [
        '_'.join(tuple(map(str, t))).rstrip('_') 
        for t in df.columns.values
        ]
    return df

回答 13

您也可以按照以下步骤进行操作。考虑df是您的数据框,并假设一个二级索引(在您的示例中就是这种情况)

df.columns = [(df.columns[i][0])+'_'+(datadf_pos4.columns[i][1]) for i in range(len(df.columns))]

You could also do as below. Consider df to be your dataframe and assume a two level index (as is the case in your example)

df.columns = [(df.columns[i][0])+'_'+(datadf_pos4.columns[i][1]) for i in range(len(df.columns))]

回答 14

我将分享一种对我有用的简单方法。

[" ".join([str(elem) for elem in tup]) for tup in df.columns.tolist()]
#df = df.reset_index() if needed

I’ll share a straight-forward way that worked for me.

[" ".join([str(elem) for elem in tup]) for tup in df.columns.tolist()]
#df = df.reset_index() if needed

回答 15

要在其他DataFrame方法链内展平MultiIndex,请定义如下函数:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

然后使用该pipe方法在DataFrame方法链中,在链中任何其他方法之后groupbyagg之前应用此函数:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

To flatten a MultiIndex inside a chain of other DataFrame methods, define a function like this:

def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

Then use the pipe method to apply this function in the chain of DataFrame methods, after groupby and agg but before any other methods in the chain:

my_df \
  .groupby('group') \
  .agg({'value': ['count']}) \
  .pipe(flatten_index) \
  .sort_values('value_count')

回答 16

另一个简单的例程。

def flatten_columns(df, sep='.'):
    def _remove_empty(column_name):
        return tuple(element for element in column_name if element)
    def _join(column_name):
        return sep.join(column_name)

    new_columns = [_join(_remove_empty(column)) for column in df.columns.values]
    df.columns = new_columns

Another simple routine.

def flatten_columns(df, sep='.'):
    def _remove_empty(column_name):
        return tuple(element for element in column_name if element)
    def _join(column_name):
        return sep.join(column_name)

    new_columns = [_join(_remove_empty(column)) for column in df.columns.values]
    df.columns = new_columns

iloc,ix和loc有何不同?

问题:iloc,ix和loc有何不同?

有人可以解释这三种切片方法有何不同吗?
我看过文档,也看过这些 答案,但仍然发现自己无法解释这三者之间的区别。在我看来,它们在很大程度上似乎是可互换的,因为它们处于切片的较低级别。

例如,假设我们要获取的前五行DataFrame。这三者如何运作?

df.loc[:5]
df.ix[:5]
df.iloc[:5]

有人可以提出三种用法之间的区别更清楚的情况吗?

Can someone explain how these three methods of slicing are different?
I’ve seen the docs, and I’ve seen these answers, but I still find myself unable to explain how the three are different. To me, they seem interchangeable in large part, because they are at the lower levels of slicing.

For example, say we want to get the first five rows of a DataFrame. How is it that all three of these work?

df.loc[:5]
df.ix[:5]
df.iloc[:5]

Can someone present three cases where the distinction in uses are clearer?


回答 0

注意:在熊猫版本0.20.0及更高版本中,ix弃用,建议改为使用lociloc。我留下了ix完整的答案部分,以供早期版本的熊猫用户参考。下面添加了示例,显示了的替代方案 ix


首先,以下是三种方法的概述:

  • loc从索引中获取带有特定标签的行(或列)。
  • iloc在索引中的特定位置获取行(或列)(因此仅获取整数)。
  • ix通常会尝试表现得像,lociloc如果索引中没有标签,则会回落为行为。

重要的是要注意一些细微之处,这些细微之处可能会使ix使用起来有些棘手:

  • 如果索引是整数类型,ix则将仅使用基于标签的索引,而不会使用基于位置的索引。如果标签不在索引中,则会引发错误。

  • 如果指数不包含唯一整数,然后给出一个整数,ix将立即使用基于位置的索引,而不是基于标签的索引。但是,如果ix给定其他类型(例如字符串),则可以使用基于标签的索引。


为了说明这三种方法之间的差异,请考虑以下系列:

>>> s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>> s
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN

我们将看看用整数值切片3

在这种情况下,向s.iloc[:3]我们返回前3行(因为它将3视为位置),并向s.loc[:3]我们返回前8行(因为将3视为标签):

>>> s.iloc[:3] # slice the first three rows
49   NaN
48   NaN
47   NaN

>>> s.loc[:3] # slice up to and including label 3
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

>>> s.ix[:3] # the integer is in the index so s.ix[:3] works like loc
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

注意s.ix[:3]s.loc[:3]由于它首先查找标签,而不是在位置上工作(因此,其索引为s整数类型),因此Notification 返回相同的Series 。

如果我们尝试使用不在索引中的整数标签(例如6)怎么办?

此处s.iloc[:6]按预期返回Series的前6行。但是,s.loc[:6]由于6不在索引中,所以引发KeyError 。

>>> s.iloc[:6]
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN

>>> s.loc[:6]
KeyError: 6

>>> s.ix[:6]
KeyError: 6

根据上面提到的细微之处,s.ix[:6]现在引发KeyError,因为它试图像在索引中loc找到一个那样工作,但找不到它6。因为我们的索引是整数类型,ix所以不会回落为iloc

但是,如果我们的索引为混合类型,则给定的整数ixiloc立即表现出来,而不是引发KeyError:

>>> s2 = pd.Series(np.nan, index=['a','b','c','d','e', 1, 2, 3, 4, 5])
>>> s2.index.is_mixed() # index is mix of different types
True
>>> s2.ix[:6] # now behaves like iloc given integer
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
1   NaN

请记住,ix它仍然可以接受非整数并表现为loc

>>> s2.ix[:'c'] # behaves like loc given non-integer
a   NaN
b   NaN
c   NaN

作为一般建议,如果您仅使用标签建立索引,或者仅使用整数位置建立索引,请坚持使用lociloc避免出现意外结果-请勿使用ix


结合基于位置和基于标签的索引

有时在给定DataFrame的情况下,您将需要为行和列混合使用标签和位置索引方法。

例如,考虑以下DataFrame。如何最好地将行切成“ c” 包括前四列?

>>> df = pd.DataFrame(np.nan, 
                      index=list('abcde'),
                      columns=['x','y','z', 8, 9])
>>> df
    x   y   z   8   9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

在早期版本的pandas(0.20.0之前)中ix,您可以整齐地进行此操作-我们可以按标签对行进行切片,按位置对列进行切片(请注意,对于列,ix由于4不是列名,因此默认为基于位置的切片 ):

>>> df.ix[:'c', :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

在更高版本的熊猫中,我们可以使用iloc并借助另一种方法来获得此结果:

>>> df.iloc[:df.index.get_loc('c') + 1, :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

get_loc()是一种索引方法,意思是“获取标签在此索引中的位置”。请注意,由于切片与iloc不包含其端点,因此如果还要行’c’,则必须在此值上加1。

此处的熊猫文档中还有其他示例。

Note: in pandas version 0.20.0 and above, ix is deprecated and the use of loc and iloc is encouraged instead. I have left the parts of this answer that describe ix intact as a reference for users of earlier versions of pandas. Examples have been added below showing alternatives to ix.


First, here’s a recap of the three methods:

  • loc gets rows (or columns) with particular labels from the index.
  • iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
  • ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index.

It’s important to note some subtleties that can make ix slightly tricky to use:

  • if the index is of integer type, ix will only use label-based indexing and not fall back to position-based indexing. If the label is not in the index, an error is raised.

  • if the index does not contain only integers, then given an integer, ix will immediately use position-based indexing rather than label-based indexing. If however ix is given another type (e.g. a string), it can use label-based indexing.


To illustrate the differences between the three methods, consider the following Series:

>>> s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>> s
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN

We’ll look at slicing with the integer value 3.

In this case, s.iloc[:3] returns us the first 3 rows (since it treats 3 as a position) and s.loc[:3] returns us the first 8 rows (since it treats 3 as a label):

>>> s.iloc[:3] # slice the first three rows
49   NaN
48   NaN
47   NaN

>>> s.loc[:3] # slice up to and including label 3
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

>>> s.ix[:3] # the integer is in the index so s.ix[:3] works like loc
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN

Notice s.ix[:3] returns the same Series as s.loc[:3] since it looks for the label first rather than working on the position (and the index for s is of integer type).

What if we try with an integer label that isn’t in the index (say 6)?

Here s.iloc[:6] returns the first 6 rows of the Series as expected. However, s.loc[:6] raises a KeyError since 6 is not in the index.

>>> s.iloc[:6]
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN

>>> s.loc[:6]
KeyError: 6

>>> s.ix[:6]
KeyError: 6

As per the subtleties noted above, s.ix[:6] now raises a KeyError because it tries to work like loc but can’t find a 6 in the index. Because our index is of integer type ix doesn’t fall back to behaving like iloc.

If, however, our index was of mixed type, given an integer ix would behave like iloc immediately instead of raising a KeyError:

>>> s2 = pd.Series(np.nan, index=['a','b','c','d','e', 1, 2, 3, 4, 5])
>>> s2.index.is_mixed() # index is mix of different types
True
>>> s2.ix[:6] # now behaves like iloc given integer
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
1   NaN

Keep in mind that ix can still accept non-integers and behave like loc:

>>> s2.ix[:'c'] # behaves like loc given non-integer
a   NaN
b   NaN
c   NaN

As general advice, if you’re only indexing using labels, or only indexing using integer positions, stick with loc or iloc to avoid unexpected results – try not use ix.


Combining position-based and label-based indexing

Sometimes given a DataFrame, you will want to mix label and positional indexing methods for the rows and columns.

For example, consider the following DataFrame. How best to slice the rows up to and including ‘c’ and take the first four columns?

>>> df = pd.DataFrame(np.nan, 
                      index=list('abcde'),
                      columns=['x','y','z', 8, 9])
>>> df
    x   y   z   8   9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN

In earlier versions of pandas (before 0.20.0) ix lets you do this quite neatly – we can slice the rows by label and the columns by position (note that for the columns, ix will default to position-based slicing since 4 is not a column name):

>>> df.ix[:'c', :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

In later versions of pandas, we can achieve this result using iloc and the help of another method:

>>> df.iloc[:df.index.get_loc('c') + 1, :4]
    x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

get_loc() is an index method meaning “get the position of the label in this index”. Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row ‘c’ as well.

There are further examples in pandas’ documentation here.


回答 1

iloc基于整数定位工作。因此,无论您的行标签是什么,您都可以始终执行以下操作:

df.iloc[0]

或最后五行

df.iloc[-5:]

您也可以在列上使用它。这将检索第三列:

df.iloc[:, 2]    # the : in the first position indicates all rows

您可以将它们结合起来以获得行和列的交集:

df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)

另一方面,.loc使用命名索引。让我们设置一个带有字符串作为行和列标签的数据框:

df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])

然后我们可以得到第一行

df.loc['a']     # equivalent to df.iloc[0]

和第二两排的'date'柱通过

df.loc['b':, 'date']   # equivalent to df.iloc[1:, 1]

等等。现在,可能值得指出的是,a的默认行和列索引DataFrame是从0开始的整数,在这种情况下iloc,它们的loc工作方式相同。这就是为什么您的三个示例是等效的。如果您有非数字索引(例如字符串或日期时间), df.loc[:5] 则会引发错误。

另外,您可以仅使用数据框的进行列检索__getitem__

df['time']    # equivalent to df.loc[:, 'time']

现在假设您要混合使用位置索引和命名索引,即使用行上的名称和列上的位置进行索引(为澄清起见,我的意思是从我们的数据框中选择内容,而不是使用行索引中包含字符串和整数的方式创建数据框列索引)。这是.ix进来的地方:

df.ix[:2, 'time']    # the first two rows of the 'time' column

我认为也值得一提的是,您也可以将布尔向量传递给该loc方法。例如:

 b = [True, False, True]
 df.loc[b] 

将返回的第一行和第三行df。这等效df[b]于选择,但也可以用于通过布尔向量进行分配:

df.loc[b, 'name'] = 'Mary', 'John'

iloc works based on integer positioning. So no matter what your row labels are, you can always, e.g., get the first row by doing

df.iloc[0]

or the last five rows by doing

df.iloc[-5:]

You can also use it on the columns. This retrieves the 3rd column:

df.iloc[:, 2]    # the : in the first position indicates all rows

You can combine them to get intersections of rows and columns:

df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)

On the other hand, .loc use named indices. Let’s set up a data frame with strings as row and column labels:

df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])

Then we can get the first row by

df.loc['a']     # equivalent to df.iloc[0]

and the second two rows of the 'date' column by

df.loc['b':, 'date']   # equivalent to df.iloc[1:, 1]

and so on. Now, it’s probably worth pointing out that the default row and column indices for a DataFrame are integers from 0 and in this case iloc and loc would work in the same way. This is why your three examples are equivalent. If you had a non-numeric index such as strings or datetimes, df.loc[:5] would raise an error.

Also, you can do column retrieval just by using the data frame’s __getitem__:

df['time']    # equivalent to df.loc[:, 'time']

Now suppose you want to mix position and named indexing, that is, indexing using names on rows and positions on columns (to clarify, I mean select from our data frame, rather than creating a data frame with strings in the row index and integers in the column index). This is where .ix comes in:

df.ix[:2, 'time']    # the first two rows of the 'time' column

I think it’s also worth mentioning that you can pass boolean vectors to the loc method as well. For example:

 b = [True, False, True]
 df.loc[b] 

Will return the 1st and 3rd rows of df. This is equivalent to df[b] for selection, but it can also be used for assigning via boolean vectors:

df.loc[b, 'name'] = 'Mary', 'John'

回答 2

我认为,可接受的答案令人困惑,因为它使用仅缺少值的DataFrame。我也不喜欢术语基于位置.iloc,相反,喜欢整数位置,因为它是更描述性,正是.iloc代表。关键字是.ilocINTEGER-需要INTEGERS。

请参阅我关于子集选择的非常详细的博客系列,以了解更多信息


.ix已弃用且含糊不清,切勿使用

由于.ix已弃用,因此我们仅关注.loc和之间的差异.iloc

在讨论差异之前,重要的是要了解DataFrames具有标签,这些标签可帮助标识每个列和每个索引。让我们看一个示例DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

在此处输入图片说明

所有粗体字均为标签。标签,agecolorfoodheightscorestate被用于。其他标签,JaneNickAaronPenelopeDeanChristinaCornelia被用于索引


在DataFrame中选择特定行的主要方法是使用.loc.iloc索引器。这些索引器中的每一个也可以用于同时选择列,但是现在只关注行更容易。同样,每个索引器都使用紧跟其名称的一组括号进行选择。

.loc仅通过标签选择数据

我们将首先讨论.loc仅通过索引或列标签选择数据的索引器。在示例DataFrame中,我们提供了有意义的名称作为索引值。许多DataFrame都没有任何有意义的名称,而是默认为0到n-1之间的整数,其中n是DataFrame的长度。

您可以使用三种不同的输入 .loc

  • 一串
  • 字符串列表
  • 使用字符串作为起始值和终止值的切片符号

用带字符串的.loc选择单行

要选择一行数据,请将索引标签放在后面的括号内.loc

df.loc['Penelope']

这将数据行作为系列返回

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

使用.loc与字符串列表选择多行

df.loc[['Cornelia', 'Jane', 'Dean']]

这将返回一个DataFrame,其中的数据行按列表中指定的顺序进行:

在此处输入图片说明

使用带有切片符号的.loc选择多行

切片符号由开始,停止和步进值定义。按标签切片时,大熊猫在返回值中包含停止值。以下是从亚伦到迪恩(含)的片段。它的步长未明确定义,但默认为1。

df.loc['Aaron':'Dean']

在此处输入图片说明

可以采用与Python列表相同的方式获取复杂的切片。

.iloc仅按整数位置选择数据

现在转到.iloc。DataFrame中数据的每一行和每一列都有一个定义它的整数位置。这是在输出中直观显示的标签的补充。整数位置只是从0开始从顶部/左侧开始的行/列数。

您可以使用三种不同的输入 .iloc

  • 一个整数
  • 整数列表
  • 使用整数作为起始值和终止值的切片符号

用带整数的.iloc选择单行

df.iloc[4]

这将返回第5行(整数位置4)为系列

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

用.iloc选择带有整数列表的多行

df.iloc[[2, -2]]

这将返回第三行和倒数第二行的DataFrame:

在此处输入图片说明

使用带切片符号的.iloc选择多行

df.iloc[:5:3]

在此处输入图片说明


使用.loc和.iloc同时选择行和列

两者的一项出色功能.loc/.iloc是它们可以同时选择行和列。在上面的示例中,所有列都是从每个选择中返回的。我们可以选择输入类型与行相同的列。我们只需要用逗号分隔行和列选择即可。

例如,我们可以选择Jane行和Dean行,它们的高度,得分和状态如下:

df.loc[['Jane', 'Dean'], 'height':]

在此处输入图片说明

这对行使用标签列表,对列使用切片符号

我们自然可以.iloc只使用整数来执行类似的操作。

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

带标签和整数位置的同时选择

.ix用来与标签和整数位置同时进行选择,这很有用,但有时会造成混淆和模棱两可,值得庆幸的是,它已被弃用。如果您需要混合使用标签和整数位置进行选择,则必须同时选择标签或整数位置。

例如,如果我们要选择行Nick以及第Cornelia2列和第4列,则可以.loc通过以下方式将整数转换为标签来使用:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

或者,可以使用get_locindex方法将索引标签转换为整数。

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

布尔选择

.loc索引器还可以进行布尔选择。例如,如果我们有兴趣查找年龄在30岁以上的所有行,并仅返回foodscore列,则可以执行以下操作:

df.loc[df['age'] > 30, ['food', 'score']] 

您可以使用复制它,.iloc但是不能将其传递为布尔系列。您必须将boolean Series转换为numpy数组,如下所示:

df.iloc[(df['age'] > 30).values, [2, 4]] 

选择所有行

可以.loc/.iloc仅用于列选择。您可以使用如下冒号来选择所有行:

df.loc[:, 'color':'score':2]

在此处输入图片说明


索引运算符[]可以选择行和列,但不能同时选择。

大多数人都熟悉DataFrame索引运算符的主要目的,即选择列。字符串选择单个列作为系列,而字符串列表选择多个列作为DataFrame。

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

使用列表选择多个列

df[['food', 'score']]

在此处输入图片说明

人们所不熟悉的是,当使用切片符号时,选择是通过行标签或整数位置进行的。这非常令人困惑,我几乎从未使用过,但是确实可以使用。

df['Penelope':'Christina'] # slice rows by label

在此处输入图片说明

df[2:6:2] # slice rows by integer location

在此处输入图片说明

.loc/.iloc选择行的显式性是高度首选的。单独的索引运算符无法同时选择行和列。

df[3:5, 'color']
TypeError: unhashable type: 'slice'

In my opinion, the accepted answer is confusing, since it uses a DataFrame with only missing values. I also do not like the term position-based for .iloc and instead, prefer integer location as it is much more descriptive and exactly what .iloc stands for. The key word is INTEGER – .iloc needs INTEGERS.

See my extremely detailed blog series on subset selection for more


.ix is deprecated and ambiguous and should never be used

Because .ix is deprecated we will only focus on the differences between .loc and .iloc.

Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each index. Let’s take a look at a sample DataFrame:

df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
                   'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
                   'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
                   'height':[165, 70, 120, 80, 180, 172, 150],
                   'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])

enter image description here

All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used for the index.


The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.

.loc selects data only by labels

We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length of the DataFrame.

There are three different inputs you can use for .loc

  • A string
  • A list of strings
  • Slice notation using strings as the start and stop values

Selecting a single row with .loc with a string

To select a single row of data, place the index label inside of the brackets following .loc.

df.loc['Penelope']

This returns the row of data as a Series

age           4
color     white
food      Apple
height       80
score       3.3
state        AL
Name: Penelope, dtype: object

Selecting multiple rows with .loc with a list of strings

df.loc[['Cornelia', 'Jane', 'Dean']]

This returns a DataFrame with the rows in the order specified in the list:

enter image description here

Selecting multiple rows with .loc with slice notation

Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.

df.loc['Aaron':'Dean']

enter image description here

Complex slices can be taken in the same manner as Python lists.

.iloc selects data only by integer location

Let’s now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.

There are three different inputs you can use for .iloc

  • An integer
  • A list of integers
  • Slice notation using integers as the start and stop values

Selecting a single row with .iloc with an integer

df.iloc[4]

This returns the 5th row (integer location 4) as a Series

age           32
color       gray
food      Cheese
height       180
score        1.8
state         AK
Name: Dean, dtype: object

Selecting multiple rows with .iloc with a list of integers

df.iloc[[2, -2]]

This returns a DataFrame of the third and second to last rows:

enter image description here

Selecting multiple rows with .iloc with slice notation

df.iloc[:5:3]

enter image description here


Simultaneous selection of rows and columns with .loc and .iloc

One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.

For example, we can select rows Jane, and Dean with just the columns height, score and state like this:

df.loc[['Jane', 'Dean'], 'height':]

enter image description here

This uses a list of labels for the rows and slice notation for the columns

We can naturally do similar operations with .iloc using only integers.

df.iloc[[1,4], 2]
Nick      Lamb
Dean    Cheese
Name: food, dtype: object

Simultaneous selection with labels and integer location

.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.

For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:

col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names] 

Or alternatively, convert the index labels to integers with the get_loc index method.

labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]

Boolean Selection

The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows wher age is above 30 and return just the food and score columns we can do the following:

df.loc[df['age'] > 30, ['food', 'score']] 

You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:

df.iloc[(df['age'] > 30).values, [2, 4]] 

Selecting all rows

It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:

df.loc[:, 'color':'score':2]

enter image description here


The indexing operator, [], can select rows and columns too but not simultaneously.

Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.

df['food']

Jane          Steak
Nick           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: food, dtype: object

Using a list selects multiple columns

df[['food', 'score']]

enter image description here

What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.

df['Penelope':'Christina'] # slice rows by label

enter image description here

df[2:6:2] # slice rows by integer location

enter image description here

The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.

df[3:5, 'color']
TypeError: unhashable type: 'slice'

如何处理熊猫中的SettingWithCopyWarning?

问题:如何处理熊猫中的SettingWithCopyWarning?

背景

我刚刚将熊猫从0.11升级到0.13.0rc1。现在,该应用程序弹出了许多新警告。其中之一是这样的:

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE

我想知道到底是什么意思?我需要改变什么吗?

如果我坚持使用该如何警告quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE

产生错误的功能

def _decode_stock_quote(list_of_150_stk_str):
    """decode the webpage and return dataframe"""

    from cStringIO import StringIO

    str_of_all = "".join(list_of_150_stk_str)

    quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
    quote_df['TClose'] = quote_df['TPrice']
    quote_df['RT']     = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
    quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
    quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
    quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
    quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
    quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

    return quote_df

更多错误讯息

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

Background

I just upgraded my Pandas from 0.11 to 0.13.0rc1. Now, the application is popping out many new warnings. One of them like this:

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE

I want to know what exactly it means? Do I need to change something?

How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE?

The function that gives errors

def _decode_stock_quote(list_of_150_stk_str):
    """decode the webpage and return dataframe"""

    from cStringIO import StringIO

    str_of_all = "".join(list_of_150_stk_str)

    quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
    quote_df['TClose'] = quote_df['TPrice']
    quote_df['RT']     = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
    quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
    quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
    quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
    quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
    quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

    return quote_df

More error messages

E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TVol']   = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TAmt']   = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  quote_df['TDate']  = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

回答 0

SettingWithCopyWarning被创造的标志可能造成混淆的“链接”的任务,比如下面这并不总是按预期方式工作,特别是当第一选择返回一个副本。[ 有关背景讨论,请参见GH5390GH5597。]

df[df['A'] > 2]['B'] = new_val  # new_val not set in df

该警告提出了如下重写建议:

df.loc[df['A'] > 2, 'B'] = new_val

但是,这不适合您的用法,相当于:

df = df[df['A'] > 2]
df['B'] = new_val

显然,您不关心将其写回到原始帧的写操作(因为您正在覆盖对它的引用),但是不幸的是,这种模式无法与第一个链式分配示例区分开。因此,(误报)警告。如果您想进一步阅读,可能会在建立索引文档中解决误报的可能性。您可以通过以下分配安全地禁用此新警告。

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

The SettingWithCopyWarning was created to flag potentially confusing “chained” assignments, such as the following, which does not always work as expected, particularly when the first selection returns a copy. [see GH5390 and GH5597 for background discussion.]

df[df['A'] > 2]['B'] = new_val  # new_val not set in df

The warning offers a suggestion to rewrite as follows:

df.loc[df['A'] > 2, 'B'] = new_val

However, this doesn’t fit your usage, which is equivalent to:

df = df[df['A'] > 2]
df['B'] = new_val

While it’s clear that you don’t care about writes making it back to the original frame (since you are overwriting the reference to it), unfortunately this pattern cannot be differentiated from the first chained assignment example. Hence the (false positive) warning. The potential for false positives is addressed in the docs on indexing, if you’d like to read further. You can safely disable this new warning with the following assignment.

import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

回答 1

SettingWithCopyWarning熊猫如何应对?

这篇文章的读者对象是:

  1. 想了解此警告的含义
  2. 想了解抑制此警告的不同方法
  3. 想了解如何改进其代码并遵循良好做法,以避免将来出现此警告。

设定

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df
   A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1

什么是SettingWithCopyWarning

要知道如何处理此警告,重要的是要理解它的含义以及为什么首先提出它。

过滤DataFrame时,可以对帧进行切片/索引以返回一个视图copy,具体取决于内部布局和各种实现细节。顾名思义,“视图”是原始数据的视图,因此修改视图可能会修改原始对象。另一方面,“副本”是原始数据的复制,修改副本不会影响原始数据。

如其他答案所述SettingWithCopyWarning,创建时会标记“链接分配”操作。df在上面的设置中考虑。假设您要选择“ B”列中的所有值,其中“ A”列中的值>5。Pandas允许您以不同的方式执行此操作,其中某些方法比其他方法更正确。例如,

df[df.A > 5]['B']

1    3
2    6
Name: B, dtype: int64

和,

df.loc[df.A > 5, 'B']

1    3
2    6
Name: B, dtype: int64

这些返回相同的结果,因此,如果您仅读取这些值,则没有区别。那么,问题是什么呢?链式分配的问题在于,通常很难预测是否返回视图或副本,因此在尝试分配回值时这在很大程度上成为一个问题。为了建立在前面的示例上,请考虑解释器如何执行此代码:

df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)

只需__setitem__调用一次即可df。OTOH,请考虑以下代码:

df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B", 4)

现在,根据__getitem__返回的视图还是副本,__setitem__操作可能不起作用

通常,您应将其loc用于基于标签的分配以及iloc基于整数/位置的分配,因为该规范保证它们始终在原始文件上运行。此外,要设置单个单元格,应使用atiat

可以在文档中找到更多信息

注意使用进行的
所有布尔索引操作loc也可以使用进行iloc。唯一的区别是iloc期望索引的整数/位置或布尔值的numpy数组,以及列的整数/位置索引。

例如,

df.loc[df.A > 5, 'B'] = 4

可以写成nas

df.iloc[(df.A > 5).values, 1] = 4

和,

df.loc[1, 'A'] = 100

可以写成

df.iloc[1, 0] = 100

等等。


告诉我如何抑制警告!

考虑对的“ A”列进行的简单操作df。选择“ A”并除以2将发出警告,但该操作将起作用。

df2 = df[['A']]
df2['A'] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

df2
     A
0  2.5
1  4.5
2  3.5

有两种方法可以直接静默此警告:

  1. 做一个 deepcopy

    df2 = df[['A']].copy(deep=True)
    df2['A'] /= 2
  2. 更改pd.options.mode.chained_assignment
    可以设置为None"warn""raise""warn"是默认值。None将完全抑制警告,并"raise"抛出SettingWithCopyError,阻止操作进行。

    pd.options.mode.chained_assignment = None
    df2['A'] /= 2

@Peter Cotton在评论中提出了一种不错的方法,即使用上下文管理器以非侵入方式更改模式(从此要点修改),仅在需要时才设置模式,然后将其重置为完成后的原始状态。

class ChainedAssignent:
    def __init__(self, chained=None):
        acceptable = [None, 'warn', 'raise']
        assert chained in acceptable, "chained must be in " + str(acceptable)
        self.swcw = chained

    def __enter__(self):
        self.saved_swcw = pd.options.mode.chained_assignment
        pd.options.mode.chained_assignment = self.swcw
        return self

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = self.saved_swcw

用法如下:

# some code here
with ChainedAssignent():
    df2['A'] /= 2
# more code follows

或者,引发异常

with ChainedAssignent(chained='raise'):
    df2['A'] /= 2

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

“ XY问题”:我在做什么错?

很多时候,用户试图寻找抑制此异常的方法而没有完全理解为什么首先出现该异常。这是XY问题的一个很好的示例,用户尝试解决问题“ Y”,这实际上是根源问题“ X”的症状。将根据遇到此警告的常见问题提出问题,然后提出解决方案。

问题1
我有一个DataFrame

df
       A  B  C  D  E
    0  5  0  3  3  7
    1  9  3  5  2  4
    2  7  6  8  8  1

我想为“ A”> 5到1000分配值。我的预期输出是

      A  B  C  D  E
0     5  0  3  3  7
1  1000  3  5  2  4
2  1000  6  8  8  1

错误的方法:

df.A[df.A > 5] = 1000         # works, because df.A returns a view
df[df.A > 5]['A'] = 1000      # does not work
df.loc[df.A  5]['A'] = 1000   # does not work

正确使用方法loc

df.loc[df.A > 5, 'A'] = 1000


问题2 1
我正在尝试将单元格(1,’D’)中的值设置为12345。我的预期输出是

   A  B  C      D  E
0  5  0  3      3  7
1  9  3  5  12345  4
2  7  6  8      8  1

我尝试了多种访问此单元格的方法,例如 df['D'][1]。做这个的最好方式是什么?

1.这个问题与警告并不特别相关,但是最好了解如何正确执行此特定操作,以避免将来可能出现警告的情况。

您可以使用以下任何一种方法来执行此操作。

df.loc[1, 'D'] = 12345
df.iloc[1, 3] = 12345
df.at[1, 'D'] = 12345
df.iat[1, 3] = 12345


问题3
我试图根据某些条件对值进行子集化。我有一个DataFrame

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

我想将“ D”中的值分配给123,以使“ C” ==5。我尝试过

df2.loc[df2.C == 5, 'D'] = 123

看起来不错,但我仍然可以 SettingWithCopyWarning!我该如何解决?

实际上,这可能是因为您的管道中的代码更高。您是否df2从更大的事物(例如

df2 = df[df.A > 5]

?在这种情况下,布尔索引将返回一个视图,因此df2将引用原始视图。您需要做的就是分配df2一个副本

df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]


问题4
我试图将列“ C”从

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

但是使用

df2.drop('C', axis=1, inplace=True)

抛出SettingWithCopyWarning。为什么会这样呢?

这是因为df2必须已通过其他切片操作将其创建为视图,例如

df2 = df[df.A > 5]

这里的解决方案是要么做copy()df,或使用loc,如前。

How to deal with SettingWithCopyWarning in Pandas?

This post is meant for readers who,

  1. Would like to understand what this warning means
  2. Would like to understand different ways of suppressing this warning
  3. Would like to understand how to improve their code and follow good practices to avoid this warning in the future.

Setup

np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df
   A  B  C  D  E
0  5  0  3  3  7
1  9  3  5  2  4
2  7  6  8  8  1

What is the SettingWithCopyWarning?

To know how to deal with this warning, it is important to understand what it means and why it is raised in the first place.

When filtering DataFrames, it is possible slice/index a frame to return either a view, or a copy, depending on the internal layout and various implementation details. A “view” is, as the term suggests, a view into the original data, so modifying the view may modify the original object. On the other hand, a “copy” is a replication of data from the original, and modifying the copy has no effect on the original.

As mentioned by other answers, the SettingWithCopyWarning was created to flag “chained assignment” operations. Consider df in the setup above. Suppose you would like to select all values in column “B” where values in column “A” is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,

df[df.A > 5]['B']

1    3
2    6
Name: B, dtype: int64

And,

df.loc[df.A > 5, 'B']

1    3
2    6
Name: B, dtype: int64

These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back. To build on the earlier example, consider how this code is executed by the interpreter:

df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)

With a single __setitem__ call to df. OTOH, consider this code:

df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B", 4)

Now, depending on whether __getitem__ returned a view or a copy, the __setitem__ operation may not work.

In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use at and iat.

More can be found in the documentation.

Note
All boolean indexing operations done with loc can also be done with iloc. The only difference is that iloc expects either integers/positions for index or a numpy array of boolean values, and integer/position indexes for the columns.

For example,

df.loc[df.A > 5, 'B'] = 4

Can be written nas

df.iloc[(df.A > 5).values, 1] = 4

And,

df.loc[1, 'A'] = 100

Can be written as

df.iloc[1, 0] = 100

And so on.


Just tell me how to suppress the warning!

Consider a simple operation on the “A” column of df. Selecting “A” and dividing by 2 will raise the warning, but the operation will work.

df2 = df[['A']]
df2['A'] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

df2
     A
0  2.5
1  4.5
2  3.5

There are a couple ways of directly silencing this warning:

  1. Make a deepcopy

    df2 = df[['A']].copy(deep=True)
    df2['A'] /= 2
    
  2. Change pd.options.mode.chained_assignment
    Can be set to None, "warn", or "raise". "warn" is the default. None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError, preventing the operation from going through.

    pd.options.mode.chained_assignment = None
    df2['A'] /= 2
    

@Peter Cotton in the comments, came up with a nice way of non-intrusively changing the mode (modified from this gist) using a context manager, to set the mode only as long as it is required, and the reset it back to the original state when finished.

class ChainedAssignent:
    def __init__(self, chained=None):
        acceptable = [None, 'warn', 'raise']
        assert chained in acceptable, "chained must be in " + str(acceptable)
        self.swcw = chained

    def __enter__(self):
        self.saved_swcw = pd.options.mode.chained_assignment
        pd.options.mode.chained_assignment = self.swcw
        return self

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = self.saved_swcw

The usage is as follows:

# some code here
with ChainedAssignent():
    df2['A'] /= 2
# more code follows

Or, to raise the exception

with ChainedAssignent(chained='raise'):
    df2['A'] /= 2

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

The “XY Problem”: What am I doing wrong?

A lot of the time, users attempt to look for ways of suppressing this exception without fully understanding why it was raised in the first place. This is a good example of an XY problem, where users attempt to solve a problem “Y” that is actually a symptom of a deeper rooted problem “X”. Questions will be raised based on common problems that encounter this warning, and solutions will then be presented.

Question 1
I have a DataFrame

df
       A  B  C  D  E
    0  5  0  3  3  7
    1  9  3  5  2  4
    2  7  6  8  8  1

I want to assign values in col “A” > 5 to 1000. My expected output is

      A  B  C  D  E
0     5  0  3  3  7
1  1000  3  5  2  4
2  1000  6  8  8  1

Wrong way to do this:

df.A[df.A > 5] = 1000         # works, because df.A returns a view
df[df.A > 5]['A'] = 1000      # does not work
df.loc[df.A  5]['A'] = 1000   # does not work

Right way using loc:

df.loc[df.A > 5, 'A'] = 1000


Question 21
I am trying to set the value in cell (1, ‘D’) to 12345. My expected output is

   A  B  C      D  E
0  5  0  3      3  7
1  9  3  5  12345  4
2  7  6  8      8  1

I have tried different ways of accessing this cell, such as df['D'][1]. What is the best way to do this?

1. This question isn’t specifically related to the warning, but it is good to understand how to do this particular operation correctly so as to avoid situations where the warning could potentially arise in future.

You can use any of the following methods to do this.

df.loc[1, 'D'] = 12345
df.iloc[1, 3] = 12345
df.at[1, 'D'] = 12345
df.iat[1, 3] = 12345


Question 3
I am trying to subset values based on some condition. I have a DataFrame

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

I would like to assign values in “D” to 123 such that “C” == 5. I tried

df2.loc[df2.C == 5, 'D'] = 123

Which seems fine but I am still getting the SettingWithCopyWarning! How do I fix this?

This is actually probably because of code higher up in your pipeline. Did you create df2 from something larger, like

df2 = df[df.A > 5]

? In this case, boolean indexing will return a view, so df2 will reference the original. What you’d need to do is assign df2 to a copy:

df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]


Question 4
I’m trying to drop column “C” in-place from

   A  B  C  D  E
1  9  3  5  2  4
2  7  6  8  8  1

But using

df2.drop('C', axis=1, inplace=True)

Throws SettingWithCopyWarning. Why is this happening?

This is because df2 must have been created as a view from some other slicing operation, such as

df2 = df[df.A > 5]

The solution here is to either make a copy() of df, or use loc, as before.


回答 2

通常,的目的SettingWithCopyWarning是向用户(尤其是新用户)显示他们可能正在使用副本,而不是他们认为的原始内容。这里误报(IOW如果你知道你在做什么,它可能是好的)。一种可能就是简单地关闭(默认警告按照@Garrett的建议)警告。

这是另一个选择:

In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [2]: dfa = df.ix[:, [1, 0]]

In [3]: dfa.is_copy
Out[3]: True

In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  #!/usr/local/bin/python

您可以将is_copy标志设置为False,以有效关闭该对象的检查:

In [5]: dfa.is_copy = False

In [6]: dfa['A'] /= 2

如果您明确复制,则不会发生进一步的警告:

In [7]: dfa = df.ix[:, [1, 0]].copy()

In [8]: dfa['A'] /= 2

OP在上面显示的代码是合法的,并且可能是我也可以做的,但从技术上讲,此警告是一种情况,不是误报。没有警告的另一种方法是通过进行选择操作reindex,例如

quote_df = quote_df.reindex(columns=['STK', ...])

要么,

quote_df = quote_df.reindex(['STK', ...], axis=1)  # v.0.21

In general the point of the SettingWithCopyWarning is to show users (and especially new users) that they may be operating on a copy and not the original as they think. There are false positives (IOW if you know what you are doing it could be ok). One possibility is simply to turn off the (by default warn) warning as @Garrett suggest.

Here is another option:

In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [2]: dfa = df.ix[:, [1, 0]]

In [3]: dfa.is_copy
Out[3]: True

In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  #!/usr/local/bin/python

You can set the is_copy flag to False, which will effectively turn off the check, for that object:

In [5]: dfa.is_copy = False

In [6]: dfa['A'] /= 2

If you explicitly copy then no further warning will happen:

In [7]: dfa = df.ix[:, [1, 0]].copy()

In [8]: dfa['A'] /= 2

The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a false positive. Another way to not have the warning would be to do the selection operation via reindex, e.g.

quote_df = quote_df.reindex(columns=['STK', ...])

Or,

quote_df = quote_df.reindex(['STK', ...], axis=1)  # v.0.21

回答 3

熊猫数据框复制警告

当您去做这样的事情时:

quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

pandas.ix 在这种情况下将返回一个新的独立数据帧。

您决定在此数据框中更改的任何值都不会更改原始数据框。

这就是熊猫试图警告您的内容。


为什么 .ix是个坏主意

.ix对象试图做的事情不只一件事,而且对于任何阅读过干净代码的人来说,这是一种强烈的气味。

给定此数据框:

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})

两种行为:

dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2

行为一:dfcopy现在是一个独立的数据框。改变它不会改变df

df.ix[0, "a"] = 3

行为二:更改原始数据框。


使用.loc替代

熊猫开发者意识到该.ix对象很臭(推测地),因此创建了两个新对象,这些对象有助于数据的获取和分配。(另一个是.iloc

.loc 速度更快,因为它不会尝试创建数据副本。

.loc 旨在就地修改现有数据框,从而提高内存效率。

.loc 是可预测的,它具有一种行为。


解决方案

在代码示例中,您正在执行的操作是加载一个包含许多列的大文件,然后将其修改为较小的文件。

pd.read_csv功能可以帮助您解决很多问题,还可以加快文件的加载速度。

所以不要这样做

quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

做这个

columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns

这只会读取您感兴趣的列,并正确命名它们。无需使用邪恶的.ix物体做神奇的事情。

Pandas dataframe copy warning

When you go and do something like this:

quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

pandas.ix in this case returns a new, stand alone dataframe.

Any values you decide to change in this dataframe, will not change the original dataframe.

This is what pandas tries to warn you about.


Why .ix is a bad idea

The .ix object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell.

Given this dataframe:

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})

Two behaviors:

dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2

Behavior one: dfcopy is now a stand alone dataframe. Changing it will not change df

df.ix[0, "a"] = 3

Behavior two: This changes the original dataframe.


Use .loc instead

The pandas developers recognized that the .ix object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. (The other being .iloc)

.loc is faster, because it does not try to create a copy of the data.

.loc is meant to modify your existing dataframe inplace, which is more memory efficient.

.loc is predictable, it has one behavior.


The solution

What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller.

The pd.read_csv function can help you out with a lot of this and also make the loading of the file a lot faster.

So instead of doing this

quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]

Do this

columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns

This will only read the columns you are interested in, and name them properly. No need for using the evil .ix object to do magical stuff.


回答 4

在这里,我直接回答这个问题。怎么处理呢?

.copy(deep=False)切片后做一个。参见pandas.DataFrame.copy

等等,切片不返回副本吗?毕竟,这是警告消息要说的内容?阅读详细答案:

import pandas as pd
df = pd.DataFrame({'x':[1,2,3]})

这给出了警告:

df0 = df[df.x>2]
df0['foo'] = 'bar'

这不是:

df1 = df[df.x>2].copy(deep=False)
df1['foo'] = 'bar'

两者df0df1都是DataFrame对象,但它们之间的某些不同之处使熊猫能够打印警告。让我们找出它是什么。

import inspect
slice= df[df.x>2]
slice_copy = df[df.x>2].copy(deep=False)
inspect.getmembers(slice)
inspect.getmembers(slice_copy)

使用选择的差异工具,您将看到,除了几个地址之外,唯一的实质区别是:

|          | slice   | slice_copy |
| _is_copy | weakref | None       |

决定是否发出警告的方法是DataFrame._check_setitem_copy检查_is_copy。所以,你去。制作一个copy使您的DataFrame不_is_copy

建议使用警告.loc,但如果在上使用.loc该框架_is_copy,您仍会收到相同的警告。误导?是。烦人吗 你打赌 有帮助吗?可能在使用链式分配时。但是它不能正确检测链条分配,并且会随意打印警告。

Here I answer the question directly. How to deal with it?

Make a .copy(deep=False) after you slice. See pandas.DataFrame.copy.

Wait, doesn’t a slice return a copy? After all, this is what the warning message is attempting to say? Read the long answer:

import pandas as pd
df = pd.DataFrame({'x':[1,2,3]})

This gives a warning:

df0 = df[df.x>2]
df0['foo'] = 'bar'

This does not:

df1 = df[df.x>2].copy(deep=False)
df1['foo'] = 'bar'

Both df0 and df1 are DataFrame objects, but something about them is different that enables pandas to print the warning. Let’s find out what it is.

import inspect
slice= df[df.x>2]
slice_copy = df[df.x>2].copy(deep=False)
inspect.getmembers(slice)
inspect.getmembers(slice_copy)

Using your diff tool of choice, you will see that beyond a couple of addresses, the only material difference is this:

|          | slice   | slice_copy |
| _is_copy | weakref | None       |

The method that decides whether to warn is DataFrame._check_setitem_copy which checks _is_copy. So here you go. Make a copy so that your DataFrame is not _is_copy.

The warning is suggesting to use .loc, but if you use .loc on a frame that _is_copy, you will still get the same warning. Misleading? Yes. Annoying? You bet. Helpful? Potentially, when chained assignment is used. But it cannot correctly detect chain assignment and prints the warning indiscriminately.


回答 5

这个话题确实让Pandas感到困惑。幸运的是,它有一个相对简单的解决方案。

问题在于,并不总是清楚数据过滤操作(例如loc)是否返回DataFrame的副本或视图。因此,这种过滤后的DataFrame的进一步使用可能会造成混淆。

简单的解决方案是(除非您需要处理非常大的数据集):

每当需要更新任何值时,请始终确保在分配之前隐式复制DataFrame。

df  # Some DataFrame
df = df.loc[:, 0:2]  # Some filtering (unsure whether a view or copy is returned)
df = df.copy()  # Ensuring a copy is made
df[df["Name"] == "John"] = "Johny"  # Assignment can be done now (no warning)

This topic is really confusing with Pandas. Luckily, it has a relatively simple solution.

The problem is that it is not always clear whether data filtering operations (e.g. loc) return a copy or a view of the DataFrame. Further use of such filtered DataFrame could therefore be confusing.

The simple solution is (unless you need to work with very large sets of data):

Whenever you need to update any values, always make sure that you implicitely copy the DataFrame before the assignment.

df  # Some DataFrame
df = df.loc[:, 0:2]  # Some filtering (unsure whether a view or copy is returned)
df = df.copy()  # Ensuring a copy is made
df[df["Name"] == "John"] = "Johny"  # Assignment can be done now (no warning)


回答 6

为了消除任何疑问,我的解决方案是制作切片的深层副本,而不是常规副本。根据您的上下文,这可能不适用(内存限制/切片的大小,潜在的性能下降-特别是如果复制像对我一样在一个循环中发生,等等。)

需要明确的是,这是我收到的警告:

/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

插图

我怀疑是否由于我将一列放在切片的副本上而引发警告。虽然从技术上讲,它不是在切片副本中尝试设置值,但是这仍然是切片副本的修改。以下是我为确认怀疑而采取的(简化)步骤,希望它能对那些试图了解警告的人有所帮助。

示例1:在原件上放置一列会影响复印

我们已经知道了,但这是健康的提醒。这是不是警告是关于什么的。

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123


>> df2 = df1
>> df2

A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    B
0   121
1   122
2   123

可以避免对df1进行更改以影响df2

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    A   B
0   111 121
1   112 122
2   113 123

示例2:在副本上放置一列可能会影响原始

这实际上说明了警告。

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> df2 = df1
>> df2

    A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1

B
0   121
1   122
2   123

可以避免对df2进行更改以影响df1

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2

A   B
0   111 121
1   112 122
2   113 123

>> df2.drop('A', axis=1, inplace=True)
>> df1

A   B
0   111 121
1   112 122
2   113 123

干杯!

To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy. This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation – especially if the copy occurs in a loop like it did for me, etc…)

To be clear, here is the warning I received:

/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Illustration

I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice. Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning.

Example 1: dropping a column on the original affects the copy

We knew that already but this is a healthy reminder. This is NOT what the warning is about.

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123


>> df2 = df1
>> df2

A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    B
0   121
1   122
2   123

It is possible to avoid changes made on df1 to affect df2

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
    A   B
0   111 121
1   112 122
2   113 123

Example 2: dropping a column on the copy may affect the original

This actually illustrates the warning.

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> df2 = df1
>> df2

    A   B
0   111 121
1   112 122
2   113 123

# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1

B
0   121
1   122
2   123

It is possible to avoid changes made on df2 to affect df1

>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1

    A   B
0   111 121
1   112 122
2   113 123

>> import copy
>> df2 = copy.deepcopy(df1)
>> df2

A   B
0   111 121
1   112 122
2   113 123

>> df2.drop('A', axis=1, inplace=True)
>> df1

A   B
0   111 121
1   112 122
2   113 123

Cheers!


回答 7

这应该工作:

quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE

This should work:

quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE

回答 8

有些人可能想简单地消除警告:

class SupressSettingWithCopyWarning:
    def __enter__(self):
        pd.options.mode.chained_assignment = None

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = 'warn'

with SupressSettingWithCopyWarning():
    #code that produces warning

Some may want to simply suppress the warning:

class SupressSettingWithCopyWarning:
    def __enter__(self):
        pd.options.mode.chained_assignment = None

    def __exit__(self, *args):
        pd.options.mode.chained_assignment = 'warn'

with SupressSettingWithCopyWarning():
    #code that produces warning

回答 9

如果您已将切片分配给变量,并希望使用变量进行设置,如下所示:

df2 = df[df['A'] > 2]
df2['B'] = value

而且由于您的条件计算df2时间太长或出于某些其他原因,您不想使用Jeffs解决方案,那么您可以使用以下方法:

df.loc[df2.index.tolist(), 'B'] = value

df2.index.tolist() 返回df2中所有条目的索引,然后将这些索引用于设置原始数据帧中的B列。

If you have assigned the slice to a variable and want to set using the variable as in the following:

df2 = df[df['A'] > 2]
df2['B'] = value

And you do not want to use Jeffs solution because your condition computing df2 is to long or for some other reason, then you can use the following:

df.loc[df2.index.tolist(), 'B'] = value

df2.index.tolist() returns the indices from all entries in df2, which will then be used to set column B in the original dataframe.


回答 10

对我来说,此问题发生在下面的> simplified <示例中。我也能够解决它(希望有一个正确的解决方案):

带有警告的旧代码:

def update_old_dataframe(old_dataframe, new_dataframe):
    for new_index, new_row in new_dataframe.iterrorws():
        old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)

def update_row(old_row, new_row):
    for field in [list_of_columns]:
        # line with warning because of chain indexing old_dataframe[new_index][field]
        old_row[field] = new_row[field]  
    return old_row

这打印了该行的警告 old_row[field] = new_row[field]

由于update_row方法中的行实际上是type Series,因此我将其替换为:

old_row.at[field] = new_row.at[field]

即用于访问/查找的方法Series。尽管两者都可以正常工作并且结果是相同的,但是通过这种方式,我不必禁用警告(=将其保留在其他地方的其他链索引问题中)。

我希望这可以帮助某人。

For me this issue occured in a following >simplified< example. And I was also able to solve it (hopefully with a correct solution):

old code with warning:

def update_old_dataframe(old_dataframe, new_dataframe):
    for new_index, new_row in new_dataframe.iterrorws():
        old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)

def update_row(old_row, new_row):
    for field in [list_of_columns]:
        # line with warning because of chain indexing old_dataframe[new_index][field]
        old_row[field] = new_row[field]  
    return old_row

This printed the warning for the line old_row[field] = new_row[field]

Since the rows in update_row method are actually type Series, I replaced the line with:

old_row.at[field] = new_row.at[field]

i.e. method for accessing/lookups for a Series. Eventhough both works just fine and the result is same, this way I don’t have to disable the warnings (=keep them for other chain indexing issues somewhere else).

I hope this may help someone.


回答 11

我相信您可以避免像这样的整个问题:

return (
    pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    .rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    .ix[:,[0,3,2,1,4,5,8,9,30,31]]
    .assign(
        TClose=lambda df: df['TPrice'],
        RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),
        TVol=lambda df: df['TVol']/TVOL_SCALE,
        TAmt=lambda df: df['TAmt']/TAMT_SCALE,
        STK_ID=lambda df: df['STK'].str.slice(13,19),
        STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),
        TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),
    )
)

使用分配。从文档中:将新列分配给DataFrame,返回一个新对象(一个副本),其中除新列外还包含所有原始列。

参见汤姆·奥格斯珀格(Tom Augspurger)关于熊猫方法链接的文章:https ://tomaugspurger.github.io/method-chaining

You could avoid the whole problem like this, I believe:

return (
    pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
    .rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
    .ix[:,[0,3,2,1,4,5,8,9,30,31]]
    .assign(
        TClose=lambda df: df['TPrice'],
        RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),
        TVol=lambda df: df['TVol']/TVOL_SCALE,
        TAmt=lambda df: df['TAmt']/TAMT_SCALE,
        STK_ID=lambda df: df['STK'].str.slice(13,19),
        STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),
        TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),
    )
)

Using Assign. From the documentation: Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

See Tom Augspurger’s article on method chaining in pandas: https://tomaugspurger.github.io/method-chaining


回答 12

后续初学者问题/备注

也许是对其他像我这样的初学者的澄清(我来自R,似乎在幕后工作有所不同)。以下看起来无害且功能正常的代码不断产生SettingWithCopy警告,但我不知道为什么。我已经阅读并理解了带有“链式索引”的内容,但是我的代码不包含任何内容:

def plot(pdb, df, title, **kw):
    df['target'] = (df['ogg'] + df['ugg']) / 2
    # ...

但是后来,太晚了,我查看了plot()函数的调用位置:

    df = data[data['anz_emw'] > 0]
    pixbuf = plot(pdb, df, title)

因此,“ df”不是数据帧,而是某种对象,它会以某种方式记住它是通过索引数据帧而创建的(因此是视图?),这将使plot()中的行成为可能。

 df['target'] = ...

相当于

 data[data['anz_emw'] > 0]['target'] = ...

这是一个链接索引。我说对了吗?

无论如何,

def plot(pdb, df, title, **kw):
    df.loc[:,'target'] = (df['ogg'] + df['ugg']) / 2

固定它。

Followup beginner question / remark

Maybe a clarification for other beginners like me (I come from R which seems to work a bit differently under the hood). The following harmless-looking and functional code kept producing the SettingWithCopy warning, and I couldn’t figure out why. I had both read and understood the issued with “chained indexing”, but my code doesn’t contain any:

def plot(pdb, df, title, **kw):
    df['target'] = (df['ogg'] + df['ugg']) / 2
    # ...

But then, later, much too late, I looked at where the plot() function is called:

    df = data[data['anz_emw'] > 0]
    pixbuf = plot(pdb, df, title)

So “df” isn’t a data frame but an object that somehow remembers that it was created by indexing a data frame (so is that a view?) which would make the line in plot()

 df['target'] = ...

equivalent to

 data[data['anz_emw'] > 0]['target'] = ...

which is a chained indexing. Did I get that right?

Anyway,

def plot(pdb, df, title, **kw):
    df.loc[:,'target'] = (df['ogg'] + df['ugg']) / 2

fixed it.


回答 13

由于这个问题已经在现有答案中得到了充分的解释和讨论,因此我将为pandas上下文管理器提供一种简洁的方法,使用pandas.option_context(指向文档示例的链接)-绝对不需要使用所有dunder方法和其他方法创建自定义类和口哨声。

首先,上下文管理器代码本身:

from contextlib import contextmanager

@contextmanager
def SuppressPandasWarning():
    with pd.option_context("mode.chained_assignment", None):
        yield

再举一个例子:

import pandas as pd
from string import ascii_letters

a = pd.DataFrame({"A": list(ascii_letters[0:4]), "B": range(0,4)})

mask = a["A"].isin(["c", "d"])
# Even shallow copy below is enough to not raise the warning, but why is a mystery to me.
b = a.loc[mask]  # .copy(deep=False)

# Raises the `SettingWithCopyWarning`
b["B"] = b["B"] * 2

# Does not!
with SuppressPandasWarning():
    b["B"] = b["B"] * 2

值得一提的是,这两个方法均未修改a,这对我来说有点令人惊讶,即使是带有df的浅表副本.copy(deep=False)也将阻止发出此警告(据我所知,浅表副本也应至少a也应进行修改,但它不会’t。pandas魔术。)。

As this question is already fully explained and discussed in existing answers I will just provide a neat pandas approach to the context manager using pandas.option_context (links to docs and example) – there is absolutely no need to create a custom class with all the dunder methods and other bells and whistles.

First the context manager code itself:

from contextlib import contextmanager

@contextmanager
def SuppressPandasWarning():
    with pd.option_context("mode.chained_assignment", None):
        yield

Then an example:

import pandas as pd
from string import ascii_letters

a = pd.DataFrame({"A": list(ascii_letters[0:4]), "B": range(0,4)})

mask = a["A"].isin(["c", "d"])
# Even shallow copy below is enough to not raise the warning, but why is a mystery to me.
b = a.loc[mask]  # .copy(deep=False)

# Raises the `SettingWithCopyWarning`
b["B"] = b["B"] * 2

# Does not!
with SuppressPandasWarning():
    b["B"] = b["B"] * 2

Worth noticing is that both approches do not modify a, which is a bit surprising to me, and even a shallow df copy with .copy(deep=False) would prevent this warning to be raised (as far as I understand shallow copy should at least modify a as well, but it doesn’t. pandas magic.).


回答 14

.apply()从使用该.query()方法的现有数据帧分配新数据帧时,我一直遇到这个问题。例如:

prop_df = df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

将返回此错误。在这种情况下,似乎可以解决该错误的修补程序是将其更改为:

prop_df = df.copy(deep=True)
prop_df = prop_df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

但是,由于必须进行新的复制,因此这在使用大型数据帧时效率不高。

如果.apply()在生成新列及其值时使用该方法,则可以通过添加以下方法来解决错误并提高效率.reset_index(drop=True)

prop_df = df.query('column == "value"').reset_index(drop=True)
prop_df['new_column'] = prop_df.apply(function, axis=1)

I had been getting this issue with .apply() when assigning a new dataframe from a pre-existing dataframe on which i’ve used the .query() method. For instance:

prop_df = df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

Would return this error. The fix that seems to resolve the error in this case is by changing this to:

prop_df = df.copy(deep=True)
prop_df = prop_df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)

However, this is NOT efficient especially when using large dataframes, due to having to make a new copy.

If you’re using the .apply() method in generating a new column and its values, a fix that resolves the error and is more efficient is by adding .reset_index(drop=True):

prop_df = df.query('column == "value"').reset_index(drop=True)
prop_df['new_column'] = prop_df.apply(function, axis=1)

将Pandas GroupBy输出从Series转换为DataFrame

问题:将Pandas GroupBy输出从Series转换为DataFrame

我从这样的输入数据开始

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

打印时显示为:

   City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

分组非常简单:

g1 = df1.groupby( [ "Name", "City"] ).count()

打印产生一个GroupBy对象:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
        Seattle      1     1

但是我最终想要的是另一个DataFrame对象,该对象包含GroupBy对象中的所有行。换句话说,我想得到以下结果:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

我在pandas文档中看不到如何完成此操作。任何提示都将受到欢迎。

I’m starting with input data like this

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

Which when printed appears as this:

   City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

Grouping is simple enough:

g1 = df1.groupby( [ "Name", "City"] ).count()

and printing yields a GroupBy object:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
        Seattle      1     1

But what I want eventually is another DataFrame object that contains all the rows in the GroupBy object. In other words I want to get the following result:

                  City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

I can’t quite see how to accomplish this in the pandas documentation. Any hints would be welcome.


回答 0

g1一个DataFrame。但是,它具有层次结构索引:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame

In [20]: g1.index
Out[20]: 
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
       ('Mallory', 'Seattle')], dtype=object)

也许您想要这样的东西?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]: 
      Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

或类似的东西:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]: 
      Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

g1 here is a DataFrame. It has a hierarchical index, though:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame

In [20]: g1.index
Out[20]: 
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
       ('Mallory', 'Seattle')], dtype=object)

Perhaps you want something like this?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]: 
      Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

Or something like:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]: 
      Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

回答 1

我想稍微更改Wes给出的答案,因为版本0.16.2需要as_index=False。如果未设置,则会得到一个空的数据框。

资料来源

如果将聚集函数命名as_index=True为默认值列,则聚集函数将不会返回要聚集的组。分组的列将是返回对象的索引。

as_index=False如果它们被命名为“列”,则传递将返回您正在聚合的组。

聚合函数是那些减少返回的对象的尺寸,例如:meansumsizecountstdvarsemdescribefirstlastnthminmax。例如DataFrame.sum(),当您这样做并取回一个时,就会发生这种情况Series

nth可以充当减速器或过滤器,请参见此处

import pandas as pd

df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                    "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
#       City     Name
#0   Seattle    Alice
#1   Seattle      Bob
#2  Portland  Mallory
#3   Seattle  Mallory
#4   Seattle      Bob
#5  Portland  Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
#                  City  Name
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1
#

编辑:

在版本0.17.1及更高版本,您可以使用subsetcountreset_index与参数namesize

print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range

print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]

print df1.groupby(["Name", "City"])[['Name','City']].count()
#                  Name  City
#Name    City                
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1

print df1.groupby(["Name", "City"]).size().reset_index(name='count')
#      Name      City  count
#0    Alice   Seattle      1
#1      Bob   Seattle      2
#2  Mallory  Portland      2
#3  Mallory   Seattle      1

count和之间的区别在于sizesize计算NaN值时count不计算。

I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False. If you don’t set it, you get an empty dataframe.

Source:

Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.

nth can act as a reducer or a filter, see here.

import pandas as pd

df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                    "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
print df1
#
#       City     Name
#0   Seattle    Alice
#1   Seattle      Bob
#2  Portland  Mallory
#3   Seattle  Mallory
#4   Seattle      Bob
#5  Portland  Mallory
#
g1 = df1.groupby(["Name", "City"], as_index=False).count()
print g1
#
#                  City  Name
#Name    City
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1
#

EDIT:

In version 0.17.1 and later you can use subset in count and reset_index with parameter name in size:

print df1.groupby(["Name", "City"], as_index=False ).count()
#IndexError: list index out of range

print df1.groupby(["Name", "City"]).count()
#Empty DataFrame
#Columns: []
#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]

print df1.groupby(["Name", "City"])[['Name','City']].count()
#                  Name  City
#Name    City                
#Alice   Seattle      1     1
#Bob     Seattle      2     2
#Mallory Portland     2     2
#        Seattle      1     1

print df1.groupby(["Name", "City"]).size().reset_index(name='count')
#      Name      City  count
#0    Alice   Seattle      1
#1      Bob   Seattle      2
#2  Mallory  Portland      2
#3  Mallory   Seattle      1

The difference between count and size is that size counts NaN values while count does not.


回答 2

简单地说,这应该完成任务:

import pandas as pd

grouped_df = df1.groupby( [ "Name", "City"] )

pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))

在这里,grouped_df.size()提取唯一的groupby计数,reset_index()方法重置您想要的列名。最后,Dataframe()调用pandas 函数创建一个DataFrame对象。

Simply, this should do the task:

import pandas as pd

grouped_df = df1.groupby( [ "Name", "City"] )

pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))

Here, grouped_df.size() pulls up the unique groupby count, and reset_index() method resets the name of the column you want it to be. Finally, the pandas Dataframe() function is called upon to create a DataFrame object.


回答 3

关键是使用reset_index()方法。

采用:

import pandas

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()

现在,您在g1中有了新的数据

结果数据框

The key is to use the reset_index() method.

Use:

import pandas

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()

Now you have your new dataframe in g1:

result dataframe


回答 4

也许我误解了这个问题,但是如果您想将groupby转换回数据框,则可以使用.to_frame()。我想在执行此操作时重设索引,所以我也包括了该部分。

与问题无关的示例代码

df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

Maybe I misunderstand the question but if you want to convert the groupby back to a dataframe you can use .to_frame(). I wanted to reset the index when I did this so I included that part as well.

example code unrelated to question

df = df['TIME'].groupby(df['Name']).min()
df = df.to_frame()
df = df.reset_index(level=['Name',"TIME"])

回答 5

我发现这对我有用。

import numpy as np
import pandas as pd

df1 = pd.DataFrame({ 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

df1['City_count'] = 1
df1['Name_count'] = 1

df1.groupby(['Name', 'City'], as_index=False).count()

I found this worked for me.

import numpy as np
import pandas as pd

df1 = pd.DataFrame({ 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

df1['City_count'] = 1
df1['Name_count'] = 1

df1.groupby(['Name', 'City'], as_index=False).count()

回答 6

下面的解决方案可能更简单:

df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

Below solution may be simpler:

df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()

回答 7

我已经汇总了数量明智的数据并存储到数据框

almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
          )['Qty'].sum()}).reset_index()

I have aggregated with Qty wise data and store to dataframe

almo_grp_data = pd.DataFrame({'Qty_cnt' :
almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
          )['Qty'].sum()}).reset_index()

回答 8

这些解决方案仅对我部分起作用,因为我正在进行多个聚合。这是我要转换为数据框的分组的示例输出:

分组输出

因为我想要的不仅仅是reset_index()提供的计数,所以我写了一个手动方法将上面的图像转换为数据帧。我知道这不是最复杂的方法,因为它很冗长和明确,但这是我所需要的。基本上,使用上面说明的reset_index()方法来启动“脚手架”数据框,然后遍历分组数据框中的组配对,检索索引,针对未分组数据框执行计算,并在新的聚合数据框中设置值。

df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)

# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0

def manualAggregations(indices_array):
    temp_df = df.iloc[indices_array]
    return {
        'Male Count': temp_df['Male Count'].sum(),
        'Female Count': temp_df['Female Count'].sum(),
        'Job Rate': temp_df['Hourly Rate'].max()
    }

for name, group in df_grouped:
    ix = df_grouped.indices[name]
    calcDict = manualAggregations(ix)

    for key in calcDict:
        #Salary Basis, Job Title
        columns = list(name)
        df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                          (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]

如果不是字典,可以在for循环中内联应用计算:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:

Groupby Output

Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a “scaffolding” dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.

df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)

# Grouped gives us the indices we want for each grouping
# We cannot convert a groupedby object back to a dataframe, so we need to do it manually
# Create a new dataframe to work against
df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
df_aggregated['Male Count'] = 0
df_aggregated['Female Count'] = 0
df_aggregated['Job Rate'] = 0

def manualAggregations(indices_array):
    temp_df = df.iloc[indices_array]
    return {
        'Male Count': temp_df['Male Count'].sum(),
        'Female Count': temp_df['Female Count'].sum(),
        'Job Rate': temp_df['Hourly Rate'].max()
    }

for name, group in df_grouped:
    ix = df_grouped.indices[name]
    calcDict = manualAggregations(ix)

    for key in calcDict:
        #Salary Basis, Job Title
        columns = list(name)
        df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                          (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]

If a dictionary isn’t your thing, the calculations could be applied inline in the for loop:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()

在pandas / python中的数据框中合并两列文本

问题:在pandas / python中的数据框中合并两列文本

我在Python中使用熊猫有20 x 4000数据框。其中两列分别命名为Yearquarter。我想创建一个名为periodmake Year = 2000quarter= q2into 的变量2000q2

有人可以帮忙吗?

I have a 20 x 4000 dataframe in Python using pandas. Two of these columns are named Year and quarter. I’d like to create a variable called period that makes Year = 2000 and quarter= q2 into 2000q2.

Can anyone help with that?


回答 0

如果两个列都是字符串,则可以直接将它们连接起来:

df["period"] = df["Year"] + df["quarter"]

如果其中一列(或两列)均未输入字符串,则应首先将其转换为字符串,

df["period"] = df["Year"].astype(str) + df["quarter"]

这样做时要小心NaN!


如果需要连接多个字符串列,则可以使用agg

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

其中“-”是分隔符。

if both columns are strings, you can concatenate them directly:

df["period"] = df["Year"] + df["quarter"]

If one (or both) of the columns are not string typed, you should convert it (them) first,

df["period"] = df["Year"].astype(str) + df["quarter"]

Beware of NaNs when doing this!


If you need to join multiple string columns, you can use agg:

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

Where “-” is the separator.


回答 1

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

产生此数据框

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

此方法通过替换df[['Year', 'quarter']]为数据框的任何列切片(例如)将其推广为任意数量的字符串列df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1)

您可以在此处查看有关apply()方法的更多信息

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

Yields this dataframe

   Year quarter  period
0  2014      q1  2014q1
1  2015      q2  2015q2

This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

You can check more information about apply() method here


回答 2

小型数据集(<150行)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

或稍慢但更紧凑:

df.Year.str.cat(df.quarter)

更大的数据集(> 150行)

df['Year'].astype(str) + df['quarter']

更新:时序图熊猫0.23.4

在此处输入图片说明

让我们在200K行DF上进行测试:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

更新:使用Pandas 0.19.0的新计时

定时不CPU / GPU优化(从排序最快到最慢):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

时序采用CPU / GPU优化:

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

回答@ anton-vbr的贡献

Small data-sets (< 150rows)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

or slightly slower but more compact:

df.Year.str.cat(df.quarter)

Larger data sets (> 150rows)

df['Year'].astype(str) + df['quarter']

UPDATE: Timing graph Pandas 0.23.4

enter image description here

Let’s test it on 200K rows DF:

In [250]: df
Out[250]:
   Year quarter
0  2014      q1
1  2015      q2

In [251]: df = pd.concat([df] * 10**5)

In [252]: df.shape
Out[252]: (200000, 2)

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

In [107]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 131 ms per loop

In [106]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 161 ms per loop

In [108]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 189 ms per loop

In [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 567 ms per loop

In [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 584 ms per loop

In [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 24.7 s per loop

Timing using CPU/GPU optimization:

In [113]: %timeit df['Year'].astype(str) + df['quarter']
10 loops, best of 3: 53.3 ms per loop

In [114]: %timeit df['Year'].map(str) + df['quarter']
10 loops, best of 3: 65.5 ms per loop

In [115]: %timeit df.Year.str.cat(df.quarter)
10 loops, best of 3: 79.9 ms per loop

In [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)
1 loop, best of 3: 230 ms per loop

In [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
1 loop, best of 3: 9.38 s per loop

Answer contribution by @anton-vbr


回答 3

该方法cat()的的.str访问可以很好地表现这一点:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat() 甚至允许您添加分隔符,因此,例如,假设年份和期间只有整数,则可以执行以下操作:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

连接多列只是传递一系列列表或包含除第一列之外的所有列的数据框作为要str.cat()在第一列(系列)上调用的参数的问题:

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

请注意,如果您的pandas数据框/系列具有空值,则需要包括参数na_rep以用字符串替换NaN值,否则合并的列将默认为NaN。

The method cat() of the .str accessor works really well for this:

>>> import pandas as pd
>>> df = pd.DataFrame([["2014", "q1"], 
...                    ["2015", "q3"]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014      q1
1  2015      q3
>>> df['Period'] = df.Year.str.cat(df.Quarter)
>>> print(df)
   Year Quarter  Period
0  2014      q1  2014q1
1  2015      q3  2015q3

cat() even allows you to add a separator so, for example, suppose you only have integers for year and period, you can do this:

>>> import pandas as pd
>>> df = pd.DataFrame([[2014, 1],
...                    [2015, 3]],
...                   columns=('Year', 'Quarter'))
>>> print(df)
   Year Quarter
0  2014       1
1  2015       3
>>> df['Period'] = df.Year.astype(str).str.cat(df.Quarter.astype(str), sep='q')
>>> print(df)
   Year Quarter  Period
0  2014       1  2014q1
1  2015       3  2015q3

Joining multiple columns is just a matter of passing either a list of series or a dataframe containing all but the first column as a parameter to str.cat() invoked on the first column (Series):

>>> df = pd.DataFrame(
...     [['USA', 'Nevada', 'Las Vegas'],
...      ['Brazil', 'Pernambuco', 'Recife']],
...     columns=['Country', 'State', 'City'],
... )
>>> df['AllTogether'] = df['Country'].str.cat(df[['State', 'City']], sep=' - ')
>>> print(df)
  Country       State       City                   AllTogether
0     USA      Nevada  Las Vegas      USA - Nevada - Las Vegas
1  Brazil  Pernambuco     Recife  Brazil - Pernambuco - Recife

Do note that if your pandas dataframe/series has null values, you need to include the parameter na_rep to replace the NaN values with a string, otherwise the combined column will default to NaN.


回答 4

这次通过string.format()使用lamba函数。

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

这使您可以根据需要使用非字符串并重新格式化值。

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2

Use of a lamba function this time with string.format().

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': ['q1', 'q2']})
print df
df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)
print df

  Quarter  Year
0      q1  2014
1      q2  2015
  Quarter  Year YearQuarter
0      q1  2014      2014q1
1      q2  2015      2015q2

This allows you to work with non-strings and reformat values as needed.

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'Quarter': [1, 2]})
print df.dtypes
print df

df['YearQuarter'] = df[['Year','Quarter']].apply(lambda x : '{}q{}'.format(x[0],x[1]), axis=1)
print df

Quarter     int64
Year       object
dtype: object
   Quarter  Year
0        1  2014
1        2  2015
   Quarter  Year YearQuarter
0        1  2014      2014q1
1        2  2015      2015q2

回答 5

您问题的简单答案。

    year    quarter
0   2000    q1
1   2000    q2

> df['year_quarter'] = df['year'] + '' + df['quarter']

> print(df['year_quarter'])
  2000q1
  2000q2

Simple answer for your question.

    year    quarter
0   2000    q1
1   2000    q2

> df['year_quarter'] = df['year'] + '' + df['quarter']

> print(df['year_quarter'])
  2000q1
  2000q2

回答 6

虽然@silvado的答案很好,但如果更改df.map(str)df.astype(str)它会更快:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

Although the @silvado answer is good if you change df.map(str) to df.astype(str) it will be faster:

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

In [131]: %timeit df["Year"].map(str)
10000 loops, best of 3: 132 us per loop

In [132]: %timeit df["Year"].astype(str)
10000 loops, best of 3: 82.2 us per loop

回答 7

让我们假设您 dataframedfYear和为Quarter

import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})

假设我们要看数据框;

df
>>>  Quarter    Year
   0    q1      2000
   1    q2      2000
   2    q3      2000
   3    q4      2000

最后,将Year和连接Quarter如下。

df['Period'] = df['Year'] + ' ' + df['Quarter']

现在print df ,您可以查看生成的数据框。

df
>>>  Quarter    Year    Period
    0   q1      2000    2000 q1
    1   q2      2000    2000 q2
    2   q3      2000    2000 q3
    3   q4      2000    2000 q4

如果您不想在年份和季度之间留出空间,只需将其删除即可;

df['Period'] = df['Year'] + df['Quarter']

Let us suppose your dataframe is df with columns Year and Quarter.

import pandas as pd
df = pd.DataFrame({'Quarter':'q1 q2 q3 q4'.split(), 'Year':'2000'})

Suppose we want to see the dataframe;

df
>>>  Quarter    Year
   0    q1      2000
   1    q2      2000
   2    q3      2000
   3    q4      2000

Finally, concatenate the Year and the Quarter as follows.

df['Period'] = df['Year'] + ' ' + df['Quarter']

You can now print df to see the resulting dataframe.

df
>>>  Quarter    Year    Period
    0   q1      2000    2000 q1
    1   q2      2000    2000 q2
    2   q3      2000    2000 q3
    3   q4      2000    2000 q4

If you do not want the space between the year and quarter, simply remove it by doing;

df['Period'] = df['Year'] + df['Quarter']

回答 8

这是我发现非常通用的实现:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog

Here is an implementation that I find very versatile:

In [1]: import pandas as pd 

In [2]: df = pd.DataFrame([[0, 'the', 'quick', 'brown'],
   ...:                    [1, 'fox', 'jumps', 'over'], 
   ...:                    [2, 'the', 'lazy', 'dog']],
   ...:                   columns=['c0', 'c1', 'c2', 'c3'])

In [3]: def str_join(df, sep, *cols):
   ...:     from functools import reduce
   ...:     return reduce(lambda x, y: x.astype(str).str.cat(y.astype(str), sep=sep), 
   ...:                   [df[col] for col in cols])
   ...: 

In [4]: df['cat'] = str_join(df, '-', 'c0', 'c1', 'c2', 'c3')

In [5]: df
Out[5]: 
   c0   c1     c2     c3                cat
0   0  the  quick  brown  0-the-quick-brown
1   1  fox  jumps   over   1-fox-jumps-over
2   2  the   lazy    dog     2-the-lazy-dog

回答 9

将数据插入数据框时,此命令应该可以解决您的问题:

df['period'] = df[['Year', 'quarter']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

As your data are inserted to a dataframe, this command should solve your problem:

df['period'] = df[['Year', 'quarter']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

回答 10

更有效的是

def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)

这是一个时间测试:

import numpy as np
import pandas as pd

from time import time


def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)


def concat_df_str2(df):
    """ run time: 5.2758s """
    return df.astype(str).sum(axis=1)


def concat_df_str3(df):
    """ run time: 5.0076s """
    df = df.astype(str)
    return df[0] + df[1] + df[2] + df[3] + df[4] + \
           df[5] + df[6] + df[7] + df[8] + df[9]


def concat_df_str4(df):
    """ run time: 7.8624s """
    return df.astype(str).apply(lambda x: ''.join(x), axis=1)


def main():
    df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
    df = df.astype(int)

    time1 = time()
    df_en = concat_df_str4(df)
    print('run time: %.4fs' % (time() - time1))
    print(df_en.head(10))


if __name__ == '__main__':
    main()

最后,当使用sum(concat_df_str2)时,结果不是简单的concat,它将转换为整数。

more efficient is

def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)

and here is a time test:

import numpy as np
import pandas as pd

from time import time


def concat_df_str1(df):
    """ run time: 1.3416s """
    return pd.Series([''.join(row.astype(str)) for row in df.values], index=df.index)


def concat_df_str2(df):
    """ run time: 5.2758s """
    return df.astype(str).sum(axis=1)


def concat_df_str3(df):
    """ run time: 5.0076s """
    df = df.astype(str)
    return df[0] + df[1] + df[2] + df[3] + df[4] + \
           df[5] + df[6] + df[7] + df[8] + df[9]


def concat_df_str4(df):
    """ run time: 7.8624s """
    return df.astype(str).apply(lambda x: ''.join(x), axis=1)


def main():
    df = pd.DataFrame(np.zeros(1000000).reshape(100000, 10))
    df = df.astype(int)

    time1 = time()
    df_en = concat_df_str4(df)
    print('run time: %.4fs' % (time() - time1))
    print(df_en.head(10))


if __name__ == '__main__':
    main()

final, when sum(concat_df_str2) is used, the result is not simply concat, it will trans to integer.


回答 11

归纳为多列,为什么不这样做:

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

generalising to multiple columns, why not:

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

回答 12

使用zip甚至可以更快:

df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

图形:

在此处输入图片说明

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

myfuncs = {
"df['Year'].astype(str) + df['quarter']":
    lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
    lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
    lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df[['Year','quarter']].astype(str).sum(axis=1),
    "df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
    lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
    "[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
    lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in myfuncs.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

Using zip could be even quicker:

df["period"] = [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

Graph:

enter image description here

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})

myfuncs = {
"df['Year'].astype(str) + df['quarter']":
    lambda: df['Year'].astype(str) + df['quarter'],
"df['Year'].map(str) + df['quarter']":
    lambda: df['Year'].map(str) + df['quarter'],
"df.Year.str.cat(df.quarter)":
    lambda: df.Year.str.cat(df.quarter),
"df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df.loc[:, ['Year','quarter']].astype(str).sum(axis=1),
"df[['Year','quarter']].astype(str).sum(axis=1)":
    lambda: df[['Year','quarter']].astype(str).sum(axis=1),
    "df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)":
    lambda: df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1),
    "[''.join(i) for i in zip(dataframe['Year'].map(str),dataframe['quarter'])]":
    lambda: [''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in myfuncs.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

回答 13

最简单的解决方案:

通用解决方案

df['combined_col'] = df[['col1', 'col2']].astype(str).apply('-'.join, axis=1)

特定问题的解决方案

df['quarter_year'] = df[['quarter', 'year']].astype(str).apply(''.join, axis=1)

.join之前的引号内指定首选的分隔符

Simplest Solution:

Generic Solution

df['combined_col'] = df[['col1', 'col2']].astype(str).apply('-'.join, axis=1)

Question specific solution

df['quarter_year'] = df[['quarter', 'year']].astype(str).apply(''.join, axis=1)

Specify the preferred delimiter inside the quotes before .join


回答 14

此解决方案使用中间步骤将DataFrame的两列压缩为包含列表单列。这不仅适用于字符串,而且适用于所有类型的column-dtypes

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)

结果:

   Year quarter        list  period
0  2014      q1  [2014, q1]  2014q1
1  2015      q2  [2015, q2]  2015q2

This solution uses an intermediate step compressing two columns of the DataFrame to a single column containing a list of the values. This works not only for strings but for all kind of column-dtypes

import pandas as pd
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})
df['list']=df[['Year','quarter']].values.tolist()
df['period']=df['list'].apply(''.join)
print(df)

Result:

   Year quarter        list  period
0  2014      q1  [2014, q1]  2014q1
1  2015      q2  [2015, q2]  2015q2

回答 15

如前所述,您必须将每一列转换为字符串,然后使用加号运算符组合两个字符串列。使用NumPy可以大大提高性能。

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As many have mentioned previously, you must convert each column to string and then use the plus operator to combine two string columns. You can get a large performance improvement by using NumPy.

%timeit df['Year'].values.astype(str) + df.quarter
71.1 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df['Year'].astype(str) + df['quarter']
565 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

回答 16

我认为在pandas中组合列的最好方法是将两个列都转换为整数,然后转换为str。

df[['Year', 'quarter']] = df[['Year', 'quarter']].astype(int).astype(str)
df['Period']= df['Year'] + 'q' + df['quarter']

I think the best way to combine the columns in pandas is by converting both the columns to integer and then to str.

df[['Year', 'quarter']] = df[['Year', 'quarter']].astype(int).astype(str)
df['Period']= df['Year'] + 'q' + df['quarter']

回答 17

这是我上面的解决方案的摘要,该方法使用列值之间的分隔符将具有int和str值的两列连接/合并为新列。为此有三种解决方案。

# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError

separator = "&&" 

# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"

df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)

Here is my summary of the above solutions to concatenate / combine two columns with int and str value into a new column, using a separator between the values of columns. Three solutions work for this purpose.

# be cautious about the separator, some symbols may cause "SyntaxError: EOL while scanning string literal".
# e.g. ";;" as separator would raise the SyntaxError

separator = "&&" 

# pd.Series.str.cat() method does not work to concatenate / combine two columns with int value and str value. This would raise "AttributeError: Can only use .cat accessor with a 'category' dtype"

df["period"] = df["Year"].map(str) + separator + df["quarter"]
df["period"] = df[['Year','quarter']].apply(lambda x : '{} && {}'.format(x[0],x[1]), axis=1)
df["period"] = df.apply(lambda x: f'{x["Year"]} && {x["quarter"]}', axis=1)

回答 18

使用.combine_first

df['Period'] = df['Year'].combine_first(df['Quarter'])

Use .combine_first.

df['Period'] = df['Year'].combine_first(df['Quarter'])

回答 19

def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

例如:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4
def madd(x):
    """Performs element-wise string concatenation with multiple input arrays.

    Args:
        x: iterable of np.array.

    Returns: np.array.
    """
    for i, arr in enumerate(x):
        if type(arr.item(0)) is not str:
            x[i] = x[i].astype(str)
    return reduce(np.core.defchararray.add, x)

For example:

data = list(zip([2000]*4, ['q1', 'q2', 'q3', 'q4']))
df = pd.DataFrame(data=data, columns=['Year', 'quarter'])
df['period'] = madd([df[col].values for col in ['Year', 'quarter']])

df

    Year    quarter period
0   2000    q1  2000q1
1   2000    q2  2000q2
2   2000    q3  2000q3
3   2000    q4  2000q4

回答 20

一个可以使用DataFrame的分配方法:

df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
  assign(period=lambda x: x.Year+x.quarter ))

One can use assign method of DataFrame:

df= (pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']}).
  assign(period=lambda x: x.Year+x.quarter ))

回答 21

dataframe["period"] = dataframe["Year"].astype(str).add(dataframe["quarter"])

或者如果值类似于[2000] [4]并想要设为[2000q4]

dataframe["period"] = dataframe["Year"].astype(str).add('q').add(dataframe["quarter"]).astype(str)

.astype(str).map(str)作品代替。

dataframe["period"] = dataframe["Year"].astype(str).add(dataframe["quarter"])

or if values are like [2000] [4] and want to make [2000q4]

dataframe["period"] = dataframe["Year"].astype(str).add('q').add(dataframe["quarter"]).astype(str)

substituting .astype(str) with .map(str) works too.


如何检查Pandas DataFrame中的值是否为NaN

问题:如何检查Pandas DataFrame中的值是否为NaN

在Python Pandas中,检查DataFrame是否具有一个(或多个)NaN值的最佳方法是什么?

我知道函数pd.isnan,但是这会为每个元素返回一个布尔值的DataFrame。此处的帖子也无法完全回答我的问题。

In Python Pandas, what’s the best way to check whether a DataFrame has one (or more) NaN values?

I know about the function pd.isnan, but this returns a DataFrame of booleans for each element. This post right here doesn’t exactly answer my question either.


回答 0

jwilner的反应是现场的。我一直在探索是否有更快的选择,因为根据我的经验,求平面数组的总和(奇怪)比计数快。这段代码看起来更快:

df.isnull().values.any()

例如:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum()速度稍慢,但当然还有其他信息-的数量NaNs

jwilner‘s response is spot on. I was exploring to see if there’s a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:

df.isnull().values.any()

For example:

In [2]: df = pd.DataFrame(np.random.randn(1000,1000))

In [3]: df[df > 0.9] = pd.np.nan

In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop

In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop

In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop

In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop

df.isnull().sum().sum() is a bit slower, but of course, has additional information — the number of NaNs.


回答 1

您有两种选择。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

现在数据框看起来像这样:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • 选项1df.isnull().any().any()-返回布尔值

您知道isnull()哪个会返回这样的数据帧:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

如果您这样做df.isnull().any(),则只能找到具有NaN值的列:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

还有一个.any()会告诉你,如果上述任何有True

> df.isnull().any().any()
True
  • 选项2df.isnull().sum().sum()-返回NaN值总数的整数:

这与操作相同.any().any(),首先对NaN列中的值数量求和,然后对这些值求和:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

最后,要获取DataFrame中NaN值的总数:

df.isnull().sum().sum()
5

You have a couple of options.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10,6))
# Make a few areas have NaN values
df.iloc[1:3,1] = np.nan
df.iloc[5,3] = np.nan
df.iloc[7:9,5] = np.nan

Now the data frame looks something like this:

          0         1         2         3         4         5
0  0.520113  0.884000  1.260966 -0.236597  0.312972 -0.196281
1 -0.837552       NaN  0.143017  0.862355  0.346550  0.842952
2 -0.452595       NaN -0.420790  0.456215  1.203459  0.527425
3  0.317503 -0.917042  1.780938 -1.584102  0.432745  0.389797
4 -0.722852  1.704820 -0.113821 -1.466458  0.083002  0.011722
5 -0.622851 -0.251935 -1.498837       NaN  1.098323  0.273814
6  0.329585  0.075312 -0.690209 -3.807924  0.489317 -0.841368
7 -1.123433 -1.187496  1.868894 -2.046456 -0.949718       NaN
8  1.133880 -0.110447  0.050385 -1.158387  0.188222       NaN
9 -0.513741  1.196259  0.704537  0.982395 -0.585040 -1.693810
  • Option 1: df.isnull().any().any() – This returns a boolean value

You know of the isnull() which would return a dataframe like this:

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False   True  False  False  False  False
2  False   True  False  False  False  False
3  False  False  False  False  False  False
4  False  False  False  False  False  False
5  False  False  False   True  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False   True
8  False  False  False  False  False   True
9  False  False  False  False  False  False

If you make it df.isnull().any(), you can find just the columns that have NaN values:

0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

One more .any() will tell you if any of the above are True

> df.isnull().any().any()
True
  • Option 2: df.isnull().sum().sum() – This returns an integer of the total number of NaN values:

This operates the same way as the .any().any() does, by first giving a summation of the number of NaN values in a column, then the summation of those values:

df.isnull().sum()
0    0
1    2
2    0
3    1
4    0
5    2
dtype: int64

Finally, to get the total number of NaN values in the DataFrame:

df.isnull().sum().sum()
5

回答 2

要找出特定列中具有NaN的行:

nan_rows = df[df['name column'].isnull()]

To find out which rows have NaNs in a specific column:

nan_rows = df[df['name column'].isnull()]

回答 3

如果您需要知道带有“一个或多个NaNs”的行数:

df.isnull().T.any().T.sum()

或者,如果您需要拉出这些行并进行检查:

nan_rows = df[df.isnull().T.any().T]

If you need to know how many rows there are with “one or more NaNs”:

df.isnull().T.any().T.sum()

Or if you need to pull out these rows and examine them:

nan_rows = df[df.isnull().T.any().T]

回答 4

df.isnull().any().any() 应该这样做。

df.isnull().any().any() should do it.


回答 5

除了给霍布斯一个绝妙的答案外,我对Python和Pandas还很陌生,所以请指出我是否错。

要找出哪些行具有NaN:

nan_rows = df[df.isnull().any(1)]

通过将any()的轴指定为1来检查行中是否存在“ True”,将无需移置即可执行相同的操作。

Adding to Hobs brilliant answer, I am very new to Python and Pandas so please point out if I am wrong.

To find out which rows have NaNs:

nan_rows = df[df.isnull().any(1)]

would perform the same operation without the need for transposing by specifying the axis of any() as 1 to check if ‘True’ is present in rows.


回答 6

超级简单语法: df.isna().any(axis=None)

从v0.23.2开始,可以使用DataFrame.isna+ DataFrame.any(axis=None)其中axis=None指定整个DataFrame的逻辑归约。

# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
     A    B
0  1.0  NaN
1  2.0  4.0
2  NaN  5.0

df.isna()

       A      B
0  False   True
1  False  False
2   True  False

df.isna().any(axis=None)
# True

有用的选择

numpy.isnan
如果您正在运行旧版本的熊猫,则是另一个性能选择。

np.isnan(df.values)

array([[False,  True],
       [False, False],
       [ True, False]])

np.isnan(df.values).any()
# True

或者,检查总和:

np.isnan(df.values).sum()
# 2

np.isnan(df.values).sum() > 0
# True

Series.hasnans
您也可以迭代调用Series.hasnans。例如,要检查单个列是否具有NaN,

df['A'].hasnans
# True

并检查任何列有NaN的,你可以使用与理解any(这是一个短路操作)。

any(df[c].hasnans for c in df)
# True

这实际上非常快。

Super Simple Syntax: df.isna().any(axis=None)

Starting from v0.23.2, you can use DataFrame.isna + DataFrame.any(axis=None) where axis=None specifies logical reduction over the entire DataFrame.

# Setup
df = pd.DataFrame({'A': [1, 2, np.nan], 'B' : [np.nan, 4, 5]})
df
     A    B
0  1.0  NaN
1  2.0  4.0
2  NaN  5.0

df.isna()

       A      B
0  False   True
1  False  False
2   True  False

df.isna().any(axis=None)
# True

Useful Alternatives

numpy.isnan
Another performant option if you’re running older versions of pandas.

np.isnan(df.values)

array([[False,  True],
       [False, False],
       [ True, False]])

np.isnan(df.values).any()
# True

Alternatively, check the sum:

np.isnan(df.values).sum()
# 2

np.isnan(df.values).sum() > 0
# True

Series.hasnans
You can also iteratively call Series.hasnans. For example, to check if a single column has NaNs,

df['A'].hasnans
# True

And to check if any column has NaNs, you can use a comprehension with any (which is a short-circuiting operation).

any(df[c].hasnans for c in df)
# True

This is actually very fast.


回答 7

由于没有人提及,因此只有一个名为的变量hasnans

df[i].hasnansTrue如果pandas系列中的一个或多个值是NaN,False则输出为NaN(如果不是)。请注意,它不是功能。

熊猫版本“ 0.19.2”和“ 0.20.2”

Since none have mentioned, there is just another variable called hasnans.

df[i].hasnans will output to True if one or more of the values in the pandas Series is NaN, False if not. Note that its not a function.

pandas version ‘0.19.2’ and ‘0.20.2’


回答 8

由于pandas必须为此找到答案DataFrame.dropna(),因此我看了看他们是如何实现它的,并发现他们利用了它DataFrame.count(),它计算了中的所有非空值DataFrame。cf. 熊猫源代码。我尚未对该技术进行基准测试,但是我认为该库的作者可能已经对如何进行选择做出了明智的选择。

Since pandas has to find this out for DataFrame.dropna(), I took a look to see how they implement it and discovered that they made use of DataFrame.count(), which counts all non-null values in the DataFrame. Cf. pandas source code. I haven’t benchmarked this technique, but I figure the authors of the library are likely to have made a wise choice for how to do it.


回答 9

dfPandas DataFrame的名称,以及任何numpy.nan为空值的值。

  1. 如果要查看哪些列为空,哪些不为空(仅True和False)
    df.isnull().any()
  2. 如果只想查看具有空值的列
    df.loc[:, df.isnull().any()].columns
  3. 如果要查看每列中的空值计数
    df.isna().sum()
  4. 如果要查看每列中空值的百分比

    df.isna().sum()/(len(df))*100
  5. 如果要查看仅包含空值的列中的空值百分比: df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100

编辑1:

如果要直观地查看数据丢失的位置:

import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])

let df be the name of the Pandas DataFrame and any value that is numpy.nan is a null value.

  1. If you want to see which columns has nulls and which not(just True and False)
    df.isnull().any()
    
  2. If you want to see only the columns that has nulls
    df.loc[:, df.isnull().any()].columns
    
  3. If you want to see the count of nulls in every column
    df.isna().sum()
    
  4. If you want to see the percentage of nulls in every column

    df.isna().sum()/(len(df))*100
    
  5. If you want to see the percentage of nulls in columns only with nulls: df.loc[:,list(df.loc[:,df.isnull().any()].columns)].isnull().sum()/(len(df))*100

EDIT 1:

If you want to see where your data is missing visually:

import missingno
missingdata_df = df.columns[df.isnull().any()].tolist()
missingno.matrix(df[missingdata_df])

回答 10

仅使用 math.isnan(x),如果x是一个NaN(不是数字),则返回True,否则返回False。

Just using math.isnan(x), Return True if x is a NaN (not a number), and False otherwise.


回答 11

df.isnull().sum()

这将为您提供DataFrame各个列中存在的所有NaN值的计数。

df.isnull().sum()

This will give you count of all NaN values present in the respective coloums of the DataFrame.


回答 12

这是找到空值并替换为计算值的另一种有趣方式

    #Creating the DataFrame

    testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3     NaN
    3       40       4     NaN
    4       50       5   250.0

    #Identifying the rows with empty columns
    nan_rows = testdf2[testdf2['Yearly'].isnull()]
    >>> nan_rows
       Monthly  Tenure  Yearly
    2       30       3     NaN
    3       40       4     NaN

    #Getting the rows# into a list
    >>> index = list(nan_rows.index)
    >>> index
    [2, 3]

    # Replacing null values with calculated value
    >>> for i in index:
        testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3    90.0
    3       40       4   160.0
    4       50       5   250.0

Here is another interesting way of finding null and replacing with a calculated value

    #Creating the DataFrame

    testdf = pd.DataFrame({'Tenure':[1,2,3,4,5],'Monthly':[10,20,30,40,50],'Yearly':[10,40,np.nan,np.nan,250]})
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3     NaN
    3       40       4     NaN
    4       50       5   250.0

    #Identifying the rows with empty columns
    nan_rows = testdf2[testdf2['Yearly'].isnull()]
    >>> nan_rows
       Monthly  Tenure  Yearly
    2       30       3     NaN
    3       40       4     NaN

    #Getting the rows# into a list
    >>> index = list(nan_rows.index)
    >>> index
    [2, 3]

    # Replacing null values with calculated value
    >>> for i in index:
        testdf2['Yearly'][i] = testdf2['Monthly'][i] * testdf2['Tenure'][i]
    >>> testdf2
       Monthly  Tenure  Yearly
    0       10       1    10.0
    1       20       2    40.0
    2       30       3    90.0
    3       40       4   160.0
    4       50       5   250.0

回答 13

我一直在使用以下内容并将其类型转换为字符串并检查nan值

   (str(df.at[index, 'column']) == 'nan')

这使我可以检查序列中的特定值,而不仅仅是返回该值是否包含在序列中。

I’ve been using the following and type casting it to a string and checking for the nan value

   (str(df.at[index, 'column']) == 'nan')

This allows me to check specific value in a series and not just return if this is contained somewhere within the series.


回答 14

或者你可以使用.info()DF,例如:

df.info(null_counts=True) 返回列中的非空行数,例如:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches                          3276314 non-null int64
avg_pic_distance                   3276314 non-null float64

Or you can use .info() on the DF such as :

df.info(null_counts=True) which returns the number of non_null rows in a columns such as:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276314 entries, 0 to 3276313
Data columns (total 10 columns):
n_matches                          3276314 non-null int64
avg_pic_distance                   3276314 non-null float64

回答 15

最好是使用:

df.isna().any().any()

这就是为什么。因此isna()用于定义isnull(),但两者当然是完全相同的。

这甚至比接受的答案还要快,并且涵盖了所有2D熊猫阵列。

The best would be to use:

df.isna().any().any()

Here is why. So isna() is used to define isnull(), but both of these are identical of course.

This is even faster than the accepted answer and covers all 2D panda arrays.


回答 16

import missingno as msno
msno.matrix(df)  # just to visualize. no missing value.

在此处输入图片说明

import missingno as msno
msno.matrix(df)  # just to visualize. no missing value.

enter image description here


回答 17

df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

将检查每个列是否包含Nan。

df.apply(axis=0, func=lambda x : any(pd.isnull(x)))

Will check for each column if it contains Nan or not.


回答 18

我们可以通过使用Seaborn模块热图生成热图来查看数据集中存在的空值

import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)

We can see the null values present in the dataset by generating heatmap using seaborn moduleheatmap

import pandas as pd
import seaborn as sns
dataset=pd.read_csv('train.csv')
sns.heatmap(dataset.isnull(),cbar=False)

回答 19

您不仅可以检查是否存在“ NaN”,还可以使用以下命令获取每一列中“ NaN”的百分比,

df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})  
df  

   col1 col2  
0   1   6.0  
1   2   NaN  
2   3   8.0  
3   4   9.0  
4   5   10.0  


df.isnull().sum()/len(df)  
col1    0.0  
col2    0.2  
dtype: float64

You could not only check if any ‘NaN’ exist but also get the percentage of ‘NaN’s in each column using the following,

df = pd.DataFrame({'col1':[1,2,3,4,5],'col2':[6,np.nan,8,9,10]})  
df  

   col1 col2  
0   1   6.0  
1   2   NaN  
2   3   8.0  
3   4   9.0  
4   5   10.0  


df.isnull().sum()/len(df)  
col1    0.0  
col2    0.2  
dtype: float64

回答 20

根据要处理的数据类型,您还可以通过将dropna设置为False来在执行EDA时获取每一列的值计数。

for col in df:
   print df[col].value_counts(dropna=False)

对于分类变量,效果很好,当您拥有许多唯一值时,效果不是很好。

Depending on the type of data you’re dealing with, you could also just get the value counts of each column while performing your EDA by setting dropna to False.

for col in df:
   print df[col].value_counts(dropna=False)

Works well for categorical variables, not so much when you have many unique values.